Skip to content

Commit

Permalink
Merge pull request #1 from transparency-certified/lars-1
Browse files Browse the repository at this point in the history
Adjusting the case study for language and accuracy
  • Loading branch information
craig-willis authored Jan 26, 2024
2 parents 56c43a4 + a3dbce7 commit c96a05c
Show file tree
Hide file tree
Showing 4 changed files with 77 additions and 65 deletions.
Binary file removed docs/case-profiles/.campus-cluster.md.swp
Binary file not shown.
30 changes: 19 additions & 11 deletions docs/case-profiles/bplim.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,18 @@ The [Banco de Portugal Microdata Research
Laboratory](https://bplim.bportugal.pt/content/access-0) (BPLIM) provides access
to datasets about the Portuguese economy. The following summary is based on the
BPLIM [Guide for
Researchers](https://msites-dee-bplim-prd.azurewebsites.net/sites/default/files/guide_for_researchers_v202210.pdf).
Researchers](https://msites-dee-bplim-prd.azurewebsites.net/sites/default/files/guide_for_researchers_v202210.pdf) and [Guimarães (2023)](https://doi.org/10.1162/99608f92.54a00239).

## Data access
Data can only be accessed using BPLIM's dedicated servers. Under some
circumstances, researchers maybe granted surrogate access to the server, where

Data can only be accessed using BPLIM's dedicated servers. Once approved,
external researchers are granted surrogate access to the server, where
BPLIM staff execute scripts written by an external researcher and outputs are
shared after review. Internal researchers have broader access to BPLIM datasets
while external researchers must apply for access to BPLIM data including a
detailed project description.

Datasets made available to external researchers are classified as low, medium,
Datasets made available to external researchers on the BPLIM servers are classified as low, medium,
or high confidentiality. Datasets classified with medium confidentiality have
modified values (perturbed data) and those with high confidentiality contain
modified values and are restriced to a subsample of observations. BPLIM may
Expand All @@ -24,9 +25,11 @@ develop their analysis, but these are not valid for research purposes.
Researchers must then request that BPLIM run their analysis on the original
data.

Datasets are assigned DOIs.

## Statistical software

BPLIM servers provide a Linux environment with Stata, R, and Julia. BPLIM staff
BPLIM servers provide a Linux environment with Stata, R, Julia, and Python. BPLIM staff
can assist researchers with installing additional packages for their project.
BPLIM has also developed custom Stata packages.

Expand All @@ -36,6 +39,16 @@ External researchers are never allowed to transfer data to or from BPLIM's
servers. For external researchers, BPLIM staff will verify that all outputs
conform to policies (termed "output verification").

## Processing

External researchers develop their code on the BPLIM servers, using the provided data. BPLIM staff validate (reproduce) the analysis by running the code on the same data, but on internal servers. They then modify the scripts to use the confidential data, produce output based on the confidential data, and after output verification, provide the output to the researcher.

More recently, they have been pushing researches to work with Singularity containers ([https://github.com/BPLIM/Containers](https://github.com/BPLIM/Containers)). They have developed an [app](https://github.com/BPLIM/ReplicationApp) that can already run the researchers' code within the container, and let the researcher verify that the code runs cleanly.

**The app could be augmented to generate a TRO. The app itself would be described as part of the TRS.**

> The initial TRO in the current usage is not useful to the researcher, since it only tells the BPLIM staff that the code runs cleanly. The confidential TRO generated by the app (TRS) on the confidential data needs vetting, similar to the [FSRDC case](caseprofile-rdc).
## Archiving

When a project is "closed", BPLIM will archive a copy of all analysis files and
Expand All @@ -44,11 +57,6 @@ unless the researcher explicitly request and justify why they must be archived.

## Citation

BPLIM requests that all datasets be cited

## Replication verification
BPLIM requests that all datasets be cited.

BPLIM will work with third-parties reponsible for verifying the replicability of
results (including data editors, certification services, or individual
researchers).

82 changes: 41 additions & 41 deletions docs/case-profiles/rdc.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,48 @@
(caseprofile-rdc)=
# Research Data Centers
# Federal Research Data Centers

Jump to [TRACE in the RDC](caseprofile-trace-in-the-rdc).
Jump to [TRACE in the FSRDC](caseprofile-trace-in-the-rdc).

[Federal Statistical Research Data
Centers](https://www.census.gov/about/adrm/fsrdc.html) (FSRDC) provide secure
environments to support researchers using restricted access data. FSRDCs are
partnerships between US statistical agencies (e.g., Census Bureau, Bureau of
partnerships between US statistical agencies (e.g., U.S. Census Bureau, Bureau of
Labor Statistics, Bureau of Economic Anlaysis, National Center for Science and
Engineering Statistics) and research institutions. This case profile is
focused on the U.S. Census Bureau RDCs.

## Census Bureau RDCs

The following summary is based on the Census Bureau's Center for Economic
Studies (CES) [Researcher
Handbook](https://www.census.gov/content/dam/Census/programs-surveys/sipp/methodology/Researcher_Handbook_20091119.pdf).
Engineering Statistics) and research institutions.
The following summary is based on the FSRDC's [Researcher
Handbook](https://psurdc.psu.edu/sites/rdc/files/2021-07/Researcher_Handbook_1208020.pdf).
We use the example of Census Bureau data, but the case applies more generally.

### Inputs
CES microdata is confidential and protected under Title 13 of the U.S. Code.
Researchers must apply for access to RDC data. Available

Census Bureau data accessible via the FSRDC is confidential and protected under Title 13 of the U.S. Code.
Researchers must apply for access to data covered under Title 13. Available
[data](https://www.census.gov/topics/research/guidance/restricted-use-microdata.html)
includes administrative, demographic, and economic data products such as the
American Community Survey (ACS), American Housing Survey (AHS), Longitudinal
Employer-Household Dynamics (LEHD). Researchers must apply for access to CES
data.
Employer-Household Dynamics (LEHD).

Datasets are currently not assigned a persistent identifier.
Datasets are currently not assigned a persistent public identifier, though a Census Bureau internal database tracks data made available through the FSRDCs.

### Environment
Researchers access RDC systems via a thin-client (X-Terminal). A limited range
of software is available for researcher use. All data processing is conducted
on central servers running Red Hat Enterprise Linux. The RDC network is isolated
from other networks. Not mentioned in the handbook, the [Integrated Research
Environment](https://www2.census.gov/foia/events/2017-03/2017_03_16/7_Integrated_Research_Environment_IRE.pdf)
based on PBSPro is also available (since ~2017) and a [Cloud Research
Environment](https://www.census.gov/content/dam/Census/library/publications/2022/adrm/2022-CSRM-Annual-Report.pdf) is being prototyped. Both are within the Center for Enterprise Dissemination
(CED)

Researchers access FSRDC systems via a remote desktop interface (Citrix-based), which in turn connects to the internal compute cluster via a software thin-client (NX). The Linux compute cluster is known as the [Integrated Research
Environment](https://www2.census.gov/foia/events/2017-03/2017_03_16/7_Integrated_Research_Environment_IRE.pdf).
A controlled list
of software is available for researcher use. Jobs are scheduled using PBSPro. Nearly all data processing is conducted
on the IRE, and this case will concentrate on that modal access scenario. The FSRDC network is isolated
from other networks.

Other environments not considered in this case are a prototype [Cloud Research
Environment](https://www.census.gov/content/dam/Census/library/publications/2022/adrm/2022-CSRM-Annual-Report.pdf) and custom Windows access for specific software.

### Outputs and disclosure review

All outputs are subject to review for potential disclosure of confidential
information. If problems are found, researchers are asked to collapse
(combine) certain cells, suppress (replace) information, or reconsider outputs
altogether. Release of "intermediate output" (not included in publication) is
discouraged. CES provides SAS and Stata programs to help with disclosure anlaysis.
information. This disclosure review is conducted by the agency providing the data, *i.e.*, the Census Bureau's Disclosure Review Board (DRB) reviews output that is based on Census Bureau data, the Bureau of Labor Statistics reviews output that is produced from BLS data, etc.
To prepare disclosable output, researchers are asked modify data by applying various disclosure avoidance techniques, such as collapsing
(combine) certain cells, suppressing or replacing information, or adding noise. Some output may not be disclosable. Release of "intermediate output" (not included in publication) is
discouraged. The FSRDC system provides SAS and Stata programs to help with disclosure anlaysis.

### Archiving

Expand All @@ -57,43 +55,45 @@ years.
## TRACE in the RDC

### Ingest of data

Data ingest should create permanent records of their existence at a point in
time. If preserved (ideal), should have a unique identifier, ideally public. If
not preserved or de-accessioned, a record of the when and why should be kept.

**In a nutshell: Assign DOI to input data**

### TRACE System description
A TRACE System description should be published. The format is yet undefined, but
can be an XML file with style applied (in essence, a webpage that has
structured, machine-readable content). It simply needs to exist. It is expected

A TRACE System description should be published. The content should conform to the [TRACE System](element-trace-system), and could be made human-readable as a webpage that has
structured, machine-readable content. It is expected
that a TRACE System description does not reveal sensitive information. It should
include principles of disclosure review. To back it up, in principle, a duly
include principles of disclosure review. In principle, a duly
authorized person may need to inspect the systems described by the TRACE system
description.

**In a nutshell: Publish a structured webpage, and be ready to back it up.**
**In a nutshell: Publish a structured webpage, and be ready to provide evidence of audit or inspection.**

### TRACE System itself

A likely version of the "ideal" TRACE system is a queue system with staging,
i.e., the researcher submits a job to the queue, which is staged to a different
server, which does not allow for interaction during the execution of all
i.e., the researcher submits a job to a special queue that "stages" (copies) to a different
file system or server, to ensure that no user interaction is possible during the execution of all
programs. In other words, all input data are read-only, all code uses only the
original, read-only input data, all copy is copied from the researcher's space
into a separate area, and all output is captured. Output, code, and indicators
of the input data (hash) are cryptographically signed.

**In a nutshell: Implement a somewhat fancy PBS Pro queue on Census servers**
**In a nutshell: Implement a somewhat fancy PBSPro queue on Census servers**

This implements a confidential version of a TRO.
The TRO produced by this TRS would *a priori* be confidential, since it contains raw output that has not yet been vetted by the Census Bureau's DRB.

### TRO

A TRO is the object that links the input data, code, and output, subject to all
the processing by the system. The output from the confidential TRO described
above can be submitted directly to DRB-like processes. The DRB typically
modifies output further, through a standardized but manual process. The output
above can be submitted directly to the DRB processes. The DRB may
modify output further, through a standardized (possibly manual) process. The output
from the DRB (modified code, modified output) is combined with information about
the original input data (identifiers, hash), digitally signed, and published.

**In a nutshell: Implement a variation of the current DRB process with digital
signatures and (delayed) publication (with persistent identifiers)**
**In a nutshell: Implement a variation of the current DRB process with digital signatures and (delayed) publication (with persistent identifiers)**
30 changes: 17 additions & 13 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ American Economic Journal: Macroeconomics 13(2): 444–86.
* Requires 12 cores, 512GB RAM and ~3TB fast local storage. Run time was >12 hours.

Webb, Clayton; Linn, Suzanna; Lebo, Matthew, 2019, "Replication Data for: Beyond
the Unit Root Question: Uncertainty and Inference",
the Unit Root Question: Uncertainty and Inference", [Paper](https://doi.org/10.1111/ajps.12506)
[Replication package](https://doi.org/10.7910/DVN/ZBRTJH)
* Simulations were performed on the University of Kansas High Performance
Compute Cluster with each job requesting 4 nodes with 20 cores per node.
Expand All @@ -60,33 +60,37 @@ Public Goods: The Evidence from Deforestation",

(example-twitter)=
## Twitter

Oklobdzija, Stan; Kousser, Thad; Butler, Daniel, 2022, "Replication Data for: Do
Male and Female Legislators Have Different Twitter Communication Styles?",
https://doi.org/10.15139/S3/MHAAZV, UNC Dataverse.
* Uses Twitter data
Male and Female Legislators Have Different Twitter Communication Styles?", [Paper](https://doi.org/10.1017/spq.2022.16)
[Replication package](https://doi.org/10.15139/S3/MHAAZV)
* Uses Twitter data, which cannot be redistributed.

(example-international-agencies)=
## International statistical agencies
## International data

Hjortskov, Morten; Andersen, Simon Calmar; Jakobsen, Morten, 2018, "Replication
Data for: Encouraging Political Voices of Underrepresented Citizens through
Coproduction. Evidence from a Randomized Field Trial". [Replication
package](https://doi.org/10.7910/DVN/MZKJDR)
* Confidential data from Statistics Denmark
* Danish data is only accessible upon application and on servers hosted by Statistics Denmark.

Hager, Anselm; Hilbig, Hanno, 2019, "Replication Data for: Do Inheritance
Customs Affect Political and Social Inequality" [Replication
Customs Affect Political and Social Inequality" [Paper](https://doi.org/10.1111/ajps.12460) [Replication
package](https://doi.org/10.7910/DVN/ZUH3UG)
* Confidential data from German SOEP
* Uses data from German SOEP (a survey created by the German Institute for
Economic Research, DIW Berlin, a private research institute)
* Access requires signing a use agreement and prevents redistribution. Accessible data differ depending on location of researcher.
* Additional geolocated data can only be accessed on-site at DIW Berlin.

Bonhomme, Lamadon, and Manresa, forthcoming. "A distributional Framework for
matched employer-employee data". Econometrica. [Github
matched employer-employee data". Econometrica. [Paper](https://doi.org/10.3982/ECTA15722) [Github
repo](https://github.com/tlamadon/blm-replicate)
* Docker runs with synthetic data and is designed to run on confidential data
from Sweden.
* Injecting a bit of code would make this be able to run on Swedish data and
demonstrate input data, output results, and possibly some disclosure avoidance
review (not specified)
* Docker runs with synthetic data
* Docker is designed to run on confidential data from Sweden, accessible only upon application and from within Europe.

> NOTE: The use of Docker (=TRS) would make generating a TRO relatively straightforward, subject to the same caveats as the FSRDC case.

(example-ipums)=
Expand Down

0 comments on commit c96a05c

Please sign in to comment.