-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathreproducible_research_workshop.qmd
389 lines (264 loc) · 13.1 KB
/
reproducible_research_workshop.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
---
title: "Reproducible Research"
subtitle: "Concepts and Tools"
author: "Allan Just"
date: today
format:
revealjs:
slide-number: true
chalkboard:
buttons: false
preview-links: auto
editor: visual
---
## Objective of the workshop
::: {style="text-align: left; margin-top: 1em"}
### To demonstrate how reproducible research practices make research easier and more transparent
:::
## Agenda
- Project and data organization principles
- Writing more reusable code
- A quick tour of tools
- Playing with Git
- Data sharing and NIH rules (for Jan 2023 forward)
::: aside
With lots of examples from our lab
:::
## What do we mean by Reproducible Research?
- **Reproducibility:** You/someone else can exactly (numerically) regenerate your results on the same dataset
- **Replicability:** getting similar results and reaching same conclusion with another dataset
## Who is this for?
- **You** _(thanks for joining)_
- Small teams working on science projects with data
## Reproducibility as a component of Scientific Rigor {.smaller}
For my NIH proposals, I wrote the following text:
> **Scientific rigor and reproducibility:** We have several protocols in place to ensure rigor and reproducibility. In both exposure modeling and epidemiologic analyses, we use careful data cleaning, sanity checks and quality control. These protocols include (1) automated data quality checking using statistical reporting tools and audits; (2) prespecified analytic plans that are approved by the PI with mandated reporting of null results; (3) use of reproducible analytic tools such as R Markdown and Git repositories for statistical code; (4) required verification of analyses by a second data analyst; and (5) adherence to STROBE guidelines[^1] so that results are more likely interpretable and useful to the scientific and policy communities.
[^1]: [Strengthening Reporting in OBservational Epidemiology](https://www.equator-network.org/reporting-guidelines/strobe/)
# Project Organization and Data Management
## Good enough practices - Data mgmt
- Save the raw data.
- Record all the steps used to process data.
- Anticipate the need to use multiple data sources, and use a unique identifier for every record.
- Submit data to a reputable DOI-issuing repository so that others can access and cite it.
::: footer
Publication: [Wilson et al. 2017. Good enough practices in scientific computing. PLoS Computational Biology](https://doi.org/10.1371/journal.pcbi.1005510)
:::
## Good enough practices *-* Software
- Place a brief explanatory comment at the start of every program.
- Decompose programs into functions.
- Give functions and variables meaningful names.
- Make dependencies and requirements explicit.
- Do not comment and uncomment sections of code to control a program's behavior.
- Provide a simple example or test data set.
- Submit code to a reputable DOI-issuing repository.
::: footer
Publication: [Wilson et al. 2017. Good enough practices in scientific computing. PLoS Computational Biology](https://doi.org/10.1371/journal.pcbi.1005510)
:::
## Good enough practices *-* Collaboration
- Create an overview of your project.
- Create a shared "to-do" list for the project.
- Decide on communication strategies.
- Make the license explicit.
- Make the project citable.
::: footer
Publication: [Wilson et al. 2017. Good enough practices in scientific computing. PLoS Computational Biology](https://doi.org/10.1371/journal.pcbi.1005510)
:::
## Good enough practices *-* Project organization
- Put each project in its own directory, which is named after the project.
- Put text documents associated with the project in the doc directory.
- Put raw data and metadata in a data directory and files generated during cleanup and analysis in a results directory.
- Put project source code in a designated directory.
- Name all files to reflect their content or function.
::: footer
Publication: [Wilson et al. 2017. Good enough practices in scientific computing. PLoS Computational Biology](https://doi.org/10.1371/journal.pcbi.1005510)
:::
## Good enough practices *-* Tracking changes
- Back up (almost) everything created by a human being as soon as it is created.
- Keep changes small.
- Share changes frequently.
- Use a version control system.
::: footer
Publication: [Wilson et al. 2017. Good enough practices in scientific computing. PLoS Computational Biology](https://doi.org/10.1371/journal.pcbi.1005510)
:::
## Checking in on your data stewardship
![](images/horst-seaside-chats.png){fig-alt="cartoon of a group of scientists discussing data with laptop computers and whiteboard"}
::: footer
[Artwork by @allison_horst](https://twitter.com/allison_horst)
:::
## Openscapes and the "seaside chat"
Talking about data (analysis and stewardship)!
![](images/openscapes_hex_badge.png){fig-alt="hexagonal badge of the abstract Openscapes logo"}
::: footer
seriously -- check out [Openscapes](https://www.openscapes.org/)
:::
## Our group check-in (Spring 2022)
::: {.column-screen}
![](images/justlab-openscapes-pathway-2022-screenshot.png){fig-alt="screenshot of a busy spreadsheet with notes on reproducibility"}
:::
## Your group check-in
::: {style="text-align: center; margin-top: 2em"}
[Openscapes Template](https://docs.google.com/spreadsheets/d/14D5o9IG3UYmV_kxdUa1bqy1EYBF5iuD2i8b4jL3Tu5c/edit?usp=sharing){preview-link="true" style="text-align: center"}
:::
::: aside
This version is View-only but you can make a copy in Google Docs!
:::
# Version control with *`Git`* and *`GitHub`*
## Motivation for using Git and GitHub
::: {style="text-align: center; margin-top: 1em"}
*"yourself from 3 months ago doesn't answer email"*
:::
. . .
![](images/github_friends_crop.png)
::: footer
Illustrations from the Openscapes blog [GitHub for supporting, contributing, and failing safely](https://www.openscapes.org/blog/2022/05/27/github-illustrated-series/) by Allison Horst and Julia Lowndes
:::
---
![](images/github_friends.png)
::: footer
Illustrations from the Openscapes blog [GitHub for supporting, contributing, and failing safely](https://www.openscapes.org/blog/2022/05/27/github-illustrated-series/) by Allison Horst and Julia Lowndes
:::
---
![](images/github_wickham_bryan_git_quote.png)
::: footer
Illustrations from the Openscapes blog [GitHub for supporting, contributing, and failing safely](https://www.openscapes.org/blog/2022/05/27/github-illustrated-series/) by Allison Horst and Julia Lowndes
:::
---
![](images/github_compare_text.png)
GitHub helps streamline our work
::: footer
Illustrations from the Openscapes blog [GitHub for supporting, contributing, and failing safely](https://www.openscapes.org/blog/2022/05/27/github-illustrated-series/) by Allison Horst and Julia Lowndes
:::
---
![](images/github_groundcrew_text.png)
GitHub helps us support
::: footer
Illustrations from the Openscapes blog [GitHub for supporting, contributing, and failing safely](https://www.openscapes.org/blog/2022/05/27/github-illustrated-series/) by Allison Horst and Julia Lowndes
:::
---
![](images/github_reuse.jpeg)
GitHub helps us reuse
::: footer
Illustrations from the Openscapes blog [GitHub for supporting, contributing, and failing safely](https://www.openscapes.org/blog/2022/05/27/github-illustrated-series/) by Allison Horst and Julia Lowndes
:::
---
![](images/github_clip.png)
GitHub helps us contribute
::: footer
Illustrations from the Openscapes blog [GitHub for supporting, contributing, and failing safely](https://www.openscapes.org/blog/2022/05/27/github-illustrated-series/) by Allison Horst and Julia Lowndes
:::
---
![](images/github_fall.png)
GitHub helps us fail safely
::: footer
Illustrations from the Openscapes blog [GitHub for supporting, contributing, and failing safely](https://www.openscapes.org/blog/2022/05/27/github-illustrated-series/) by Allison Horst and Julia Lowndes
:::
---
![](images/github_scoping.jpeg)
Harness the power of GitHub.
::: footer
Illustrations from the Openscapes blog [GitHub for supporting, contributing, and failing safely](https://www.openscapes.org/blog/2022/05/27/github-illustrated-series/) by Allison Horst and Julia Lowndes
:::
---
![](images/github_harness.jpeg)
GitHub for the win!
::: footer
Illustrations from the Openscapes blog [GitHub for supporting, contributing, and failing safely](https://www.openscapes.org/blog/2022/05/27/github-illustrated-series/) by Allison Horst and Julia Lowndes
:::
## Summary of advantages of using Git
- Continuously write/revise/delete code over the course of a project
- Try out things safely - you can always revert
- Go back in time
- Reuse code
- Share and synchronize code, work collaboratively
- Get feedback
- Rest easy with code backups
## Learning to use Git
::: {style="text-align: center; margin-top: 1em"}
[**Happy Git and and GitHub for the useR**](https://happygitwithr.com/)
:::
::: {style="text-align: center; margin-top: 1em"}
[**Software Carpentry Lesson "Version Control with Git"**](https://swcarpentry.github.io/git-novice/)
:::
## Working toward a reproducible toolkit
![](images/pexels-polina-tankilevitch-3872373.jpg){{fig-align="center"; fig-alt="a salad bowl filled with a nice mix of different foods"}
::: footer
[Photo by Polina Tankilevitch from Pexels](https://www.pexels.com/photo/photo-of-sliced-vegetables-on-ceramic-plate-3872373/)
:::
# Examples from our lab
## Analysis of data from NIH repositories
- Reused 80 datasets deposited in NCBI's Gene Expression Omnibus to conduct new science
- Built a widely used software package ([ewastools](https://hhhh5.github.io/ewastools/))
- Useful teaching tool
::: {style="text-align: center; margin-top: 2em"}
[Zenodo repository for code: DOI 10.5281/zenodo.1172730](https://doi.org/10.5281/zenodo.1172730){style="text-align: center"}
:::
::: footer
Publication: [Heiss, J., Just, A. 2018. Identifying mislabeled and contaminated DNA methylation microarray data: an extended quality control toolset with examples from GEO. *Clin Epigenet*.](https://doi.org/10.1186/s13148-018-0504-1)
:::
## Analysis of public COVID-19 datasets
- Complex analysis integrating multiple public data sources
- Sharing the code plus output increases transparency
::: {style="text-align: center; margin-top: 1em"}
[*Nature Comms* - Rendered markdown with code and output](https://justlab.github.io/COVID_19_admin_disparities/code/NYC_COVID_Disparities.html){preview-link="true" style="text-align: center"}
:::
::: footer
Publication: [Carrión, D., et al. (2021). Neighborhood-level disparities and subway utilization during the COVID-19 pandemic in New York City. *Nature Communications*, 12(1).](https://doi.org/10.1038/s41467-021-24088-7)
:::
## Reproducible epidemiologic simulation
::: {style="text-align: center; margin-top: 2em"}
[Code repository with targets workflow](https://github.com/justlab/casecrossover_preterm_simulation){style="text-align: center"}
:::
::: footer
Publication: [Carrión, D., et al. (2022). The Case--Crossover Design Under Changing Baseline Outcome Risk: A Simulation of Ambient Temperature and Preterm Birth. *Epidemiology*.](https://doi.org/10.1097/EDE.0000000000001477)
:::
## Interactive exercise with RStudio Cloud
# Data sharing
## FAIR principles and data sharing expectations
::: incremental
- **F**indability
- **A**ccessibility
- **I**nteroperability
- **R**euse of digital assets
:::
[go-fair.org](https://www.go-fair.org/fair-principles/)
## Our lab uses **Zenodo**
Zenodo is a "Generalist Repository"
![](images/zenodo.png)
::: {.incremental}
- Submissions are assigned Digital Object Identifiers (prevents link rot)
- Versioned: DOIs for each version and one DOI linking to the most recent version
- Rich metadata
- Have it your way (any file format possible)
- No weird file names forced upon you
:::
## New NIH rules as of Jan 25, 2023
[sharing.nih.gov](https://sharing.nih.gov/)
. . .
::: {style="text-align: left; margin-top: 1em"}
**This is a big deal**
:::
::: {style="text-align: left; margin-top: 1em"}
[tweet thread by ISMMS professor Florian Krammer](https://twitter.com/florian_krammer/status/1485271552325300230)
:::
## A brief *informal* digest of the new NIH rules
::: {.incremental}
- Who: anyone funded in whole or part by NIH
- What: data management plan (2 page: [link to draft template](https://grants.nih.gov/sites/default/files/DMS-Plan-blank-format-page.pdf){preview-link="true"}) must cover any data generated by their project (regardless of publication); default is open sharing
- When: applications after Jan 25 2023 (not retroactive); data must be shared by earlier of publication or end of award
- Where: into appropriate scientific repositories
- Why: to improve science!
:::
::: footer
Again -- see [sharing.nih.gov](sharing.nih.gov) for details
:::
## Summary
- Reproducible research in epidemiology **is** complicated
- There are many small steps you can take
- You and your collaborators will be the biggest beneficiary
- Our community is moving toward more open science and NIH is making a big push
::: {style="text-align: left; margin-top: 2em"}
**email: `[email protected]`**
:::
::: {.footer}
Slides generated from [github.com/justlab/reproducible_research](https://github.com/justlab/reproducible_research)
:::