-
Notifications
You must be signed in to change notification settings - Fork 6
/
24-reproducibility.Rmd
82 lines (57 loc) · 3.51 KB
/
24-reproducibility.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
title: ''
output: html_document
---
# Reproducible {#reproduce}
We want our code to be reproducible so that:
* it can be used by others (both for collaboration and to allow effective review and accountability);
* it keeps working over time (protected from external changes);
* it can be easily reused by others in their own projects.
There are a number of steps that we can take to ensure that our code is as reproducible as possible.
## Manage project dependencies {#projdep}
Your project will depend on an number of external factors, such as software or packages. These dependencies may mean that your project won't work on others' machines or may not work on your machine at a later date (e.g. as external packages are updated over time). To ensure that this doesn't become an issue for your project, you should use some kind of dependency management tool.
### Dependency management tools {-}
+--------------+--------------------------------------------------------+
| Language | Tools |
+==============+========================================================+
| R | We recommend using [Conda][241]. Other alternatives are [Packrat][242] and [Renv][243]. |
+--------------+--------------------------------------------------------+
| Python | We recommend using [Conda][241] |
+--------------+--------------------------------------------------------+
| Javascript | Include third party library dependencies in the project as `.js` files |
+--------------+--------------------------------------------------------+
[241]:https://user-guidance.services.alpha.mojanalytics.xyz/appendix/conda/#content
[242]:https://rstudio.github.io/packrat/
[243]:https://blog.rstudio.com/2019/11/06/renv-project-environments-for-r/
### Include a git hash {-#githash}
If practical, the output of your code should include the git hash of the code that produced it. By doing so, the analysis should be
more reproducible, there is no ambiguity about the specific code that was used to generate it.
#### R {-}
You can access the git hash using either of the following code:
snippets.
```{r, results="hide"}
library(git2r)
repo <- repository(".")
print(repository_head(repo))
```
or
```{r, results="hide"}
print(system("git rev-parse --short HEAD", intern = TRUE))
```
#### Python {-}
You can access the git hash using the following code:
``` {python}
import subprocess
def get_git_revision_hash():
return subprocess.check_output(['git', 'rev-parse', 'HEAD'])
def get_git_revision_short_hash():
return subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD'])
```
## Format {#format}
If the output is a report, the write up should be fully reproducible, or as close as possible.
* Avoid workflows that require manually copying and pasting results between documents.
* For Python, consider using Jupyter notebooks. For R, use `rmarkdown`.
## Optimize for change {#change}
* Don’t try to solve every conceivable problem up-front, instead focus on making your code easy to change when needed.
* Don’t prematurely optimize - choose clarity over performance, unless there is a serious performance issue that needs to be addressed.
* Change can come in several forms, including hardware - your code will eventually be run on a colleague's machine or a server somewhere. Without over-complicating things, write your code with this in mind. For example, use relative paths (e.g. `./file_in_the_project_directory.R` rather than `/Users/my_username/development/my_project/file_in_the_project_directory.R`)