-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathpackage-data.Rmd
110 lines (81 loc) · 5.15 KB
/
package-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# Package data {#data}
When developing a software package, an excellent practice is to give a
comprehensive illustration of the methods in the package using an existing
[experiment data package][exptdata-views], [annotation data][annotation-views]
or data in the `r BiocStyle::Biocpkg("ExperimentHub")` or `r BiocStyle::Biocpkg("AnnotationHub")`,
or submitting new data to those resources yourself.
If existing data is not available or applicable, or a new smaller dataset is
needed for examples in the package, data can be included either as a separate
data package (for larger amounts of data) or within the package (for smaller
datasets).
The Bioconductor Build system does not support git-lfs. This is not a current
option for storing large data. Large data sets must be included through the
`r BiocStyle::Biocpkg("ExperimentHub")`. See additional information at [Create A
Hub Package][].
## Experiment Data Package {#experiment-data-package}
Experimental data packages contain data specific to a particular analysis or
experiment. They often accompany a software package for use in the examples and
vignettes and in general are not updated regularly. If you need a general subset
of data for workflows or examples first check the `r BiocStyle::Biocpkg("AnnotationHub")`
resource for available files (e.g., BAM, FASTA, BigWig, etc.) or
`r BiocStyle::Biocpkg("ExperimentHub")` for more processed and predefined
_Bioconductor_ class structures.
[_Bioconductor_][] strongly encourages creating an experiment data package that
utilizes `r BiocStyle::Biocpkg("ExperimentHub")` or `r BiocStyle::Biocpkg("AnnotationHub")`
(See [Create A Hub Package][]) but a traditional package that encapsulates the
data is also acceptable with pre-approval from the [bioc-devel][bioc-devel-mail]
mailing list.
See the [Package Submission guidelines](#bioconductor-package-submissions) for
further submission procedures and if submitting related or circular dependent
packages please read [related package submission procedure][cirdep]
## Adding Data to Existing Package {#adding-data-to-existing-package}
[_Bioconductor_][] strongly encourages the use of existing datasets, but if not
available data can be included directly in the package for use in the examples
found in man pages, vignettes, and tests of your package. This is a good
reference by Hadley Wickham about [package data](http://r-pkgs.had.co.nz/data.html).
However, as mentioned in [The DESCRIPTION file](#description-lazydata) chapter,
_Bioconductor_ does not encourage using `LazyData: True` despite its
recommendation in this article.
Note. You might have to modify the usage section with `@usage data("mydata")`
Some key points are summarized in the following sections.
### Exported Data and the `data/` directory
Data in `data/` is exported to the user and readily available.
It is made available in an <i class="fab fa-r-project"></i> session through the use of `data()`.
It will require documentation concerning its creation and source information.
It is most often a `.RData` file created with `save()` but other types are acceptable as well, see `?data()`.
Please remember to compress the data.
For packages that need to use documented data within a function, we recommend
creating an environment for holding the data to avoid polluting the
global environment when using `data()`.
```r
data_env <- new.env(parent = emptyenv())
data("mydata", envir = data_env, package = "mypackage")
mydata <- data_env[["mydata"]]
```
Here we create a new and empty environment. Extract the data within the
environment and finally extract the data object from the environment using
double brackets.
### Raw Data and the `inst/extdata/` directory
It is often desirable to show a workflow which involves the parsing or loading of raw files.
[_Bioconductor_][] recommends finding existing raw data already provided in
another package or the hubs as previously stated.
However, if this is not applicable, raw data files should be included in the
`inst/extdata` directory. Files of these type are often accessed utilizing
`system.file()`. _Bioconductor_ requires documentation on these files in an
`inst/script/` directory. See [data documentation](#doc-inst-script).
### Internal data
Rarely, a package may require parsed data that is used internal but should not
be exported to the user. An `R/sysdata.rda` is often the best place to include
this type of data. Proper documentation of the data will be required.
### Other data
Downloads of files and external data from the web should be avoided.
If it is necessary, at minimum the files should be cached.
See `r BiocStyle::Biocpkg("BiocFileCache")` for Bioconductor recommended package
for caching of files. If a maintainer creates their own caching directory, it
should utilize standard caching directories `tools::R_user_dir(package,
which="cache")`. It is not allowed to download or write any files to a users
home directory, working directory, or installed package directory. Files should
be cached as stated above with `BiocFileCache` (preferred) or `R_user_dir` or
`tempdir()/tempfile()` if files should not be persistent. Please also see
sections on [querying web resources](#querying-web-resources) and [file
downloads](#web-querying-and-file-caching).