forked from fhdsl/Containers_for_Scientists
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01-intro.qmd
74 lines (46 loc) · 5.14 KB
/
01-intro.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
```{r, include = FALSE}
ottrpal::set_knitr_image_path()
```
# Introduction
```{r, out.width = "100%", echo = FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1T5Lfei2UVou9b0qaUCrWXmkcIwAao-UcN4pHMPEE4CY/edit#slide=id.p")
```
## Target Audience
The course is intended for students in the biomedical sciences and researchers who use informatics tools in their research.
_This course is written for individuals who:_
- Are comfortable with bash and command line
- Write code for scientific projects
- Perhaps tried to use Docker or another containerization software before but felt overwhelmed or confused on how to get started
- Want to learn best practices for using containers
```{r, out.width = "100%", echo = FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1T5Lfei2UVou9b0qaUCrWXmkcIwAao-UcN4pHMPEE4CY/edit#slide=id.g2942096bb40_0_787")
```
## Topics covered
This course covers how to use containers for scientific software development. Scientific software can take many forms but all can benefit from the concepts of continuous integration (CI) and continuous deployment (CD). Containers play a critical role in CI/CD by providing a consistent, portable, and isolated environment for building, testing, and deploying software.
```{r, out.width = "100%", echo = FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1T5Lfei2UVou9b0qaUCrWXmkcIwAao-UcN4pHMPEE4CY/edit#slide=id.g2942096bb40_0_464")
```
This course builds on concepts introduced in the [Reproducibility](https://jhudatascience.org/Reproducibility_in_Cancer_Informatics/introduction.html) and [Advanced Reproducibility](https://jhudatascience.org/Adv_Reproducibility_in_Cancer_Informatics/introduction.html) courses from the ITCR Training Network. If you are unfamiliar with GitHub and/or do not have an account, we'd suggest you start with those courses by using the links above.
```{r, out.width = "100%", echo = FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g101867ebdaa_18_0")
```
## Motivation
Cancer datasets are plentiful, complicated, and hold untold amounts of information regarding cancer biology. Cancer researchers are working to apply their expertise to the analysis of these vast amounts of data but training opportunities to properly equip them in these efforts can be sparse. This includes training in reproducible data analysis methods.
Data analyses are generally not reproducible without direct contact with the original researchers and a substantial amount of time and effort [@BeaulieuJones2017]. Reproducibility in cancer informatics (as with other fields) is still not monitored or incentivized despite the fact that it is fundamental to the scientific method. Even without external incentives, many researchers strive for reproducibility in their own work but often lack the skills or training to do so effectively.
Equipping researchers with the skills to create reproducible data analyses increases the efficiency of everyone involved. By recognizing that biological data analysis code is a form of software development, we can try to adapt good development practices in scientific analyses and software contexts.
_Scientific software projects may include (but aren't limited to):_
- Software built as tools to be utilized by others to analyze biologically derived data.
- Code that is built primarily for analyzing one project's data.
- Code that is built as a workflow for a series of steps and analyses that might be reused among collaborators or within a lab.
- Any scripts and code that are built to handle data in a research setting.
- Any scripts and code a researcher might interact with.
One tool among many for creating reproducible analyses is utilizing containers. A container is a lightweight, portable, and isolated environment that encapsulates an application and its dependencies, enabling it to run consistently across different computing environments. Many individuals performing analyses on cancer data may not have formal training in software development and may be unfamiliar with the idea of containers.
## Curriculum
The course includes hands-on exercises for how to use, modify, share, and troubleshoot containers for scientific software development purposes.
**Goal of this course:**
Equip learners with basics skills and confidence to utilize containers within the context of scientific software analyses.
**What is not the goal**
This course is not meant to teach learners how to create complex containers, but instead introduce learners to basic fundamentals of continuous integration and continuous deployment. This course focuses on containers (Docker or Podman) and will not cover any other (perfectly fine) tools for CI/CD.
## How to use the course
Ideally you should follow along with the chapters and perform the activities as they are described. These activities involve using Docker or optionally Podman.
We also recommend that if you'd like to fully leverage your container skills after taking this course you may also enjoy our [GitHub Actions course](https://hutchdatascience.org/GitHub_Automation_for_Scientists/) that pairs well with the skillset taught here.