fhdsl · cansavvy · Nov 30, 2023 · Oct 16, 2023 · Oct 17, 2023 · Oct 17, 2023
diff --git a/02-programming-practices.Rmd b/02-programming-practices.Rmd
@@ -1,2 +1,104 @@
+# Scientific software development best practices
 
-# Programming Practices
+## Learning Objectives
+
+```{r, fig.align='center', out.width="100%", echo = FALSE }
+ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g1013f9881e2_0_132")
+```
+
+## Science, and software, as iterative processes
+
+Scientific papers are often arranged as a list of methods and results, building on themselves more or less sequentially.
+Each figure follows from the previous figure or text description, to describe the data that support a hypothesis or illustrate a conclusion in a linear, "story"-like order.
+
+```{r, fig.align='center', out.width="100%", echo = FALSE }
+ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g24731cad425_0_0")
+```
+
+However, the modern process of _doing_ science, itself, is rarely linear.
+It is not realistic to do an experiment, and write a manuscript, and publish the paper, in that order and with no other complications -- usually, there is some amount of iteration involved on one or more of these steps:
+
+- You might do an experiment, then summarize it, then run more experiments based on the results to confirm/test/extend your findings
+- You might do an experiment, write a manuscript, then revise the manuscript based on feedback from other scientists
+- You might submit a manuscript, then a reviewer may request revisions or additional experiments, which will require you to go back and revisit your experimental setup and conclusions
+
+As scientists, we don't generally expect science to be a static, "write once and forget" process.
+
+```{r, fig.align='center', out.width="100%", echo = FALSE }
+ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_16")
+```
+
+The same idea applies to developing research software!
+Rarely, you might be able to write a script or program for a scientific study and use it once, for a single well-defined purpose.
+But more often, you’ll write a script (or join several of them together in a more complex pipeline) and reuse it, possibly with changes or extensions as the project progresses.
+
+```{r, fig.align='center', out.width="100%", echo = FALSE }
+ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_46")
+```
+
+In this course, through the lens of automation, we hope to familiarize you with some of the skills necessary to think about research software in an iterative way, from the beginning of a research project.
+It is almost always easier to put good software development practices in place proactively, before a project has matured, rather than adding them on once the project is sufficiently complex that they are necessary.
+
+Although software development is not generally rewarded directly in academia, it turns out that writing good software does have less obvious rewards, even within the traditional academic structure.
+For example, software that is easy to install tends to be cited more often [@mangul2019], and software that is more consistently maintained tends to be more accurate [@gardner2022].
+
+## Software complexity as a spectrum
+
+Not all software is complex, and not all software requires complex infrastructure (or automation, for that matter)!
+It can be useful to think about the complexity of software engineering infrastructure necessary for a project proportionally to the complexity of the software itself:
+
+- Simple software (math, data transformations, procedural/rule-based scripts) requires simpler infrastructure.
+- Complex software (e.g. "pipelines" composed of many commands/software packages chained together, "libraries" that are intended to be reused in many different applications) requires more complex infrastructure, to check assumptions and test reproducibility at each step.
+
+```{r, fig.align='center', out.width="100%", echo = FALSE }
+ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_153")
+```
+
+We will take a closer look at two concrete examples, one on each end of the software complexity spectrum, in the next section.
+
+## Examples
+
+### t-test
+
+Imagine that you have two sampling distributions (lists/arrays of numbers) and you want to test whether the means of the distributions are statistically equivalent or not.
+This is the setup for a _t_-test.
+_t_-tests are implemented in standard functions in R (base library) and Python (scipy), as well as most other commonly used programming languages.
+
+```{r, fig.align='center', out.width="100%", echo = FALSE }
+ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_122")
+```
+
+In both R and Python, a _t_-test is a very well-defined, specific function that takes two lists of numbers and returns the _t_-statistic and _p_-value.
+Since this is a part of a standard, widely used library in each language, it is already tested as part of those libraries.
+In your own software, you might need to do some verification of your input (for instance, what happens if you pass an empty list of numbers?) but probably not too much, since you can be fairly confident that the _t_-test function does what it is documented to do in the programming language you choose to use.
+
+### Sequencing analysis
+
+Imagine that you have a list of reads from a sequencing machine, and you want to use these data to answer a biological question, or to make a plot/visualization to communicate a biological insight.
+This is a much less well-defined problem than our previous example, with many more independently operating components, and many more subjective decisions that a researcher must make along the way.
+
+```{r, fig.align='center', out.width="100%", echo = FALSE }
+ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g28fb47c580d_0_152")
+```
+
+Most sequencing analyses require multiple steps (i.e. different programs or scripts), and generate multiple intermediate files (e.g. read counts, normalized counts, quality information) that can be checked to verify that the pipeline is proceeding as expected.
+Sequencing analyses can also take hours or days to run, as compared to the _t_-test example which runs effectively instantaneously.
+All of these potential axes of variation mean that 1) the set of steps that need to take place is more complex than our previous example, and 2) having infrastructure to check that everything is running smoothly will be more beneficial than our previous example.
+This is especially true in the case where the sequencing pipeline needs to be rerun or modified to generate different output.
+
+## Automation for scientific software
+
+Good software practices do not necessarily have to rely on automation.
+However, as projects and infrastructure become more complex, it can be unwieldy to check and revise your results without some sort of automated process to kick them off automatically, without too much human intervention.
+Steps that are involved might include: rerunning the software itself (often on new or modified input data), software testing, code style linting, rebuilding figures or processed datasets, and so on.
+Each of these can be individually run by hand, or combined in a central script that runs all the steps in order or in parallel, which can also be triggered manually.
+Such a central script can itself be considered a form of automation.
+
+Automation via GitHub Actions, in contrast, can provide a "single point of truth": a single central script to run these steps, and a single set of (automated) criteria for when to run them.
+This eliminates the need for you to remember to run tests, or to clean up your code, or to rebuild figures, or to kick off similar standard processes or commands on your own.
+
+```{r, fig.align='center', out.width="100%", echo = FALSE }
+ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g28fb47c580d_0_55")
+```
+
+Later in the course, we will talk more specifically about what exactly automation via continuous integration looks like, and go into more depth as to its uses and benefits.
diff --git a/book.bib b/book.bib
@@ -38,3 +38,25 @@ @Book{Xie2020
   note = {ISBN 9780367563837},
   url = {https://bookdown.org/yihui/rmarkdown-cookbook},
 }
+
+@article{mangul2019,
+  title = {Challenges and recommendations to improve the installability and archival stability of omics computational tools},
+  author = {Mangul, Serghei and Mosqueiro, Thiago and Abdill, Richard J and Duong, Dat and Mitchell, Keith and Sarwal, Varuni and Hill, Brian and Brito, Jaqueline and Littman, Russell Jared and Statz, Benjamin and others},
+  journal = {PLoS Biology},
+  volume = {17},
+  number = {6},
+  pages = {e3000333},
+  year = {2019},
+  publisher = {Public Library of Science San Francisco, CA USA}
+}
+
+@article{gardner2022,
+  title={Sustained software development, not number of citations or journal choice, is indicative of accurate bioinformatic software},
+  author={Gardner, Paul P and Paterson, James M and McGimpsey, Stephanie and Ashari-Ghomi, Fatemeh and Umu, Sinan U and Pawlik, Aleksandra and Gavryushkin, Alex and Black, Michael A},
+  journal={Genome Biology},
+  volume={23},
+  number={1},
+  pages={1--13},
+  year={2022},
+  publisher={BioMed Central}
+}