From 2d0a821164bd7770beda65460e76d12410f3268a Mon Sep 17 00:00:00 2001 From: Jake Crawford Date: Mon, 16 Oct 2023 16:01:49 -0400 Subject: [PATCH 01/12] start working on programming practices chapter --- 02-programming-practices.Rmd | 38 +++++++++++++++++++++++++++++++++++- 1 file changed, 37 insertions(+), 1 deletion(-) diff --git a/02-programming-practices.Rmd b/02-programming-practices.Rmd index b9753c00..2dfa88e9 100644 --- a/02-programming-practices.Rmd +++ b/02-programming-practices.Rmd @@ -1,2 +1,38 @@ +# Scientific software development best practices -# Programming Practices +## Learning Objectives + +In this chapter, we will introduce scientific computational analyses as a form of software development, and ... + +```{r, fig.align='center', out.width="100%", echo = FALSE } +ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g1013f9881e2_0_132") +``` + +## Science, and software, as iterative processes + +Scientific papers are often laid out as a list of methods and/or results, arranged more or less sequentially. +Each figure follows from the previous figure or text description, to linearly describe the data that support a hypothesis or illustrate a conclusion. + +```{r, fig.align='center', out.width="100%", echo = FALSE } +ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g24731cad425_0_0") +``` + +However, the modern process of _doing_ science, itself, is rarely linear. +It usually is not realistic to do an experiment, and write a manuscript, and publish the paper, in that order -- usually, there is some amount of iteration involved: +* You might do an experiment, then summarize it, then run more experiments based on the results to confirm/test/extend your findings +* You might do an experiment, write a manuscript, then revise the manuscript based on feedback from other scientists +* You might submit a manuscript, then a reviewer may request revisions or additional experiments, which will require you to go back and revisit your experimental setup and conclusions + +It’s not reasonable to expect it to be a static, "write-once and forget" process! + +```{r, fig.align='center', out.width="100%", echo = FALSE } +ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_16") +``` + +The same idea applies to developing research software. +Rarely, you might be able to write a script for a scientific study and use it once, for a single well-defined purpose. +But more often, you’ll write a script (or several of them joined together in a more complex pipeline) and reuse it, possibly with changes or extensions as the project progresses. + +```{r, fig.align='center', out.width="100%", echo = FALSE } +ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_46") +``` From f7860f76147ccf78ee1b86d94a72cddd6e4b3460 Mon Sep 17 00:00:00 2001 From: Jake Crawford Date: Tue, 17 Oct 2023 10:42:08 -0400 Subject: [PATCH 02/12] outline and spectrum slide --- 02-programming-practices.Rmd | 38 +++++++++++++++++++++++++++++------- 1 file changed, 31 insertions(+), 7 deletions(-) diff --git a/02-programming-practices.Rmd b/02-programming-practices.Rmd index 2dfa88e9..60a4b4d0 100644 --- a/02-programming-practices.Rmd +++ b/02-programming-practices.Rmd @@ -10,29 +10,53 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmX ## Science, and software, as iterative processes -Scientific papers are often laid out as a list of methods and/or results, arranged more or less sequentially. -Each figure follows from the previous figure or text description, to linearly describe the data that support a hypothesis or illustrate a conclusion. +Scientific papers are often arranged as a list of methods and/or results, building on itself more or less sequentially. +Each figure follows from the previous figure or text description, to describe the data that support a hypothesis or illustrate a conclusion in a linear, "story"-like order. ```{r, fig.align='center', out.width="100%", echo = FALSE } ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g24731cad425_0_0") ``` However, the modern process of _doing_ science, itself, is rarely linear. -It usually is not realistic to do an experiment, and write a manuscript, and publish the paper, in that order -- usually, there is some amount of iteration involved: +It is not realistic to do an experiment, and write a manuscript, and publish the paper, in that order and with no other complications -- usually, there is some amount of iteration involved on one or more of these steps: * You might do an experiment, then summarize it, then run more experiments based on the results to confirm/test/extend your findings * You might do an experiment, write a manuscript, then revise the manuscript based on feedback from other scientists * You might submit a manuscript, then a reviewer may request revisions or additional experiments, which will require you to go back and revisit your experimental setup and conclusions -It’s not reasonable to expect it to be a static, "write-once and forget" process! +As scientists, we don't generally expect science to be a static, "write once and forget" process. ```{r, fig.align='center', out.width="100%", echo = FALSE } ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_16") ``` -The same idea applies to developing research software. -Rarely, you might be able to write a script for a scientific study and use it once, for a single well-defined purpose. -But more often, you’ll write a script (or several of them joined together in a more complex pipeline) and reuse it, possibly with changes or extensions as the project progresses. +The same idea applies to developing research software! +Rarely, you might be able to write a script or program for a scientific study and use it once, for a single well-defined purpose. +But more often, you’ll write a script (or join several of them together in a more complex pipeline) and reuse it, possibly with changes or extensions as the project progresses. ```{r, fig.align='center', out.width="100%", echo = FALSE } ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_46") ``` + +In this course, through the lens of automation, we hope to familiarize you with some of the skills necessary to think about research software iteratively, from the beginning of a research project. +It is almost always easier to put good software development practices in place proactively, before a project has matured, rather than adding them on once the project is sufficiently complex that they are necessary. + +## Software complexity as a spectrum + +Not all software is complex, and not all software requires complex infrastructure (or automation, for that matter)! +It can be useful to think about the complexity of software engineering infrastructure necessary for a project as scaling proportionally to the complexity of the software itself: +* Simple software (math, data transformations, procedural/rule-based scripts) requires simpler infrastructure. +* Complex software (e.g. "pipelines" composed of many commands/software packages chained together, "libraries" that are intended to be reused in many different applications) requires more complex infrastructure, to check assumptions and test reproducibility at each step. + +```{r, fig.align='center', out.width="100%", echo = FALSE } +ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_153") +``` + +We will take a closer look at two concrete examples, one on each end of the spectrum, in the next section. + +## Examples + +### t-test + +### Sequencing analysis + +## Automation for scientific software From 4eb72ef8e953958cd1b0b935fe7824adaf2c8397 Mon Sep 17 00:00:00 2001 From: Jake Crawford Date: Tue, 17 Oct 2023 11:23:52 -0400 Subject: [PATCH 03/12] add examples --- 02-programming-practices.Rmd | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/02-programming-practices.Rmd b/02-programming-practices.Rmd index 60a4b4d0..c83b890a 100644 --- a/02-programming-practices.Rmd +++ b/02-programming-practices.Rmd @@ -51,12 +51,35 @@ It can be useful to think about the complexity of software engineering infrastru ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_153") ``` -We will take a closer look at two concrete examples, one on each end of the spectrum, in the next section. +We will take a closer look at two concrete examples, one on each end of the software complexity spectrum, in the next section. ## Examples ### t-test +Imagine that you have two sampling distributions (lists/arrays of numbers) and you want to test whether the means of the distributions are statistically equivalent or not. +This is the setup for a _t_-test. +_t_-tests are implemented in standard functions in R (base library) and Python (scipy), as well as most other commonly used programming languages. + +```{r, fig.align='center', out.width="100%", echo = FALSE } +ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_122") +``` + +In both R and Python, a _t_-test is a very well-defined, specific function that takes two lists of numbers and returns the _t_-statistic and _p_-value. +Since this is a part of a standard, widely used library in each language, it is already tested as part of those libraries. +In your own software, you might need to do some verification of your input (for instance, what happens if you pass an empty list of numbers?) but probably not too much, since you can be fairly confident that the _t_-test function does what it is documented to do in the programming language you choose to use. + ### Sequencing analysis +Imagine that you have a list of reads from a sequencing machine, and you want to use these data to answer a biological question, or to make a plot/visualization to support a biological insight. +This is a much less well-defined problem than our previous example, with many more independently operating components, and many more subjective decisions that a researcher must make along the way. + +```{r, fig.align='center', out.width="100%", echo = FALSE } +ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g28fb47c580d_0_152") +``` + +Most sequencing analyses require multiple steps (i.e. different programs or scripts), and generate multiple intermediate files (e.g. read counts, normalized counts, quality information) that can be checked to verify that the pipeline is proceeding as expected. +Sequencing analyses can also take hours or days to run, as compared to the _t_-test example which runs effectively instantaneously. +All of these potential axes of variation mean that having infrastructure to check that everything is running smoothly is 1) more complex than our previous example, and 2) more necessary than our previous example, especially in the case where the sequencing pipeline needs to be rerun or modified. + ## Automation for scientific software From f72411316d1a5f97d895a187d5c6b2cd721db3a5 Mon Sep 17 00:00:00 2001 From: Jake Crawford Date: Tue, 17 Oct 2023 12:20:10 -0400 Subject: [PATCH 04/12] add automation intro section --- 02-programming-practices.Rmd | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/02-programming-practices.Rmd b/02-programming-practices.Rmd index c83b890a..7086b5ce 100644 --- a/02-programming-practices.Rmd +++ b/02-programming-practices.Rmd @@ -10,7 +10,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmX ## Science, and software, as iterative processes -Scientific papers are often arranged as a list of methods and/or results, building on itself more or less sequentially. +Scientific papers are often arranged as a list of methods and results, building on themselves more or less sequentially. Each figure follows from the previous figure or text description, to describe the data that support a hypothesis or illustrate a conclusion in a linear, "story"-like order. ```{r, fig.align='center', out.width="100%", echo = FALSE } @@ -43,7 +43,7 @@ It is almost always easier to put good software development practices in place p ## Software complexity as a spectrum Not all software is complex, and not all software requires complex infrastructure (or automation, for that matter)! -It can be useful to think about the complexity of software engineering infrastructure necessary for a project as scaling proportionally to the complexity of the software itself: +It can be useful to think about the complexity of software engineering infrastructure necessary for a project proportionally to the complexity of the software itself: * Simple software (math, data transformations, procedural/rule-based scripts) requires simpler infrastructure. * Complex software (e.g. "pipelines" composed of many commands/software packages chained together, "libraries" that are intended to be reused in many different applications) requires more complex infrastructure, to check assumptions and test reproducibility at each step. @@ -71,7 +71,7 @@ In your own software, you might need to do some verification of your input (for ### Sequencing analysis -Imagine that you have a list of reads from a sequencing machine, and you want to use these data to answer a biological question, or to make a plot/visualization to support a biological insight. +Imagine that you have a list of reads from a sequencing machine, and you want to use these data to answer a biological question, or to make a plot/visualization to communicate a biological insight. This is a much less well-defined problem than our previous example, with many more independently operating components, and many more subjective decisions that a researcher must make along the way. ```{r, fig.align='center', out.width="100%", echo = FALSE } @@ -80,6 +80,25 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmX Most sequencing analyses require multiple steps (i.e. different programs or scripts), and generate multiple intermediate files (e.g. read counts, normalized counts, quality information) that can be checked to verify that the pipeline is proceeding as expected. Sequencing analyses can also take hours or days to run, as compared to the _t_-test example which runs effectively instantaneously. -All of these potential axes of variation mean that having infrastructure to check that everything is running smoothly is 1) more complex than our previous example, and 2) more necessary than our previous example, especially in the case where the sequencing pipeline needs to be rerun or modified. +All of these potential axes of variation mean that having infrastructure to check that everything is running smoothly is 1) more complex than our previous example, and 2) more necessary than our previous example, especially in the case where the sequencing pipeline needs to be rerun or modified to generate different output. ## Automation for scientific software + +Good software practices do not necessarily have to be implemented using automation. +However, as projects and infrastructure become more complex, it can be unwieldy to check and revise your results without some sort of automated process to kick them off automatically, without too much human intervention. +Steps that are involved might include: rerunning the software itself (often on new or modified input data), software testing, code style linting, rebuilding figures or processed datasets, and so on. +Each of these can be individually run by hand, or combined in a central script that runs all the steps in order or in parallel, which can also be triggered manually. +Such a central script can itself be considered a form of automation. + +```{r, fig.align='center', out.width="100%", echo = FALSE } +ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g28fb47c580d_0_55") +``` + +Automation via GitHub Actions, in contrast, can provide a "single point of truth": a single central script to run these steps, and a single set of (automated) criteria for when to run them. +This eliminates the need for you to remember to run tests, or to clean up your code, or to rebuild figures, etc. on your own. + +```{r, fig.align='center', out.width="100%", echo = FALSE } +ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g28fb47c580d_0_115") +``` + +Later in the course, we will talk more specifically about what exactly automation via continuous integration is, and go into more depth as to its uses and benefits. From 0a81530fb1ce6423a9cf1aaf069dbe8f152b4058 Mon Sep 17 00:00:00 2001 From: Jake Crawford Date: Tue, 17 Oct 2023 14:36:23 -0400 Subject: [PATCH 05/12] add references --- 02-programming-practices.Rmd | 5 +++-- book.bib | 22 ++++++++++++++++++++++ 2 files changed, 25 insertions(+), 2 deletions(-) diff --git a/02-programming-practices.Rmd b/02-programming-practices.Rmd index 7086b5ce..14d4dd59 100644 --- a/02-programming-practices.Rmd +++ b/02-programming-practices.Rmd @@ -2,8 +2,6 @@ ## Learning Objectives -In this chapter, we will introduce scientific computational analyses as a form of software development, and ... - ```{r, fig.align='center', out.width="100%", echo = FALSE } ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g1013f9881e2_0_132") ``` @@ -40,6 +38,9 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmX In this course, through the lens of automation, we hope to familiarize you with some of the skills necessary to think about research software iteratively, from the beginning of a research project. It is almost always easier to put good software development practices in place proactively, before a project has matured, rather than adding them on once the project is sufficiently complex that they are necessary. +Although software development is not generally rewarded directly in academia, it turns out that writing good software does have less obvious rewards, even within the traditional academic structure. +For example, software that is easy to install tends to be cited more often [@mangul2019], and software that is more consistently maintained tends to be more accurate [@gardner2022]. + ## Software complexity as a spectrum Not all software is complex, and not all software requires complex infrastructure (or automation, for that matter)! diff --git a/book.bib b/book.bib index 0015b3f0..fe58a4ab 100644 --- a/book.bib +++ b/book.bib @@ -38,3 +38,25 @@ @Book{Xie2020 note = {ISBN 9780367563837}, url = {https://bookdown.org/yihui/rmarkdown-cookbook}, } + +@article{mangul2019, + title = {Challenges and recommendations to improve the installability and archival stability of omics computational tools}, + author = {Mangul, Serghei and Mosqueiro, Thiago and Abdill, Richard J and Duong, Dat and Mitchell, Keith and Sarwal, Varuni and Hill, Brian and Brito, Jaqueline and Littman, Russell Jared and Statz, Benjamin and others}, + journal = {PLoS Biology}, + volume = {17}, + number = {6}, + pages = {e3000333}, + year = {2019}, + publisher = {Public Library of Science San Francisco, CA USA} +} + +@article{gardner2022, + title={Sustained software development, not number of citations or journal choice, is indicative of accurate bioinformatic software}, + author={Gardner, Paul P and Paterson, James M and McGimpsey, Stephanie and Ashari-Ghomi, Fatemeh and Umu, Sinan U and Pawlik, Aleksandra and Gavryushkin, Alex and Black, Michael A}, + journal={Genome Biology}, + volume={23}, + number={1}, + pages={1--13}, + year={2022}, + publisher={BioMed Central} +} From 3cb7bc2754543f5fe3f4d3934690e4c4b03ff923 Mon Sep 17 00:00:00 2001 From: Jake Crawford Date: Thu, 19 Oct 2023 11:48:16 -0400 Subject: [PATCH 06/12] consolidate automation slide + edit text --- 02-programming-practices.Rmd | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/02-programming-practices.Rmd b/02-programming-practices.Rmd index 14d4dd59..9b05bc8b 100644 --- a/02-programming-practices.Rmd +++ b/02-programming-practices.Rmd @@ -81,25 +81,22 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmX Most sequencing analyses require multiple steps (i.e. different programs or scripts), and generate multiple intermediate files (e.g. read counts, normalized counts, quality information) that can be checked to verify that the pipeline is proceeding as expected. Sequencing analyses can also take hours or days to run, as compared to the _t_-test example which runs effectively instantaneously. -All of these potential axes of variation mean that having infrastructure to check that everything is running smoothly is 1) more complex than our previous example, and 2) more necessary than our previous example, especially in the case where the sequencing pipeline needs to be rerun or modified to generate different output. +All of these potential axes of variation mean that 1) the set of steps that need to take place is more complex than our previous example, and 2) having infrastructure to check that everything is running smoothly will be more beneficial than our previous example. +This is especially true in the case where the sequencing pipeline needs to be rerun or modified to generate different output. ## Automation for scientific software -Good software practices do not necessarily have to be implemented using automation. +Good software practices do not necessarily have to rely on automation. However, as projects and infrastructure become more complex, it can be unwieldy to check and revise your results without some sort of automated process to kick them off automatically, without too much human intervention. Steps that are involved might include: rerunning the software itself (often on new or modified input data), software testing, code style linting, rebuilding figures or processed datasets, and so on. Each of these can be individually run by hand, or combined in a central script that runs all the steps in order or in parallel, which can also be triggered manually. Such a central script can itself be considered a form of automation. -```{r, fig.align='center', out.width="100%", echo = FALSE } -ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g28fb47c580d_0_55") -``` - Automation via GitHub Actions, in contrast, can provide a "single point of truth": a single central script to run these steps, and a single set of (automated) criteria for when to run them. -This eliminates the need for you to remember to run tests, or to clean up your code, or to rebuild figures, etc. on your own. +This eliminates the need for you to remember to run tests, or to clean up your code, or to rebuild figures, or to kick off similar standard processes or commands on your own. ```{r, fig.align='center', out.width="100%", echo = FALSE } -ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g28fb47c580d_0_115") +ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g28fb47c580d_0_55") ``` -Later in the course, we will talk more specifically about what exactly automation via continuous integration is, and go into more depth as to its uses and benefits. +Later in the course, we will talk more specifically about what exactly automation via continuous integration looks like, and go into more depth as to its uses and benefits. From c6d48bc734c6743e28744155eed1160dae450e60 Mon Sep 17 00:00:00 2001 From: Jake Crawford Date: Thu, 19 Oct 2023 12:02:33 -0400 Subject: [PATCH 07/12] fix bullets --- 02-programming-practices.Rmd | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/02-programming-practices.Rmd b/02-programming-practices.Rmd index 9b05bc8b..c0d81907 100644 --- a/02-programming-practices.Rmd +++ b/02-programming-practices.Rmd @@ -17,9 +17,10 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmX However, the modern process of _doing_ science, itself, is rarely linear. It is not realistic to do an experiment, and write a manuscript, and publish the paper, in that order and with no other complications -- usually, there is some amount of iteration involved on one or more of these steps: -* You might do an experiment, then summarize it, then run more experiments based on the results to confirm/test/extend your findings -* You might do an experiment, write a manuscript, then revise the manuscript based on feedback from other scientists -* You might submit a manuscript, then a reviewer may request revisions or additional experiments, which will require you to go back and revisit your experimental setup and conclusions + +- You might do an experiment, then summarize it, then run more experiments based on the results to confirm/test/extend your findings +- You might do an experiment, write a manuscript, then revise the manuscript based on feedback from other scientists +- You might submit a manuscript, then a reviewer may request revisions or additional experiments, which will require you to go back and revisit your experimental setup and conclusions As scientists, we don't generally expect science to be a static, "write once and forget" process. @@ -35,7 +36,7 @@ But more often, you’ll write a script (or join several of them together in a m ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_46") ``` -In this course, through the lens of automation, we hope to familiarize you with some of the skills necessary to think about research software iteratively, from the beginning of a research project. +In this course, through the lens of automation, we hope to familiarize you with some of the skills necessary to think about research software in an iterative way, from the beginning of a research project. It is almost always easier to put good software development practices in place proactively, before a project has matured, rather than adding them on once the project is sufficiently complex that they are necessary. Although software development is not generally rewarded directly in academia, it turns out that writing good software does have less obvious rewards, even within the traditional academic structure. @@ -45,8 +46,9 @@ For example, software that is easy to install tends to be cited more often [@man Not all software is complex, and not all software requires complex infrastructure (or automation, for that matter)! It can be useful to think about the complexity of software engineering infrastructure necessary for a project proportionally to the complexity of the software itself: -* Simple software (math, data transformations, procedural/rule-based scripts) requires simpler infrastructure. -* Complex software (e.g. "pipelines" composed of many commands/software packages chained together, "libraries" that are intended to be reused in many different applications) requires more complex infrastructure, to check assumptions and test reproducibility at each step. + +- Simple software (math, data transformations, procedural/rule-based scripts) requires simpler infrastructure. +- Complex software (e.g. "pipelines" composed of many commands/software packages chained together, "libraries" that are intended to be reused in many different applications) requires more complex infrastructure, to check assumptions and test reproducibility at each step. ```{r, fig.align='center', out.width="100%", echo = FALSE } ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_153") From 092bedfb7b48e415d4081852c62d6f674cbebe8c Mon Sep 17 00:00:00 2001 From: Jake Crawford Date: Sun, 19 Nov 2023 16:10:04 -0500 Subject: [PATCH 08/12] apply changes from code review --- 02-programming-practices.Rmd | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/02-programming-practices.Rmd b/02-programming-practices.Rmd index c0d81907..853eeda2 100644 --- a/02-programming-practices.Rmd +++ b/02-programming-practices.Rmd @@ -6,7 +6,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g1013f9881e2_0_132") ``` -## Science, and software, as iterative processes +## Science and software as iterative processes Scientific papers are often arranged as a list of methods and results, building on themselves more or less sequentially. Each figure follows from the previous figure or text description, to describe the data that support a hypothesis or illustrate a conclusion in a linear, "story"-like order. @@ -62,7 +62,7 @@ We will take a closer look at two concrete examples, one on each end of the soft Imagine that you have two sampling distributions (lists/arrays of numbers) and you want to test whether the means of the distributions are statistically equivalent or not. This is the setup for a _t_-test. -_t_-tests are implemented in standard functions in R (base library) and Python (scipy), as well as most other commonly used programming languages. +_t_-tests are implemented in standard functions in R (`base` library) and Python (`scipy`), as well as most other commonly used programming languages. ```{r, fig.align='center', out.width="100%", echo = FALSE } ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_122") @@ -74,7 +74,7 @@ In your own software, you might need to do some verification of your input (for ### Sequencing analysis -Imagine that you have a list of reads from a sequencing machine, and you want to use these data to answer a biological question, or to make a plot/visualization to communicate a biological insight. +Imagine that you have a list of reads from a (DNA sequencing)[https://en.wikipedia.org/wiki/DNA_sequencing#High-throughput_methods] machine, and you want to use these data to answer a biological question, or to make a plot/visualization to communicate a biological insight. This is a much less well-defined problem than our previous example, with many more independently operating components, and many more subjective decisions that a researcher must make along the way. ```{r, fig.align='center', out.width="100%", echo = FALSE } @@ -89,16 +89,17 @@ This is especially true in the case where the sequencing pipeline needs to be re ## Automation for scientific software Good software practices do not necessarily have to rely on automation. -However, as projects and infrastructure become more complex, it can be unwieldy to check and revise your results without some sort of automated process to kick them off automatically, without too much human intervention. +However, complex projects can be unwieldy to check and revise in the absence of some sort of automated process to kick them off automatically, without too much human intervention. Steps that are involved might include: rerunning the software itself (often on new or modified input data), software testing, code style linting, rebuilding figures or processed datasets, and so on. Each of these can be individually run by hand, or combined in a central script that runs all the steps in order or in parallel, which can also be triggered manually. Such a central script can itself be considered a form of automation. -Automation via GitHub Actions, in contrast, can provide a "single point of truth": a single central script to run these steps, and a single set of (automated) criteria for when to run them. -This eliminates the need for you to remember to run tests, or to clean up your code, or to rebuild figures, or to kick off similar standard processes or commands on your own. +Automation like that of GitHub Actions, in contrast, can provide a "single point of truth": a single central script to run these steps, and a single set of (automated) criteria for when to run them. +This eliminates the need for you to remember to run tests, to clean up your code, to rebuild figures, or to kick off similar standard processes or commands on your own. ```{r, fig.align='center', out.width="100%", echo = FALSE } ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g28fb47c580d_0_55") ``` Later in the course, we will talk more specifically about what exactly automation via continuous integration looks like, and go into more depth as to its uses and benefits. + From 44ef964b805c6b8730170f06a5d1017ff67cb851 Mon Sep 17 00:00:00 2001 From: Jake Crawford Date: Sun, 19 Nov 2023 16:19:13 -0500 Subject: [PATCH 09/12] some other review changes --- 02-programming-practices.Rmd | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/02-programming-practices.Rmd b/02-programming-practices.Rmd index 853eeda2..f6bc8544 100644 --- a/02-programming-practices.Rmd +++ b/02-programming-practices.Rmd @@ -37,8 +37,6 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmX ``` In this course, through the lens of automation, we hope to familiarize you with some of the skills necessary to think about research software in an iterative way, from the beginning of a research project. -It is almost always easier to put good software development practices in place proactively, before a project has matured, rather than adding them on once the project is sufficiently complex that they are necessary. - Although software development is not generally rewarded directly in academia, it turns out that writing good software does have less obvious rewards, even within the traditional academic structure. For example, software that is easy to install tends to be cited more often [@mangul2019], and software that is more consistently maintained tends to be more accurate [@gardner2022]. @@ -62,7 +60,7 @@ We will take a closer look at two concrete examples, one on each end of the soft Imagine that you have two sampling distributions (lists/arrays of numbers) and you want to test whether the means of the distributions are statistically equivalent or not. This is the setup for a _t_-test. -_t_-tests are implemented in standard functions in R (`base` library) and Python (`scipy`), as well as most other commonly used programming languages. +_t_-tests are implemented in standard functions in R ((`base` library)[https://rdrr.io/r/base/base-package.html]) and Python ((`scipy` library)[https://scipy.org/]), as well as most other commonly used programming languages. ```{r, fig.align='center', out.width="100%", echo = FALSE } ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_122") @@ -90,8 +88,16 @@ This is especially true in the case where the sequencing pipeline needs to be re Good software practices do not necessarily have to rely on automation. However, complex projects can be unwieldy to check and revise in the absence of some sort of automated process to kick them off automatically, without too much human intervention. -Steps that are involved might include: rerunning the software itself (often on new or modified input data), software testing, code style linting, rebuilding figures or processed datasets, and so on. -Each of these can be individually run by hand, or combined in a central script that runs all the steps in order or in parallel, which can also be triggered manually. +Steps that are involved might include: + +- Rerunning the software itself (often on new or modified input data) +- Software testing +- Code style linting +- Rebuilding figures or processed datasets +- And many more! + +Each of these steps could be individually run by hand. +Alternatively, they could be combined in a central script that runs all the steps in order or in parallel, which can also be triggered manually. Such a central script can itself be considered a form of automation. Automation like that of GitHub Actions, in contrast, can provide a "single point of truth": a single central script to run these steps, and a single set of (automated) criteria for when to run them. From 0a063b81a6bf1bc89cffd122bf904b6575e60bc0 Mon Sep 17 00:00:00 2001 From: Jake Crawford Date: Sun, 19 Nov 2023 16:46:39 -0500 Subject: [PATCH 10/12] fix links --- 02-programming-practices.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/02-programming-practices.Rmd b/02-programming-practices.Rmd index f6bc8544..dccbb6cf 100644 --- a/02-programming-practices.Rmd +++ b/02-programming-practices.Rmd @@ -60,7 +60,7 @@ We will take a closer look at two concrete examples, one on each end of the soft Imagine that you have two sampling distributions (lists/arrays of numbers) and you want to test whether the means of the distributions are statistically equivalent or not. This is the setup for a _t_-test. -_t_-tests are implemented in standard functions in R ((`base` library)[https://rdrr.io/r/base/base-package.html]) and Python ((`scipy` library)[https://scipy.org/]), as well as most other commonly used programming languages. +_t_-tests are implemented in standard functions in R ([`base` library](https://rdrr.io/r/base/base-package.html)) and Python ([`scipy` library](https://scipy.org/)), as well as most other commonly used programming languages. ```{r, fig.align='center', out.width="100%", echo = FALSE } ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_122") @@ -72,7 +72,7 @@ In your own software, you might need to do some verification of your input (for ### Sequencing analysis -Imagine that you have a list of reads from a (DNA sequencing)[https://en.wikipedia.org/wiki/DNA_sequencing#High-throughput_methods] machine, and you want to use these data to answer a biological question, or to make a plot/visualization to communicate a biological insight. +Imagine that you have a list of reads from a [DNA sequencing](https://en.wikipedia.org/wiki/DNA_sequencing#High-throughput_methods) machine, and you want to use these data to answer a biological question, or to make a plot/visualization to communicate a biological insight. This is a much less well-defined problem than our previous example, with many more independently operating components, and many more subjective decisions that a researcher must make along the way. ```{r, fig.align='center', out.width="100%", echo = FALSE } From 9f4537193a2b370f6f2f36f1c1bec4fb590a2dfb Mon Sep 17 00:00:00 2001 From: Jake Crawford Date: Sun, 26 Nov 2023 16:00:27 -0500 Subject: [PATCH 11/12] elaborate on sequencing complexity comments --- 02-programming-practices.Rmd | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/02-programming-practices.Rmd b/02-programming-practices.Rmd index dccbb6cf..661f38b0 100644 --- a/02-programming-practices.Rmd +++ b/02-programming-practices.Rmd @@ -81,8 +81,11 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmX Most sequencing analyses require multiple steps (i.e. different programs or scripts), and generate multiple intermediate files (e.g. read counts, normalized counts, quality information) that can be checked to verify that the pipeline is proceeding as expected. Sequencing analyses can also take hours or days to run, as compared to the _t_-test example which runs effectively instantaneously. -All of these potential axes of variation mean that 1) the set of steps that need to take place is more complex than our previous example, and 2) having infrastructure to check that everything is running smoothly will be more beneficial than our previous example. -This is especially true in the case where the sequencing pipeline needs to be rerun or modified to generate different output. + +This means that: + +- The set of steps that need to take place is more complex than our previous example, and each step in the analysis likely builds from previous steps. Finding errors early in the process can save a lot of time and effort in later steps. +- A longer or more complex set of steps often means there are more ambiguous/"gray area" decisions that need to be made along the way. This usually means more iterations or experiments, to explore what works and what doesn't. Introducing reproducible software practices from the ground up will help to make this exploratory process easier and clearer. ## Automation for scientific software From 89f593facbe58ad536ee99eef841d9c15666a252 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Thu, 30 Nov 2023 13:38:32 -0500 Subject: [PATCH 12/12] add a bit about complex decisions --- 02-programming-practices.Rmd | 1 + 1 file changed, 1 insertion(+) diff --git a/02-programming-practices.Rmd b/02-programming-practices.Rmd index 661f38b0..74ec2aba 100644 --- a/02-programming-practices.Rmd +++ b/02-programming-practices.Rmd @@ -74,6 +74,7 @@ In your own software, you might need to do some verification of your input (for Imagine that you have a list of reads from a [DNA sequencing](https://en.wikipedia.org/wiki/DNA_sequencing#High-throughput_methods) machine, and you want to use these data to answer a biological question, or to make a plot/visualization to communicate a biological insight. This is a much less well-defined problem than our previous example, with many more independently operating components, and many more subjective decisions that a researcher must make along the way. +Complex data analyses means complex decisions! This often means that decisions made are not so cut and dry and should rely on the scientific context of the data. In other words, analyses often are tailored to reflect the biology (or other science) and or perhaps the experimental goals. ```{r, fig.align='center', out.width="100%", echo = FALSE } ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g28fb47c580d_0_152")