Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Programming practices chapter (now on a branch, rather than a fork) #14

Merged
merged 15 commits into from
Nov 30, 2023
Merged
104 changes: 103 additions & 1 deletion 02-programming-practices.Rmd
Original file line number Diff line number Diff line change
@@ -1,2 +1,104 @@
# Scientific software development best practices

# Programming Practices
## Learning Objectives

```{r, fig.align='center', out.width="100%", echo = FALSE }
ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g1013f9881e2_0_132")
```

## Science, and software, as iterative processes
jjc2718 marked this conversation as resolved.
Show resolved Hide resolved

Scientific papers are often arranged as a list of methods and results, building on themselves more or less sequentially.
Each figure follows from the previous figure or text description, to describe the data that support a hypothesis or illustrate a conclusion in a linear, "story"-like order.

```{r, fig.align='center', out.width="100%", echo = FALSE }
ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g24731cad425_0_0")
```

However, the modern process of _doing_ science, itself, is rarely linear.
It is not realistic to do an experiment, and write a manuscript, and publish the paper, in that order and with no other complications -- usually, there is some amount of iteration involved on one or more of these steps:

- You might do an experiment, then summarize it, then run more experiments based on the results to confirm/test/extend your findings
- You might do an experiment, write a manuscript, then revise the manuscript based on feedback from other scientists
- You might submit a manuscript, then a reviewer may request revisions or additional experiments, which will require you to go back and revisit your experimental setup and conclusions
jjc2718 marked this conversation as resolved.
Show resolved Hide resolved

As scientists, we don't generally expect science to be a static, "write once and forget" process.

```{r, fig.align='center', out.width="100%", echo = FALSE }
ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_16")
```

The same idea applies to developing research software!
Rarely, you might be able to write a script or program for a scientific study and use it once, for a single well-defined purpose.
But more often, you’ll write a script (or join several of them together in a more complex pipeline) and reuse it, possibly with changes or extensions as the project progresses.

```{r, fig.align='center', out.width="100%", echo = FALSE }
ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_46")
```

In this course, through the lens of automation, we hope to familiarize you with some of the skills necessary to think about research software in an iterative way, from the beginning of a research project.
It is almost always easier to put good software development practices in place proactively, before a project has matured, rather than adding them on once the project is sufficiently complex that they are necessary.
jjc2718 marked this conversation as resolved.
Show resolved Hide resolved

Although software development is not generally rewarded directly in academia, it turns out that writing good software does have less obvious rewards, even within the traditional academic structure.
For example, software that is easy to install tends to be cited more often [@mangul2019], and software that is more consistently maintained tends to be more accurate [@gardner2022].

## Software complexity as a spectrum

Not all software is complex, and not all software requires complex infrastructure (or automation, for that matter)!
It can be useful to think about the complexity of software engineering infrastructure necessary for a project proportionally to the complexity of the software itself:

- Simple software (math, data transformations, procedural/rule-based scripts) requires simpler infrastructure.
- Complex software (e.g. "pipelines" composed of many commands/software packages chained together, "libraries" that are intended to be reused in many different applications) requires more complex infrastructure, to check assumptions and test reproducibility at each step.

```{r, fig.align='center', out.width="100%", echo = FALSE }
ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_153")
```

We will take a closer look at two concrete examples, one on each end of the software complexity spectrum, in the next section.

## Examples

### t-test

Imagine that you have two sampling distributions (lists/arrays of numbers) and you want to test whether the means of the distributions are statistically equivalent or not.
This is the setup for a _t_-test.
_t_-tests are implemented in standard functions in R (base library) and Python (scipy), as well as most other commonly used programming languages.
jjc2718 marked this conversation as resolved.
Show resolved Hide resolved

```{r, fig.align='center', out.width="100%", echo = FALSE }
ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g287bcb243d2_0_122")
```

In both R and Python, a _t_-test is a very well-defined, specific function that takes two lists of numbers and returns the _t_-statistic and _p_-value.
Since this is a part of a standard, widely used library in each language, it is already tested as part of those libraries.
In your own software, you might need to do some verification of your input (for instance, what happens if you pass an empty list of numbers?) but probably not too much, since you can be fairly confident that the _t_-test function does what it is documented to do in the programming language you choose to use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by this sentence. Can you explain what you mean here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I'm trying to say is that a t-test (or the function to perform one) has a specific input and output, and is very commonly used in a wide variety of applications. So it's probably okay to assume that it's doing what its documentation says it will do, without verifying it yourself or writing your own tests for it, unless you're doing something really non-standard with it, or making a mistake in your own code (like the empty list example).

This would be in contrast to the sequencing pipeline example, where it's (as far as I know) not practical to have a single well-tested and widely used function that goes from raw reads to a volcano plot - there are a lot of very subjective decisions and data transformations that have to happen to get from point A to point B, and they can't be completely encapsulated or abstracted away, which makes it more important to make sure that each step is doing what you expect it to do.

I guess another way to think about this would be the difference between using a programming language as a calculator or a set of steps (which is more or less where I started, and where I think many people probably start), and using it to build more complex software with more extensive computing/testing requirements. I think as someone gravitates toward the latter, the argument for automation starts to make a bit more sense.

Does that make any sense to you? I guess I'll have to think about how to get all that across more concisely - let me know if you have ideas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think this is a good point. Let's just figure out how we can get this point across more concisely.

It sounds like to me you are saying that as the complexity of an analysis increases, so does the decisions and parameters surrounding it. This means its even more critical to carefully read and consider the documentation of the software you are using to figure out what best fits the goals for your data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably make a toy graph illustrating this idea. x axis is complexity of the analysis y is the number of decisions and parameters associated with the analysis.

(This idea would need some polishing).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a graph I added to the slides (I'll link to it in the part of the text you referenced):

Screen Shot 2023-11-26 at 4 08 39 PM

I also toyed with labeling the parts of the graph off the "y = x" line, but I'm not sure if this is helpful or counter-productive: maybe we shouldn't even mention overengineering since it's such an uncommon case in academic software (in my experience at least!)

Screen Shot 2023-11-26 at 4 08 55 PM

Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some minor edits to these slides but overall I love the concept!!! Great visual.


### Sequencing analysis

Imagine that you have a list of reads from a sequencing machine, and you want to use these data to answer a biological question, or to make a plot/visualization to communicate a biological insight.
jjc2718 marked this conversation as resolved.
Show resolved Hide resolved
This is a much less well-defined problem than our previous example, with many more independently operating components, and many more subjective decisions that a researcher must make along the way.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

```{r, fig.align='center', out.width="100%", echo = FALSE }
ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g28fb47c580d_0_152")
```

Most sequencing analyses require multiple steps (i.e. different programs or scripts), and generate multiple intermediate files (e.g. read counts, normalized counts, quality information) that can be checked to verify that the pipeline is proceeding as expected.
Sequencing analyses can also take hours or days to run, as compared to the _t_-test example which runs effectively instantaneously.
All of these potential axes of variation mean that 1) the set of steps that need to take place is more complex than our previous example, and 2) having infrastructure to check that everything is running smoothly will be more beneficial than our previous example.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we dive into this point a bit more?

For example, instead of saying it is more beneficial explain why it is. Perhaps a list of reasons (feel free to edit and add to what I've written here -- just giving example thoughts):

  1. Analysis steps likely build from previous steps. So finding errors or changes early on in the process can save a great amount of time.
  2. More complex analyses often mean more complex (gray area) type decisions. This may mean more iteration and experiments to figure out what works. Readily reproducible analyses that are solidly built from the ground up will help this process.

Copy link
Collaborator Author

@jjc2718 jjc2718 Nov 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I tried adding a bit more detail in 9f45371 - let me know if that's closer to what you had in mind.

This is especially true in the case where the sequencing pipeline needs to be rerun or modified to generate different output.

## Automation for scientific software

Good software practices do not necessarily have to rely on automation.
However, as projects and infrastructure become more complex, it can be unwieldy to check and revise your results without some sort of automated process to kick them off automatically, without too much human intervention.
jjc2718 marked this conversation as resolved.
Show resolved Hide resolved
Steps that are involved might include: rerunning the software itself (often on new or modified input data), software testing, code style linting, rebuilding figures or processed datasets, and so on.
jjc2718 marked this conversation as resolved.
Show resolved Hide resolved
Each of these can be individually run by hand, or combined in a central script that runs all the steps in order or in parallel, which can also be triggered manually.
Such a central script can itself be considered a form of automation.

Automation via GitHub Actions, in contrast, can provide a "single point of truth": a single central script to run these steps, and a single set of (automated) criteria for when to run them.
jjc2718 marked this conversation as resolved.
Show resolved Hide resolved
This eliminates the need for you to remember to run tests, or to clean up your code, or to rebuild figures, or to kick off similar standard processes or commands on your own.
jjc2718 marked this conversation as resolved.
Show resolved Hide resolved

```{r, fig.align='center', out.width="100%", echo = FALSE }
ottrpal::include_slide("https://docs.google.com/presentation/d/1x0Cnk2Wcsg8HYkmXnXo_0PxmYCxAwzVrUQzb8DUDvTA/edit#slide=id.g28fb47c580d_0_55")
```

Later in the course, we will talk more specifically about what exactly automation via continuous integration looks like, and go into more depth as to its uses and benefits.
jjc2718 marked this conversation as resolved.
Show resolved Hide resolved
22 changes: 22 additions & 0 deletions book.bib
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,25 @@ @Book{Xie2020
note = {ISBN 9780367563837},
url = {https://bookdown.org/yihui/rmarkdown-cookbook},
}

@article{mangul2019,
title = {Challenges and recommendations to improve the installability and archival stability of omics computational tools},
author = {Mangul, Serghei and Mosqueiro, Thiago and Abdill, Richard J and Duong, Dat and Mitchell, Keith and Sarwal, Varuni and Hill, Brian and Brito, Jaqueline and Littman, Russell Jared and Statz, Benjamin and others},
journal = {PLoS Biology},
volume = {17},
number = {6},
pages = {e3000333},
year = {2019},
publisher = {Public Library of Science San Francisco, CA USA}
}

@article{gardner2022,
title={Sustained software development, not number of citations or journal choice, is indicative of accurate bioinformatic software},
author={Gardner, Paul P and Paterson, James M and McGimpsey, Stephanie and Ashari-Ghomi, Fatemeh and Umu, Sinan U and Pawlik, Aleksandra and Gavryushkin, Alex and Black, Michael A},
journal={Genome Biology},
volume={23},
number={1},
pages={1--13},
year={2022},
publisher={BioMed Central}
}