Skip to content

Commit

Permalink
Render toc-less
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions[bot] committed Mar 3, 2025
1 parent 0cc9369 commit ab390a7
Show file tree
Hide file tree
Showing 148 changed files with 8,169 additions and 1,126 deletions.
34 changes: 18 additions & 16 deletions docs/no_toc/01-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

# Introduction

In this course we will explore a variety of tools that can assist with data analysis from a broad range of fields. The tools we will cover may take some time to get used to, but the payoff will be immeasurable. Not only are these skills valuable for career advancement, they will also make your work-life easier. The tools will enhance your ability to reproduce your work across similar projects, stay organized, collaborate with others effectively, and more.
In this course, we will explore a variety of tools that can assist with reproducible data analysis from a broad range of fields. The tools we will cover may take some time to get used to, but the payoff will be immeasurable. Not only are these skills valuable for career advancement, they will also make your work-life easier. The tools will enhance your ability to reproduce your work across similar projects, stay organized, collaborate with others effectively, and more. This course was funded as part of a series of courses in the [Training Module for Reproducible Data Science Research project](https://reporter.nih.gov/search/k_pXzn8wfUeEvaWpnzIToA/project-details/10663171).


## Motivation
Expand All @@ -17,25 +17,27 @@ This course will help learners to use tools that will make their data analytic w

This course is intended for people conducting data analyses at the level of a graduate student or higher. The course is designed so that the majority of the material is presented in a high-level manner that should be applicable to researchers working in a broad range of areas. The course is centered around the R programming language, a widely used statistical analysis software package.

## Curriculum

The course covers...

<img src="resources/images/01-intro_files/figure-html//1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA_g33bf0789107_101_14.png" alt="For individuals who: Are new to working in R or RStudio, are familiar with R but want to make their projects more organized, transparent, and reproducible, want to learn about making reproducible reports and want to track changes across projects over time with GitHub" width="100%" style="display: block; margin: auto;" />

## Learning Objectives
## Topics covered:

- Implement basic project organization tools:
- Setup and configure RStudio/RStudio projects for data analysis (`here` package and file structure/paths)
- Install and configure `ProjectTemplate` package for formalizing and automating workflows
- Apply the `pointblank` package for validation of tabular data
- Write functions and package them
- Apply the `testthat` package for building software unit tests
- Setup and use Git repositories for version control of code
- Interface with GitHub to share Git repositories for collaboration; execute GitHub-based workflows
- Pull Requests
- Code review
- Issues
- Discussions
This course will cover organization practices, coding practices, tools, and concepts for making your data analyzes more reproducible in R.

We will cover important topics such as version control to track changes in documents over time, coding practices to make your code more transparent and to test your code, and methods for sharing your code and data in efficient and clear ways.

<img src="resources/images/01-intro_files/figure-html//1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA_g33bf0789107_101_33.png" alt="Concepts discussed in the Tools for Reproducible Workflows in R course: Why R is a great tool for reproducibility, major practices involved in reproducibility and methods to organize projects, how to use tools in RStudio to make your work more reproducible, How to make reproducible RMarkdown and Quarto reports, code practices to make your code more transparent, version control with GitHub to track changes over time and collaborate with others on projects, how to be transparent about software versions, how to share data and code publicly" width="100%" style="display: block; margin: auto;" />



## Curriculum

The course will cover the basics for getting started with configuring your projects for use of tools and practices to make your analyses more reproducible.

We will also point to more advanced topics in other resources.

<img src="resources/images/01-intro_files/figure-html//1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA_g33c01fedb5a_0_1.png" alt="Overall Course Learning Objectives. This course will demonstrate how to: 1. Explain best practices for making analyses more reproducible and transparent, 2. Use special features in RStudio for efficiency and reproducibility, 3. Configure and organize projects for data analysis using the here package and the ProjectTemplate package, 4. Create reproducible reports using RMarkdown and Quarto, 5. Write custom functions for reuse of code, 6.Test functions with the testthat package, 7. Setup and use Git and GitHub to track changes over time., 8. Share data and code publicly " width="100%" style="display: block; margin: auto;" />

References will include @gillespie_efficient_2021, @riederer_column_2020, @timbers_data_nodate.

Expand Down
103 changes: 103 additions & 0 deletions docs/no_toc/02-why-R.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@

# R for Reproducibility




## Learning Objectives

Before we begin to jump into additional tools that R can help us with to be work more efficiently and in a more reproducible manner, it is helpful to first discuss why we should consider R in the first place. After completing this section you will be able to:

<img src="resources/images/02-why-R_files/figure-html//1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA_g21c5ab757ec_0_0.png" alt="Learning objectives are to be able to: 1.Explain why R can be especially helpful for transparent and reproducibility data analyses, 2. Recognize that R has a very active and supportive community and locate access points to that community 3. Compare R to other similar statistical and data analysis tools and programming languages, 4.Describe the unique benefits of R" width="100%" style="display: block; margin: auto;" />


## Why R

<img src="resources/images/02-why-R_files/figure-html//1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA_g21a84b32106_0_9.png" alt="Why R?" width="100%" style="display: block; margin: auto;" />

[R](https://www.r-project.org/) is a [programming language](https://en.wikipedia.org/wiki/Programming_language) for working with data, performing statistical analyses, and for creating plots and graphics that was developed in 1991 by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand [@r_2023; @r_project]. Countless contributors have made R what it is today.

There are some especially useful aspects about R that make it a great option for creating reproducible data analyses.

<img src="resources/images/02-why-R_files/figure-html//1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA_gcf1264c749_0_135.png" alt="Why is R useful for Reproducibility? 1.It is free and open source, 2. The community, 3. It is designed for data wrangling and stats" width="100%" style="display: block; margin: auto;" />

## It is free and open source

The first is that R is free and [open source](https://opensource.com/resources/what-open-source).

<img src="resources/images/02-why-R_files/figure-html//1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA_g21c5ab757ec_1_0.png" alt="Cartoon of parrot saying: What!? R is free!! That's awesome!" width="100%" style="display: block; margin: auto;" />


The term **open source** means that the code is publicly available.
Thus all of the code involved in creating R is actually publicly available! This enables users to check what code is used in a particular **package** (a set of code that allows you to do various things) so that they can modify or build upon the code if they would like to.

In fact, many users create their own R **packages** to share their code with others. There are places such as the Comprehensive R Archive Network ([CRAN](https://cran.r-project.org/)) and elsewhere that allow users to publish their own packages for others to use.

<div class = "dictionary">
- **programming language** - A specified set of notations to tell a computer what to do
- **R** - Programming language for working with data to perform statistical analyses and for creating plots and other graphics
- **open source** - Code is publicly available
- **R package** - A set of code that can be shared between users

</div>

Why are these aspects good for reproducibility?

- Since R is free, it is accessible to anyone. Therefore, anyone could run your code if you shared it with them, without them needing to buy software.
- Since R is open source, if you use packages from others, people can determine what underlying code your code used (if you tell them what version you used - more on that later!)

## The community

R has a very rich and active community!

This makes it easier to reach out to others for help, find support, find tutorials, and more.

<img src="resources/images/02-why-R_files/figure-html//1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA_g21c5ab757ec_0_6.png" alt="Cartoon of parrot saying: The R community can support me to learn about R" width="100%" style="display: block; margin: auto;" />



There are several R community groups that are especially helpful:

- [R Ladies](https://rladies.org/) - a support group that is not just for ladies, but is open to anyone who wants to improve their R skills! There are local chapters in many large cities that often have in-person meetings.
- There are lots of useful resources, such as the [R for Data Science book](https://r4ds.had.co.nz/) (written by two developers at Posit (formally called RStudio) which develops lots of core R packages), resources and online courses from the [Johns Hopkins Data Science Lab](https://jhudatascience.org/courses.html) including [Open Case Studies](https://www.opencasestudies.org/), resources and workshops from [Data Carpentry](https://datacarpentry.org/), [Dataquest](https://www.dataquest.io/v2/), [DataTrail](https://datatrail-jhu.github.io/DataTrail/) and more!

See this [link](https://jhudatascience.org/intro_to_r/resources.html) for more R resources.

Why is this rich community good for reproducibility?

- Overall your code has a better chance of being more accessible than if it were written in a language that is not open source or that has limited support.
- You can also find support to make sure your code does what you want it to, as well as support to make your code as reproducible as possible.

## Designed for data

R is a statistical programming language, meaning it was designed to help you analyze data. It is the main focus of the language. This is one of the major advantages of using R over other programming languages that have more general purposes.

Because of this many people have designed useful packages that are especially relevant to:

1) Dealing with messy data in a systematic and reproducible way to get it into a state that is useful for data analysis
2) Producing statistical analysis of data
3) Creating effective plots of data

Although other options like [SPSS](https://www.ibm.com/products/spss-statistics) and [SAS](https://www.sas.com/) (which are not free!) can also be helpful for statistical analysis, R is especially powerful at getting messy data ready to analyze and for creating useful plots to represent patterns in data. Conveniently, R can do all of these steps in a data project and does not require users to switch between different programs to perform these tasks. R also helps create reports that can demonstrate to collaborators and others exactly how analysis was performed, aiding in the transparency of how the data was used from start to finish.

R can also import data from many different sources that other statistical software can't handle (including scraping data from websites or [PDFs](https://www.adobe.com/acrobat/about-adobe-pdf.html). This allows users much more flexibility to use data as close to the source as possible. This can enable users to stop copy and pasting data and reduce the risk of human error. If you are interested, see [Open Case Studies](https://www.opencasestudies.org/) for more guidance on importing many different kinds of data.

<img src="resources/images/02-why-R_files/figure-html//1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA_g21c5ab757ec_0_78.png" alt="I created errors copying my data into Excel and spent hours figuring it out later! I’m glad R can help!" width="100%" style="display: block; margin: auto;" />


Why are these design features especially helpful for creating reproducible analyses?

1. It enables users to work with messy data and get it ready for analysis, as opposed to requiring users to use other programs. The `tidyverse` a suite of very helpful packages has many data wrangling packages that are especially intuitive for others to read and understand your code.
1. Users can create effective plots using the same program as for data prep and analysis. The `ggplot2` package is famous for making really effective and customizable plots.
1. It helps create reports that can show the entire data analysis process from importing the data to making plots. `R Markdown` reports are very helpful for this.
1. It is easier to import data closer to the original source, rather than converting files or copy and pasting data, which can result in accidental modifications of the data.


## Conclusion

In summary, R can be especially useful if you want to make your data analyses more transparent and reproducible for the following reasons:

1. It is free and open source, meaning that code that you might incorporate in your analyses is accessible to anyone. Secondly, others can use your code without needing to buy software.
2. There is a rich R community that can help you make the most out of your code and learn how to write your code in a more reproducible manner.
3. R is particularly powerful for preparing data for analysis and for creating visual representations of data. Beyond being free, these unique benefits make R a particularly good statistical tool.
4. R is especially designed to analyze data and for the entirety of the process, which makes it great for creating transparent information about how you actually worked with data from start to finish.
Loading

0 comments on commit ab390a7

Please sign in to comment.