From 142df8b19751e8c3796e4b4794e025fe8a290995 Mon Sep 17 00:00:00 2001 From: "github-actions[bot]" Date: Wed, 16 Oct 2024 11:32:54 +0000 Subject: [PATCH] Render bookdown --- docs/09-soil_exploration_module.md | 10 +- docs/404.html | 28 +- docs/About.md | 10 +- docs/about-the-authors.html | 38 +- docs/activity-questions.html | 54 +-- docs/anvil-workspace.html | 28 +- docs/background.html | 28 +- docs/billing.html | 28 +- docs/biodigs-data.html | 28 +- ...klist-for-running-activities-on-anvil.html | 28 +- ...g-credit-for-professional-development.html | 28 +- docs/index.html | 30 +- docs/index.md | 2 +- docs/introduction.html | 313 +++++++++++++++ docs/notes-for-instructors.html | 28 +- docs/part-1.-examining-the-data.html | 377 ++++++++++++++++++ ...-summarizing-the-data-with-statistics.html | 367 +++++++++++++++++ docs/part-3.-visualizing-the-data.html | 334 ++++++++++++++++ docs/reference-keys.txt | 2 +- docs/references.html | 38 +- docs/research-team.html | 28 +- docs/search_index.json | 2 +- docs/setting-up-billing-on-anvil.html | 28 +- docs/setting-up-the-class-activity.html | 28 +- docs/support.html | 28 +- docs/using-rstudio-on-anvil.html | 32 +- 26 files changed, 1668 insertions(+), 277 deletions(-) create mode 100644 docs/introduction.html create mode 100644 docs/part-1.-examining-the-data.html create mode 100644 docs/part-2.-summarizing-the-data-with-statistics.html create mode 100644 docs/part-3.-visualizing-the-data.html diff --git a/docs/09-soil_exploration_module.md b/docs/09-soil_exploration_module.md index 1489ce9..a9d4c8a 100644 --- a/docs/09-soil_exploration_module.md +++ b/docs/09-soil_exploration_module.md @@ -1,9 +1,9 @@ -# (PART\*) Data Exploration {-} +# (PART\*) Student Activity {-} -# Exploring Soil Testing Data With R +# Introduction In this activity, you'll have a chance to become familiar with the BioDIGS soil testing data. This dataset includes information on the inorganic components of each soil sample, particularly metal concentrations. Human activity can increase the concentration of inorganic compounds in the soil. When cars drive on roads, compounds from the exhaust, oil, and other fluids might settle onto the roads and be washed into the soil. When we put salt on roads, parking lots, and sidewalks, the salts themselves will eventually be washed away and enter the ecosystem through both water and soil. Chemicals from factories and other businesses also leech into our environment. All of this means the concentration of heavy metals and other chemicals will vary among the soil samples collected for the BioDIGS project. @@ -24,7 +24,7 @@ This activity will teach you how to use the AnVIL platform to: 1. Create and interpret histograms and boxplots for variables in the soil testing data -## Part 1. Examining the Data +# Part 1. Examining the Data We will use the `BioDIGS` package to retrieve the data. We first need to install the package from where it is stored on GitHub. @@ -161,7 +161,7 @@ QUESTIONS: ::: -## Part 2. Summarizing the Data with Statistics +# Part 2. Summarizing the Data with Statistics Now that we have the dataset loaded, let's explore the data in more depth. @@ -303,7 +303,7 @@ QUESTIONS: ::: -## Part 3. Visualizing the Data +# Part 3. Visualizing the Data Often, it can be easier to immediately interpret data displayed as a plot than as a list of values. For example, we can more easily understand how the arsenic concentration of the soil samples are distributed if we create histograms compared to looking at point values like mean, standard deviation, minimum, and maximum. diff --git a/docs/404.html b/docs/404.html index 860c893..899156f 100644 --- a/docs/404.html +++ b/docs/404.html @@ -6,7 +6,7 @@ Page not found | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/About.md b/docs/About.md index 8293d59..932325a 100644 --- a/docs/About.md +++ b/docs/About.md @@ -43,12 +43,12 @@ These credits are based on our [course contributors table guidelines](https://gi ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2024-09-09 +## date 2024-10-16 ## pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source -## bookdown 0.39.1 2024-06-11 [1] Github (rstudio/bookdown@f244cf1) +## bookdown 0.40 2024-07-02 [1] CRAN (R 4.3.2) ## bslib 0.6.1 2023-11-28 [1] RSPM (R 4.3.0) ## cachem 1.0.8 2023-05-01 [1] RSPM (R 4.3.0) ## cli 3.6.2 2023-12-11 [1] RSPM (R 4.3.0) @@ -64,7 +64,7 @@ These credits are based on our [course contributors table guidelines](https://gi ## httpuv 1.6.14 2024-01-26 [1] RSPM (R 4.3.0) ## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.3.0) ## jsonlite 1.8.8 2023-12-04 [1] RSPM (R 4.3.0) -## knitr 1.47.3 2024-06-11 [1] Github (yihui/knitr@e1edd34) +## knitr 1.48 2024-07-07 [1] CRAN (R 4.3.2) ## later 1.3.2 2023-12-06 [1] RSPM (R 4.3.0) ## lifecycle 1.0.4 2023-11-07 [1] RSPM (R 4.3.0) ## magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.3.0) @@ -80,7 +80,7 @@ These credits are based on our [course contributors table guidelines](https://gi ## Rcpp 1.0.12 2024-01-09 [1] RSPM (R 4.3.0) ## remotes 2.4.2.1 2023-07-18 [1] RSPM (R 4.3.0) ## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.2) -## rmarkdown 2.27.1 2024-06-11 [1] Github (rstudio/rmarkdown@e1c93a9) +## rmarkdown 2.25 2023-09-18 [1] RSPM (R 4.3.0) ## sass 0.4.8 2023-12-06 [1] RSPM (R 4.3.0) ## sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0) ## shiny 1.8.0 2023-11-17 [1] RSPM (R 4.3.0) @@ -89,7 +89,7 @@ These credits are based on our [course contributors table guidelines](https://gi ## urlchecker 1.0.1 2021-11-30 [1] RSPM (R 4.3.0) ## usethis 2.2.3 2024-02-19 [1] RSPM (R 4.3.0) ## vctrs 0.6.5 2023-12-01 [1] RSPM (R 4.3.0) -## xfun 0.44.4 2024-06-11 [1] Github (yihui/xfun@9da62cc) +## xfun 0.48 2024-10-03 [1] CRAN (R 4.3.2) ## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.3.0) ## yaml 2.3.8 2023-12-11 [1] RSPM (R 4.3.0) ## diff --git a/docs/about-the-authors.html b/docs/about-the-authors.html index db0e0de..1b2eb2e 100644 --- a/docs/about-the-authors.html +++ b/docs/about-the-authors.html @@ -6,7 +6,7 @@ About the Authors | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    @@ -315,12 +315,12 @@

    About the Authors - Chapter 14 Activity Questions | BioDIGS: Exploring Soil Data + Chapter 17 Activity Questions | BioDIGS: Exploring Soil Data - + - + - + - + - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    @@ -221,10 +221,10 @@

    -
    -

    Chapter 14 Activity Questions

    -
    -

    14.1 Part 1. Examining the Data

    +
    +

    Chapter 17 Activity Questions

    +
    +

    17.1 Part 1. Examining the Data

    1. What data is found in the column labeled “Fe_Mehlich3”? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.)

    2. What data is found in the column labeled “Base_Sat_pct”? What does this variable tell us about the soil?

    3. @@ -232,8 +232,8 @@

      14.1 Part 1. Examining the Data

      How many different regions are represented in the soil testing dataset? How many of them have soil testing data available?

    -
    -

    14.2 Part 2. Summarizing the Data with Statistics

    +
    +

    17.2 Part 2. Summarizing the Data with Statistics

    1. All the samples from Baltimore City and Montgomery County were collected from public park land. The parks sampled from Montgomery County were located in suburban and rural areas, compared to the urban parks sampled in Baltimore City. Why might the Montgomery County samples have a lower average arsenic concentration than the samples from Baltimore City?

    2. What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value?

    3. @@ -242,8 +242,8 @@

      14.2 Part 2. Summarizing the Data
    4. Calculate both the mean and maximum values for concentrations that were determined using the Mehlich3 test. (HINT: change the terms in the columns_to_include vector, as well as the function you call in the summarize statement.) Which of these metals has the highest average and maximum concentrations, and in which region are they found?

    -
    -

    14.3 Part 3. Visualizing the Data

    +
    +

    17.3 Part 3. Visualizing the Data

    1. Create a histogram for iron concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations?

    2. Create a histogram for lead concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations?

    3. @@ -265,7 +265,7 @@

      14.3 Part 3. Visualizing the Data

    - +
    diff --git a/docs/anvil-workspace.html b/docs/anvil-workspace.html index af48702..c25f16b 100644 --- a/docs/anvil-workspace.html +++ b/docs/anvil-workspace.html @@ -6,7 +6,7 @@ Chapter 11 AnVIL Workspace | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/background.html b/docs/background.html index 1f0a24e..8af910f 100644 --- a/docs/background.html +++ b/docs/background.html @@ -6,7 +6,7 @@ Chapter 1 Background | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/billing.html b/docs/billing.html index d6d170c..6d1fa48 100644 --- a/docs/billing.html +++ b/docs/billing.html @@ -6,7 +6,7 @@ Chapter 5 Billing | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/biodigs-data.html b/docs/biodigs-data.html index 2a508c5..e1378d6 100644 --- a/docs/biodigs-data.html +++ b/docs/biodigs-data.html @@ -6,7 +6,7 @@ Chapter 4 BioDIGS Data | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/checklist-for-running-activities-on-anvil.html b/docs/checklist-for-running-activities-on-anvil.html index 8b11d8e..876d2f6 100644 --- a/docs/checklist-for-running-activities-on-anvil.html +++ b/docs/checklist-for-running-activities-on-anvil.html @@ -6,7 +6,7 @@ Chapter 7 Checklist for Running Activities on AnVIL | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/getting-credit-for-professional-development.html b/docs/getting-credit-for-professional-development.html index ea7a9a9..7aceebf 100644 --- a/docs/getting-credit-for-professional-development.html +++ b/docs/getting-credit-for-professional-development.html @@ -6,7 +6,7 @@ Chapter 10 Getting Credit for Professional Development | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/index.html b/docs/index.html index 48b1334..11b4e92 100644 --- a/docs/index.html +++ b/docs/index.html @@ -6,7 +6,7 @@ BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    @@ -223,7 +223,7 @@

    About this Book

    diff --git a/docs/index.md b/docs/index.md index 4f2e188..1f4ecec 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,6 +1,6 @@ --- title: "BioDIGS: Exploring Soil Data" -date: "September 09, 2024" +date: "October 16, 2024" site: bookdown::bookdown_site documentclass: book bibliography: book.bib diff --git a/docs/introduction.html b/docs/introduction.html new file mode 100644 index 0000000..572f81f --- /dev/null +++ b/docs/introduction.html @@ -0,0 +1,313 @@ + + + + + + + Chapter 13 Introduction | BioDIGS: Exploring Soil Data + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    +
    + + +
    +
    + +
    + + + + + + + + +
    + +
    +
    +

    Chapter 13 Introduction

    +

    In this activity, you’ll have a chance to become familiar with the BioDIGS soil testing data. This dataset includes information on the inorganic components of each soil sample, particularly metal concentrations. Human activity can increase the concentration of inorganic compounds in the soil. When cars drive on roads, compounds from the exhaust, oil, and other fluids might settle onto the roads and be washed into the soil. When we put salt on roads, parking lots, and sidewalks, the salts themselves will eventually be washed away and enter the ecosystem through both water and soil. Chemicals from factories and other businesses also leech into our environment. All of this means the concentration of heavy metals and other chemicals will vary among the soil samples collected for the BioDIGS project.

    +
    +

    13.1 Before You Start

    +

    If you do not already have a Google account that you would like to use for accessing Terra, create one now.

    +

    If you would like to create a Google account that is associated with your non-Gmail, institutional email address, follow these instructions.

    +
    +
    +

    13.2 Objectives

    +

    This activity will teach you how to use the AnVIL platform to:

    +
      +
    1. Open data from an R package
    2. +
    3. Examine objects in R
    4. +
    5. Calculate summary statistics for variables in the soil testing data
    6. +
    7. Create and interpret histograms and boxplots for variables in the soil testing data
    8. +
    +
    +
    +
    +
    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + diff --git a/docs/notes-for-instructors.html b/docs/notes-for-instructors.html index 952d801..5888956 100644 --- a/docs/notes-for-instructors.html +++ b/docs/notes-for-instructors.html @@ -6,7 +6,7 @@ Chapter 6 Notes for Instructors | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/part-1.-examining-the-data.html b/docs/part-1.-examining-the-data.html new file mode 100644 index 0000000..bded8f7 --- /dev/null +++ b/docs/part-1.-examining-the-data.html @@ -0,0 +1,377 @@ + + + + + + + Chapter 14 Part 1. Examining the Data | BioDIGS: Exploring Soil Data + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    +
    + + +
    +
    + +
    + + + + + + + + +
    + +
    +
    +

    Chapter 14 Part 1. Examining the Data

    +

    We will use the BioDIGS package to retrieve the data. We first need to install the package from where it is stored on GitHub.

    +
    devtools::install_github("fhdsl/BioDIGSData")
    +

    Once you’ve installed the package, we can load the library and assign the soil testing data to an object. This command follows the code structure:

    +

    dataset_object_name <- stored_BioDIGS_dataset

    +
    library(BioDIGSData)
    +
    +soil.values <- BioDIGS_soil_data()
    +

    It seems like the dataset loaded, but it’s always a good idea to verify. There are many ways to check, but the easiest approach (if you’re using RStudio) is to look at the Environment tab on the upper right-hand side of the screen. You should now have an object called soil.values that includes some number of observations for 28 variables. The observations refer to the number of rows in the dataset, while the variables tell you the number of columns. As long as neither the observations or variables are 0, you can be confident that your dataset loaded.

    +

    If the dataset loaded, you will see an object with non-zero observations and variables in the Environment tab

    +

    Let’s take a quick look at the dataset. We can do this by clicking on soil.values object in the Environment tab. (Note: this is equivalent to typing View(soil.values) in the R console.)

    +

    This will open a new window for us to scroll through the dataset.

    +

    You can click on the object in the Environment tab to open a new window that allows you to scroll through the loaded dataset

    +

    Well, the data definitely loaded, but those column names aren’t immediately understandable. What could As_EPA3051 possibly mean? In addition to the dataset, we need to load the data dictionary as well.

    +
    +

    Data dictionary: a file containing the names, definitions, and attributes about data in a database or dataset.

    +
    +

    In this case, the data dictionary can help us make sense of what sort of values each column represents. The data dictionary for the BioDIGS soil testing data is available in the R package (see code below), but we have also reproduced it here.

    +
    ?BioDIGS_soil_data()
    +
    +
      +
    • site_id Unique letter and number site name
    • +
    • full_name Full site name
    • +
    • As_EPA3051 Arsenic (mg/kg), EPA Method 3051A. Quantities < 3.0 are not detectable.
    • +
    • Cd_EPA3051 Cadmium (mg/kg), EPA Method 3051A. Quantities < 0.2 are not detectable.
    • +
    • Cr_EPA3051 Chromium (mg/kg), EPA Method 3051A
    • +
    • Cu_EPA3051 Copper (mg/kg), EPA Method 3051A
    • +
    • Ni_EPA3051 Nickel (mg/kg), EPA Method 3051A
    • +
    • Pb_EPA3051 Lead (mg/kg), EPA Method 3051A
    • +
    • Zn_EPA3051 Zinc (mg/kg), EPA Method 3051A
    • +
    • water_pH
    • +
    • A-E_Buffer_pH
    • +
    • OM_by_LOI_pct Organic Matter by Loss on Ignition
    • +
    • P_Mehlich3 Phosphorus (mg/kg), using the Mehlich 3 soil test extractant
    • +
    • K_Mehlich3 Potassium (mg/kg), using the Mehlich 3 soil test extractant
    • +
    • Ca_Mehlich3 Calcium (mg/kg), using the Mehlich 3 soil test extractant
    • +
    • Mg_Mehlich3 Magnesium (mg/kg), using the Mehlich 3 soil test extractant
    • +
    • Mn_Mehlich3 Manganese (mg/kg), using the Mehlich 3 soil test extractant
    • +
    • Zn_Mehlich3 Zinc (mg/kg), using the Mehlich 3 soil test extractant
    • +
    • Cu_Mehlich3 Copper (mg/kg), using the Mehlich 3 soil test extractant
    • +
    • Fe_Mehlich3 Iron (mg/kg), using the Mehlich 3 soil test extractant
    • +
    • B_Mehlich3 Boron (mg/kg), using the Mehlich 3 soil test extractant
    • +
    • S_Mehlich3 Sulfur (mg/kg), using the Mehlich 3 soil test extractant
    • +
    • Na_Mehlich3 Sodium (mg/kg), using the Mehlich 3 soil test extractant
    • +
    • Al_Mehlich3 Aluminum (mg/kg), using the Mehlich 3 soil test extractant
    • +
    • Est_CEC Cation Exchange Capacity (meq/100g) at pH 7.0 (CEC)
    • +
    • Base_Sat_pct Base saturation (BS). This represents the percentage of CEC occupied by bases (Ca2+, Mg2+, K+, and Na+). The %BS increases with increasing soil pH. The availability of Ca2+, Mg2+, and K+ increases with increasing %BS.
    • +
    • P_Sat_ratio Phosphorus saturation ratio. This is the ratio between the amount of phosphorus present in the soil and the total capacity of that soil to retain phosphorus. The ability of phosphorus to be bound in the soil is primary a function of iron (Fe) and aluminum (Al) content in that soil.
    • +
    +
    +

    Using the data dictionary, we find that the values in column As_EPA3051 give us the arsenic concentration in mg/kg of each soil sample, as determined by EPA Method 3051A. This method uses a combination of heat and acid to extract specific elements (like arsenic, cadmium, chromium, copper, nickel, lead, and zinc) from soil samples.

    +

    While arsenic can occur naturally in soils, higher levels suggest the soil may have been contaminated by mining, hazardous waste, or pesticide application. Arsenic is toxic to humans.

    +
    +

    QUESTIONS:

    +
      +
    1. What data is found in the column labeled “Fe_Mehlich3”? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.)

    2. +
    3. What data is found in the column labeled “Base_Sat_pct”? What does this variable tell us about the soil?

    4. +
    +
    +

    We can also look at just the names of all the columns using the R console using the colnames() command.

    +
    colnames(soil.values)
    +
    ##  [1] "site_id"       "site_name"     "type"          "As_EPA3051"   
    +##  [5] "Cd_EPA3051"    "Cr_EPA3051"    "Cu_EPA3051"    "Ni_EPA3051"   
    +##  [9] "Pb_EPA3051"    "Zn_EPA3051"    "water_pH"      "OM_by_LOI_pct"
    +## [13] "P_Mehlich3"    "K_Mehlich3"    "Ca_Mehlich3"   "Mg_Mehlich3"  
    +## [17] "Mn_Mehlich3"   "Zn_Mehlich3"   "Cu_Mehlich3"   "Fe_Mehlich3"  
    +## [21] "B_Mehlich3"    "S_Mehlich3"    "Na_Mehlich3"   "Al_Mehlich3"  
    +## [25] "Est_CEC"       "Base_Sat_pct"  "P_Sat_ratio"   "region"
    +

    Most of the column names are found in the data dictionary, but the very last column (“region”) isn’t. How peculiar! Let’s look at what sort of values this particular column contains. The tab with the table of the soil.views object should still be open in the upper left pane of the RStudio window. If not, you can open it again by clicking on soils.view in the Environment pane, or by using the View() command.

    +
    View(soil.values)
    +

    Switch to the soil.values tab to look at what values are in the region column

    +

    If you scroll to the end of the table, we can see that “region” seems to refer to the city or area where the samples were collected. For example, the first 6 samples all come from Baltimore City.

    +

    We can see the first samples in the dataset were collected in Baltimore City

    +

    You may notice that some cells in the soil.values table contain NA. This just means that the soil testing data for that sample isn’t available yet. We’ll take care of those values in the next part.

    +
    +

    QUESTIONS:

    +
      +
    1. How many observations are in the soil testing values dataset that you loaded? What do each of these observations refer to?

    2. +
    3. How many different regions are represented in the soil testing dataset? How many of them have soil testing data available?

    4. +
    +
    +
    +
    +
    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + diff --git a/docs/part-2.-summarizing-the-data-with-statistics.html b/docs/part-2.-summarizing-the-data-with-statistics.html new file mode 100644 index 0000000..2d3b405 --- /dev/null +++ b/docs/part-2.-summarizing-the-data-with-statistics.html @@ -0,0 +1,367 @@ + + + + + + + Chapter 15 Part 2. Summarizing the Data with Statistics | BioDIGS: Exploring Soil Data + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    +
    + + +
    +
    + +
    + + + + + + + + +
    + +
    +
    +

    Chapter 15 Part 2. Summarizing the Data with Statistics

    +

    Now that we have the dataset loaded, let’s explore the data in more depth.

    +

    First, we should remove those samples that don’t have soil testing data yet. We could keep them in the dataset, but removing them at this stage will make the analysis a little cleaner. In this case, as we know the reason the data are missing (and that reason will not skew our analysis), we can safely remove these samples. This will not be the case for every data analysis.

    +

    We can remove the unanalyzed samples using the drop_na() function from the tidyr package. This function removes any rows from a table that contains NA for a particular column. This command follows the code structure:

    +

    dataset_new_name <- dataset %>% drop_na(column_name)

    +

    The `%>% is called a pipe and it tells R that the commands after it should all be applied to the object in front of it. (In this case, we can filter out all samples missing a value for “As_EPA3051” as a proxy for samples without soil testing data.)

    +
    library(tidyr)
    +
    +soil.values.clean <- soil.values %>% drop_na(As_EPA3051)
    +

    Great! Now let’s calculate some basic statistics. For example, we might want to know what the mean (average) arsenic concentration is for all the soil samples. We can use a combination of two functions: pull() and mean(). pull() lets you extract a column from your table for statistical analysis, while mean() calculates the average value for the extracted column.

    +

    This command follows the code structure:

    +

    OBJECT %>% pull(column_name) %>% mean()

    +

    pull() is a command from the tidyverse package, so we’ll need to load that library before our command.

    +
    library(tidyverse)
    +
    +soil.values.clean %>% pull(As_EPA3051) %>% mean()
    +
    ## [1] 5.10875
    +

    We can run similar commands to calculate the standard deviation (sd), minimum (min), and maximum (max) for the soil arsenic values.

    +
    soil.values.clean %>% pull(As_EPA3051) %>% sd()
    +
    ## [1] 5.606926
    +
    soil.values.clean %>% pull(As_EPA3051) %>% min()
    +
    ## [1] 0
    +
    soil.values.clean %>% pull(As_EPA3051) %>% max()
    +
    ## [1] 27.3
    +

    The soil testing dataset contains samples from multiple geographic regions, so maybe it’s more meaningful to find out what the average arsenic values are for each region. We have to do a little bit of clever coding trickery for this using the group_by and summarize functions. First, we tell R to split our dataset up by a particular column (in this case, region) using the group_by function, then we tell R to summarize the mean arsenic concentration for each group.

    +

    When using the summarize function, we tell R to make a new table (technically, a tibble in R) that contains two columns: the column used to group the data and the statistical measure we calculated for each group.

    +

    This command follows the code structure:

    +

    dataset %>% +group_by(column_name) %>% +summarize(mean(column_name))

    +
    soil.values.clean %>%
    +    group_by(region) %>%
    +    summarize(mean(As_EPA3051))
    +
    ## # A tibble: 2 × 2
    +##   region            `mean(As_EPA3051)`
    +##   <chr>                          <dbl>
    +## 1 Baltimore City                  5.56
    +## 2 Montgomery County               4.66
    +

    Now we know that the mean arsenic concentration might be different for each region. If we compare the samples from Baltimore City and Montgomery County, the Baltimore City samples appear to have a higher mean arsenic concentration than the Montgomery County samples.

    +
    +

    QUESTIONS:

    +
      +
    1. All the samples from Baltimore City and Montgomery County were collected from public park land. The parks sampled from Montgomery County were located in suburban and rural areas, compared to the urban parks sampled in Baltimore City. Why might the Montgomery County samples have a lower average arsenic concentration than the samples from Baltimore City?

    2. +
    3. What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value?

    4. +
    5. Calculate the mean iron concentration by region. Which region has the highest mean iron concentration? What about the lowest?

    6. +
    +
    +

    Let’s say we’re interested in looking at mean concentrations that were determined using EPA Method 3051. Given that there are 8 of these measures in the soil.values dataset, it would be time consuming to run our code from above for each individual measure.

    +

    We can add two arguments to our summarize statement to calculate statistical measures for multiple columns at once: the across argument, which tells R to apply the summarize command to multiple columns; and the ends_with parameter, which tells R which columns should be included in the statistical calculation.

    +

    We are using ends_with because for this question, all the columns that we’re interested in end with the string ‘EPA3051’.

    +

    This command follows the code structure:

    +

    dataset %>% +group_by(column_name) %>% +summarize(across(ends_with(common_column_name_ending), mean))

    +
    soil.values.clean %>%
    +    group_by(region) %>%
    +    summarize(across(ends_with('EPA3051'), mean))
    +
    ## # A tibble: 2 × 8
    +##   region       As_EPA3051 Cd_EPA3051 Cr_EPA3051 Cu_EPA3051 Ni_EPA3051 Pb_EPA3051
    +##   <chr>             <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
    +## 1 Baltimore C…       5.56      0.359       34.5       35.0       17.4       67.2
    +## 2 Montgomery …       4.66      0.402       29.9       24.3       23.4       38.7
    +## # ℹ 1 more variable: Zn_EPA3051 <dbl>
    +

    This is a much more efficient way to calculate statistics.

    +
    +

    QUESTIONS:

    +
      +
    1. Calculate the maximum values for concentrations that were determined using EPA Method 3051. (HINT: change the function you call in the summarize statement.) Which of these metals has the maximum concentration you see, and in which region is it found?

    2. +
    3. Calculate both the mean and maximum values for concentrations that were determined using the Mehlich3 test. (HINT: change the terms in the columns_to_include vector, as well as the function you call in the summarize statement.) Which of these metals has the highest average and maximum concentrations, and in which region are they found?

    4. +
    +
    +
    +
    +
    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + diff --git a/docs/part-3.-visualizing-the-data.html b/docs/part-3.-visualizing-the-data.html new file mode 100644 index 0000000..45e64a8 --- /dev/null +++ b/docs/part-3.-visualizing-the-data.html @@ -0,0 +1,334 @@ + + + + + + + Chapter 16 Part 3. Visualizing the Data | BioDIGS: Exploring Soil Data + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + + + +
    +
    + + +
    +
    + +
    + + + + + + + + +
    + +
    +
    +

    Chapter 16 Part 3. Visualizing the Data

    +

    Often, it can be easier to immediately interpret data displayed as a plot than as a list of values. For example, we can more easily understand how the arsenic concentration of the soil samples are distributed if we create histograms compared to looking at point values like mean, standard deviation, minimum, and maximum.

    +

    One way to make histograms in R is with the hist() function. This function only requires that we tell R which column of the dataset that we want to plot. (However, we also have the option to tell R a histogram name and a x-axis label.)

    +

    We can again use the pull() command and pipes (%>%) to choose the column we want from the soil.values.clean dataset and make a histogram of them.

    +

    This combination of commands follows the code structure:

    +

    dataset %>% +pull(column_name) %>% +hist(main = chart_title, xlab = x_axis_title)

    +
    soil.values.clean %>% 
    +    pull(As_EPA3051) %>% 
    +    hist(main = 'Histogram of Arsenic Concentration', 
    +         xlab ='Concentration in mg/kg' )
    +

    +

    We can see that almost all the soil samples had very low concentrations of arsenic (which is good news for the soil health!). In fact, many of them had arsenic concentrations close to 0, and only one sampling location appears to have high levels of arsenic.

    +

    We might also want to graphically compare arsenic concentrations among the geographic regions in our dataset. We can do this by creating boxplots. Boxplots are particularly useful when comparing the mean, variation, and distributions among multiple groups.

    +

    In R, one way to create a boxplot is using the boxplot() function. We don’t need to use pipes for this command, but instead will specify what columns we want to use from the dataset inside the boxplot() function itself.

    +

    This command follows the code structure:

    +

    boxplot(column_we’re_plotting ~ grouping_variable, +data = dataset, +main = “Title of Graph”, +xlab = “x_axis_title”, +ylab = “y_axis_title”)

    +
    boxplot(As_EPA3051 ~ region, 
    +        data = soil.values.clean,
    +        main = "Arsenic Concentration by Geographic Region",
    +        xlab = "Region",
    +        ylab = "Arsenic Concentration in mg/kg")
    +

    +

    By using a boxplot, we can quickly see that, while one sampling site within Baltimore City has a very high concentration of arsenic in the soil, in general there isn’t a difference in arsenic content between Baltimore City and Montgomery County.

    +
    +

    QUESTIONS:

    +
      +
    1. Create a histogram for iron concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations?

    2. +
    3. Create a histogram for lead concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations?

    4. +
    5. Look at the maps for iron and lead on the BioDIGS website. Do the boxplots you created make sense, given what you see on these maps? Why or why not?

    6. +
    +
    + +
    +
    +
    + +
    +
    + +
    +
    +
    + + +
    +
    + + + + + + + + + + + + + diff --git a/docs/reference-keys.txt b/docs/reference-keys.txt index eeef067..e062dcb 100644 --- a/docs/reference-keys.txt +++ b/docs/reference-keys.txt @@ -53,7 +53,7 @@ slides launching-rstudio touring-rstudio pausing-rstudio -exploring-soil-testing-data-with-r +introduction before-you-start objectives-1 part-1.-examining-the-data diff --git a/docs/references.html b/docs/references.html index 2f8f0de..4bc7fbf 100644 --- a/docs/references.html +++ b/docs/references.html @@ -4,25 +4,25 @@ - Chapter 15 References | BioDIGS: Exploring Soil Data + Chapter 18 References | BioDIGS: Exploring Soil Data - + - + - + - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    @@ -221,8 +221,8 @@

    -
    -

    Chapter 15 References

    +
    +

    Chapter 18 References


    diff --git a/docs/research-team.html b/docs/research-team.html index 120f8ce..e407b4b 100644 --- a/docs/research-team.html +++ b/docs/research-team.html @@ -6,7 +6,7 @@ Chapter 2 Research Team | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/search_index.json b/docs/search_index.json index 8e92c91..9e3dc03 100644 --- a/docs/search_index.json +++ b/docs/search_index.json @@ -1 +1 @@ -[["index.html", "BioDIGS: Exploring Soil Data About this Book 0.1 Target Audience 0.2 Platform 0.3 Data", " BioDIGS: Exploring Soil Data September 09, 2024 About this Book This is a companion training guide for BioDIGS, a GDSCN project that brings a research experience into the classroom. This activity guides students through exploration of the BioDIGS soil data using the tidyverse in R. Students will learn basic data summarization, visualization, and mapping skills. Visit the BioDIGS (BioDiversity and Informatics for Genomics Scholars) website here for more information about this collaborative, distributed research project, including how you can get involved! The GDSCN (Genomics Data Science Community Network) is a consortium of educators who aim to create a world where researchers, educators, and students from diverse backgrounds are able to fully participate in genomic data science research. You can find more information about its mission and initiatives here. BioDIGS logo 0.1 Target Audience The activities in this guide are written for undergraduate students and beginning graduate students. Some sections require basic understanding of the R programming language, which is indicated at the beginning of the chapter. 0.2 Platform The activities in this guide are demonstrated on NHGRI’s AnVIL cloud computing platform. AnVIL is the preferred computing platform for the GDSCN. However, all of these activities can be done using your personal installation of R or using the online Galaxy portal. 0.3 Data The data generated by the BioDIGS project is available through the BioDIGS website, as well as through an AnVIL workspace. Data about the soil itself as well as soil metal content was generated by the Delaware Soil Testing Program at the University of Delaware. Sequences were generated by the Johns Hopkins University Genetic Resources Core Facility and by PacBio. "],["background.html", "Chapter 1 Background 1.1 What is genomics? 1.2 What is data science? 1.3 What is cloud computing? 1.4 Why soil microbes? 1.5 Heavy metals and human health", " Chapter 1 Background One critical aspect of an undergraduate STEM education is hands-on research. Undergraduate research experiences enhance what students learn in the classroom as well as increase a student’s interest in pursuing STEM careers (Russell2007?). It can also lead to improved scientific reasoning and increased academic performance overall (Buffalari2020?). However, many students at underresourced institutions like community colleges, Historically Black Colleges and Universities (HBCUs), tribal colleges and universities, and Hispanic-serving institutions have limited access to research opportunities compared to their cohorts at larger four-year colleges and R1 institutions. These students are also more likely to belong to groups that are already under-represented in STEM disciplines, particularly genomics and data science (Canner2017?; GDSCN2022?). The BioDIGS Project aims to be at the intersection of genomics, data science, cloud computing, and education. 1.1 What is genomics? Genomics broadly refers to the study of genomes, which are an organism’s complete set of DNA. This includes both genes and non-coding regions of DNA. Traditional genomics involves sequencing and analyzing the genome of individual species. Metagenomics expands genomics to look at the collective genomes of entire communities of organisms in an environmental sample, like soil. It allows researchers to study not just the genes of culturable or isolated organisms, but the entirety of genetic material present in a given environment. By using genomic techniques to survey the soil microbes, we can identify everything in the soil, including microbes that no one has identified before. We are doing both traditional genomics and metagenomics as part of BioDIGS. 1.2 What is data science? Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. It includes collecting, cleaning, and combining data from multiple databases, exploring data and developing statistical and machine learning models to identify patterns in complex datasets, and creating tools to efficiently store, process, and access large amounts of data. 1.3 What is cloud computing? Cloud computing just means using the internet to get access to powerful computer resources like storage, servers, databases, networking tools, and specialized software programs. Instead of having to buy and maintain their own powerful computers, storage servers, and other systems, users can pay to use them through an internet connection as needed. Users only pay for what they need, when they actually use it, and professionals update and maintain the systems in large data centers. It is a particularly useful tool for researchers and students at smaller institutions with limited computational services, especially when working with complex databases. The genome assembly and analyses for BioDIGS have been done using the NHGRI AnVIL cloud computing platform, as well as Galaxy. 1.4 Why soil microbes? It can be challenging to include undergraduates in human genomic and health research, especially in a classroom context. Both human genetic data and human health data are protected data, which limits the sort of information students can access without undergoing specialized ethics training. However, the same sorts of data cleaning and analysis methods used for human genomic data are also used for microbial genomic data, which does not have the same sort of legal protections as human genetic data. This makes it ideal for training undergraduate students at the beginning of their careers and can be used to prepare students for future research in human genomics and health (Jurkowski2017?). Additionally, the microbes in the soil can have big impacts on our health (BrevikBurgess2014?). 1.5 Heavy metals and human health Human activities that change the landscape can also change what sorts of inorganic and abiotic compounds we find in the soil, particularly increasing the amount of heavy metals (Yan2020?). When cars drive on roads, compounds from the exhaust, oil, and other fluids might settle onto the roads and be washed into the soil. When we put salt on roads, parking lots, and sidewalks, the salts themselves will eventually be washed away and enter the ecosystem through both water and soil. Chemicals from factories and other businesses also leech into our environment. Previous research has demonstrated that in areas with more human activity, like cities, soils include greater concentrations of heavy metals than found in rural areas with limited human populations (Khan2023?; Wang2022?). Increased heavy metal concentrations also disproportionately affect lower-income and predominantly minority areas (Jones2022?). Research suggests that increased heavy metal concentration in soils has major impacts on the soil microbial community. In particular, increased heavy metal concentration is associated with an increase in soil bacteria that have antibiotic resistance markers (Gorovtsov2018?; Nguyen2019?; Sun2021?). "],["research-team.html", "Chapter 2 Research Team 2.1 Soil sampling", " Chapter 2 Research Team This project is coordinated by the Genomics Data Science Community Network (GDSCN). You can read more about the GDSCN and its mission at the network website. 2.1 Soil sampling This map shows the current sampling locations for the BioDIGS project. The extensive network of the GDSCN has made this data collection possible. Soil sampling for this project was done by both faculty and student volunteers from schools that aren’t traditional R1 research institutions. Many of the faculty are also members of the GDSCN. This list of locations reflects GDSCN institutions and friends of GDSCN who have collected soil samples. Annandale, VA: Northern Virginia Community College Atlanta, GA: Spelman College Baltimore, MD: College of Southern Maryland, Notre Dame College of Maryland, Towson University Bismark, ND: United Tribes Technical College El Paso, TX: El Paso Community College, The University of Texas at El Paso Fresno, CA: Clovis Community College Greensboro, NC: North Carolina A&T State University Harrisonburg, VA: James Madison University Honolulu, Hawai’i: University of Hawai’i at Mānoa Las Cruces, NM: Doña Ana Community College Montgomery County, MD: Montgomery College, Towson University Nashville, TN: Meharry Medical College New York, NY: Guttman Community College CUNY Petersburg, VA: Virginia State University Seattle, WA: North Seattle College, Pierce College Tsaile, AZ: Diné College "],["support.html", "Chapter 3 Support 3.1 Funding 3.2 Sponsors 3.3 Analytical and Computational Support", " Chapter 3 Support This project would not be possible without financial and technical support from many organizations and people. 3.1 Funding Funding for this project has been provided by the National Human Genome Research Institute (Contract # 75N92022P00232 awarded to Johns Hopkins University). 3.2 Sponsors PacBio and CosmosID have graciously donated supplies. Advances in Genome Biology and Technology provided funding support for several team members to attend AGBT 2024. 3.3 Analytical and Computational Support Computational support has been provided by NHGRI’s AnVIL cloud computing platform and Galaxy. "],["biodigs-data.html", "Chapter 4 BioDIGS Data 4.1 Sample Metadata 4.2 Soil Testing Data 4.3 Genomics and Metagenomics Data", " Chapter 4 BioDIGS Data There are currently three major kinds of data available from BioDIGS: sample metadata, soil testing data, and genomics and metagenomics data. All of these are available for use in your classroom. 4.1 Sample Metadata This dataset contains information about the samples themselves, including GPS coordinates for the sample location, date the sample was taken, and the site name. This dataset is also available from the BioDIGS website You can also see images of each sampling site and soil characteristics at the sample map. 4.2 Soil Testing Data This dataset includes basic information about the soil itself like pH, percentage of organic matter, variety of soil metal concentrations. The complete data dictionary is available here. The dataset is available at the BioDIGS website. This dataset was generated by the Delaware Soil Testing Program at the University of Delaware. 4.3 Genomics and Metagenomics Data You can access this data in both raw and processed forms. The Illumina and Nanopore sequences were generated at the Johns Hopkins University Genetic Resources Core Facility. PacBio sequencing was done by PacBio directly. More information coming soon! "],["billing.html", "Chapter 5 Billing 5.1 Create Google Billing Account 5.2 Add Terra to Google Billing Account 5.3 Add Members to Google Billing Account 5.4 Set Alerts for Google Billing 5.5 View Spend for Google Billing 5.6 Create Terra Billing Project 5.7 Add Member to Terra Billing Project 5.8 Disable Terra Billing Project", " Chapter 5 Billing In order to use AnVIL, you will need to set up a billing account and add members to it. These sections guide you through that process. 5.1 Create Google Billing Account Log in to the Google Cloud Platform console using your Google ID. Make sure to use the same Google account ID you use to log into Terra. If you are a first time user, don’t forget to claim your free credits! If you haven’t been to the console before, once you accept the Terms of Service you will be greeted with an invitation to “Try for Free.” Follow the instructions to sign up for a Billing Account and get your credits. Choose “Individual Account”. This “billing account” is just for managing billing, so you don’t need to be able to add your team members. You will need to give either a credit card or bank account for security. Don’t worry! You won’t be billed until you explicitly turn on automatic billing. You can view and edit your new Billing Account, by selecting “Billing” from the left-hand menu, or going directly to the billing console console.cloud.google.com/billing Clicking on the Billing Account name will allow you to manage the account, including accessing reports, setting alerts, and managing payments and billing. At any point, you can create additional Billing Accounts using the Create Account button. We generally recommend creating a new Billing Account for each funding source. 5.2 Add Terra to Google Billing Account This gives Terra permission to create projects and send charges to the Google Billing Account, and must be done by an administrator of the Google Billing Account. Terra needs to be added as a “Billing Account User”: Log in to the Google Cloud Platform console using your Google ID. Navigate to Billing You may be automatically directed to view a specific Billing Account. If you see information about a single account rather than a list of your Billing Accounts, you can get back to the list by clicking “Manage Billing Accounts” from the drop-down menu. Check the box next to the Billing Account you wish to add Terra to, click “ADD MEMBER”. Enter terra-billing@terra.bio in the text box. In the drop-down menu, mouse over Billing, then choose “Billing Account User”. Click “SAVE”. 5.3 Add Members to Google Billing Account Anyone you wish to add to the Billing Account will need their own Google ID. To add a member to a Billing Project: Log in to the Google Cloud Platform console using your Google ID. Navigate to Billing You may be automatically directed to view a specific Billing Account. If you see information about a single account rather than a list of your Billing Accounts, you can get back to the list by clicking “Manage Billing Accounts” from the drop-down menu. Check the box next to the Billing Account you wish to add a member to, click “ADD MEMBER”. Enter their Google ID in the text box. In the drop-down menu, mouse over Billing, then choose the appropriate role. Click “SAVE”. 5.4 Set Alerts for Google Billing Log in to the Google Cloud Platform console using the Google ID associated with your Google Cloud projects. Open the dropdown menu on the top left and click on Billing. You may be automatically directed to view a specific Billing Account. If you see information about a single account (and it’s not the one you’re interested in), you can get back to the list of all your Billing Accounts by clicking “Manage Billing Accounts” from the drop-down menu. Click on the name of the Billing Account you want to set alerts for. In the left-hand menu, click “Budgets & alerts”. Click the “Create Budget” tab. Enter a name for your budget, and then choose which projects you want to monitor. Then click “Next”. For Budget Type, select “Specified amount”. Enter the total budget amount for the month (you will set alerts at different thresholds in the next step). Click “Next” (do not click “Finish”). Enter the threshold amounts where you want to receive an alert. We recommend starting with 50% and 90%. You can set other alerts if you prefer. Check the box for “Email alerts to billing admins and users”, then click “Finish”. Now you (as the owner and admin), along with anyone you added with admin or user privileges (e.g. lab managers) will receive alerts when your lab members reach the specified spending thresholds. These emails will be sent to the Gmail accounts associated with the Billing Account. You can edit your budgets at any time by going to Billing > Budgets & alerts, and clicking on the name of the budget you want to edit. 5.5 View Spend for Google Billing You can always check your current spend through the Google Billing console, but remember There is a reporting delay (~1 day), so you cannot immediately see what an analysis cost Costs are reported at the level of Workspaces, so if there are multiple people using a Workspace, you will not be able to determine which of them was responsible for the charges. The Google Billing console displays information by Billing Account. To view spending: Log in to the Google Cloud Platform console using the Google ID associated with your Google Cloud projects. Open the dropdown menu on the top left and click on Billing. You may be automatically directed to view a specific Billing Account. If you see information about a single account (and it’s not the one you’re interested in), you can get back to the list of all your Billing Accounts by clicking “Manage Billing Accounts” from the drop-down menu. Click on the name of the Billing Account for the project you want to view. Look at the top of the Overview tab to see your month-to-date spending. Scroll further down the Overview tab to show your top projects. Click on the Reports tab to see more detailed information about each of your projects. This is probably the most useful tab for exploring costs of individual projects over time. Click on the Cost table tab to obtain a convenient table of spending per project. 5.6 Create Terra Billing Project Launch Terra and sign in with your Google account. If this is your first time logging in to Terra, you will need to accept the Terms of Service. In the drop-down menu on the left, navigate to “Billing”. Click the triple bar in the top left corner to access the menu. Click the arrow next to your name to expand the menu, then click “Billing”. You can also navigate there directly with this link: https://anvil.terra.bio/#billing On the Billing page, click the “+ CREATE” button to create a new Billing Project. Select GCP Billing Project (Google’s Platform). If prompted, select the Google account to use and give Terra permission to manage Google Cloud Platform billing accounts. Enter a unique name for your Terra Billing Project and select the appropriate Google Billing Account. The name of the Terra Billing Project must: Only contain lowercase letters, numbers and hyphens Start with a lowercase letter Not end with a hyphen Be between 6 and 30 characters Select the Google Billing Account to use. All activities conducted under your new Terra Billing Project will charge to this Google Billing Account. If prompted, give Terra permission to manage Google Cloud Platform billing accounts. Click “Create”. Your new Billing Project should now show up in the list of Billing Projects Owned by You. You can add additional members or can modify or deactivate the Billing Project at any time by clicking on its name in this list. The page doesn’t always update as soon as the Billing Project is created. If it’s been a couple of minutes and you don’t see a change, try refreshing the page. 5.7 Add Member to Terra Billing Project Launch Terra and sign in with your Google account. In the drop-down menu on the left, navigate to “Billing”. Click the triple bar in the top left corner to access the menu. Click the arrow next to your name to expand the menu, then click “Billing”. You can also navigate there directly with this link: https://anvil.terra.bio/#billing Click “Owned by You” and find the Billing Project. If you do not see the Billing Project in this list, then you are not an Owner and do not have permission to add members. Click on the name of the Billing Project. Click on the “Members” tab to view and manage members. Then click the “Add User” button. Enter the email address of the user or group you’d like to add the the Billing Project. If adding an individual, make sure to enter the account that they use to access AnVIL. If adding a Terra Group, use the Group email address, which can be found on the Terra Group management page. If this user or group will need to add and remove other users of the Billing Project, check the Owner box. Otherwise leave it unchecked. It’s often a good idea to have at least one other Owner of a Billing Project in order to avoid getting locked out, in case the original owner leaves or loses access to their account. Click “ADD USER”. You should now see the user or group listed in the Billing Project members, along with the appropriate role. They should now be able to use the Billing Project to fund work on AnVIL. If you need to remove members or modify their roles, you can do so at any time by clicking the teardrop button next to their name. 5.8 Disable Terra Billing Project By default this module includes a warning to make sure people understand they will lose access to their Workspace buckets. You can remove the warning from this module by setting AnVIL_module_settings$warning to FALSE before running cow::borrow_chapter: AnVIL_module_settings <- list( warning = FALSE ) cow::borrow_chapter( doc_path = "child/_child_terra_billing_project_disable.Rmd", repo_name = "jhudsl/AnVIL_Template" ) Disabling a Billing Project makes Workspace contents inaccessible! Disabling a Billing Project disables funding to all Workspaces funded by the Billing Project. You will be unable to compute in these Workspaces, and you will lose access to any data stored in the Workspace buckets. It is sometimes possible to restore access by reactivating billing, but Google makes no promises about whether or how long the data will be recoverable. Make sure everyone with Workspaces funded by the Billing Project has saved anything they want to keep in another location before disabling the Billing Project. To disable a Terra Billing Project (i.e. remove the Google Billing Account that funds the Terra Billing Project): Launch Terra and sign in with your Google account. In the drop-down menu on the left, navigate to “Billing”. Click the triple bar in the top left corner to access the menu. Click the arrow next to your name to expand the menu, then click “Billing”. You can also navigate there directly with this link: https://anvil.terra.bio/#billing Click “Owned by You” and find the Billing Project. If you do not see the Billing Project in this list, then you are not an Owner and do not have permission to add members. Click on the name of the Billing Project. If you don’t see information about the Billing Account, click on “View billing account” to expand the Billing Account information. You may be prompted to enter your login information again. You should see the name of the Google Billing Account that is funding this Terra Billing Project. Click on the teardrop icon next to the name of the Billing Account. Click “Remove Billing Account”. Click OK to confirm that you want to disable funding for this Billing Project. The page should now indicate that there is no linked billing account. If necessary, you can restore funding to the Billing Project and associated Workspaces by clicking the teardrop icon and selecting “Change Billing Account”. However, Google makes no promises about how long the Workspace contents will remain available after you disable funding, so it is best not to rely on them. "],["notes-for-instructors.html", "Chapter 6 Notes for Instructors", " Chapter 6 Notes for Instructors Although AnVIL is the preferred computational platform for the GDSCN, all activities can be run on different platforms. R-based activities can be run on your own personal installation of R or Posit(formerly called RStudio), depending on your needs. Galaxy-based activities can be run on both Galaxy on AnVIL and on the Galaxy web portal. You may also adapt these activities for other languages and platforms. "],["checklist-for-running-activities-on-anvil.html", "Chapter 7 Checklist for Running Activities on AnVIL Before the class begins After the class ends", " Chapter 7 Checklist for Running Activities on AnVIL If you choose to run these activities on AnVIL with your class, there are several things that you can do to make the experience easier. Before the class begins This checklist can serve as a reminder of the overall suggested steps to run an activity on AnVIL. You might find yourself changing these steps slightly as you become more familiar with AnVIL. Billing Obtain funding through the STRIDES program (optional) Request students make AnVIL IDs (Google IDs) Collect AnVIL IDs (Google IDs) from students Create Google Billing Account for your class Resources Create a Workspace for your class (optional) Notify Terra of your course dates and times Direct students to the Workspace Permissions Set up Groups to manage permissions AnVIL Group Class Workspace Terra Billing Projects* Instructor Owner Owner Teaching assistants Writer Owner Students Reader User After the class ends Resources Remind students to download any files they might need Tell students to delete their environments and persistent disks Billing Deactivate billing project "],["setting-up-billing-on-anvil.html", "Chapter 8 Setting up Billing on AnVIL 8.1 Creating a billing project 8.2 Adding Instructors as “Owner” 8.3 Adding Students as “User” 8.4 Understanding the various billing costs 8.5 Estimating costs before the class begins 8.6 How much does a class cost?", " Chapter 8 Setting up Billing on AnVIL The following will help you set up billing for your class. You will: * Set up a billing project for tracking costs * Add yourself and students to the billing project to grant permission to AnVIL resources * Learn about different sources of costs in AnVIL * Estimate costs for your class * Learn about how to track costs during your class 8.1 Creating a billing project First, create the Billing Project. Billing Project names must be globally unique and cannot exceed 30 characters. We suggest the name of the Billing Project should be a combination of institution-class- (e.g., “jhu-bmr2021-bill-1”). To create a Billing Project: Go to https://anvil.terra.bio/#billing Click “+CREATE” Type in your Billing Project name Select the appropriate Billing Account Click “CREATE BILLING PROJECT” You now have a unique Billing Project. 8.2 Adding Instructors as “Owner” Next, you want to give instructors permission to use the Billing Project to compute. To set instructor permissions: Go to https://anvil.terra.bio/#billing Select the “Owned by You” Billing Project sub-list Select the Billing Project you made in Instructor Billing Project Select the “Users” tab Click “+ Add User”. You will be prompted to add a “User email *”. Begin typing the instructor Group name set up in Instructor Group. You should see an email in the form (firecloud.org?) (e.g., jhu-bmr2021-instructors@firecloud.org). Ensure “Can manage users (Owner)” is selected Click “ADD USER” This step makes it so that co-instructors can edit permissions and administer the Billing Project as needed. While this means you and co-instructors can compute using the student Billing Project, this makes spending difficult to track. Instructors should always use the instructor Workspace to compute. This makes it much easier to track costs associated with instructors versus students. 8.3 Adding Students as “User” Next, you will add your student Group to the Billing Project so that they can compute. To set student permissions: Go to https://anvil.terra.bio/#billing Select the “Owned by You” Billing Project sub-list Select the Billing Project you made in Billing Project Select the “Users” tab Click “+ Add User”. You will be prompted to add a “User email *”. Begin typing the student Group name set up in Student Group. You should see an email in the form (firecloud.org?) (e.g., jhu-bmr2021-students@firecloud.org). Keep “Can manage users (Owner)” deselected. Click “ADD USER” 8.4 Understanding the various billing costs Costs in AnVIL fall into one of three categories: compute costs, storage costs, and network usage (egress) costs. Compute costs are those that students accrue when actively using an AnVIL Workspace. Students can clone a Workspace for no cost, but they will begin to accrue costs as soon as they set up a cloud environment. Compute costs are based on how many CPUs you need, as well as how much memory and storage space you choose. You can also pause the Workspace and pay a lower cost per hour than if you were to keep the Workspace running. Current prices can be found here. Storage costs are driven by the persistent disk. The persistent disk allows you to store data and installed programs/libraries for a low cost. Students can delete their Workspaces but maintain their persistent disk so they still have access to previous programs they have installed and previous files they’ve created. Current prices can be found here. Finally, network usage costs are those involved with transferring data between networks or downloading data from the cloud to your local computer. Current prices can be found here. 8.5 Estimating costs before the class begins AnVIL has a free AnVIL_Cost_Estimator that allows you estimate compute, storage, and network usage costs for your class. This is a Google sheet that you can tailor to fit your needs. Before you use it, make sure the prices are up to date by following the links at the bottom of the sheet. If you need to create a Budget Justification for your class, you can also use the free AnVIL_Budget_Justification template. 8.6 How much does a class cost? One of the advantages of billing projects in Terra is that you can keep track of the costs during real time. You can see how much each Workspace is costing while your course is happening, so there are no unexpected surprises at the end! Full details about billing in Terra can be found here. These instructions are adapted from Terra Support. To view the costs being accrued by each billing project, you can go to https://console.cloud.google.com/billing. At the top of the page, there is a dropdown menu. Choose the billing project name you’d like to view. Once you are in proper billing project, you click on “View detailed charges” in the Billing section on the far right. This takes you to a report of the detailed charges accrued by the billing account. Here, you will be able to see the total cost over a time range, as well as costs broken down by services. "],["setting-up-the-class-activity.html", "Chapter 9 Setting up the Class Activity 9.1 Overview of Class Setup 9.2 Collect Google IDs 9.3 Set Up Groups 9.4 Set Up Billing Projects 9.5 Set Permissions on the Workspace 9.6 Notify Terra", " Chapter 9 Setting up the Class Activity 9.1 Overview of Class Setup This section will show you how to organize your class to make it easier to administer access to your content. You will need to have a list of who will be taking your class, such as a course roster or sign-up list, as well as a list of additional instructors or teaching assistants. You can make changes later, so the list of students need not be final. 9.2 Collect Google IDs AnVIL IDs are based on Google accounts. Students – Contact students/participants to get their AnVIL IDs. These should be Gmail addresses or emails with GSuite capabilities. You can link students to Student Account Setup for instructions on what they should do. Co-instructors – If you will be working with other instructors, such as co-instructors or teaching assistants, you will need to collect their IDs as well. 9.3 Set Up Groups Reminder: Google Billing Accounts are managed on Google Cloud Platform and are used for organizing funding sources (e.g. credit cards, cloud credits). Terra Billing Projects are managed through Terra, and allow you to associate your Terra activity with the correct Google Billing Account. For a more detailed explanation, please see the chapter on Account Setup. We suggest creating two different Terra Billing Projects under the appropriate Billing Account that you created on cloud.google.com: one for students and one for co-instructors. The instructions below will walk you through how to set this up. Groups enable you to share your class Workspace and manage permissions for many people at once. We recommend starting with one Group for instructors and one Group for students. Instructor Group {#instructor-group} Create an informative, unique Group name for any co-instructors and teaching assistants. We suggest a combination of institution-class-role (e.g., “jhu-bmr2021-instructors”). Only letters, numbers, underscores, and dashes are allowed in Group names. To create a Group for instructors: Go to https://anvil.terra.bio/#groups Click “+ Create a New Group” Type in your instructor Group name Click “CREATE GROUP” You now have a unique instructor Group. Add Instructors as “Admin” (Instructor Group) Now that your instructor Group has been created, you should add any additional instructors. You should also ensure that they have the correct permissions. Go to https://anvil.terra.bio/#groups/ and click on the instructor Group name. This page should also be visible at https://anvil.terra.bio/#groups/<group-name>. Click on “+Add User”. You will be prompted to add the instructor’s AnVIL ID. Type in the instructor’s AnVIL ID Make sure “Can manage users (admin)” is selected Click ADD USER. This will take you back to the Group administration page. Make sure the newly added instructor displays “Admin” under “Roles” beside their AnVIL ID. Repeat this process for any additional co-instructors and teaching assistants. Student Group {#student-group} Next, you will create a Group for your students. Create an informative, unique Group name. We suggest a combination of institution-class-role (e.g., “jhu-bmr2021-students”). Only letters, numbers, underscores, and dashes are allowed in Group names. To create a Group for students: Go to https://anvil.terra.bio/#groups Click “+ Create a New Group” Type in your student Group name Click “CREATE GROUP” You now have a unique student Group. Add Instructors as “Admin” (Student Group) The next steps ensure any additional co-instructors and teaching assistants are able to administer the student Group in case you are unavailable. Follow the steps below to add each co-instructor in the student Group: Go to https://anvil.terra.bio/#groups/ and click on the student Group name. This page should be visible at https://anvil.terra.bio/#groups/<group-name>. Click on “+Add User”. You will be prompted to add the instructor’s AnVIL ID. Type in the instructor’s AnVIL ID Make sure “Can manage users (admin)” is selected Click ADD USER. This will take you back to the Group administration page. Make sure the newly added instructor displays “Admin” under “Roles” beside their AnVIL ID. Repeat this process for any additional co-instructors and teaching assistants. Add Students as “Member” Follow the steps below to add individual students to the student Group: Go to https://anvil.terra.bio/#groups/ and click on the student Group name. This page should be visible at https://anvil.terra.bio/#groups/<group-name>. Click on “+Add User”. You will be prompted to add an AnVIL ID. Type in the student’s AnVIL ID Click ADD USER Make sure the newly added student displays “Member” under “Roles” beside their AnVIL ID. At present, each student’s AnVIL ID must be added separately. Your instructor and student Groups are now set up. Group Email Lists Note that your newly created Groups have Group emails associated with them. Take note of these Group emails. You will use them for granting access to your class Billing Projects and Workspaces in the next steps. 9.4 Set Up Billing Projects Billing Projects in Terra help organize your compute costs. Like Groups, we suggest creating two different billing projects under the appropriate Billing Account that you created on cloud.google.com: one for students and one for co-instructors. Billing Project names must be globally unique and cannot exceed 30 characters. Instructor Billing Project {#instructor-billing-project} First, create the Billing Project for instructors. We suggest the name of the Billing Project should be a combination of institution-class-role (e.g., “jhu-bmr2021-instructors-bill-1”). To create a Billing Project for instructors: Go to https://anvil.terra.bio/#billing Click “+CREATE” Type in your instructor Billing Project name Select the appropriate Billing Account Click “CREATE BILLING PROJECT” You now have a unique instructor Billing Project. Add Instructors as “Owner” (Instructor Project) Next, you want to give instructors permission to use the Billing Project to compute. To set instructor permissions: Go to https://anvil.terra.bio/#billing Select the “Owned by You” Billing Project sub-list Select the Billing Project you made for instructors in Instructor Billing Project Select the “Users” tab Click “+ Add User”. You will be prompted to add a “User email *”. Begin typing the instructor Group name set up in Instructor Group. You should see an email in the form (firecloud.org?) (e.g., jhu-bmr2021-instructors@firecloud.org). Ensure “Can manage users (Owner)” is selected Click “ADD USER” Your instructor Billing Project is now set up. Student Billing Project {#student-billing-project} Next, create a student Billing Project. Again, we suggest a combination of institution-class-role (e.g., “jhu-bmr2021-students-bill-1”). To create a Billing Project for students: Go to https://anvil.terra.bio/#billing Click “+CREATE” Type in your student Billing Project name Select the appropriate Billing Account (same as above) Click “CREATE BILLING PROJECT” You now have a unique student Billing Project. Add Instructors as “Owner” (Student Project) You want to ensure any additional co-instructors and teaching assistants are able to administer the student Billing Project in case you are unavailable. To set instructor permissions: Go to https://anvil.terra.bio/#billing Select the “Owned by You” Billing Project sub-list Select the Billing Project you made for students in Student Billing Project Select the “Users” tab Click “+ Add User”. You will be prompted to add a “User email *”. Begin typing the instructor Group name set up in [### Set Up Groups]. You should see an email in the form (firecloud.org?) (e.g., jhu-bmr2021-instructors@firecloud.org). Ensure “Can manage users (Owner)” is selected Click “ADD USER” This step makes it so that co-instructors can edit permissions and administer the student Billing Project as needed. While this means you and co-instructors can compute using the student Billing Project, this makes spending difficult to track. Instructors should always use the instructor Billing Project to compute. This makes it much easier to track costs associated with instructors versus students. Add Students as “User” Next, you will add your student Group to the student Billing Project so that they can compute. To set student permissions: Go to https://anvil.terra.bio/#billing Select the “Owned by You” Billing Project sub-list Select the Billing Project you made for students in Student Billing Project Select the “Users” tab Click “+ Add User”. You will be prompted to add a “User email *”. Begin typing the student Group name set up in Student Group. You should see an email in the form (firecloud.org?) (e.g., jhu-bmr2021-students@firecloud.org). Keep “Can manage users (Owner)” deselected. Click “ADD USER” Your student Billing Project is now set up. 9.5 Set Permissions on the Workspace Finally, you will want to set up permissions for co-instructors and students to see the class Workspace you created with the development Billing Project in Developing Content. AnVIL users can take on the “Owner”, “Writer”, or “Reader” role for a Workspace. Add Instructors as “Owner” You should add your co-instructors and teaching assistants as “Owners” to the class Workspace. This is useful if they need to edit the course content or share the space with students on your behalf. To share and change permissions: Go to https://anvil.terra.bio/#workspaces and find your class Workspace you set up in Developing Content Click the teardrop button for your class Workspace Click “Share”. This will open a dialog box. Enter the name of the instructor Group (e.g., jhu-bmr2021-instructors). This will create a dropdown for the Group permissions in the box. Select this Group. Change permissions to “Owner” using the dropdown menu under the instructor Group Click “SAVE” This step makes it so that co-instructors can edit the original content of the Workspace as needed. While this means you and co-instructors can compute using the development Billing Project, this makes spending difficult to track. Instructors should instead clone the Workspace using the instructor Billing Project. This makes it much easier to track costs associated with this iteration of your class versus further iterations (e.g., the following semester or year). Add Students as “Reader” Next, add your students as “Readers” to the class Workspace. This means they will be able to view and clone the Workspace, but not make edits or perform computations. To share the Workspace: Click the teardrop button for your class Workspace Click “Share”. This will open a dialog box. Enter the name of the student Group. This will create a dropdown for the Group permissions in the box. Select this Group. Ensure permissions are set to “Reader” (default) Click “SAVE” You have now correctly set up your class permissions! 9.6 Notify Terra Contacting Terra ahead of your class time helps the Terra team avoid any major disruptions to your class. Contact Terra by submitting a request for a hold on scheduled maintenance and downtime. It’s also a good idea to ask about major changes planned for the time prior to your class. "],["getting-credit-for-professional-development.html", "Chapter 10 Getting Credit for Professional Development", " Chapter 10 Getting Credit for Professional Development We are happy to provide a letter to your supervisor, department head, or dean to indicate you’ve worked through this content and intend to use it in your class. "],["anvil-workspace.html", "Chapter 11 AnVIL Workspace 11.1 Create Google Account 11.2 Clone the Workspace", " Chapter 11 AnVIL Workspace You can easily access the data on AnVIL by cloning the dedicated workspace. These sections guide you through creating an AnVIL account and accessing the workspace. 11.1 Create Google Account If you do not already have a Google account that you would like to use for accessing Terra, create one now. If you would like to create a Google account that is associated with your non-Gmail, institutional email address, follow these instructions. 11.2 Clone the Workspace Launch Terra Locate the Workspace you want to clone. If a Workspace has been shared with you ahead of time, it will appear in “MY WORKSPACES”. You can clone a Workspace that was shared with you to perform your own analyses. In the screenshot below, no Workspaces have been shared. If a Workspace hasn’t been shared with you, navigate to the “FEATURED” or “PUBLIC” Workspace tabs. Use the search box to find the Workspace you want to clone. Click the teardrop button on the far right next to the Workspace you want to clone. Click “Clone”. You can also clone the Workspace from the Workspace Dashboard instead of the search results. You will see a popup box appear. Name your Workspace and select the appropriate Terra Billing Project. All activity in the Workspace will be charged to this Billing Project (regardless of who conducted it). Remember that each Workspace should have its own Billing Project. If you are working with protected data, you can set the Authorization Domain to limit who can be added to your Workspace. Note that the Authorization Domain cannot be changed after the Workspace is created (i.e. there is no way to make this Workspace shareable with a larger audience in the future). Workspaces by default are only visible to people you specifically share them with. Authorization domains add an extra layer of enforcement over privacy, but by nature make sharing more complicated. We recommend using Authorization Domains in cases where it is extremely important and/or legally required that the data be kept private (e.g. protected patient data, industry data). For data you would merely prefer not be shared with the world, we recommend relying on standard Workspace sharing permissions rather than Authorization Domains, as Authorization Domains can make future collaborations, publications, or other sharing complicated. Click “CLONE WORKSPACE”. The new Workspace should now show up under your Workspaces. "],["using-rstudio-on-anvil.html", "Chapter 12 Using RStudio on AnVIL 12.1 Video overview of RStudio on AnVIL 12.2 Launching RStudio 12.3 Touring RStudio 12.4 Pausing RStudio", " Chapter 12 Using RStudio on AnVIL In the next few steps, you will walk through how to get set up to use RStudio on the AnVIL platform. AnVIL is centered around different “Workspaces”. Each Workspace functions almost like a mini code laboratory - it is a place where data can be examined, stored, and analyzed. The first thing we want to do is to copy or “clone” a Workspace to create a space for you to experiment. Use a web browser to go to the AnVIL website. In the browser type: anvil.terra.bio Tip At this point, it might make things easier to open up a new window in your browser and split your screen. That way, you can follow along with this guide on one side and execute the steps on the other. Your instructor will give you information on which workspace you should clone. 12.1 Video overview of RStudio on AnVIL Here is a video tutorial that describes the basics of using RStudio on AnVIL. 12.1.1 Objectives Start compute for your RStudio environment Tour RStudio on AnVIL Stop compute to minimize expenses 12.1.2 Slides The slides for this tutorial are are located here. 12.2 Launching RStudio AnVIL is very versatile and can scale up to use very powerful cloud computers. It’s very important that you select a cloud computing environment appropriate to your needs to avoid runaway costs. If you are uncertain, start with the default settings; it is fairly easy to increase your compute resources later, if needed, but harder to scale down. Note that, in order to use RStudio, you must have access to a Terra Workspace with permission to compute (i.e. you must be a “Writer” or “Owner” of the Workspace). Open Terra - use a web browser to go to anvil.terra.bio In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. Click on the name of your Workspace. You should be routed to a link that looks like: https://anvil.terra.bio/#workspaces/<billing-project>/<workspace-name>. Click on the cloud icon on the far right to access your Cloud Environment options. If you don’t see this icon, you may need to scroll to the right. In the dialogue box, click the “Settings” button under RStudio. You will see some configuration options for the RStudio cloud environment, and a list of costs because it costs a small amount of money to use cloud computing. Configure any settings you need for your cloud environment. If you are uncertain about what you need, the default configuration is a reasonable, cost-conservative choice. It is fairly easy to increase your compute resources later, if needed, but harder to scale down. Scroll down and click the “CREATE” button when you are satisfied with your setup. The dialogue box will close and you will be returned to your Workspace. You can see the status of your cloud environment by hovering over the RStudio icon. It will take a few minutes for Terra to request computers and install software. When your environment is ready, its status will change to “Running”. Click on the RStudio logo to open a new dialogue box that will let you launch RStudio. Click the launch icon to open RStudio. This is also where you can pause, modify, or delete your environment when needed. You should now see the RStudio interface with information about the version printed to the console. 12.3 Touring RStudio Next, we will be using RStudio and the package Glimma to create interactive plots. See this vignette for more information. The Bioconductor team has created a very useful package to programmatically interact with Terra and Google Cloud. Install the AnVIL package. It will make some steps easier as we go along. You can now quickly install precompiled binaries using the AnVIL package’s install() function. We will use it to install the Glimma package and the airway package. The airway package contains a SummarizedExperiment data class. This data describes an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone. {Note: for some of the packages, you will have to install packaged from the CRAN repository, using the install.packages() function. The examples will show you which install method to use.} <img src="resources/images/08-student_using_rstudio_files/figure-html//1BLTCaogA04bbeSD1tR1Wt-mVceQA6FHXa8FmFzIARrg_g11f12bc99af_0_56.png" alt="Screenshot of the RStudio environment interface. Code has been typed in the console and is highlighted." width="480" /> Load the example data. The multidimensional scaling (MDS) plot is frequently used to explore differences in samples. When this data is MDS transformed, the first two dimensions explain the greatest variance between samples, and the amount of variance decreases monotonically with increasing dimension. The following code will launch a new window where you can interact with the MDS plot. Change the colour_by setting to “groups” so you can easily distinguish between groups. In this data, the “group” is the treatment. You can download the interactive html file by clicking on “Save As”. You can also download plots and other files created directly in RStudio. To download the following plot, click on “Export” and save in your preferred format to the default directory. This saves the file in your cloud environment. You should see the plot in the “Files” pane. Select this file and click “More” > “Export” Select “Download” to save the file to your local machine. 12.4 Pausing RStudio You can view costs and make changes to your cloud environments from the panel on the far right of the page. If you don’t see this panel, you may need to scroll to the right. Running environments will have a green dot, and paused environments will have an orange dot. Hovering over the RStudio icon will show you the costs associated with your RStudio environment. Click on the RStudio icon to open the cloud environment settings. Click the Pause button to pause RStudio. This will take a few minutes. When the environment is paused, an orange dot will be displayed next to the RStudio icon. If you hover over the icon, you will see that it is paused, and has a small ongoing cost as long as it is paused. When you’re ready to resume working, you can do so by clicking the RStudio icon and clicking Resume. The right-hand side icon reminds you that you are accruing cloud computing costs. If you don’t see this icon, you may need to scroll to the right. You should minimize charges when you are not performing an analysis. You can do this by clicking on the RStudio icon and selecting “Pause”. This will release the CPU and memory resources for other people to use. Note that your work will be saved in the environment and continue to accrue a very small cost. This work will be lost if the cloud environment gets deleted. If there is anything you would like to save permanently, it’s a good idea to copy it from your compute environment to another location, such as the Workspace bucket, GitHub, or your local machine, depending on your needs. You can also pause your cloud environment(s) at https://anvil.terra.bio/#clusters. "],["exploring-soil-testing-data-with-r.html", "Chapter 13 Exploring Soil Testing Data With R 13.1 Before You Start 13.2 Objectives 13.3 Part 1. Examining the Data 13.4 Part 2. Summarizing the Data with Statistics 13.5 Part 3. Visualizing the Data", " Chapter 13 Exploring Soil Testing Data With R In this activity, you’ll have a chance to become familiar with the BioDIGS soil testing data. This dataset includes information on the inorganic components of each soil sample, particularly metal concentrations. Human activity can increase the concentration of inorganic compounds in the soil. When cars drive on roads, compounds from the exhaust, oil, and other fluids might settle onto the roads and be washed into the soil. When we put salt on roads, parking lots, and sidewalks, the salts themselves will eventually be washed away and enter the ecosystem through both water and soil. Chemicals from factories and other businesses also leech into our environment. All of this means the concentration of heavy metals and other chemicals will vary among the soil samples collected for the BioDIGS project. 13.1 Before You Start If you do not already have a Google account that you would like to use for accessing Terra, create one now. If you would like to create a Google account that is associated with your non-Gmail, institutional email address, follow these instructions. 13.2 Objectives This activity will teach you how to use the AnVIL platform to: Open data from an R package Examine objects in R Calculate summary statistics for variables in the soil testing data Create and interpret histograms and boxplots for variables in the soil testing data 13.3 Part 1. Examining the Data We will use the BioDIGS package to retrieve the data. We first need to install the package from where it is stored on GitHub. devtools::install_github("fhdsl/BioDIGSData") Once you’ve installed the package, we can load the library and assign the soil testing data to an object. This command follows the code structure: dataset_object_name <- stored_BioDIGS_dataset library(BioDIGSData) soil.values <- BioDIGS_soil_data() It seems like the dataset loaded, but it’s always a good idea to verify. There are many ways to check, but the easiest approach (if you’re using RStudio) is to look at the Environment tab on the upper right-hand side of the screen. You should now have an object called soil.values that includes some number of observations for 28 variables. The observations refer to the number of rows in the dataset, while the variables tell you the number of columns. As long as neither the observations or variables are 0, you can be confident that your dataset loaded. Let’s take a quick look at the dataset. We can do this by clicking on soil.values object in the Environment tab. (Note: this is equivalent to typing View(soil.values) in the R console.) This will open a new window for us to scroll through the dataset. Well, the data definitely loaded, but those column names aren’t immediately understandable. What could As_EPA3051 possibly mean? In addition to the dataset, we need to load the data dictionary as well. Data dictionary: a file containing the names, definitions, and attributes about data in a database or dataset. In this case, the data dictionary can help us make sense of what sort of values each column represents. The data dictionary for the BioDIGS soil testing data is available in the R package (see code below), but we have also reproduced it here. ?BioDIGS_soil_data() site_id Unique letter and number site name full_name Full site name As_EPA3051 Arsenic (mg/kg), EPA Method 3051A. Quantities < 3.0 are not detectable. Cd_EPA3051 Cadmium (mg/kg), EPA Method 3051A. Quantities < 0.2 are not detectable. Cr_EPA3051 Chromium (mg/kg), EPA Method 3051A Cu_EPA3051 Copper (mg/kg), EPA Method 3051A Ni_EPA3051 Nickel (mg/kg), EPA Method 3051A Pb_EPA3051 Lead (mg/kg), EPA Method 3051A Zn_EPA3051 Zinc (mg/kg), EPA Method 3051A water_pH A-E_Buffer_pH OM_by_LOI_pct Organic Matter by Loss on Ignition P_Mehlich3 Phosphorus (mg/kg), using the Mehlich 3 soil test extractant K_Mehlich3 Potassium (mg/kg), using the Mehlich 3 soil test extractant Ca_Mehlich3 Calcium (mg/kg), using the Mehlich 3 soil test extractant Mg_Mehlich3 Magnesium (mg/kg), using the Mehlich 3 soil test extractant Mn_Mehlich3 Manganese (mg/kg), using the Mehlich 3 soil test extractant Zn_Mehlich3 Zinc (mg/kg), using the Mehlich 3 soil test extractant Cu_Mehlich3 Copper (mg/kg), using the Mehlich 3 soil test extractant Fe_Mehlich3 Iron (mg/kg), using the Mehlich 3 soil test extractant B_Mehlich3 Boron (mg/kg), using the Mehlich 3 soil test extractant S_Mehlich3 Sulfur (mg/kg), using the Mehlich 3 soil test extractant Na_Mehlich3 Sodium (mg/kg), using the Mehlich 3 soil test extractant Al_Mehlich3 Aluminum (mg/kg), using the Mehlich 3 soil test extractant Est_CEC Cation Exchange Capacity (meq/100g) at pH 7.0 (CEC) Base_Sat_pct Base saturation (BS). This represents the percentage of CEC occupied by bases (Ca2+, Mg2+, K+, and Na+). The %BS increases with increasing soil pH. The availability of Ca2+, Mg2+, and K+ increases with increasing %BS. P_Sat_ratio Phosphorus saturation ratio. This is the ratio between the amount of phosphorus present in the soil and the total capacity of that soil to retain phosphorus. The ability of phosphorus to be bound in the soil is primary a function of iron (Fe) and aluminum (Al) content in that soil. Using the data dictionary, we find that the values in column As_EPA3051 give us the arsenic concentration in mg/kg of each soil sample, as determined by EPA Method 3051A. This method uses a combination of heat and acid to extract specific elements (like arsenic, cadmium, chromium, copper, nickel, lead, and zinc) from soil samples. While arsenic can occur naturally in soils, higher levels suggest the soil may have been contaminated by mining, hazardous waste, or pesticide application. Arsenic is toxic to humans. QUESTIONS: What data is found in the column labeled “Fe_Mehlich3”? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.) What data is found in the column labeled “Base_Sat_pct”? What does this variable tell us about the soil? We can also look at just the names of all the columns using the R console using the colnames() command. colnames(soil.values) ## [1] "site_id" "site_name" "type" "As_EPA3051" ## [5] "Cd_EPA3051" "Cr_EPA3051" "Cu_EPA3051" "Ni_EPA3051" ## [9] "Pb_EPA3051" "Zn_EPA3051" "water_pH" "OM_by_LOI_pct" ## [13] "P_Mehlich3" "K_Mehlich3" "Ca_Mehlich3" "Mg_Mehlich3" ## [17] "Mn_Mehlich3" "Zn_Mehlich3" "Cu_Mehlich3" "Fe_Mehlich3" ## [21] "B_Mehlich3" "S_Mehlich3" "Na_Mehlich3" "Al_Mehlich3" ## [25] "Est_CEC" "Base_Sat_pct" "P_Sat_ratio" "region" Most of the column names are found in the data dictionary, but the very last column (“region”) isn’t. How peculiar! Let’s look at what sort of values this particular column contains. The tab with the table of the soil.views object should still be open in the upper left pane of the RStudio window. If not, you can open it again by clicking on soils.view in the Environment pane, or by using the View() command. View(soil.values) If you scroll to the end of the table, we can see that “region” seems to refer to the city or area where the samples were collected. For example, the first 6 samples all come from Baltimore City. You may notice that some cells in the soil.values table contain NA. This just means that the soil testing data for that sample isn’t available yet. We’ll take care of those values in the next part. QUESTIONS: How many observations are in the soil testing values dataset that you loaded? What do each of these observations refer to? How many different regions are represented in the soil testing dataset? How many of them have soil testing data available? 13.4 Part 2. Summarizing the Data with Statistics Now that we have the dataset loaded, let’s explore the data in more depth. First, we should remove those samples that don’t have soil testing data yet. We could keep them in the dataset, but removing them at this stage will make the analysis a little cleaner. In this case, as we know the reason the data are missing (and that reason will not skew our analysis), we can safely remove these samples. This will not be the case for every data analysis. We can remove the unanalyzed samples using the drop_na() function from the tidyr package. This function removes any rows from a table that contains NA for a particular column. This command follows the code structure: dataset_new_name <- dataset %>% drop_na(column_name) The `%>% is called a pipe and it tells R that the commands after it should all be applied to the object in front of it. (In this case, we can filter out all samples missing a value for “As_EPA3051” as a proxy for samples without soil testing data.) library(tidyr) soil.values.clean <- soil.values %>% drop_na(As_EPA3051) Great! Now let’s calculate some basic statistics. For example, we might want to know what the mean (average) arsenic concentration is for all the soil samples. We can use a combination of two functions: pull() and mean(). pull() lets you extract a column from your table for statistical analysis, while mean() calculates the average value for the extracted column. This command follows the code structure: OBJECT %>% pull(column_name) %>% mean() pull() is a command from the tidyverse package, so we’ll need to load that library before our command. library(tidyverse) soil.values.clean %>% pull(As_EPA3051) %>% mean() ## [1] 5.10875 We can run similar commands to calculate the standard deviation (sd), minimum (min), and maximum (max) for the soil arsenic values. soil.values.clean %>% pull(As_EPA3051) %>% sd() ## [1] 5.606926 soil.values.clean %>% pull(As_EPA3051) %>% min() ## [1] 0 soil.values.clean %>% pull(As_EPA3051) %>% max() ## [1] 27.3 The soil testing dataset contains samples from multiple geographic regions, so maybe it’s more meaningful to find out what the average arsenic values are for each region. We have to do a little bit of clever coding trickery for this using the group_by and summarize functions. First, we tell R to split our dataset up by a particular column (in this case, region) using the group_by function, then we tell R to summarize the mean arsenic concentration for each group. When using the summarize function, we tell R to make a new table (technically, a tibble in R) that contains two columns: the column used to group the data and the statistical measure we calculated for each group. This command follows the code structure: dataset %>% group_by(column_name) %>% summarize(mean(column_name)) soil.values.clean %>% group_by(region) %>% summarize(mean(As_EPA3051)) ## # A tibble: 2 × 2 ## region `mean(As_EPA3051)` ## <chr> <dbl> ## 1 Baltimore City 5.56 ## 2 Montgomery County 4.66 Now we know that the mean arsenic concentration might be different for each region. If we compare the samples from Baltimore City and Montgomery County, the Baltimore City samples appear to have a higher mean arsenic concentration than the Montgomery County samples. QUESTIONS: All the samples from Baltimore City and Montgomery County were collected from public park land. The parks sampled from Montgomery County were located in suburban and rural areas, compared to the urban parks sampled in Baltimore City. Why might the Montgomery County samples have a lower average arsenic concentration than the samples from Baltimore City? What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value? Calculate the mean iron concentration by region. Which region has the highest mean iron concentration? What about the lowest? Let’s say we’re interested in looking at mean concentrations that were determined using EPA Method 3051. Given that there are 8 of these measures in the soil.values dataset, it would be time consuming to run our code from above for each individual measure. We can add two arguments to our summarize statement to calculate statistical measures for multiple columns at once: the across argument, which tells R to apply the summarize command to multiple columns; and the ends_with parameter, which tells R which columns should be included in the statistical calculation. We are using ends_with because for this question, all the columns that we’re interested in end with the string ‘EPA3051’. This command follows the code structure: dataset %>% group_by(column_name) %>% summarize(across(ends_with(common_column_name_ending), mean)) soil.values.clean %>% group_by(region) %>% summarize(across(ends_with('EPA3051'), mean)) ## # A tibble: 2 × 8 ## region As_EPA3051 Cd_EPA3051 Cr_EPA3051 Cu_EPA3051 Ni_EPA3051 Pb_EPA3051 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Baltimore C… 5.56 0.359 34.5 35.0 17.4 67.2 ## 2 Montgomery … 4.66 0.402 29.9 24.3 23.4 38.7 ## # ℹ 1 more variable: Zn_EPA3051 <dbl> This is a much more efficient way to calculate statistics. QUESTIONS: Calculate the maximum values for concentrations that were determined using EPA Method 3051. (HINT: change the function you call in the summarize statement.) Which of these metals has the maximum concentration you see, and in which region is it found? Calculate both the mean and maximum values for concentrations that were determined using the Mehlich3 test. (HINT: change the terms in the columns_to_include vector, as well as the function you call in the summarize statement.) Which of these metals has the highest average and maximum concentrations, and in which region are they found? 13.5 Part 3. Visualizing the Data Often, it can be easier to immediately interpret data displayed as a plot than as a list of values. For example, we can more easily understand how the arsenic concentration of the soil samples are distributed if we create histograms compared to looking at point values like mean, standard deviation, minimum, and maximum. One way to make histograms in R is with the hist() function. This function only requires that we tell R which column of the dataset that we want to plot. (However, we also have the option to tell R a histogram name and a x-axis label.) We can again use the pull() command and pipes (%>%) to choose the column we want from the soil.values.clean dataset and make a histogram of them. This combination of commands follows the code structure: dataset %>% pull(column_name) %>% hist(main = chart_title, xlab = x_axis_title) soil.values.clean %>% pull(As_EPA3051) %>% hist(main = 'Histogram of Arsenic Concentration', xlab ='Concentration in mg/kg' ) We can see that almost all the soil samples had very low concentrations of arsenic (which is good news for the soil health!). In fact, many of them had arsenic concentrations close to 0, and only one sampling location appears to have high levels of arsenic. We might also want to graphically compare arsenic concentrations among the geographic regions in our dataset. We can do this by creating boxplots. Boxplots are particularly useful when comparing the mean, variation, and distributions among multiple groups. In R, one way to create a boxplot is using the boxplot() function. We don’t need to use pipes for this command, but instead will specify what columns we want to use from the dataset inside the boxplot() function itself. This command follows the code structure: boxplot(column_we’re_plotting ~ grouping_variable, data = dataset, main = “Title of Graph”, xlab = “x_axis_title”, ylab = “y_axis_title”) boxplot(As_EPA3051 ~ region, data = soil.values.clean, main = "Arsenic Concentration by Geographic Region", xlab = "Region", ylab = "Arsenic Concentration in mg/kg") By using a boxplot, we can quickly see that, while one sampling site within Baltimore City has a very high concentration of arsenic in the soil, in general there isn’t a difference in arsenic content between Baltimore City and Montgomery County. QUESTIONS: Create a histogram for iron concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations? Create a histogram for lead concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations? Look at the maps for iron and lead on the BioDIGS website. Do the boxplots you created make sense, given what you see on these maps? Why or why not? "],["activity-questions.html", "Chapter 14 Activity Questions 14.1 Part 1. Examining the Data 14.2 Part 2. Summarizing the Data with Statistics 14.3 Part 3. Visualizing the Data", " Chapter 14 Activity Questions 14.1 Part 1. Examining the Data What data is found in the column labeled “Fe_Mehlich3”? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.) What data is found in the column labeled “Base_Sat_pct”? What does this variable tell us about the soil? How many observations are in the soil testing values dataset that you loaded? What do each of these observations refer to? How many different regions are represented in the soil testing dataset? How many of them have soil testing data available? 14.2 Part 2. Summarizing the Data with Statistics All the samples from Baltimore City and Montgomery County were collected from public park land. The parks sampled from Montgomery County were located in suburban and rural areas, compared to the urban parks sampled in Baltimore City. Why might the Montgomery County samples have a lower average arsenic concentration than the samples from Baltimore City? What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value? Calculate the mean iron concentration by region. Which region has the highest mean iron concentration? What about the lowest? Calculate the maximum values for concentrations that were determined using EPA Method 3051. (HINT: change the function you call in the summarize statement.) Which of these metals has the maximum concentration you see, and in which region is it found? Calculate both the mean and maximum values for concentrations that were determined using the Mehlich3 test. (HINT: change the terms in the columns_to_include vector, as well as the function you call in the summarize statement.) Which of these metals has the highest average and maximum concentrations, and in which region are they found? 14.3 Part 3. Visualizing the Data Create a histogram for iron concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations? Create a histogram for lead concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations? Look at the maps for iron and lead on the BioDIGS website. Do the boxplots you created make sense, given what you see on these maps? Why or why not? "],["about-the-authors.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines.     Credits Names Pedagogy Content Developer Elizabeth Humphries Content Editors Ava Hoffman, Kate Isaac Project Directors Ava Hoffman, Michael Schatz, Jeff Leek, Frederick Tan Production Content Publisher Ira Gooding Technical Template Publishing Engineers Candace Savonen, Carrie Wright, Ava Hoffman Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Candace Savonen Package Developers (ottrpal) John Muschelli, Candace Savonen, Carrie Wright Package Developer (BioDIGSData) Ava Hoffman Funding Funder National Human Genome Research Institute (NHGRI) Funding Staff Fallon Bachman, Jennifer Vessio, Emily Voeglein   ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.3.2 (2023-10-31) ## os Ubuntu 22.04.4 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2024-09-09 ## pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source ## bookdown 0.39.1 2024-06-11 [1] Github (rstudio/bookdown@f244cf1) ## bslib 0.6.1 2023-11-28 [1] RSPM (R 4.3.0) ## cachem 1.0.8 2023-05-01 [1] RSPM (R 4.3.0) ## cli 3.6.2 2023-12-11 [1] RSPM (R 4.3.0) ## devtools 2.4.5 2022-10-11 [1] RSPM (R 4.3.0) ## digest 0.6.34 2024-01-11 [1] RSPM (R 4.3.0) ## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.3.0) ## evaluate 0.23 2023-11-01 [1] RSPM (R 4.3.0) ## fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.3.0) ## fs 1.6.3 2023-07-20 [1] RSPM (R 4.3.0) ## glue 1.7.0 2024-01-09 [1] RSPM (R 4.3.0) ## htmltools 0.5.7 2023-11-03 [1] RSPM (R 4.3.0) ## htmlwidgets 1.6.4 2023-12-06 [1] RSPM (R 4.3.0) ## httpuv 1.6.14 2024-01-26 [1] RSPM (R 4.3.0) ## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.3.0) ## jsonlite 1.8.8 2023-12-04 [1] RSPM (R 4.3.0) ## knitr 1.47.3 2024-06-11 [1] Github (yihui/knitr@e1edd34) ## later 1.3.2 2023-12-06 [1] RSPM (R 4.3.0) ## lifecycle 1.0.4 2023-11-07 [1] RSPM (R 4.3.0) ## magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.3.0) ## memoise 2.0.1 2021-11-26 [1] RSPM (R 4.3.0) ## mime 0.12 2021-09-28 [1] RSPM (R 4.3.0) ## miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.3.0) ## pkgbuild 1.4.3 2023-12-10 [1] RSPM (R 4.3.0) ## pkgload 1.3.4 2024-01-16 [1] RSPM (R 4.3.0) ## profvis 0.3.8 2023-05-02 [1] RSPM (R 4.3.0) ## promises 1.2.1 2023-08-10 [1] RSPM (R 4.3.0) ## purrr 1.0.2 2023-08-10 [1] RSPM (R 4.3.0) ## R6 2.5.1 2021-08-19 [1] RSPM (R 4.3.0) ## Rcpp 1.0.12 2024-01-09 [1] RSPM (R 4.3.0) ## remotes 2.4.2.1 2023-07-18 [1] RSPM (R 4.3.0) ## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.2) ## rmarkdown 2.27.1 2024-06-11 [1] Github (rstudio/rmarkdown@e1c93a9) ## sass 0.4.8 2023-12-06 [1] RSPM (R 4.3.0) ## sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0) ## shiny 1.8.0 2023-11-17 [1] RSPM (R 4.3.0) ## stringi 1.8.3 2023-12-11 [1] RSPM (R 4.3.0) ## stringr 1.5.1 2023-11-14 [1] RSPM (R 4.3.0) ## urlchecker 1.0.1 2021-11-30 [1] RSPM (R 4.3.0) ## usethis 2.2.3 2024-02-19 [1] RSPM (R 4.3.0) ## vctrs 0.6.5 2023-12-01 [1] RSPM (R 4.3.0) ## xfun 0.44.4 2024-06-11 [1] Github (yihui/xfun@9da62cc) ## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.3.0) ## yaml 2.3.8 2023-12-11 [1] RSPM (R 4.3.0) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library ## ## ────────────────────────────────────────────────────────────────────────────── "],["references.html", "Chapter 15 References", " Chapter 15 References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] +[["index.html", "BioDIGS: Exploring Soil Data About this Book 0.1 Target Audience 0.2 Platform 0.3 Data", " BioDIGS: Exploring Soil Data October 16, 2024 About this Book This is a companion training guide for BioDIGS, a GDSCN project that brings a research experience into the classroom. This activity guides students through exploration of the BioDIGS soil data using the tidyverse in R. Students will learn basic data summarization, visualization, and mapping skills. Visit the BioDIGS (BioDiversity and Informatics for Genomics Scholars) website here for more information about this collaborative, distributed research project, including how you can get involved! The GDSCN (Genomics Data Science Community Network) is a consortium of educators who aim to create a world where researchers, educators, and students from diverse backgrounds are able to fully participate in genomic data science research. You can find more information about its mission and initiatives here. BioDIGS logo 0.1 Target Audience The activities in this guide are written for undergraduate students and beginning graduate students. Some sections require basic understanding of the R programming language, which is indicated at the beginning of the chapter. 0.2 Platform The activities in this guide are demonstrated on NHGRI’s AnVIL cloud computing platform. AnVIL is the preferred computing platform for the GDSCN. However, all of these activities can be done using your personal installation of R or using the online Galaxy portal. 0.3 Data The data generated by the BioDIGS project is available through the BioDIGS website, as well as through an AnVIL workspace. Data about the soil itself as well as soil metal content was generated by the Delaware Soil Testing Program at the University of Delaware. Sequences were generated by the Johns Hopkins University Genetic Resources Core Facility and by PacBio. "],["background.html", "Chapter 1 Background 1.1 What is genomics? 1.2 What is data science? 1.3 What is cloud computing? 1.4 Why soil microbes? 1.5 Heavy metals and human health", " Chapter 1 Background One critical aspect of an undergraduate STEM education is hands-on research. Undergraduate research experiences enhance what students learn in the classroom as well as increase a student’s interest in pursuing STEM careers (Russell2007?). It can also lead to improved scientific reasoning and increased academic performance overall (Buffalari2020?). However, many students at underresourced institutions like community colleges, Historically Black Colleges and Universities (HBCUs), tribal colleges and universities, and Hispanic-serving institutions have limited access to research opportunities compared to their cohorts at larger four-year colleges and R1 institutions. These students are also more likely to belong to groups that are already under-represented in STEM disciplines, particularly genomics and data science (Canner2017?; GDSCN2022?). The BioDIGS Project aims to be at the intersection of genomics, data science, cloud computing, and education. 1.1 What is genomics? Genomics broadly refers to the study of genomes, which are an organism’s complete set of DNA. This includes both genes and non-coding regions of DNA. Traditional genomics involves sequencing and analyzing the genome of individual species. Metagenomics expands genomics to look at the collective genomes of entire communities of organisms in an environmental sample, like soil. It allows researchers to study not just the genes of culturable or isolated organisms, but the entirety of genetic material present in a given environment. By using genomic techniques to survey the soil microbes, we can identify everything in the soil, including microbes that no one has identified before. We are doing both traditional genomics and metagenomics as part of BioDIGS. 1.2 What is data science? Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. It includes collecting, cleaning, and combining data from multiple databases, exploring data and developing statistical and machine learning models to identify patterns in complex datasets, and creating tools to efficiently store, process, and access large amounts of data. 1.3 What is cloud computing? Cloud computing just means using the internet to get access to powerful computer resources like storage, servers, databases, networking tools, and specialized software programs. Instead of having to buy and maintain their own powerful computers, storage servers, and other systems, users can pay to use them through an internet connection as needed. Users only pay for what they need, when they actually use it, and professionals update and maintain the systems in large data centers. It is a particularly useful tool for researchers and students at smaller institutions with limited computational services, especially when working with complex databases. The genome assembly and analyses for BioDIGS have been done using the NHGRI AnVIL cloud computing platform, as well as Galaxy. 1.4 Why soil microbes? It can be challenging to include undergraduates in human genomic and health research, especially in a classroom context. Both human genetic data and human health data are protected data, which limits the sort of information students can access without undergoing specialized ethics training. However, the same sorts of data cleaning and analysis methods used for human genomic data are also used for microbial genomic data, which does not have the same sort of legal protections as human genetic data. This makes it ideal for training undergraduate students at the beginning of their careers and can be used to prepare students for future research in human genomics and health (Jurkowski2017?). Additionally, the microbes in the soil can have big impacts on our health (BrevikBurgess2014?). 1.5 Heavy metals and human health Human activities that change the landscape can also change what sorts of inorganic and abiotic compounds we find in the soil, particularly increasing the amount of heavy metals (Yan2020?). When cars drive on roads, compounds from the exhaust, oil, and other fluids might settle onto the roads and be washed into the soil. When we put salt on roads, parking lots, and sidewalks, the salts themselves will eventually be washed away and enter the ecosystem through both water and soil. Chemicals from factories and other businesses also leech into our environment. Previous research has demonstrated that in areas with more human activity, like cities, soils include greater concentrations of heavy metals than found in rural areas with limited human populations (Khan2023?; Wang2022?). Increased heavy metal concentrations also disproportionately affect lower-income and predominantly minority areas (Jones2022?). Research suggests that increased heavy metal concentration in soils has major impacts on the soil microbial community. In particular, increased heavy metal concentration is associated with an increase in soil bacteria that have antibiotic resistance markers (Gorovtsov2018?; Nguyen2019?; Sun2021?). "],["research-team.html", "Chapter 2 Research Team 2.1 Soil sampling", " Chapter 2 Research Team This project is coordinated by the Genomics Data Science Community Network (GDSCN). You can read more about the GDSCN and its mission at the network website. 2.1 Soil sampling This map shows the current sampling locations for the BioDIGS project. The extensive network of the GDSCN has made this data collection possible. Soil sampling for this project was done by both faculty and student volunteers from schools that aren’t traditional R1 research institutions. Many of the faculty are also members of the GDSCN. This list of locations reflects GDSCN institutions and friends of GDSCN who have collected soil samples. Annandale, VA: Northern Virginia Community College Atlanta, GA: Spelman College Baltimore, MD: College of Southern Maryland, Notre Dame College of Maryland, Towson University Bismark, ND: United Tribes Technical College El Paso, TX: El Paso Community College, The University of Texas at El Paso Fresno, CA: Clovis Community College Greensboro, NC: North Carolina A&T State University Harrisonburg, VA: James Madison University Honolulu, Hawai’i: University of Hawai’i at Mānoa Las Cruces, NM: Doña Ana Community College Montgomery County, MD: Montgomery College, Towson University Nashville, TN: Meharry Medical College New York, NY: Guttman Community College CUNY Petersburg, VA: Virginia State University Seattle, WA: North Seattle College, Pierce College Tsaile, AZ: Diné College "],["support.html", "Chapter 3 Support 3.1 Funding 3.2 Sponsors 3.3 Analytical and Computational Support", " Chapter 3 Support This project would not be possible without financial and technical support from many organizations and people. 3.1 Funding Funding for this project has been provided by the National Human Genome Research Institute (Contract # 75N92022P00232 awarded to Johns Hopkins University). 3.2 Sponsors PacBio and CosmosID have graciously donated supplies. Advances in Genome Biology and Technology provided funding support for several team members to attend AGBT 2024. 3.3 Analytical and Computational Support Computational support has been provided by NHGRI’s AnVIL cloud computing platform and Galaxy. "],["biodigs-data.html", "Chapter 4 BioDIGS Data 4.1 Sample Metadata 4.2 Soil Testing Data 4.3 Genomics and Metagenomics Data", " Chapter 4 BioDIGS Data There are currently three major kinds of data available from BioDIGS: sample metadata, soil testing data, and genomics and metagenomics data. All of these are available for use in your classroom. 4.1 Sample Metadata This dataset contains information about the samples themselves, including GPS coordinates for the sample location, date the sample was taken, and the site name. This dataset is also available from the BioDIGS website You can also see images of each sampling site and soil characteristics at the sample map. 4.2 Soil Testing Data This dataset includes basic information about the soil itself like pH, percentage of organic matter, variety of soil metal concentrations. The complete data dictionary is available here. The dataset is available at the BioDIGS website. This dataset was generated by the Delaware Soil Testing Program at the University of Delaware. 4.3 Genomics and Metagenomics Data You can access this data in both raw and processed forms. The Illumina and Nanopore sequences were generated at the Johns Hopkins University Genetic Resources Core Facility. PacBio sequencing was done by PacBio directly. More information coming soon! "],["billing.html", "Chapter 5 Billing 5.1 Create Google Billing Account 5.2 Add Terra to Google Billing Account 5.3 Add Members to Google Billing Account 5.4 Set Alerts for Google Billing 5.5 View Spend for Google Billing 5.6 Create Terra Billing Project 5.7 Add Member to Terra Billing Project 5.8 Disable Terra Billing Project", " Chapter 5 Billing In order to use AnVIL, you will need to set up a billing account and add members to it. These sections guide you through that process. 5.1 Create Google Billing Account Log in to the Google Cloud Platform console using your Google ID. Make sure to use the same Google account ID you use to log into Terra. If you are a first time user, don’t forget to claim your free credits! If you haven’t been to the console before, once you accept the Terms of Service you will be greeted with an invitation to “Try for Free.” Follow the instructions to sign up for a Billing Account and get your credits. Choose “Individual Account”. This “billing account” is just for managing billing, so you don’t need to be able to add your team members. You will need to give either a credit card or bank account for security. Don’t worry! You won’t be billed until you explicitly turn on automatic billing. You can view and edit your new Billing Account, by selecting “Billing” from the left-hand menu, or going directly to the billing console console.cloud.google.com/billing Clicking on the Billing Account name will allow you to manage the account, including accessing reports, setting alerts, and managing payments and billing. At any point, you can create additional Billing Accounts using the Create Account button. We generally recommend creating a new Billing Account for each funding source. 5.2 Add Terra to Google Billing Account This gives Terra permission to create projects and send charges to the Google Billing Account, and must be done by an administrator of the Google Billing Account. Terra needs to be added as a “Billing Account User”: Log in to the Google Cloud Platform console using your Google ID. Navigate to Billing You may be automatically directed to view a specific Billing Account. If you see information about a single account rather than a list of your Billing Accounts, you can get back to the list by clicking “Manage Billing Accounts” from the drop-down menu. Check the box next to the Billing Account you wish to add Terra to, click “ADD MEMBER”. Enter terra-billing@terra.bio in the text box. In the drop-down menu, mouse over Billing, then choose “Billing Account User”. Click “SAVE”. 5.3 Add Members to Google Billing Account Anyone you wish to add to the Billing Account will need their own Google ID. To add a member to a Billing Project: Log in to the Google Cloud Platform console using your Google ID. Navigate to Billing You may be automatically directed to view a specific Billing Account. If you see information about a single account rather than a list of your Billing Accounts, you can get back to the list by clicking “Manage Billing Accounts” from the drop-down menu. Check the box next to the Billing Account you wish to add a member to, click “ADD MEMBER”. Enter their Google ID in the text box. In the drop-down menu, mouse over Billing, then choose the appropriate role. Click “SAVE”. 5.4 Set Alerts for Google Billing Log in to the Google Cloud Platform console using the Google ID associated with your Google Cloud projects. Open the dropdown menu on the top left and click on Billing. You may be automatically directed to view a specific Billing Account. If you see information about a single account (and it’s not the one you’re interested in), you can get back to the list of all your Billing Accounts by clicking “Manage Billing Accounts” from the drop-down menu. Click on the name of the Billing Account you want to set alerts for. In the left-hand menu, click “Budgets & alerts”. Click the “Create Budget” tab. Enter a name for your budget, and then choose which projects you want to monitor. Then click “Next”. For Budget Type, select “Specified amount”. Enter the total budget amount for the month (you will set alerts at different thresholds in the next step). Click “Next” (do not click “Finish”). Enter the threshold amounts where you want to receive an alert. We recommend starting with 50% and 90%. You can set other alerts if you prefer. Check the box for “Email alerts to billing admins and users”, then click “Finish”. Now you (as the owner and admin), along with anyone you added with admin or user privileges (e.g. lab managers) will receive alerts when your lab members reach the specified spending thresholds. These emails will be sent to the Gmail accounts associated with the Billing Account. You can edit your budgets at any time by going to Billing > Budgets & alerts, and clicking on the name of the budget you want to edit. 5.5 View Spend for Google Billing You can always check your current spend through the Google Billing console, but remember There is a reporting delay (~1 day), so you cannot immediately see what an analysis cost Costs are reported at the level of Workspaces, so if there are multiple people using a Workspace, you will not be able to determine which of them was responsible for the charges. The Google Billing console displays information by Billing Account. To view spending: Log in to the Google Cloud Platform console using the Google ID associated with your Google Cloud projects. Open the dropdown menu on the top left and click on Billing. You may be automatically directed to view a specific Billing Account. If you see information about a single account (and it’s not the one you’re interested in), you can get back to the list of all your Billing Accounts by clicking “Manage Billing Accounts” from the drop-down menu. Click on the name of the Billing Account for the project you want to view. Look at the top of the Overview tab to see your month-to-date spending. Scroll further down the Overview tab to show your top projects. Click on the Reports tab to see more detailed information about each of your projects. This is probably the most useful tab for exploring costs of individual projects over time. Click on the Cost table tab to obtain a convenient table of spending per project. 5.6 Create Terra Billing Project Launch Terra and sign in with your Google account. If this is your first time logging in to Terra, you will need to accept the Terms of Service. In the drop-down menu on the left, navigate to “Billing”. Click the triple bar in the top left corner to access the menu. Click the arrow next to your name to expand the menu, then click “Billing”. You can also navigate there directly with this link: https://anvil.terra.bio/#billing On the Billing page, click the “+ CREATE” button to create a new Billing Project. Select GCP Billing Project (Google’s Platform). If prompted, select the Google account to use and give Terra permission to manage Google Cloud Platform billing accounts. Enter a unique name for your Terra Billing Project and select the appropriate Google Billing Account. The name of the Terra Billing Project must: Only contain lowercase letters, numbers and hyphens Start with a lowercase letter Not end with a hyphen Be between 6 and 30 characters Select the Google Billing Account to use. All activities conducted under your new Terra Billing Project will charge to this Google Billing Account. If prompted, give Terra permission to manage Google Cloud Platform billing accounts. Click “Create”. Your new Billing Project should now show up in the list of Billing Projects Owned by You. You can add additional members or can modify or deactivate the Billing Project at any time by clicking on its name in this list. The page doesn’t always update as soon as the Billing Project is created. If it’s been a couple of minutes and you don’t see a change, try refreshing the page. 5.7 Add Member to Terra Billing Project Launch Terra and sign in with your Google account. In the drop-down menu on the left, navigate to “Billing”. Click the triple bar in the top left corner to access the menu. Click the arrow next to your name to expand the menu, then click “Billing”. You can also navigate there directly with this link: https://anvil.terra.bio/#billing Click “Owned by You” and find the Billing Project. If you do not see the Billing Project in this list, then you are not an Owner and do not have permission to add members. Click on the name of the Billing Project. Click on the “Members” tab to view and manage members. Then click the “Add User” button. Enter the email address of the user or group you’d like to add the the Billing Project. If adding an individual, make sure to enter the account that they use to access AnVIL. If adding a Terra Group, use the Group email address, which can be found on the Terra Group management page. If this user or group will need to add and remove other users of the Billing Project, check the Owner box. Otherwise leave it unchecked. It’s often a good idea to have at least one other Owner of a Billing Project in order to avoid getting locked out, in case the original owner leaves or loses access to their account. Click “ADD USER”. You should now see the user or group listed in the Billing Project members, along with the appropriate role. They should now be able to use the Billing Project to fund work on AnVIL. If you need to remove members or modify their roles, you can do so at any time by clicking the teardrop button next to their name. 5.8 Disable Terra Billing Project By default this module includes a warning to make sure people understand they will lose access to their Workspace buckets. You can remove the warning from this module by setting AnVIL_module_settings$warning to FALSE before running cow::borrow_chapter: AnVIL_module_settings <- list( warning = FALSE ) cow::borrow_chapter( doc_path = "child/_child_terra_billing_project_disable.Rmd", repo_name = "jhudsl/AnVIL_Template" ) Disabling a Billing Project makes Workspace contents inaccessible! Disabling a Billing Project disables funding to all Workspaces funded by the Billing Project. You will be unable to compute in these Workspaces, and you will lose access to any data stored in the Workspace buckets. It is sometimes possible to restore access by reactivating billing, but Google makes no promises about whether or how long the data will be recoverable. Make sure everyone with Workspaces funded by the Billing Project has saved anything they want to keep in another location before disabling the Billing Project. To disable a Terra Billing Project (i.e. remove the Google Billing Account that funds the Terra Billing Project): Launch Terra and sign in with your Google account. In the drop-down menu on the left, navigate to “Billing”. Click the triple bar in the top left corner to access the menu. Click the arrow next to your name to expand the menu, then click “Billing”. You can also navigate there directly with this link: https://anvil.terra.bio/#billing Click “Owned by You” and find the Billing Project. If you do not see the Billing Project in this list, then you are not an Owner and do not have permission to add members. Click on the name of the Billing Project. If you don’t see information about the Billing Account, click on “View billing account” to expand the Billing Account information. You may be prompted to enter your login information again. You should see the name of the Google Billing Account that is funding this Terra Billing Project. Click on the teardrop icon next to the name of the Billing Account. Click “Remove Billing Account”. Click OK to confirm that you want to disable funding for this Billing Project. The page should now indicate that there is no linked billing account. If necessary, you can restore funding to the Billing Project and associated Workspaces by clicking the teardrop icon and selecting “Change Billing Account”. However, Google makes no promises about how long the Workspace contents will remain available after you disable funding, so it is best not to rely on them. "],["notes-for-instructors.html", "Chapter 6 Notes for Instructors", " Chapter 6 Notes for Instructors Although AnVIL is the preferred computational platform for the GDSCN, all activities can be run on different platforms. R-based activities can be run on your own personal installation of R or Posit(formerly called RStudio), depending on your needs. Galaxy-based activities can be run on both Galaxy on AnVIL and on the Galaxy web portal. You may also adapt these activities for other languages and platforms. "],["checklist-for-running-activities-on-anvil.html", "Chapter 7 Checklist for Running Activities on AnVIL Before the class begins After the class ends", " Chapter 7 Checklist for Running Activities on AnVIL If you choose to run these activities on AnVIL with your class, there are several things that you can do to make the experience easier. Before the class begins This checklist can serve as a reminder of the overall suggested steps to run an activity on AnVIL. You might find yourself changing these steps slightly as you become more familiar with AnVIL. Billing Obtain funding through the STRIDES program (optional) Request students make AnVIL IDs (Google IDs) Collect AnVIL IDs (Google IDs) from students Create Google Billing Account for your class Resources Create a Workspace for your class (optional) Notify Terra of your course dates and times Direct students to the Workspace Permissions Set up Groups to manage permissions AnVIL Group Class Workspace Terra Billing Projects* Instructor Owner Owner Teaching assistants Writer Owner Students Reader User After the class ends Resources Remind students to download any files they might need Tell students to delete their environments and persistent disks Billing Deactivate billing project "],["setting-up-billing-on-anvil.html", "Chapter 8 Setting up Billing on AnVIL 8.1 Creating a billing project 8.2 Adding Instructors as “Owner” 8.3 Adding Students as “User” 8.4 Understanding the various billing costs 8.5 Estimating costs before the class begins 8.6 How much does a class cost?", " Chapter 8 Setting up Billing on AnVIL The following will help you set up billing for your class. You will: * Set up a billing project for tracking costs * Add yourself and students to the billing project to grant permission to AnVIL resources * Learn about different sources of costs in AnVIL * Estimate costs for your class * Learn about how to track costs during your class 8.1 Creating a billing project First, create the Billing Project. Billing Project names must be globally unique and cannot exceed 30 characters. We suggest the name of the Billing Project should be a combination of institution-class- (e.g., “jhu-bmr2021-bill-1”). To create a Billing Project: Go to https://anvil.terra.bio/#billing Click “+CREATE” Type in your Billing Project name Select the appropriate Billing Account Click “CREATE BILLING PROJECT” You now have a unique Billing Project. 8.2 Adding Instructors as “Owner” Next, you want to give instructors permission to use the Billing Project to compute. To set instructor permissions: Go to https://anvil.terra.bio/#billing Select the “Owned by You” Billing Project sub-list Select the Billing Project you made in Instructor Billing Project Select the “Users” tab Click “+ Add User”. You will be prompted to add a “User email *”. Begin typing the instructor Group name set up in Instructor Group. You should see an email in the form (firecloud.org?) (e.g., jhu-bmr2021-instructors@firecloud.org). Ensure “Can manage users (Owner)” is selected Click “ADD USER” This step makes it so that co-instructors can edit permissions and administer the Billing Project as needed. While this means you and co-instructors can compute using the student Billing Project, this makes spending difficult to track. Instructors should always use the instructor Workspace to compute. This makes it much easier to track costs associated with instructors versus students. 8.3 Adding Students as “User” Next, you will add your student Group to the Billing Project so that they can compute. To set student permissions: Go to https://anvil.terra.bio/#billing Select the “Owned by You” Billing Project sub-list Select the Billing Project you made in Billing Project Select the “Users” tab Click “+ Add User”. You will be prompted to add a “User email *”. Begin typing the student Group name set up in Student Group. You should see an email in the form (firecloud.org?) (e.g., jhu-bmr2021-students@firecloud.org). Keep “Can manage users (Owner)” deselected. Click “ADD USER” 8.4 Understanding the various billing costs Costs in AnVIL fall into one of three categories: compute costs, storage costs, and network usage (egress) costs. Compute costs are those that students accrue when actively using an AnVIL Workspace. Students can clone a Workspace for no cost, but they will begin to accrue costs as soon as they set up a cloud environment. Compute costs are based on how many CPUs you need, as well as how much memory and storage space you choose. You can also pause the Workspace and pay a lower cost per hour than if you were to keep the Workspace running. Current prices can be found here. Storage costs are driven by the persistent disk. The persistent disk allows you to store data and installed programs/libraries for a low cost. Students can delete their Workspaces but maintain their persistent disk so they still have access to previous programs they have installed and previous files they’ve created. Current prices can be found here. Finally, network usage costs are those involved with transferring data between networks or downloading data from the cloud to your local computer. Current prices can be found here. 8.5 Estimating costs before the class begins AnVIL has a free AnVIL_Cost_Estimator that allows you estimate compute, storage, and network usage costs for your class. This is a Google sheet that you can tailor to fit your needs. Before you use it, make sure the prices are up to date by following the links at the bottom of the sheet. If you need to create a Budget Justification for your class, you can also use the free AnVIL_Budget_Justification template. 8.6 How much does a class cost? One of the advantages of billing projects in Terra is that you can keep track of the costs during real time. You can see how much each Workspace is costing while your course is happening, so there are no unexpected surprises at the end! Full details about billing in Terra can be found here. These instructions are adapted from Terra Support. To view the costs being accrued by each billing project, you can go to https://console.cloud.google.com/billing. At the top of the page, there is a dropdown menu. Choose the billing project name you’d like to view. Once you are in proper billing project, you click on “View detailed charges” in the Billing section on the far right. This takes you to a report of the detailed charges accrued by the billing account. Here, you will be able to see the total cost over a time range, as well as costs broken down by services. "],["setting-up-the-class-activity.html", "Chapter 9 Setting up the Class Activity 9.1 Overview of Class Setup 9.2 Collect Google IDs 9.3 Set Up Groups 9.4 Set Up Billing Projects 9.5 Set Permissions on the Workspace 9.6 Notify Terra", " Chapter 9 Setting up the Class Activity 9.1 Overview of Class Setup This section will show you how to organize your class to make it easier to administer access to your content. You will need to have a list of who will be taking your class, such as a course roster or sign-up list, as well as a list of additional instructors or teaching assistants. You can make changes later, so the list of students need not be final. 9.2 Collect Google IDs AnVIL IDs are based on Google accounts. Students – Contact students/participants to get their AnVIL IDs. These should be Gmail addresses or emails with GSuite capabilities. You can link students to Student Account Setup for instructions on what they should do. Co-instructors – If you will be working with other instructors, such as co-instructors or teaching assistants, you will need to collect their IDs as well. 9.3 Set Up Groups Reminder: Google Billing Accounts are managed on Google Cloud Platform and are used for organizing funding sources (e.g. credit cards, cloud credits). Terra Billing Projects are managed through Terra, and allow you to associate your Terra activity with the correct Google Billing Account. For a more detailed explanation, please see the chapter on Account Setup. We suggest creating two different Terra Billing Projects under the appropriate Billing Account that you created on cloud.google.com: one for students and one for co-instructors. The instructions below will walk you through how to set this up. Groups enable you to share your class Workspace and manage permissions for many people at once. We recommend starting with one Group for instructors and one Group for students. Instructor Group {#instructor-group} Create an informative, unique Group name for any co-instructors and teaching assistants. We suggest a combination of institution-class-role (e.g., “jhu-bmr2021-instructors”). Only letters, numbers, underscores, and dashes are allowed in Group names. To create a Group for instructors: Go to https://anvil.terra.bio/#groups Click “+ Create a New Group” Type in your instructor Group name Click “CREATE GROUP” You now have a unique instructor Group. Add Instructors as “Admin” (Instructor Group) Now that your instructor Group has been created, you should add any additional instructors. You should also ensure that they have the correct permissions. Go to https://anvil.terra.bio/#groups/ and click on the instructor Group name. This page should also be visible at https://anvil.terra.bio/#groups/<group-name>. Click on “+Add User”. You will be prompted to add the instructor’s AnVIL ID. Type in the instructor’s AnVIL ID Make sure “Can manage users (admin)” is selected Click ADD USER. This will take you back to the Group administration page. Make sure the newly added instructor displays “Admin” under “Roles” beside their AnVIL ID. Repeat this process for any additional co-instructors and teaching assistants. Student Group {#student-group} Next, you will create a Group for your students. Create an informative, unique Group name. We suggest a combination of institution-class-role (e.g., “jhu-bmr2021-students”). Only letters, numbers, underscores, and dashes are allowed in Group names. To create a Group for students: Go to https://anvil.terra.bio/#groups Click “+ Create a New Group” Type in your student Group name Click “CREATE GROUP” You now have a unique student Group. Add Instructors as “Admin” (Student Group) The next steps ensure any additional co-instructors and teaching assistants are able to administer the student Group in case you are unavailable. Follow the steps below to add each co-instructor in the student Group: Go to https://anvil.terra.bio/#groups/ and click on the student Group name. This page should be visible at https://anvil.terra.bio/#groups/<group-name>. Click on “+Add User”. You will be prompted to add the instructor’s AnVIL ID. Type in the instructor’s AnVIL ID Make sure “Can manage users (admin)” is selected Click ADD USER. This will take you back to the Group administration page. Make sure the newly added instructor displays “Admin” under “Roles” beside their AnVIL ID. Repeat this process for any additional co-instructors and teaching assistants. Add Students as “Member” Follow the steps below to add individual students to the student Group: Go to https://anvil.terra.bio/#groups/ and click on the student Group name. This page should be visible at https://anvil.terra.bio/#groups/<group-name>. Click on “+Add User”. You will be prompted to add an AnVIL ID. Type in the student’s AnVIL ID Click ADD USER Make sure the newly added student displays “Member” under “Roles” beside their AnVIL ID. At present, each student’s AnVIL ID must be added separately. Your instructor and student Groups are now set up. Group Email Lists Note that your newly created Groups have Group emails associated with them. Take note of these Group emails. You will use them for granting access to your class Billing Projects and Workspaces in the next steps. 9.4 Set Up Billing Projects Billing Projects in Terra help organize your compute costs. Like Groups, we suggest creating two different billing projects under the appropriate Billing Account that you created on cloud.google.com: one for students and one for co-instructors. Billing Project names must be globally unique and cannot exceed 30 characters. Instructor Billing Project {#instructor-billing-project} First, create the Billing Project for instructors. We suggest the name of the Billing Project should be a combination of institution-class-role (e.g., “jhu-bmr2021-instructors-bill-1”). To create a Billing Project for instructors: Go to https://anvil.terra.bio/#billing Click “+CREATE” Type in your instructor Billing Project name Select the appropriate Billing Account Click “CREATE BILLING PROJECT” You now have a unique instructor Billing Project. Add Instructors as “Owner” (Instructor Project) Next, you want to give instructors permission to use the Billing Project to compute. To set instructor permissions: Go to https://anvil.terra.bio/#billing Select the “Owned by You” Billing Project sub-list Select the Billing Project you made for instructors in Instructor Billing Project Select the “Users” tab Click “+ Add User”. You will be prompted to add a “User email *”. Begin typing the instructor Group name set up in Instructor Group. You should see an email in the form (firecloud.org?) (e.g., jhu-bmr2021-instructors@firecloud.org). Ensure “Can manage users (Owner)” is selected Click “ADD USER” Your instructor Billing Project is now set up. Student Billing Project {#student-billing-project} Next, create a student Billing Project. Again, we suggest a combination of institution-class-role (e.g., “jhu-bmr2021-students-bill-1”). To create a Billing Project for students: Go to https://anvil.terra.bio/#billing Click “+CREATE” Type in your student Billing Project name Select the appropriate Billing Account (same as above) Click “CREATE BILLING PROJECT” You now have a unique student Billing Project. Add Instructors as “Owner” (Student Project) You want to ensure any additional co-instructors and teaching assistants are able to administer the student Billing Project in case you are unavailable. To set instructor permissions: Go to https://anvil.terra.bio/#billing Select the “Owned by You” Billing Project sub-list Select the Billing Project you made for students in Student Billing Project Select the “Users” tab Click “+ Add User”. You will be prompted to add a “User email *”. Begin typing the instructor Group name set up in [### Set Up Groups]. You should see an email in the form (firecloud.org?) (e.g., jhu-bmr2021-instructors@firecloud.org). Ensure “Can manage users (Owner)” is selected Click “ADD USER” This step makes it so that co-instructors can edit permissions and administer the student Billing Project as needed. While this means you and co-instructors can compute using the student Billing Project, this makes spending difficult to track. Instructors should always use the instructor Billing Project to compute. This makes it much easier to track costs associated with instructors versus students. Add Students as “User” Next, you will add your student Group to the student Billing Project so that they can compute. To set student permissions: Go to https://anvil.terra.bio/#billing Select the “Owned by You” Billing Project sub-list Select the Billing Project you made for students in Student Billing Project Select the “Users” tab Click “+ Add User”. You will be prompted to add a “User email *”. Begin typing the student Group name set up in Student Group. You should see an email in the form (firecloud.org?) (e.g., jhu-bmr2021-students@firecloud.org). Keep “Can manage users (Owner)” deselected. Click “ADD USER” Your student Billing Project is now set up. 9.5 Set Permissions on the Workspace Finally, you will want to set up permissions for co-instructors and students to see the class Workspace you created with the development Billing Project in Developing Content. AnVIL users can take on the “Owner”, “Writer”, or “Reader” role for a Workspace. Add Instructors as “Owner” You should add your co-instructors and teaching assistants as “Owners” to the class Workspace. This is useful if they need to edit the course content or share the space with students on your behalf. To share and change permissions: Go to https://anvil.terra.bio/#workspaces and find your class Workspace you set up in Developing Content Click the teardrop button for your class Workspace Click “Share”. This will open a dialog box. Enter the name of the instructor Group (e.g., jhu-bmr2021-instructors). This will create a dropdown for the Group permissions in the box. Select this Group. Change permissions to “Owner” using the dropdown menu under the instructor Group Click “SAVE” This step makes it so that co-instructors can edit the original content of the Workspace as needed. While this means you and co-instructors can compute using the development Billing Project, this makes spending difficult to track. Instructors should instead clone the Workspace using the instructor Billing Project. This makes it much easier to track costs associated with this iteration of your class versus further iterations (e.g., the following semester or year). Add Students as “Reader” Next, add your students as “Readers” to the class Workspace. This means they will be able to view and clone the Workspace, but not make edits or perform computations. To share the Workspace: Click the teardrop button for your class Workspace Click “Share”. This will open a dialog box. Enter the name of the student Group. This will create a dropdown for the Group permissions in the box. Select this Group. Ensure permissions are set to “Reader” (default) Click “SAVE” You have now correctly set up your class permissions! 9.6 Notify Terra Contacting Terra ahead of your class time helps the Terra team avoid any major disruptions to your class. Contact Terra by submitting a request for a hold on scheduled maintenance and downtime. It’s also a good idea to ask about major changes planned for the time prior to your class. "],["getting-credit-for-professional-development.html", "Chapter 10 Getting Credit for Professional Development", " Chapter 10 Getting Credit for Professional Development We are happy to provide a letter to your supervisor, department head, or dean to indicate you’ve worked through this content and intend to use it in your class. "],["anvil-workspace.html", "Chapter 11 AnVIL Workspace 11.1 Create Google Account 11.2 Clone the Workspace", " Chapter 11 AnVIL Workspace You can easily access the data on AnVIL by cloning the dedicated workspace. These sections guide you through creating an AnVIL account and accessing the workspace. 11.1 Create Google Account If you do not already have a Google account that you would like to use for accessing Terra, create one now. If you would like to create a Google account that is associated with your non-Gmail, institutional email address, follow these instructions. 11.2 Clone the Workspace Launch Terra Locate the Workspace you want to clone. If a Workspace has been shared with you ahead of time, it will appear in “MY WORKSPACES”. You can clone a Workspace that was shared with you to perform your own analyses. In the screenshot below, no Workspaces have been shared. If a Workspace hasn’t been shared with you, navigate to the “FEATURED” or “PUBLIC” Workspace tabs. Use the search box to find the Workspace you want to clone. Click the teardrop button on the far right next to the Workspace you want to clone. Click “Clone”. You can also clone the Workspace from the Workspace Dashboard instead of the search results. You will see a popup box appear. Name your Workspace and select the appropriate Terra Billing Project. All activity in the Workspace will be charged to this Billing Project (regardless of who conducted it). Remember that each Workspace should have its own Billing Project. If you are working with protected data, you can set the Authorization Domain to limit who can be added to your Workspace. Note that the Authorization Domain cannot be changed after the Workspace is created (i.e. there is no way to make this Workspace shareable with a larger audience in the future). Workspaces by default are only visible to people you specifically share them with. Authorization domains add an extra layer of enforcement over privacy, but by nature make sharing more complicated. We recommend using Authorization Domains in cases where it is extremely important and/or legally required that the data be kept private (e.g. protected patient data, industry data). For data you would merely prefer not be shared with the world, we recommend relying on standard Workspace sharing permissions rather than Authorization Domains, as Authorization Domains can make future collaborations, publications, or other sharing complicated. Click “CLONE WORKSPACE”. The new Workspace should now show up under your Workspaces. "],["using-rstudio-on-anvil.html", "Chapter 12 Using RStudio on AnVIL 12.1 Video overview of RStudio on AnVIL 12.2 Launching RStudio 12.3 Touring RStudio 12.4 Pausing RStudio", " Chapter 12 Using RStudio on AnVIL In the next few steps, you will walk through how to get set up to use RStudio on the AnVIL platform. AnVIL is centered around different “Workspaces”. Each Workspace functions almost like a mini code laboratory - it is a place where data can be examined, stored, and analyzed. The first thing we want to do is to copy or “clone” a Workspace to create a space for you to experiment. Use a web browser to go to the AnVIL website. In the browser type: anvil.terra.bio Tip At this point, it might make things easier to open up a new window in your browser and split your screen. That way, you can follow along with this guide on one side and execute the steps on the other. Your instructor will give you information on which workspace you should clone. 12.1 Video overview of RStudio on AnVIL Here is a video tutorial that describes the basics of using RStudio on AnVIL. 12.1.1 Objectives Start compute for your RStudio environment Tour RStudio on AnVIL Stop compute to minimize expenses 12.1.2 Slides The slides for this tutorial are are located here. 12.2 Launching RStudio AnVIL is very versatile and can scale up to use very powerful cloud computers. It’s very important that you select a cloud computing environment appropriate to your needs to avoid runaway costs. If you are uncertain, start with the default settings; it is fairly easy to increase your compute resources later, if needed, but harder to scale down. Note that, in order to use RStudio, you must have access to a Terra Workspace with permission to compute (i.e. you must be a “Writer” or “Owner” of the Workspace). Open Terra - use a web browser to go to anvil.terra.bio In the drop-down menu on the left, navigate to “Workspaces”. Click the triple bar in the top left corner to access the menu. Click “Workspaces”. Click on the name of your Workspace. You should be routed to a link that looks like: https://anvil.terra.bio/#workspaces/<billing-project>/<workspace-name>. Click on the cloud icon on the far right to access your Cloud Environment options. If you don’t see this icon, you may need to scroll to the right. In the dialogue box, click the “Settings” button under RStudio. You will see some configuration options for the RStudio cloud environment, and a list of costs because it costs a small amount of money to use cloud computing. Configure any settings you need for your cloud environment. If you are uncertain about what you need, the default configuration is a reasonable, cost-conservative choice. It is fairly easy to increase your compute resources later, if needed, but harder to scale down. Scroll down and click the “CREATE” button when you are satisfied with your setup. The dialogue box will close and you will be returned to your Workspace. You can see the status of your cloud environment by hovering over the RStudio icon. It will take a few minutes for Terra to request computers and install software. When your environment is ready, its status will change to “Running”. Click on the RStudio logo to open a new dialogue box that will let you launch RStudio. Click the launch icon to open RStudio. This is also where you can pause, modify, or delete your environment when needed. You should now see the RStudio interface with information about the version printed to the console. 12.3 Touring RStudio Next, we will be using RStudio and the package Glimma to create interactive plots. See this vignette for more information. The Bioconductor team has created a very useful package to programmatically interact with Terra and Google Cloud. Install the AnVIL package. It will make some steps easier as we go along. You can now quickly install precompiled binaries using the AnVIL package’s install() function. We will use it to install the Glimma package and the airway package. The airway package contains a SummarizedExperiment data class. This data describes an RNA-Seq experiment on four human airway smooth muscle cell lines treated with dexamethasone. {Note: for some of the packages, you will have to install packaged from the CRAN repository, using the install.packages() function. The examples will show you which install method to use.} <img src="resources/images/08-student_using_rstudio_files/figure-html//1BLTCaogA04bbeSD1tR1Wt-mVceQA6FHXa8FmFzIARrg_g11f12bc99af_0_56.png" alt="Screenshot of the RStudio environment interface. Code has been typed in the console and is highlighted." width="480" /> Load the example data. The multidimensional scaling (MDS) plot is frequently used to explore differences in samples. When this data is MDS transformed, the first two dimensions explain the greatest variance between samples, and the amount of variance decreases monotonically with increasing dimension. The following code will launch a new window where you can interact with the MDS plot. Change the colour_by setting to “groups” so you can easily distinguish between groups. In this data, the “group” is the treatment. You can download the interactive html file by clicking on “Save As”. You can also download plots and other files created directly in RStudio. To download the following plot, click on “Export” and save in your preferred format to the default directory. This saves the file in your cloud environment. You should see the plot in the “Files” pane. Select this file and click “More” > “Export” Select “Download” to save the file to your local machine. 12.4 Pausing RStudio You can view costs and make changes to your cloud environments from the panel on the far right of the page. If you don’t see this panel, you may need to scroll to the right. Running environments will have a green dot, and paused environments will have an orange dot. Hovering over the RStudio icon will show you the costs associated with your RStudio environment. Click on the RStudio icon to open the cloud environment settings. Click the Pause button to pause RStudio. This will take a few minutes. When the environment is paused, an orange dot will be displayed next to the RStudio icon. If you hover over the icon, you will see that it is paused, and has a small ongoing cost as long as it is paused. When you’re ready to resume working, you can do so by clicking the RStudio icon and clicking Resume. The right-hand side icon reminds you that you are accruing cloud computing costs. If you don’t see this icon, you may need to scroll to the right. You should minimize charges when you are not performing an analysis. You can do this by clicking on the RStudio icon and selecting “Pause”. This will release the CPU and memory resources for other people to use. Note that your work will be saved in the environment and continue to accrue a very small cost. This work will be lost if the cloud environment gets deleted. If there is anything you would like to save permanently, it’s a good idea to copy it from your compute environment to another location, such as the Workspace bucket, GitHub, or your local machine, depending on your needs. You can also pause your cloud environment(s) at https://anvil.terra.bio/#clusters. "],["introduction.html", "Chapter 13 Introduction 13.1 Before You Start 13.2 Objectives", " Chapter 13 Introduction In this activity, you’ll have a chance to become familiar with the BioDIGS soil testing data. This dataset includes information on the inorganic components of each soil sample, particularly metal concentrations. Human activity can increase the concentration of inorganic compounds in the soil. When cars drive on roads, compounds from the exhaust, oil, and other fluids might settle onto the roads and be washed into the soil. When we put salt on roads, parking lots, and sidewalks, the salts themselves will eventually be washed away and enter the ecosystem through both water and soil. Chemicals from factories and other businesses also leech into our environment. All of this means the concentration of heavy metals and other chemicals will vary among the soil samples collected for the BioDIGS project. 13.1 Before You Start If you do not already have a Google account that you would like to use for accessing Terra, create one now. If you would like to create a Google account that is associated with your non-Gmail, institutional email address, follow these instructions. 13.2 Objectives This activity will teach you how to use the AnVIL platform to: Open data from an R package Examine objects in R Calculate summary statistics for variables in the soil testing data Create and interpret histograms and boxplots for variables in the soil testing data "],["part-1.-examining-the-data.html", "Chapter 14 Part 1. Examining the Data", " Chapter 14 Part 1. Examining the Data We will use the BioDIGS package to retrieve the data. We first need to install the package from where it is stored on GitHub. devtools::install_github("fhdsl/BioDIGSData") Once you’ve installed the package, we can load the library and assign the soil testing data to an object. This command follows the code structure: dataset_object_name <- stored_BioDIGS_dataset library(BioDIGSData) soil.values <- BioDIGS_soil_data() It seems like the dataset loaded, but it’s always a good idea to verify. There are many ways to check, but the easiest approach (if you’re using RStudio) is to look at the Environment tab on the upper right-hand side of the screen. You should now have an object called soil.values that includes some number of observations for 28 variables. The observations refer to the number of rows in the dataset, while the variables tell you the number of columns. As long as neither the observations or variables are 0, you can be confident that your dataset loaded. Let’s take a quick look at the dataset. We can do this by clicking on soil.values object in the Environment tab. (Note: this is equivalent to typing View(soil.values) in the R console.) This will open a new window for us to scroll through the dataset. Well, the data definitely loaded, but those column names aren’t immediately understandable. What could As_EPA3051 possibly mean? In addition to the dataset, we need to load the data dictionary as well. Data dictionary: a file containing the names, definitions, and attributes about data in a database or dataset. In this case, the data dictionary can help us make sense of what sort of values each column represents. The data dictionary for the BioDIGS soil testing data is available in the R package (see code below), but we have also reproduced it here. ?BioDIGS_soil_data() site_id Unique letter and number site name full_name Full site name As_EPA3051 Arsenic (mg/kg), EPA Method 3051A. Quantities < 3.0 are not detectable. Cd_EPA3051 Cadmium (mg/kg), EPA Method 3051A. Quantities < 0.2 are not detectable. Cr_EPA3051 Chromium (mg/kg), EPA Method 3051A Cu_EPA3051 Copper (mg/kg), EPA Method 3051A Ni_EPA3051 Nickel (mg/kg), EPA Method 3051A Pb_EPA3051 Lead (mg/kg), EPA Method 3051A Zn_EPA3051 Zinc (mg/kg), EPA Method 3051A water_pH A-E_Buffer_pH OM_by_LOI_pct Organic Matter by Loss on Ignition P_Mehlich3 Phosphorus (mg/kg), using the Mehlich 3 soil test extractant K_Mehlich3 Potassium (mg/kg), using the Mehlich 3 soil test extractant Ca_Mehlich3 Calcium (mg/kg), using the Mehlich 3 soil test extractant Mg_Mehlich3 Magnesium (mg/kg), using the Mehlich 3 soil test extractant Mn_Mehlich3 Manganese (mg/kg), using the Mehlich 3 soil test extractant Zn_Mehlich3 Zinc (mg/kg), using the Mehlich 3 soil test extractant Cu_Mehlich3 Copper (mg/kg), using the Mehlich 3 soil test extractant Fe_Mehlich3 Iron (mg/kg), using the Mehlich 3 soil test extractant B_Mehlich3 Boron (mg/kg), using the Mehlich 3 soil test extractant S_Mehlich3 Sulfur (mg/kg), using the Mehlich 3 soil test extractant Na_Mehlich3 Sodium (mg/kg), using the Mehlich 3 soil test extractant Al_Mehlich3 Aluminum (mg/kg), using the Mehlich 3 soil test extractant Est_CEC Cation Exchange Capacity (meq/100g) at pH 7.0 (CEC) Base_Sat_pct Base saturation (BS). This represents the percentage of CEC occupied by bases (Ca2+, Mg2+, K+, and Na+). The %BS increases with increasing soil pH. The availability of Ca2+, Mg2+, and K+ increases with increasing %BS. P_Sat_ratio Phosphorus saturation ratio. This is the ratio between the amount of phosphorus present in the soil and the total capacity of that soil to retain phosphorus. The ability of phosphorus to be bound in the soil is primary a function of iron (Fe) and aluminum (Al) content in that soil. Using the data dictionary, we find that the values in column As_EPA3051 give us the arsenic concentration in mg/kg of each soil sample, as determined by EPA Method 3051A. This method uses a combination of heat and acid to extract specific elements (like arsenic, cadmium, chromium, copper, nickel, lead, and zinc) from soil samples. While arsenic can occur naturally in soils, higher levels suggest the soil may have been contaminated by mining, hazardous waste, or pesticide application. Arsenic is toxic to humans. QUESTIONS: What data is found in the column labeled “Fe_Mehlich3”? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.) What data is found in the column labeled “Base_Sat_pct”? What does this variable tell us about the soil? We can also look at just the names of all the columns using the R console using the colnames() command. colnames(soil.values) ## [1] "site_id" "site_name" "type" "As_EPA3051" ## [5] "Cd_EPA3051" "Cr_EPA3051" "Cu_EPA3051" "Ni_EPA3051" ## [9] "Pb_EPA3051" "Zn_EPA3051" "water_pH" "OM_by_LOI_pct" ## [13] "P_Mehlich3" "K_Mehlich3" "Ca_Mehlich3" "Mg_Mehlich3" ## [17] "Mn_Mehlich3" "Zn_Mehlich3" "Cu_Mehlich3" "Fe_Mehlich3" ## [21] "B_Mehlich3" "S_Mehlich3" "Na_Mehlich3" "Al_Mehlich3" ## [25] "Est_CEC" "Base_Sat_pct" "P_Sat_ratio" "region" Most of the column names are found in the data dictionary, but the very last column (“region”) isn’t. How peculiar! Let’s look at what sort of values this particular column contains. The tab with the table of the soil.views object should still be open in the upper left pane of the RStudio window. If not, you can open it again by clicking on soils.view in the Environment pane, or by using the View() command. View(soil.values) If you scroll to the end of the table, we can see that “region” seems to refer to the city or area where the samples were collected. For example, the first 6 samples all come from Baltimore City. You may notice that some cells in the soil.values table contain NA. This just means that the soil testing data for that sample isn’t available yet. We’ll take care of those values in the next part. QUESTIONS: How many observations are in the soil testing values dataset that you loaded? What do each of these observations refer to? How many different regions are represented in the soil testing dataset? How many of them have soil testing data available? "],["part-2.-summarizing-the-data-with-statistics.html", "Chapter 15 Part 2. Summarizing the Data with Statistics", " Chapter 15 Part 2. Summarizing the Data with Statistics Now that we have the dataset loaded, let’s explore the data in more depth. First, we should remove those samples that don’t have soil testing data yet. We could keep them in the dataset, but removing them at this stage will make the analysis a little cleaner. In this case, as we know the reason the data are missing (and that reason will not skew our analysis), we can safely remove these samples. This will not be the case for every data analysis. We can remove the unanalyzed samples using the drop_na() function from the tidyr package. This function removes any rows from a table that contains NA for a particular column. This command follows the code structure: dataset_new_name <- dataset %>% drop_na(column_name) The `%>% is called a pipe and it tells R that the commands after it should all be applied to the object in front of it. (In this case, we can filter out all samples missing a value for “As_EPA3051” as a proxy for samples without soil testing data.) library(tidyr) soil.values.clean <- soil.values %>% drop_na(As_EPA3051) Great! Now let’s calculate some basic statistics. For example, we might want to know what the mean (average) arsenic concentration is for all the soil samples. We can use a combination of two functions: pull() and mean(). pull() lets you extract a column from your table for statistical analysis, while mean() calculates the average value for the extracted column. This command follows the code structure: OBJECT %>% pull(column_name) %>% mean() pull() is a command from the tidyverse package, so we’ll need to load that library before our command. library(tidyverse) soil.values.clean %>% pull(As_EPA3051) %>% mean() ## [1] 5.10875 We can run similar commands to calculate the standard deviation (sd), minimum (min), and maximum (max) for the soil arsenic values. soil.values.clean %>% pull(As_EPA3051) %>% sd() ## [1] 5.606926 soil.values.clean %>% pull(As_EPA3051) %>% min() ## [1] 0 soil.values.clean %>% pull(As_EPA3051) %>% max() ## [1] 27.3 The soil testing dataset contains samples from multiple geographic regions, so maybe it’s more meaningful to find out what the average arsenic values are for each region. We have to do a little bit of clever coding trickery for this using the group_by and summarize functions. First, we tell R to split our dataset up by a particular column (in this case, region) using the group_by function, then we tell R to summarize the mean arsenic concentration for each group. When using the summarize function, we tell R to make a new table (technically, a tibble in R) that contains two columns: the column used to group the data and the statistical measure we calculated for each group. This command follows the code structure: dataset %>% group_by(column_name) %>% summarize(mean(column_name)) soil.values.clean %>% group_by(region) %>% summarize(mean(As_EPA3051)) ## # A tibble: 2 × 2 ## region `mean(As_EPA3051)` ## <chr> <dbl> ## 1 Baltimore City 5.56 ## 2 Montgomery County 4.66 Now we know that the mean arsenic concentration might be different for each region. If we compare the samples from Baltimore City and Montgomery County, the Baltimore City samples appear to have a higher mean arsenic concentration than the Montgomery County samples. QUESTIONS: All the samples from Baltimore City and Montgomery County were collected from public park land. The parks sampled from Montgomery County were located in suburban and rural areas, compared to the urban parks sampled in Baltimore City. Why might the Montgomery County samples have a lower average arsenic concentration than the samples from Baltimore City? What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value? Calculate the mean iron concentration by region. Which region has the highest mean iron concentration? What about the lowest? Let’s say we’re interested in looking at mean concentrations that were determined using EPA Method 3051. Given that there are 8 of these measures in the soil.values dataset, it would be time consuming to run our code from above for each individual measure. We can add two arguments to our summarize statement to calculate statistical measures for multiple columns at once: the across argument, which tells R to apply the summarize command to multiple columns; and the ends_with parameter, which tells R which columns should be included in the statistical calculation. We are using ends_with because for this question, all the columns that we’re interested in end with the string ‘EPA3051’. This command follows the code structure: dataset %>% group_by(column_name) %>% summarize(across(ends_with(common_column_name_ending), mean)) soil.values.clean %>% group_by(region) %>% summarize(across(ends_with('EPA3051'), mean)) ## # A tibble: 2 × 8 ## region As_EPA3051 Cd_EPA3051 Cr_EPA3051 Cu_EPA3051 Ni_EPA3051 Pb_EPA3051 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Baltimore C… 5.56 0.359 34.5 35.0 17.4 67.2 ## 2 Montgomery … 4.66 0.402 29.9 24.3 23.4 38.7 ## # ℹ 1 more variable: Zn_EPA3051 <dbl> This is a much more efficient way to calculate statistics. QUESTIONS: Calculate the maximum values for concentrations that were determined using EPA Method 3051. (HINT: change the function you call in the summarize statement.) Which of these metals has the maximum concentration you see, and in which region is it found? Calculate both the mean and maximum values for concentrations that were determined using the Mehlich3 test. (HINT: change the terms in the columns_to_include vector, as well as the function you call in the summarize statement.) Which of these metals has the highest average and maximum concentrations, and in which region are they found? "],["part-3.-visualizing-the-data.html", "Chapter 16 Part 3. Visualizing the Data", " Chapter 16 Part 3. Visualizing the Data Often, it can be easier to immediately interpret data displayed as a plot than as a list of values. For example, we can more easily understand how the arsenic concentration of the soil samples are distributed if we create histograms compared to looking at point values like mean, standard deviation, minimum, and maximum. One way to make histograms in R is with the hist() function. This function only requires that we tell R which column of the dataset that we want to plot. (However, we also have the option to tell R a histogram name and a x-axis label.) We can again use the pull() command and pipes (%>%) to choose the column we want from the soil.values.clean dataset and make a histogram of them. This combination of commands follows the code structure: dataset %>% pull(column_name) %>% hist(main = chart_title, xlab = x_axis_title) soil.values.clean %>% pull(As_EPA3051) %>% hist(main = 'Histogram of Arsenic Concentration', xlab ='Concentration in mg/kg' ) We can see that almost all the soil samples had very low concentrations of arsenic (which is good news for the soil health!). In fact, many of them had arsenic concentrations close to 0, and only one sampling location appears to have high levels of arsenic. We might also want to graphically compare arsenic concentrations among the geographic regions in our dataset. We can do this by creating boxplots. Boxplots are particularly useful when comparing the mean, variation, and distributions among multiple groups. In R, one way to create a boxplot is using the boxplot() function. We don’t need to use pipes for this command, but instead will specify what columns we want to use from the dataset inside the boxplot() function itself. This command follows the code structure: boxplot(column_we’re_plotting ~ grouping_variable, data = dataset, main = “Title of Graph”, xlab = “x_axis_title”, ylab = “y_axis_title”) boxplot(As_EPA3051 ~ region, data = soil.values.clean, main = "Arsenic Concentration by Geographic Region", xlab = "Region", ylab = "Arsenic Concentration in mg/kg") By using a boxplot, we can quickly see that, while one sampling site within Baltimore City has a very high concentration of arsenic in the soil, in general there isn’t a difference in arsenic content between Baltimore City and Montgomery County. QUESTIONS: Create a histogram for iron concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations? Create a histogram for lead concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations? Look at the maps for iron and lead on the BioDIGS website. Do the boxplots you created make sense, given what you see on these maps? Why or why not? "],["activity-questions.html", "Chapter 17 Activity Questions 17.1 Part 1. Examining the Data 17.2 Part 2. Summarizing the Data with Statistics 17.3 Part 3. Visualizing the Data", " Chapter 17 Activity Questions 17.1 Part 1. Examining the Data What data is found in the column labeled “Fe_Mehlich3”? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.) What data is found in the column labeled “Base_Sat_pct”? What does this variable tell us about the soil? How many observations are in the soil testing values dataset that you loaded? What do each of these observations refer to? How many different regions are represented in the soil testing dataset? How many of them have soil testing data available? 17.2 Part 2. Summarizing the Data with Statistics All the samples from Baltimore City and Montgomery County were collected from public park land. The parks sampled from Montgomery County were located in suburban and rural areas, compared to the urban parks sampled in Baltimore City. Why might the Montgomery County samples have a lower average arsenic concentration than the samples from Baltimore City? What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value? Calculate the mean iron concentration by region. Which region has the highest mean iron concentration? What about the lowest? Calculate the maximum values for concentrations that were determined using EPA Method 3051. (HINT: change the function you call in the summarize statement.) Which of these metals has the maximum concentration you see, and in which region is it found? Calculate both the mean and maximum values for concentrations that were determined using the Mehlich3 test. (HINT: change the terms in the columns_to_include vector, as well as the function you call in the summarize statement.) Which of these metals has the highest average and maximum concentrations, and in which region are they found? 17.3 Part 3. Visualizing the Data Create a histogram for iron concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations? Create a histogram for lead concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations? Look at the maps for iron and lead on the BioDIGS website. Do the boxplots you created make sense, given what you see on these maps? Why or why not? "],["about-the-authors.html", "About the Authors", " About the Authors These credits are based on our course contributors table guidelines.     Credits Names Pedagogy Content Developer Elizabeth Humphries Content Editors Ava Hoffman, Kate Isaac Project Directors Ava Hoffman, Michael Schatz, Jeff Leek, Frederick Tan Production Content Publisher Ira Gooding Technical Template Publishing Engineers Candace Savonen, Carrie Wright, Ava Hoffman Publishing Maintenance Engineer Candace Savonen Technical Publishing Stylists Carrie Wright, Candace Savonen Package Developers (ottrpal) John Muschelli, Candace Savonen, Carrie Wright Package Developer (BioDIGSData) Ava Hoffman Funding Funder National Human Genome Research Institute (NHGRI) Funding Staff Fallon Bachman, Jennifer Vessio, Emily Voeglein   ## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.3.2 (2023-10-31) ## os Ubuntu 22.04.4 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC ## date 2024-10-16 ## pandoc 3.1.1 @ /usr/local/bin/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source ## bookdown 0.40 2024-07-02 [1] CRAN (R 4.3.2) ## bslib 0.6.1 2023-11-28 [1] RSPM (R 4.3.0) ## cachem 1.0.8 2023-05-01 [1] RSPM (R 4.3.0) ## cli 3.6.2 2023-12-11 [1] RSPM (R 4.3.0) ## devtools 2.4.5 2022-10-11 [1] RSPM (R 4.3.0) ## digest 0.6.34 2024-01-11 [1] RSPM (R 4.3.0) ## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.3.0) ## evaluate 0.23 2023-11-01 [1] RSPM (R 4.3.0) ## fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.3.0) ## fs 1.6.3 2023-07-20 [1] RSPM (R 4.3.0) ## glue 1.7.0 2024-01-09 [1] RSPM (R 4.3.0) ## htmltools 0.5.7 2023-11-03 [1] RSPM (R 4.3.0) ## htmlwidgets 1.6.4 2023-12-06 [1] RSPM (R 4.3.0) ## httpuv 1.6.14 2024-01-26 [1] RSPM (R 4.3.0) ## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.3.0) ## jsonlite 1.8.8 2023-12-04 [1] RSPM (R 4.3.0) ## knitr 1.48 2024-07-07 [1] CRAN (R 4.3.2) ## later 1.3.2 2023-12-06 [1] RSPM (R 4.3.0) ## lifecycle 1.0.4 2023-11-07 [1] RSPM (R 4.3.0) ## magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.3.0) ## memoise 2.0.1 2021-11-26 [1] RSPM (R 4.3.0) ## mime 0.12 2021-09-28 [1] RSPM (R 4.3.0) ## miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.3.0) ## pkgbuild 1.4.3 2023-12-10 [1] RSPM (R 4.3.0) ## pkgload 1.3.4 2024-01-16 [1] RSPM (R 4.3.0) ## profvis 0.3.8 2023-05-02 [1] RSPM (R 4.3.0) ## promises 1.2.1 2023-08-10 [1] RSPM (R 4.3.0) ## purrr 1.0.2 2023-08-10 [1] RSPM (R 4.3.0) ## R6 2.5.1 2021-08-19 [1] RSPM (R 4.3.0) ## Rcpp 1.0.12 2024-01-09 [1] RSPM (R 4.3.0) ## remotes 2.4.2.1 2023-07-18 [1] RSPM (R 4.3.0) ## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.2) ## rmarkdown 2.25 2023-09-18 [1] RSPM (R 4.3.0) ## sass 0.4.8 2023-12-06 [1] RSPM (R 4.3.0) ## sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0) ## shiny 1.8.0 2023-11-17 [1] RSPM (R 4.3.0) ## stringi 1.8.3 2023-12-11 [1] RSPM (R 4.3.0) ## stringr 1.5.1 2023-11-14 [1] RSPM (R 4.3.0) ## urlchecker 1.0.1 2021-11-30 [1] RSPM (R 4.3.0) ## usethis 2.2.3 2024-02-19 [1] RSPM (R 4.3.0) ## vctrs 0.6.5 2023-12-01 [1] RSPM (R 4.3.0) ## xfun 0.48 2024-10-03 [1] CRAN (R 4.3.2) ## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.3.0) ## yaml 2.3.8 2023-12-11 [1] RSPM (R 4.3.0) ## ## [1] /usr/local/lib/R/site-library ## [2] /usr/local/lib/R/library ## ## ────────────────────────────────────────────────────────────────────────────── "],["references.html", "Chapter 18 References", " Chapter 18 References "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]] diff --git a/docs/setting-up-billing-on-anvil.html b/docs/setting-up-billing-on-anvil.html index 3d7cbce..1b6ec31 100644 --- a/docs/setting-up-billing-on-anvil.html +++ b/docs/setting-up-billing-on-anvil.html @@ -6,7 +6,7 @@ Chapter 8 Setting up Billing on AnVIL | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/setting-up-the-class-activity.html b/docs/setting-up-the-class-activity.html index 779c5ad..253e261 100644 --- a/docs/setting-up-the-class-activity.html +++ b/docs/setting-up-the-class-activity.html @@ -6,7 +6,7 @@ Chapter 9 Setting up the Class Activity | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/support.html b/docs/support.html index 2da9739..46e7718 100644 --- a/docs/support.html +++ b/docs/support.html @@ -6,7 +6,7 @@ Chapter 3 Support | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    diff --git a/docs/using-rstudio-on-anvil.html b/docs/using-rstudio-on-anvil.html index 143c8fd..88fe024 100644 --- a/docs/using-rstudio-on-anvil.html +++ b/docs/using-rstudio-on-anvil.html @@ -6,7 +6,7 @@ Chapter 12 Using RStudio on AnVIL | BioDIGS: Exploring Soil Data - + @@ -22,7 +22,7 @@ - + @@ -30,7 +30,7 @@ - + @@ -169,23 +169,23 @@
  • 12.3 Touring RStudio
  • 12.4 Pausing RStudio
  • -
  • Data Exploration
  • -
  • 13 Exploring Soil Testing Data With R +
  • Student Activity
  • +
  • 13 Introduction
  • -
  • 14 Activity Questions +
  • 14 Part 1. Examining the Data
  • +
  • 15 Part 2. Summarizing the Data with Statistics
  • +
  • 16 Part 3. Visualizing the Data
  • +
  • 17 Activity Questions
  • About the Authors
  • -
  • 15 References
  • +
  • 18 References
  • This content was published with bookdown by:

    The Fred Hutch Data Science Lab

    @@ -346,7 +346,7 @@

    12.4 Pausing RStudio - +