From 913f685a355518f70055f766636543dffc266bb6 Mon Sep 17 00:00:00 2001
From: Elizabeth Humphries <emarellahumphries@gmail.com>
Date: Fri, 2 Feb 2024 17:11:49 -0500
Subject: [PATCH 1/7] create student anvil guide

---
 08-student_anvil_guide.Rmd |  59 +++++++
 08-student_modules.Rmd     | 141 ----------------
 09-data_exploration.Rmd    | 319 +++++++++++++++++++++++++++++++++++++
 _bookdown.yml              |   3 +-
 4 files changed, 380 insertions(+), 142 deletions(-)
 create mode 100644 08-student_anvil_guide.Rmd
 delete mode 100644 08-student_modules.Rmd
 create mode 100644 09-data_exploration.Rmd
diff --git a/08-student_anvil_guide.Rmd b/08-student_anvil_guide.Rmd
new file mode 100644
index 0000000..b57956c
--- /dev/null
+++ b/08-student_anvil_guide.Rmd
@@ -0,0 +1,59 @@
+# (PART\*) Student Guide to AnVIL {-}
+
+
+```{r, include = FALSE}
+ottrpal::set_knitr_image_path()
+```
+
+# Using RStudio on AnVIL
+
+In the next few steps, you will walk through how to get set up to use RStudio on the AnVIL platform. AnVIL is centered around different “Workspaces”. Each Workspace functions almost like a mini code laboratory - it is a place where data can be examined, stored, and analyzed. The first thing we want to do is to copy or “clone” a Workspace to create a space for you to experiment.
+
+Use a web browser to go to the AnVIL website. In the browser type:
+
+```
+anvil.terra.bio
+```
+
+:::{.notice}
+**Tip**
+At this point, it might make things easier to open up a new window in your browser and split your screen. That way, you can follow along with this guide on one side and execute the steps on the other.
+:::
+
+Your instructor will give you information on which workspace you should clone.
+
+## Video overview of RStudio on AnVIL
+
+```{r, echo = FALSE, results='asis'}
+cow::borrow_chapter(
+  doc_path = "child/_child_rstudio_video.Rmd",
+  repo_name = "jhudsl/AnVIL_Template"
+)
+```
+
+## Launching RStudio
+
+```{r, echo = FALSE, results='asis'}
+cow::borrow_chapter(
+  doc_path = "child/_child_rstudio_launch.Rmd",
+  repo_name = "jhudsl/AnVIL_Template"
+)
+```
+
+## Touring RStudio
+
+```{r, echo = FALSE, results='asis'}
+cow::borrow_chapter(
+  doc_path = "child/_child_rstudio_tour.Rmd",
+  repo_name = "jhudsl/AnVIL_Template"
+)
+```
+
+## Pausing RStudio
+
+```{r, echo = FALSE, results='asis'}
+cow::borrow_chapter(
+  doc_path = "child/_child_rstudio_pause.Rmd",
+  repo_name = "jhudsl/AnVIL_Template"
+)
+```
diff --git a/08-student_modules.Rmd b/08-student_modules.Rmd
deleted file mode 100644
index efb5178..0000000
--- a/08-student_modules.Rmd
+++ /dev/null
@@ -1,141 +0,0 @@
-```{r echo = FALSE}
-knitr::opts_chunk$set(out.width = "100%")
-```
-
-# Student instructions
-
-Modules aimed at students in a course or workshop.
- 
-<br>
- 
-## Student Account Setup
- 
-:::: {.borrowed_chunk}
-```{r, echo = FALSE, results='asis'}
-cow::borrow_chapter(
-  doc_path = "child/_child_student_create_account.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-::::
-
-## Student instructions for cloning a Workspace
-
-These instructions can be customized to a specific workspace by setting certain variables before running `cow::borrow_chapter()`.  If these variables have not been set, reasonable defaults are provided (e.g. "ask your instructor").
-
-### With no variables set:
-
-:::: {.borrowed_chunk}
-```{r, echo = FALSE, results='asis'}
-cow::borrow_chapter(
-  doc_path = "child/_child_student_workspace_clone.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-::::
-
-### With variables set:
-
-:::: {.borrowed_chunk}
-```{r, echo = FALSE, results='asis'}
-# Specify variables
-AnVIL_module_settings <- list(
-  workspace_name = "Example_Workspace",
-  workspace_link = "http://example.com/",
-  billing_project = "Example Billing Project"
-)
-
-cow::borrow_chapter(
-  doc_path = "child/_child_student_workspace_clone.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-::::
-
-## Student instructions for launching Jupyter
-
-The module below is specially customized for students, allowing you to give more specific instructions on the settings for their Jupyter environment.  There are several other general purpose modules that may also be useful for students (e.g. Pausing Jupyter, Deleting Jupyter) that can be found in other chapters of this book.
-
-The following instructions can be customized by setting certain variables before running `cow::borrow_chapter()`. Developers should create these variables as a list `AnVIL_module_settings`. The following variables can be provided:
-
-- `audience` = Defaults to `general`, telling them to use the default Jupyter settings. If `audience` is set to `student`, it gives more specific instructions.
-- `docker_image` = Optional, it will tell them how to set the image.
-- `startup_script` =  Optional, it will tell them how to set the script.
-
-### Using default Jupyter environment:
-
-:::: {.borrowed_chunk}
-```{r, echo = FALSE, results='asis'}
-# Specify variables
-AnVIL_module_settings <- list(
-  audience = "student"
-)
-
-cow::borrow_chapter(
-  doc_path = "child/_child_jupyter_launch.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-::::
-
-### Using custom Jupyter environment:
-
-:::: {.borrowed_chunk}
-```{r, echo = FALSE, results='asis'}
-# Specify variables
-AnVIL_module_settings <- list(
-  audience = "student",
-  docker_image = "example docker",
-  startup_script = "example startup script"
-)
-
-cow::borrow_chapter(
-  doc_path = "child/_child_jupyter_launch.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-::::
-
-## Student instructions for launching RStudio
-
-The module below is specially customized for students, allowing you to give more specific instructions on the settings for their RStudio environment.  There are several other general purpose modules that may also be useful for students (e.g. Pausing RStudio, Deleting RStudio) that can be found in other chapters of this book.
-
-The following instructions can be customized by setting certain variables before running `cow::borrow_chapter()`. Developers should create these variables as a list `AnVIL_module_settings`. The following variables can be provided:
-
-- `audience` = Defaults to `general`, telling them to use the default RStudio settings. If `audience` is set to `student`, it gives more specific instructions.
-- `docker_image` = Optional, it will tell them to open the customization dialogue and direct them on how to set the image.
-- `startup_script` =  Optional, it will tell them to open the customization dialogue and direct them on how to set the script.
-
-### Using default RStudio environment:
-
-:::: {.borrowed_chunk}
-```{r, echo = FALSE, results='asis'}
-# Specify variables
-AnVIL_module_settings <- list(
-  audience = "student"
-)
-
-cow::borrow_chapter(
-  doc_path = "child/_child_rstudio_launch.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-::::
-
-### Using custom RStudio environment:
-
-:::: {.borrowed_chunk}
-```{r, echo = FALSE, results='asis'}
-# Specify variables
-AnVIL_module_settings <- list(
-  audience = "student",
-  docker_image = "example docker",
-  startup_script = "example startup script"
-)
-
-cow::borrow_chapter(
-  doc_path = "child/_child_rstudio_launch.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-::::
diff --git a/09-data_exploration.Rmd b/09-data_exploration.Rmd
new file mode 100644
index 0000000..a472343
--- /dev/null
+++ b/09-data_exploration.Rmd
@@ -0,0 +1,319 @@
+# (PART\*) Data Exploration {-}
+
+
+```{r, include = FALSE}
+ottrpal::set_knitr_image_path()
+```
+
+# Exploring Soil Testing Data With R
+
+In this activity, you'll have a chance to become familiar with the BioDIGS soil testing data. This dataset includes information on the inorganic components of each soil sample, particularly metal concentrations. Human activity can increase the concentration of inorganic compounds in the soil. When cars drive on roads, compounds from the exhaust, oil, and other fluids might settle onto the roads and be washed into the soil. When we put salt on roads, parking lots, and sidewalks, the salts themselves will eventually be washed away and enter the ecosystem through both water and soil. Chemicals from factories and other businesses also leech into our environment. All of this means the concentration of heavy metals and other chemicals will vary among the soil samples collected for the BioDIGS project. 
+
+## Before You Start
+
+```{r, echo = FALSE, results='asis'}
+cow::borrow_chapter(
+  doc_path = "child/_child_google_create_account.Rmd",
+  repo_name = "jhudsl/AnVIL_Template"
+)
+```
+
+## Objectives
+
+This activity will teach you how to use the AnVIL platform to:
+
+1. Import data into RStudio
+1. Examine csv file that contains the soil testing data from the BioDIGS project
+1. Calculate summary statistics for variables in the soil testing data
+1. Create and interpret histograms and boxplots for variables in the soil testing data.
+
+## Getting Started
+
+In the next few steps, you will walk through how to get set up to use RStudio on the AnVIL platform. AnVIL is centered around different “Workspaces”. Each Workspace functions almost like a mini code laboratory - it is a place where data can be examined, stored, and analyzed. The first thing we want to do is to copy or “clone” a Workspace to create a space for you to experiment.
+
+Use a web browser to go to the AnVIL website. In the browser type:
+
+```
+anvil.terra.bio
+```
+
+:::{.notice}
+**Tip**
+At this point, it might make things easier to open up a new window in your browser and split your screen. That way, you can follow along with this guide on one side and execute the steps on the other.
+:::
+
+Your instructor will give you information on which workspace you should clone.
+
+### Video overview of RStudio on AnVIL
+
+```{r, echo = FALSE, results='asis'}
+cow::borrow_chapter(
+  doc_path = "child/_child_rstudio_video.Rmd",
+  repo_name = "jhudsl/AnVIL_Template"
+)
+```
+
+### Launching RStudio
+
+```{r, echo = FALSE, results='asis'}
+cow::borrow_chapter(
+  doc_path = "child/_child_rstudio_launch.Rmd",
+  repo_name = "jhudsl/AnVIL_Template"
+)
+```
+
+### Touring RStudio
+
+```{r, echo = FALSE, results='asis'}
+cow::borrow_chapter(
+  doc_path = "child/_child_rstudio_tour.Rmd",
+  repo_name = "jhudsl/AnVIL_Template"
+)
+```
+
+### Pausing RStudio
+
+```{r, echo = FALSE, results='asis'}
+cow::borrow_chapter(
+  doc_path = "child/_child_rstudio_pause.Rmd",
+  repo_name = "jhudsl/AnVIL_Template"
+)
+```
+
+## Part 1. Data Import
+
+We will use the `BioDIGS` package to retrieve the data. We first need to install the package from where it is stored on GitHub.
+
+```{r, message = FALSE, warning = FALSE, echo = FALSE}
+
+library(readr)
+soil.values <- read_csv(file = "soil_testing_data.csv")
+```
+
+
+```{r, message = FALSE, warning = FALSE, eval=F}
+
+devtools::install_github("fhdsl/BioDIGSData")
+```
+
+Once you've installed the package, we can load the library and assign the soil testing data to an _object_. This command follows the code structure:
+
+dataset_object_name <- stored_BioDIGS_dataset
+
+```{r, message = FALSE, warning = FALSE, eval=F}
+
+library(BioDIGSData)
+
+soil.values <- BioDIGS_soil_data()
+```
+
+It _seems_ like the dataset loaded, but it's always a good idea to verify. There are many ways to check, but the easiest approach (if you're using RStudio) is to look at the Environment tab on the upper right-hand side of the screen. You should now have an object called `soil.values` that includes some number of observations for 28 variables. The _observations_ refer to the number of rows in the dataset, while the _variables_ tell you the number of columns. As long as neither the observations or variables are 0, you can be confident that your dataset loaded.
+
+Let's take a quick look at the dataset. We can do this by clicking on soil.values object in the Environment tab. (Note: this is equivalent to typing `View(soil.values)` in the R console.)
+
+<img src="resources/images/08-environment.png" title="If the dataset loaded, you will see an object with non-zero observations and variables in the Environment tab" alt="If the dataset loaded, you will see an object with non-zero observations and variables in the Environment tab" style="display: block; margin: auto;" />
+
+
+This will open a new window for us to scroll through the dataset.
+
+<img src="resources/images/08-scrolling_through_dataset.png" title="You can click on the object in the Environment tab to open a new window that allows you to scroll through the loaded dataset" alt="You can click on the object in the Environment tab to open a new window that allows you to scroll through the loaded dataset" style="display: block; margin: auto;" />
+
+Well, the data definitely loaded, but those column names aren't immediately understandable. What could **As_EPA3051** possibly mean? In addition to the dataset, we need to load the _data dictionary_ as well.
+
+:::{.dictionary}
+
+**Data dictionary:** a file containing the names, definitions, and attributes about data in a database or dataset.
+
+:::
+
+In this case, the data dictionary can help us make sense of what sort of values each column represents. The data dictionary for the BioDIGS soil testing data is available in the R package (see code below), but we have also reproduced it here.
+
+```{r, message = FALSE, warning = FALSE, eval=FALSE}
+
+?BioDIGS_soil_data()
+```
+
+:::{.dictionary}
+
+- **site_id** Unique letter and number site name
+- **full_name** Full site name
+- **As_EPA3051** Arsenic (mg/kg), EPA Method 3051A. Quantities < 3.0 are not detectable.
+- **Cd_EPA3051** Cadmium (mg/kg), EPA Method 3051A. Quantities < 0.2 are not detectable.
+- **Cr_EPA3051** Chromium (mg/kg), EPA Method 3051A
+- **Cu_EPA3051** Copper (mg/kg), EPA Method 3051A
+- **Ni_EPA3051** Nickel (mg/kg), EPA Method 3051A
+- **Pb_EPA3051** Lead (mg/kg), EPA Method 3051A
+- **Zn_EPA3051** Zinc (mg/kg), EPA Method 3051A
+- **water_pH**
+- **A-E_Buffer_pH**
+- **OM_by_LOI_pct** Organic Matter by Loss on Ignition
+- **P_Mehlich3** Phosphorus (mg/kg), using the Mehlich 3 soil test extractant
+- **K_Mehlich3 Potassium** (mg/kg), using the Mehlich 3 soil test extractant
+- **Ca_Mehlich3** Calcium (mg/kg), using the Mehlich 3 soil test extractant
+- **Mg_Mehlich3** Magnesium (mg/kg), using the Mehlich 3 soil test extractant
+- **Mn_Mehlich3** Manganese (mg/kg), using the Mehlich 3 soil test extractant
+- **Zn_Mehlich3** Zinc (mg/kg), using the Mehlich 3 soil test extractant
+- **Cu_Mehlich3** Copper (mg/kg), using the Mehlich 3 soil test extractant
+- **Fe_Mehlich3** Iron (mg/kg), using the Mehlich 3 soil test extractant
+- **B_Mehlich3** Boron (mg/kg), using the Mehlich 3 soil test extractant
+- **S_Mehlich3** Sulfur (mg/kg), using the Mehlich 3 soil test extractant
+- **Na_Mehlich3** Sodium (mg/kg), using the Mehlich 3 soil test extractant
+- **Al_Mehlich3** Aluminum (mg/kg), using the Mehlich 3 soil test extractant
+- **Est_CEC** Cation Exchange Capacity (meq/100g) at pH 7.0 (CEC)
+- **Base_Sat_pct** Base saturation (BS). This represents the percentage of CEC occupied by bases (Ca2+, Mg2+, K+, and Na+). The %BS increases with increasing soil pH. The availability of Ca2+, Mg2+, and K+ increases with increasing %BS.
+- **P_Sat_ratio** Phosphorus saturation ratio. This is the ratio between the amount of phosphorus present in the soil and the total capacity of that soil to retain phosphorus. The ability of phosphorus to be bound in the soil is primary a function of iron (Fe) and aluminum (Al) content in that soil.
+
+:::
+
+Using the data dictionary, we find that the values in column `As_EPA3051` give us the arsenic concentration in mg/kg of each soil sample, as determined by EPA Method 3051A. While arsenic can occur naturally in soils, higher levels suggest the soil may have been contaminated by mining, hazardous waste, or pesticide application. Arsenic is toxic to humans.
+
+We can also look at just the names of all the columns using the R console using the `colnames()` command.
+
+```{r, message = FALSE, warning = FALSE}
+
+colnames(soil.values)
+```
+
+Most of the column names are found in the data dictionary, but the very last column ("region") isn't. How peculiar! Let's look at what sort of values this particular column contains. The tab with the table of the `soil.views` object should still be open in the upper left pane of the RStudio window. If not, you can open it again by clicking on  `soils.view` in the Environment pane, or by using the `View()` command.
+
+```{r, message = FALSE, warning = FALSE, eval = F}
+
+View(soil.values)
+```
+
+<img src="resources/images/08-soil_values_object.png" title="Switch to the soil.values tab to look at what values are in the region column" alt="Switch to the soil.values tab to look at what values are in the region column" style="display: block; margin: auto;" />
+
+
+If you scroll to the end of the table, we can see that "region" seems to refer to the city or area where the samples were collected. For example, the first 24 samples all come from Baltimore City.
+
+<img src="resources/images/08-region.png" title="We can see the first samples in the dataset were collected in Baltimore City" alt="We can see the first samples in the dataset were collected in Baltimore City" style="display: block; margin: auto;" />
+
+
+You may notice that some cells in the `soil.values` table contain _NA_. This just means that the soil testing data for that sample isn't available yet. We'll take care of those values in the next part.
+
+::: {.reflection}
+QUESTIONS:
+
+1. How many observations are in the soil testing values dataset that you loaded? What do each of these observations refer to?
+
+2. What data is found in the column labeled "Fe_Mehlich3"? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.)
+
+:::
+
+## Part 2. Data Summarization
+
+Now that we have the dataset loaded, let's explore the data in more depth.
+
+First, we should remove those samples that don't have soil testing data yet. We _could_ keep them in the dataset, but removing them at this stage will make the analysis a little cleaner. In this case, as we know the reason the data are missing (and that reason will not skew our analysis), we can safely remove these samples. This will not be the case for every data analysis.
+
+We can remove the unanalyzed samples using the `drop_na()` function from the `tidyr` package. This function removes any rows from a table that contains _NA_ for a particular column. This command follows the code structure:
+
+dataset_new_name <- dataset %>% drop_na(column_name)
+
+The `%>% is called a pipe and it tells R that the commands after it should all be applied to the object in front of it. (In this case, we can filter out all samples missing a value for "As_EPA3051" as a proxy for samples without soil testing data.)
+
+```{r, message = FALSE, warning = FALSE}
+
+library(tidyr)
+
+soil.values.clean <- soil.values %>% drop_na(As_EPA3051)
+```
+
+Great! Now let's calculate some basic statistics. For example, we might want to know what the mean (average) lead concentration is for each soil sample. According to the data dictionary, the values for lead concentration are in the column labeled "Pb_EPA3051". We can use a combination of two functions: `pull()` and `mean()`.`pull()` lets you extract a column from your table for statistical analysis, while `mean()` calculates the average value for the extracted column.
+
+This command follows the code structure:
+
+OBJECT %>% pull(column_name) %>% mean()
+
+`pull()` is a command from the `tidyverse` package, so we'll need to load that library before our command.
+
+```{r, message = FALSE, warning = FALSE}
+
+library(tidyverse)
+
+soil.values.clean %>% pull(As_EPA3051) %>% mean()
+```
+
+We can run similar commands to calculate the standard deviation, minimum, and maximum for the soil arsenic values.
+
+```{r, message = FALSE, warning = FALSE}
+
+soil.values.clean %>% pull(As_EPA3051) %>% sd()
+soil.values.clean %>% pull(As_EPA3051) %>% min()
+soil.values.clean %>% pull(As_EPA3051) %>% max()
+```
+As you can see, the standard deviation of the arsenic concentrations is listed first, then the minimum concentration, and finally the maximum concentration.
+
+The soil testing dataset contains samples from multiple geographic regions, so maybe it's more meaningful to find out what the average arsenic values are for each region. We have to do a little bit of clever coding trickery for this using the `group_by` and `summarize` functions. First, we tell R to split our dataset up by a particular column (in this case, region) using the `group_by` function, then we tell R to summarize the mean arsenic concentration for each group. Because there are several different functions with the name `summarize` in R, we have to specify that we want to use `summarize` from the `dplyr` package. This command follows the code structure:
+
+dataset %>%
+    group_by(column_name) %>%
+    dplyr::summarize(Mean = mean(column_name))
+    
+```{r, message = FALSE, warning = FALSE}
+
+soil.values.clean %>%
+    group_by(region) %>%
+    dplyr::summarize(Mean = mean(As_EPA3051))
+```
+
+Now we know that the mean arsenic concentration might be different for each region, and appears higher for the Baltimore City samples than the Montgomery County samples.
+
+::: {.reflection}
+QUESTIONS:
+
+3. What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value?
+
+2. Calculate the mean iron concentration by region. Which region has the highest mean iron concentration? What about the lowest?
+
+:::
+
+## Part 3. Data Visualization
+
+Often, it can be easier to immediately interpret data displayed as a plot than as a list of values. For example, we can more easily understand how the arsenic concentration of the soil samples are distributed if we create histograms compared to looking at point values like mean, standard deviation, minimum, and maximum.
+
+One way to make histograms in R is to use the `hist()` function. We can again use the `pull()` command and pipes (`%>%`) to choose the column we want from the `soil.values.clean` dataset and make a histogram of them. Remember, this command follows the code structure:
+
+dataset %>%
+    pull(column_name) %>%
+    hist(main = chart_title, xlab = x_axis_title)
+
+In this case, we do _not_ have to use the `dplyr::summarize` command before `hist()` because there's only one function called `hist()` in the packages we're using.
+
+```{r, message = FALSE, warning = FALSE}
+
+soil.values.clean %>% 
+    pull(As_EPA3051) %>% 
+    hist(main = 'Histogram of Arsenic Concentration', 
+         xlab ='Concentration in mg/kg' )
+```
+
+We can see that almost all the soil samples had very low concentrations of arsenic (which is good news for the soil health!). In fact, many of them had arsenic concentrations close to 0, and only one sampling location appears to have high levels of arsenic. 
+
+We might also want to graphically compare arsenic concentrations among the geographic regions in our dataset. We can do this by creating boxplots. Boxplots are particularly useful when comparing the mean, variation, and distributions among multiple groups. In R, one way to create a boxplot is using the `boxplot()` function. We don't need to use pipes for this command, but instead will specify what columns we want to use from the dataset inside the `boxplot()` function itself.
+
+This command follows the code structure:
+
+boxplot(arsenic_concentration ~ grouping_variable, 
+    data = dataset,
+    main = "Title of Graph",
+    xlab = "x_axis_title",
+    ylab = "y_axis_title")
+
+```{r, message = FALSE, warning = FALSE}
+boxplot(As_EPA3051 ~ region, data = soil.values.clean,
+        main = "Arsenic Concentration by Geographic Region",
+        xlab = "Region",
+        ylab = "Arsenic Concentration in mg/kg")
+```
+
+By using a boxplot, we can quickly see that, while one sampling site within Baltimore City has a very high concentration of arsenic in the soil, in general there isn't a difference in arsenic content between Baltimore City and Montgomery County.
+
+::: {.reflection}
+QUESTIONS:
+
+5. Create a histogram for _iron_ concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations?
+
+6. Create a histogram for _lead_ concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations?
+
+:::
diff --git a/_bookdown.yml b/_bookdown.yml
index bbfc021..6cb3df7 100644
--- a/_bookdown.yml
+++ b/_bookdown.yml
@@ -9,7 +9,8 @@ rmd_files: ["index.Rmd",
             "05-anvil_onboarding.Rmd",
             "06-using_platforms_modules.Rmd",
             "07-instructor-guide.Rmd",
-            "08-data_exploration.Rmd",
+            "08-student_anvil_guide.Rmd"
+            "09-data_exploration.Rmd",
             "About.Rmd",
             "References.Rmd"]
 new_session: yes

From 378124c6eb05de510b3b97cfdd4361c3f3fd8316 Mon Sep 17 00:00:00 2001
From: Elizabeth Humphries <emarellahumphries@gmail.com>
Date: Fri, 2 Feb 2024 17:16:54 -0500
Subject: [PATCH 2/7] silly commas

---
 04-billing_modules.Rmd                             | 2 +-
 07-instructor-guide.Rmd => 07-instructor_guide.Rmd | 0
 _bookdown.yml                                      | 4 ++--
 3 files changed, 3 insertions(+), 3 deletions(-)
 rename 07-instructor-guide.Rmd => 07-instructor_guide.Rmd (100%)

diff --git a/04-billing_modules.Rmd b/04-billing_modules.Rmd
index b184f1a..575fad9 100644
--- a/04-billing_modules.Rmd
+++ b/04-billing_modules.Rmd
@@ -1,7 +1,7 @@
 ```{r echo = FALSE}
 knitr::opts_chunk$set(out.width = "100%")
 ```
-# (PART\*) Using AnVIL {-}
+# (PART\*) AnVIL Overview {-}
 
 # Billing
 
diff --git a/07-instructor-guide.Rmd b/07-instructor_guide.Rmd
similarity index 100%
rename from 07-instructor-guide.Rmd
rename to 07-instructor_guide.Rmd
diff --git a/_bookdown.yml b/_bookdown.yml
index 6cb3df7..f982312 100644
--- a/_bookdown.yml
+++ b/_bookdown.yml
@@ -8,8 +8,8 @@ rmd_files: ["index.Rmd",
             "04-billing_modules.Rmd",
             "05-anvil_onboarding.Rmd",
             "06-using_platforms_modules.Rmd",
-            "07-instructor-guide.Rmd",
-            "08-student_anvil_guide.Rmd"
+            "07-instructor_guide.Rmd",
+            "08-student_anvil_guide.Rmd",
             "09-data_exploration.Rmd",
             "About.Rmd",
             "References.Rmd"]

From ee662bb1f62579323402c17a5995d67bd5cda44f Mon Sep 17 00:00:00 2001
From: Elizabeth Humphries <emarellahumphries@gmail.com>
Date: Fri, 2 Feb 2024 17:25:58 -0500
Subject: [PATCH 3/7] making instructor section title match others

---
 07-instructor_guide.Rmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/07-instructor_guide.Rmd b/07-instructor_guide.Rmd
index 47c6c25..32c0388 100644
--- a/07-instructor_guide.Rmd
+++ b/07-instructor_guide.Rmd
@@ -1,4 +1,4 @@
-# (PART\*) Instructor Guide {-}
+# (PART\*) Instructor Guide to AnVIL {-}
 
 # Notes for Instructors 
 

From 43486d5949a2e06ef0c7918357a373bcb5a19c7c Mon Sep 17 00:00:00 2001
From: Elizabeth Humphries <emarellahumphries@gmail.com>
Date: Fri, 2 Feb 2024 17:34:28 -0500
Subject: [PATCH 4/7] adding galaxy to student guide

---
 08-student_anvil_guide.Rmd | 90 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/08-student_anvil_guide.Rmd b/08-student_anvil_guide.Rmd
index b57956c..b7dc53b 100644
--- a/08-student_anvil_guide.Rmd
+++ b/08-student_anvil_guide.Rmd
@@ -57,3 +57,93 @@ cow::borrow_chapter(
   repo_name = "jhudsl/AnVIL_Template"
 )
 ```
+
+
+# Using Galaxy on AnVIL
+
+In the next few steps, you will walk through how to get set up to use Galaxy on the AnVIL platform. AnVIL is centered around different “Workspaces”. Each Workspace functions almost like a mini code laboratory - it is a place where data can be examined, stored, and analyzed. The first thing we want to do is to copy or “clone” a Workspace to create a space for you to experiment.
+
+Use a web browser to go to the AnVIL website. In the browser type:
+
+```
+anvil.terra.bio
+```
+
+:::{.notice}
+**Tip**
+At this point, it might make things easier to open up a new window in your browser and split your screen. That way, you can follow along with this guide on one side and execute the steps on the other.
+:::
+
+Your instructor will give you information on which workspace you should clone. After logging in, click “View Workspaces”. Select the “Public” tab.  In the top search bar type the activity workspace. 
+
+Clone the workspace by clicking the teardrop button (![teardrop button](resources/images/teardrop.png){#id .class width=25 height=20px}). And selecting “Clone”.
+
+```{r, echo=FALSE, fig.alt='Screenshot showing the teardrop button. The button has been clicked revealing the "clone" option. The Clone option and the teardrop button are highlighted.'}
+ottrpal::include_slide("https://docs.google.com/presentation/d/182AOzMaiyrreinnsRX2VhH7YsVgvAp4xtIB_7Mzmk6I/edit#slide=id.ged15532ded_0_625")
+```
+
+In the first box, give your Workspace clone a name by adding an underscore (“_”) and your name. For example, “SARS-CoV-2-Genome_Ava_Hoffman”. Next, select the Billing project provided by your instructor. Leave the bottom two boxes as-is and click “CLONE WORKSPACE”.
+
+```{r, echo=FALSE, fig.alt='Screenshot showing the "clone a workspace" popout. The Workspace name, Billing Project, and Clone Workspace button have been filled in and highlighted.'}
+ottrpal::include_slide("https://docs.google.com/presentation/d/182AOzMaiyrreinnsRX2VhH7YsVgvAp4xtIB_7Mzmk6I/edit#slide=id.ged15532ded_0_648")
+```
+
+## Starting Galaxy {#starting-galaxy}
+
+Galaxy is a great tool for performing bioinformatics analysis without having to update software or worry too much about coding. In order to use Galaxy, we need to create a cloud environment. This is like quickly renting a few computers from Google as the engine to power our Galaxy analysis. 
+
+:::{.warning}
+Currently, you will need to use Chrome or Safari as your browser for Galaxy cloud environments to work.
+:::
+
+In your new Workspace, click on the “ANALYSES” tab. Next, click on “START”. You should see a popup window on the right side of the screen. Click on the Galaxy logo to proceed.
+
+```{r, echo=FALSE, fig.alt='Screenshot of the Workspace Notebooks tab. The notebook tab name and the plus button that starts a cloud environment for Galaxy have been highlighted,'}
+ottrpal::include_slide("https://docs.google.com/presentation/d/182AOzMaiyrreinnsRX2VhH7YsVgvAp4xtIB_7Mzmk6I/edit#slide=id.ged15532ded_0_788")
+```
+
+Click on “NEXT” and “CREATE” to keep all settings as-is.
+
+```{r, echo=FALSE, fig.alt='The CREATE button among cloud environments has been highlighted.'}
+ottrpal::include_slide("https://docs.google.com/presentation/d/182AOzMaiyrreinnsRX2VhH7YsVgvAp4xtIB_7Mzmk6I/edit#slide=id.ged15532ded_0_798")
+```
+
+Click on the Galaxy icon. 
+
+```{r, echo=FALSE, fig.alt='The Galaxy icon appears if the environment has been successfully launched.'}
+ottrpal::include_slide("https://docs.google.com/presentation/d/182AOzMaiyrreinnsRX2VhH7YsVgvAp4xtIB_7Mzmk6I/edit#slide=id.g2283b458fae_100_31")
+```
+
+You will see that the environment is still being set up.
+
+```{r, echo=FALSE, fig.alt='The status of the cloud computing environment shows that it is still being set up.'}
+ottrpal::include_slide("https://docs.google.com/presentation/d/182AOzMaiyrreinnsRX2VhH7YsVgvAp4xtIB_7Mzmk6I/edit#slide=id.g2283b458fae_100_38")
+```
+
+This will take 8-10 minutes. When it is done, click “Open”. You might need to refresh the page.
+
+```{r, echo=FALSE, fig.alt='The Provisioning status text has changed to "Launch Galaxy" indicating the cloud environment is ready to use.'}
+ottrpal::include_slide("https://docs.google.com/presentation/d/182AOzMaiyrreinnsRX2VhH7YsVgvAp4xtIB_7Mzmk6I/edit#slide=id.g2283b458fae_100_46")
+```
+
+:::{.notice}
+Remember that you can refresh your browser or navigate away at any time. This is because the connection to the environment is in the cloud, not on your personal computer.
+:::
+
+You can also follow along with the first ~2 minutes of [this video](https://jhudatascience.org/AnVIL_Book_Getting_Started/starting-galaxy.html) to start Galaxy on AnVIL.
+
+## Navigating Galaxy
+
+Notice the three main sections.
+
+**Tools** - These are all of the bioinformatics tool packages available for you to use.
+
+**The Main Dashboard** - This contains flash messages and posts when you first open Galaxy, but when we are using data this is the main interface area.
+
+**History** - When you start a project you will be able to see all of the documents in the project in the history. Now be aware, this can become very busy. Also the naming that Galaxy uses is not very intuitive, so you must make sure that you label your files with something that makes sense to you.
+
+```{r, echo=FALSE, fig.alt='Screenshot of the Galaxy landing page. The Tools and History headings have been highlighted.'}
+ottrpal::include_slide("https://docs.google.com/presentation/d/182AOzMaiyrreinnsRX2VhH7YsVgvAp4xtIB_7Mzmk6I/edit#slide=id.ged15532ded_0_816")
+```
+
+On the welcome page, there are links to tutorials. You may try these out on your own. If you want to try a new analysis this is a good place to start.

From fb19e0934d942aae27ea3f0bfa441eb2b48bd2c8 Mon Sep 17 00:00:00 2001
From: Elizabeth Humphries <emarellahumphries@gmail.com>
Date: Fri, 2 Feb 2024 17:36:15 -0500
Subject: [PATCH 5/7] fussing with language

---
 08-student_anvil_guide.Rmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/08-student_anvil_guide.Rmd b/08-student_anvil_guide.Rmd
index b7dc53b..8a68028 100644
--- a/08-student_anvil_guide.Rmd
+++ b/08-student_anvil_guide.Rmd
@@ -90,7 +90,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/182AOzMaiyrreinns
 
 ## Starting Galaxy {#starting-galaxy}
 
-Galaxy is a great tool for performing bioinformatics analysis without having to update software or worry too much about coding. In order to use Galaxy, we need to create a cloud environment. This is like quickly renting a few computers from Google as the engine to power our Galaxy analysis. 
+Galaxy is a free, relatively easy to use bioinformatics implementation package. It changes command line programs into GUI based programs and is a great tool for performing bioinformatics analysis without having to update software or worry too much about coding. In order to use Galaxy, we need to create a cloud environment. This is like quickly renting a few computers from Google as the engine to power our Galaxy analysis. 
 
 :::{.warning}
 Currently, you will need to use Chrome or Safari as your browser for Galaxy cloud environments to work.

From ee88b46f6b54f5db3bb661879c4b38598869038e Mon Sep 17 00:00:00 2001
From: Elizabeth Humphries <emarellahumphries@gmail.com>
Date: Fri, 2 Feb 2024 17:46:51 -0500
Subject: [PATCH 6/7] removing duplicate information

---
 08-data_exploration.Rmd | 321 ----------------------------------------
 09-data_exploration.Rmd |  52 -------
 2 files changed, 373 deletions(-)
 delete mode 100644 08-data_exploration.Rmd

diff --git a/08-data_exploration.Rmd b/08-data_exploration.Rmd
deleted file mode 100644
index 13016d4..0000000
--- a/08-data_exploration.Rmd
+++ /dev/null
@@ -1,321 +0,0 @@
-# (PART\*) Data Exploration {-}
-
-
-```{r, include = FALSE}
-ottrpal::set_knitr_image_path()
-```
-
-# Exploring Soil Testing Data With R
-
-In this activity, you'll have a chance to become familiar with the BioDIGS soil testing data. This dataset includes information on the inorganic components of each soil sample, particularly metal concentrations. Human activity can increase the concentration of inorganic compounds in the soil. When cars drive on roads, compounds from the exhaust, oil, and other fluids might settle onto the roads and be washed into the soil. When we put salt on roads, parking lots, and sidewalks, the salts themselves will eventually be washed away and enter the ecosystem through both water and soil. Chemicals from factories and other businesses also leech into our environment. All of this means the concentration of heavy metals and other chemicals will vary among the soil samples collected for the BioDIGS project. 
-
-## Before You Start
-
-```{r, echo = FALSE, results='asis'}
-cow::borrow_chapter(
-  doc_path = "child/_child_google_create_account.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-
-## Objectives
-
-This activity will teach you how to use the AnVIL platform to:
-
-1. Get started working on AnVIL
-1. Launch RStudio
-1. Import data into RStudio
-1. Examine csv file that contains the soil testing data from the BioDIGS project
-1. Calculate summary statistics for variables in the soil testing data
-1. Create and interpret histograms and boxplots for variables in the soil testing data.
-
-## Getting Started
-
-In the next few steps, you will walk through how to get set up to use RStudio on the AnVIL platform. AnVIL is centered around different “Workspaces”. Each Workspace functions almost like a mini code laboratory - it is a place where data can be examined, stored, and analyzed. The first thing we want to do is to copy or “clone” a Workspace to create a space for you to experiment.
-
-Use a web browser to go to the AnVIL website. In the browser type:
-
-```
-anvil.terra.bio
-```
-
-:::{.notice}
-**Tip**
-At this point, it might make things easier to open up a new window in your browser and split your screen. That way, you can follow along with this guide on one side and execute the steps on the other.
-:::
-
-Your instructor will give you information on which workspace you should clone.
-
-### Video overview of RStudio on AnVIL
-
-```{r, echo = FALSE, results='asis'}
-cow::borrow_chapter(
-  doc_path = "child/_child_rstudio_video.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-
-### Launching RStudio
-
-```{r, echo = FALSE, results='asis'}
-cow::borrow_chapter(
-  doc_path = "child/_child_rstudio_launch.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-
-### Touring RStudio
-
-```{r, echo = FALSE, results='asis'}
-cow::borrow_chapter(
-  doc_path = "child/_child_rstudio_tour.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-
-### Pausing RStudio
-
-```{r, echo = FALSE, results='asis'}
-cow::borrow_chapter(
-  doc_path = "child/_child_rstudio_pause.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-
-## Part 1. Data Import
-
-We will use the `BioDIGS` package to retrieve the data. We first need to install the package from where it is stored on GitHub.
-
-```{r, message = FALSE, warning = FALSE, echo = FALSE}
-
-library(readr)
-soil.values <- read_csv(file = "soil_testing_data.csv")
-```
-
-
-```{r, message = FALSE, warning = FALSE, eval=F}
-
-devtools::install_github("fhdsl/BioDIGSData")
-```
-
-Once you've installed the package, we can load the library and assign the soil testing data to an _object_. This command follows the code structure:
-
-dataset_object_name <- stored_BioDIGS_dataset
-
-```{r, message = FALSE, warning = FALSE, eval=F}
-
-library(BioDIGSData)
-
-soil.values <- BioDIGS_soil_data()
-```
-
-It _seems_ like the dataset loaded, but it's always a good idea to verify. There are many ways to check, but the easiest approach (if you're using RStudio) is to look at the Environment tab on the upper right-hand side of the screen. You should now have an object called `soil.values` that includes some number of observations for 28 variables. The _observations_ refer to the number of rows in the dataset, while the _variables_ tell you the number of columns. As long as neither the observations or variables are 0, you can be confident that your dataset loaded.
-
-Let's take a quick look at the dataset. We can do this by clicking on soil.values object in the Environment tab. (Note: this is equivalent to typing `View(soil.values)` in the R console.)
-
-<img src="resources/images/08-environment.png" title="If the dataset loaded, you will see an object with non-zero observations and variables in the Environment tab" alt="If the dataset loaded, you will see an object with non-zero observations and variables in the Environment tab" style="display: block; margin: auto;" />
-
-
-This will open a new window for us to scroll through the dataset.
-
-<img src="resources/images/08-scrolling_through_dataset.png" title="You can click on the object in the Environment tab to open a new window that allows you to scroll through the loaded dataset" alt="You can click on the object in the Environment tab to open a new window that allows you to scroll through the loaded dataset" style="display: block; margin: auto;" />
-
-Well, the data definitely loaded, but those column names aren't immediately understandable. What could **As_EPA3051** possibly mean? In addition to the dataset, we need to load the _data dictionary_ as well.
-
-:::{.dictionary}
-
-**Data dictionary:** a file containing the names, definitions, and attributes about data in a database or dataset.
-
-:::
-
-In this case, the data dictionary can help us make sense of what sort of values each column represents. The data dictionary for the BioDIGS soil testing data is available in the R package (see code below), but we have also reproduced it here.
-
-```{r, message = FALSE, warning = FALSE, eval=FALSE}
-
-?BioDIGS_soil_data()
-```
-
-:::{.dictionary}
-
-- **site_id** Unique letter and number site name
-- **full_name** Full site name
-- **As_EPA3051** Arsenic (mg/kg), EPA Method 3051A. Quantities < 3.0 are not detectable.
-- **Cd_EPA3051** Cadmium (mg/kg), EPA Method 3051A. Quantities < 0.2 are not detectable.
-- **Cr_EPA3051** Chromium (mg/kg), EPA Method 3051A
-- **Cu_EPA3051** Copper (mg/kg), EPA Method 3051A
-- **Ni_EPA3051** Nickel (mg/kg), EPA Method 3051A
-- **Pb_EPA3051** Lead (mg/kg), EPA Method 3051A
-- **Zn_EPA3051** Zinc (mg/kg), EPA Method 3051A
-- **water_pH**
-- **A-E_Buffer_pH**
-- **OM_by_LOI_pct** Organic Matter by Loss on Ignition
-- **P_Mehlich3** Phosphorus (mg/kg), using the Mehlich 3 soil test extractant
-- **K_Mehlich3 Potassium** (mg/kg), using the Mehlich 3 soil test extractant
-- **Ca_Mehlich3** Calcium (mg/kg), using the Mehlich 3 soil test extractant
-- **Mg_Mehlich3** Magnesium (mg/kg), using the Mehlich 3 soil test extractant
-- **Mn_Mehlich3** Manganese (mg/kg), using the Mehlich 3 soil test extractant
-- **Zn_Mehlich3** Zinc (mg/kg), using the Mehlich 3 soil test extractant
-- **Cu_Mehlich3** Copper (mg/kg), using the Mehlich 3 soil test extractant
-- **Fe_Mehlich3** Iron (mg/kg), using the Mehlich 3 soil test extractant
-- **B_Mehlich3** Boron (mg/kg), using the Mehlich 3 soil test extractant
-- **S_Mehlich3** Sulfur (mg/kg), using the Mehlich 3 soil test extractant
-- **Na_Mehlich3** Sodium (mg/kg), using the Mehlich 3 soil test extractant
-- **Al_Mehlich3** Aluminum (mg/kg), using the Mehlich 3 soil test extractant
-- **Est_CEC** Cation Exchange Capacity (meq/100g) at pH 7.0 (CEC)
-- **Base_Sat_pct** Base saturation (BS). This represents the percentage of CEC occupied by bases (Ca2+, Mg2+, K+, and Na+). The %BS increases with increasing soil pH. The availability of Ca2+, Mg2+, and K+ increases with increasing %BS.
-- **P_Sat_ratio** Phosphorus saturation ratio. This is the ratio between the amount of phosphorus present in the soil and the total capacity of that soil to retain phosphorus. The ability of phosphorus to be bound in the soil is primary a function of iron (Fe) and aluminum (Al) content in that soil.
-
-:::
-
-Using the data dictionary, we find that the values in column `As_EPA3051` give us the arsenic concentration in mg/kg of each soil sample, as determined by EPA Method 3051A. While arsenic can occur naturally in soils, higher levels suggest the soil may have been contaminated by mining, hazardous waste, or pesticide application. Arsenic is toxic to humans.
-
-We can also look at just the names of all the columns using the R console using the `colnames()` command.
-
-```{r, message = FALSE, warning = FALSE}
-
-colnames(soil.values)
-```
-
-Most of the column names are found in the data dictionary, but the very last column ("region") isn't. How peculiar! Let's look at what sort of values this particular column contains. The tab with the table of the `soil.views` object should still be open in the upper left pane of the RStudio window. If not, you can open it again by clicking on  `soils.view` in the Environment pane, or by using the `View()` command.
-
-```{r, message = FALSE, warning = FALSE, eval = F}
-
-View(soil.values)
-```
-
-<img src="resources/images/08-soil_values_object.png" title="Switch to the soil.values tab to look at what values are in the region column" alt="Switch to the soil.values tab to look at what values are in the region column" style="display: block; margin: auto;" />
-
-
-If you scroll to the end of the table, we can see that "region" seems to refer to the city or area where the samples were collected. For example, the first 24 samples all come from Baltimore City.
-
-<img src="resources/images/08-region.png" title="We can see the first samples in the dataset were collected in Baltimore City" alt="We can see the first samples in the dataset were collected in Baltimore City" style="display: block; margin: auto;" />
-
-
-You may notice that some cells in the `soil.values` table contain _NA_. This just means that the soil testing data for that sample isn't available yet. We'll take care of those values in the next part.
-
-::: {.reflection}
-QUESTIONS:
-
-1. How many observations are in the soil testing values dataset that you loaded? What do each of these observations refer to?
-
-2. What data is found in the column labeled "Fe_Mehlich3"? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.)
-
-:::
-
-## Part 2. Data Summarization
-
-Now that we have the dataset loaded, let's explore the data in more depth.
-
-First, we should remove those samples that don't have soil testing data yet. We _could_ keep them in the dataset, but removing them at this stage will make the analysis a little cleaner. In this case, as we know the reason the data are missing (and that reason will not skew our analysis), we can safely remove these samples. This will not be the case for every data analysis.
-
-We can remove the unanalyzed samples using the `drop_na()` function from the `tidyr` package. This function removes any rows from a table that contains _NA_ for a particular column. This command follows the code structure:
-
-dataset_new_name <- dataset %>% drop_na(column_name)
-
-The `%>% is called a pipe and it tells R that the commands after it should all be applied to the object in front of it. (In this case, we can filter out all samples missing a value for "As_EPA3051" as a proxy for samples without soil testing data.)
-
-```{r, message = FALSE, warning = FALSE}
-
-library(tidyr)
-
-soil.values.clean <- soil.values %>% drop_na(As_EPA3051)
-```
-
-Great! Now let's calculate some basic statistics. For example, we might want to know what the mean (average) lead concentration is for each soil sample. According to the data dictionary, the values for lead concentration are in the column labeled "Pb_EPA3051". We can use a combination of two functions: `pull()` and `mean()`.`pull()` lets you extract a column from your table for statistical analysis, while `mean()` calculates the average value for the extracted column.
-
-This command follows the code structure:
-
-OBJECT %>% pull(column_name) %>% mean()
-
-`pull()` is a command from the `tidyverse` package, so we'll need to load that library before our command.
-
-```{r, message = FALSE, warning = FALSE}
-
-library(tidyverse)
-
-soil.values.clean %>% pull(As_EPA3051) %>% mean()
-```
-
-We can run similar commands to calculate the standard deviation, minimum, and maximum for the soil arsenic values.
-
-```{r, message = FALSE, warning = FALSE}
-
-soil.values.clean %>% pull(As_EPA3051) %>% sd()
-soil.values.clean %>% pull(As_EPA3051) %>% min()
-soil.values.clean %>% pull(As_EPA3051) %>% max()
-```
-As you can see, the standard deviation of the arsenic concentrations is listed first, then the minimum concentration, and finally the maximum concentration.
-
-The soil testing dataset contains samples from multiple geographic regions, so maybe it's more meaningful to find out what the average arsenic values are for each region. We have to do a little bit of clever coding trickery for this using the `group_by` and `summarize` functions. First, we tell R to split our dataset up by a particular column (in this case, region) using the `group_by` function, then we tell R to summarize the mean arsenic concentration for each group. Because there are several different functions with the name `summarize` in R, we have to specify that we want to use `summarize` from the `dplyr` package. This command follows the code structure:
-
-dataset %>%
-    group_by(column_name) %>%
-    dplyr::summarize(Mean = mean(column_name))
-    
-```{r, message = FALSE, warning = FALSE}
-
-soil.values.clean %>%
-    group_by(region) %>%
-    dplyr::summarize(Mean = mean(As_EPA3051))
-```
-
-Now we know that the mean arsenic concentration might be different for each region, and appears higher for the Baltimore City samples than the Montgomery County samples.
-
-::: {.reflection}
-QUESTIONS:
-
-3. What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value?
-
-2. Calculate the mean iron concentration by region. Which region has the highest mean iron concentration? What about the lowest?
-
-:::
-
-## Part 3. Data Visualization
-
-Often, it can be easier to immediately interpret data displayed as a plot than as a list of values. For example, we can more easily understand how the arsenic concentration of the soil samples are distributed if we create histograms compared to looking at point values like mean, standard deviation, minimum, and maximum.
-
-One way to make histograms in R is to use the `hist()` function. We can again use the `pull()` command and pipes (`%>%`) to choose the column we want from the `soil.values.clean` dataset and make a histogram of them. Remember, this command follows the code structure:
-
-dataset %>%
-    pull(column_name) %>%
-    hist(main = chart_title, xlab = x_axis_title)
-
-In this case, we do _not_ have to use the `dplyr::summarize` command before `hist()` because there's only one function called `hist()` in the packages we're using.
-
-```{r, message = FALSE, warning = FALSE}
-
-soil.values.clean %>% 
-    pull(As_EPA3051) %>% 
-    hist(main = 'Histogram of Arsenic Concentration', 
-         xlab ='Concentration in mg/kg' )
-```
-
-We can see that almost all the soil samples had very low concentrations of arsenic (which is good news for the soil health!). In fact, many of them had arsenic concentrations close to 0, and only one sampling location appears to have high levels of arsenic. 
-
-We might also want to graphically compare arsenic concentrations among the geographic regions in our dataset. We can do this by creating boxplots. Boxplots are particularly useful when comparing the mean, variation, and distributions among multiple groups. In R, one way to create a boxplot is using the `boxplot()` function. We don't need to use pipes for this command, but instead will specify what columns we want to use from the dataset inside the `boxplot()` function itself.
-
-This command follows the code structure:
-
-boxplot(arsenic_concentration ~ grouping_variable, 
-    data = dataset,
-    main = "Title of Graph",
-    xlab = "x_axis_title",
-    ylab = "y_axis_title")
-
-```{r, message = FALSE, warning = FALSE}
-boxplot(As_EPA3051 ~ region, data = soil.values.clean,
-        main = "Arsenic Concentration by Geographic Region",
-        xlab = "Region",
-        ylab = "Arsenic Concentration in mg/kg")
-```
-
-By using a boxplot, we can quickly see that, while one sampling site within Baltimore City has a very high concentration of arsenic in the soil, in general there isn't a difference in arsenic content between Baltimore City and Montgomery County.
-
-::: {.reflection}
-QUESTIONS:
-
-5. Create a histogram for _iron_ concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations?
-
-6. Create a histogram for _lead_ concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations?
-
-:::
diff --git a/09-data_exploration.Rmd b/09-data_exploration.Rmd
index a472343..74f1f08 100644
--- a/09-data_exploration.Rmd
+++ b/09-data_exploration.Rmd
@@ -27,58 +27,6 @@ This activity will teach you how to use the AnVIL platform to:
 1. Calculate summary statistics for variables in the soil testing data
 1. Create and interpret histograms and boxplots for variables in the soil testing data.
 
-## Getting Started
-
-In the next few steps, you will walk through how to get set up to use RStudio on the AnVIL platform. AnVIL is centered around different “Workspaces”. Each Workspace functions almost like a mini code laboratory - it is a place where data can be examined, stored, and analyzed. The first thing we want to do is to copy or “clone” a Workspace to create a space for you to experiment.
-
-Use a web browser to go to the AnVIL website. In the browser type:
-
-```
-anvil.terra.bio
-```
-
-:::{.notice}
-**Tip**
-At this point, it might make things easier to open up a new window in your browser and split your screen. That way, you can follow along with this guide on one side and execute the steps on the other.
-:::
-
-Your instructor will give you information on which workspace you should clone.
-
-### Video overview of RStudio on AnVIL
-
-```{r, echo = FALSE, results='asis'}
-cow::borrow_chapter(
-  doc_path = "child/_child_rstudio_video.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-
-### Launching RStudio
-
-```{r, echo = FALSE, results='asis'}
-cow::borrow_chapter(
-  doc_path = "child/_child_rstudio_launch.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-
-### Touring RStudio
-
-```{r, echo = FALSE, results='asis'}
-cow::borrow_chapter(
-  doc_path = "child/_child_rstudio_tour.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
-
-### Pausing RStudio
-
-```{r, echo = FALSE, results='asis'}
-cow::borrow_chapter(
-  doc_path = "child/_child_rstudio_pause.Rmd",
-  repo_name = "jhudsl/AnVIL_Template"
-)
-```
 
 ## Part 1. Data Import
 

From e5d535798d5cdc7baa905536bc42cdaf00180a54 Mon Sep 17 00:00:00 2001
From: Elizabeth Humphries <emarellahumphries@gmail.com>
Date: Fri, 2 Feb 2024 18:01:16 -0500
Subject: [PATCH 7/7] little punctuation fixes

---
 09-data_exploration.Rmd | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/09-data_exploration.Rmd b/09-data_exploration.Rmd
index 74f1f08..84fb5fd 100644
--- a/09-data_exploration.Rmd
+++ b/09-data_exploration.Rmd
@@ -23,9 +23,9 @@ cow::borrow_chapter(
 This activity will teach you how to use the AnVIL platform to:
 
 1. Import data into RStudio
-1. Examine csv file that contains the soil testing data from the BioDIGS project
+1. Examine a csv file that contains the soil testing data from the BioDIGS project
 1. Calculate summary statistics for variables in the soil testing data
-1. Create and interpret histograms and boxplots for variables in the soil testing data.
+1. Create and interpret histograms and boxplots for variables in the soil testing data
 
 
 ## Part 1. Data Import