From fcae90b3b8d8abde1ad482781651fc28fcba2185 Mon Sep 17 00:00:00 2001
From: "github-actions[bot]"
Date: Mon, 9 Sep 2024 20:42:58 +0000
Subject: [PATCH] Render bookdown
---
docs/09-soil_exploration_module.md | 130 +++++--
docs/10-module_questions.md | 31 ++
docs/404.html | 16 +-
docs/About.md | 2 +-
docs/about-the-authors.html | 22 +-
docs/activity-questions.html | 325 ++++++++++++++++++
docs/anvil-workspace.html | 16 +-
docs/background.html | 16 +-
docs/billing.html | 16 +-
docs/biodigs-data.html | 16 +-
...klist-for-running-activities-on-anvil.html | 16 +-
docs/exploring-soil-testing-data-with-r.html | 137 +++++---
...g-credit-for-professional-development.html | 16 +-
docs/index.html | 18 +-
docs/index.md | 2 +-
docs/notes-for-instructors.html | 16 +-
docs/reference-keys.txt | 10 +-
docs/references.html | 26 +-
docs/research-team.html | 16 +-
docs/resources/images/08-environment.png | Bin 0 -> 152563 bytes
docs/resources/images/08-region.png | Bin 0 -> 132335 bytes
.../images/08-scrolling_through_dataset.png | Bin 0 -> 148208 bytes
.../images/08-soil_values_object.png | Bin 0 -> 146884 bytes
.../figure-html/unnamed-chunk-13-1.png | Bin 18382 -> 0 bytes
.../figure-html/unnamed-chunk-14-1.png | Bin 23408 -> 18382 bytes
.../figure-html/unnamed-chunk-15-1.png | Bin 0 -> 23408 bytes
docs/search_index.json | 2 +-
docs/setting-up-billing-on-anvil.html | 16 +-
docs/setting-up-the-class-activity.html | 16 +-
docs/support.html | 16 +-
docs/using-rstudio-on-anvil.html | 16 +-
31 files changed, 737 insertions(+), 176 deletions(-)
create mode 100644 docs/10-module_questions.md
create mode 100644 docs/activity-questions.html
create mode 100644 docs/resources/images/08-environment.png
create mode 100644 docs/resources/images/08-region.png
create mode 100644 docs/resources/images/08-scrolling_through_dataset.png
create mode 100644 docs/resources/images/08-soil_values_object.png
delete mode 100644 docs/resources/images/09-soil_exploration_module_files/figure-html/unnamed-chunk-13-1.png
create mode 100644 docs/resources/images/09-soil_exploration_module_files/figure-html/unnamed-chunk-15-1.png
diff --git a/docs/09-soil_exploration_module.md b/docs/09-soil_exploration_module.md
index 566d7f2..1489ce9 100644
--- a/docs/09-soil_exploration_module.md
+++ b/docs/09-soil_exploration_module.md
@@ -18,13 +18,13 @@ If you would like to create a Google account that is associated with your non-Gm
This activity will teach you how to use the AnVIL platform to:
-1. Import data into RStudio
-1. Examine a csv file that contains the soil testing data from the BioDIGS project
+1. Open data from an R package
+1. Examine objects in R
1. Calculate summary statistics for variables in the soil testing data
1. Create and interpret histograms and boxplots for variables in the soil testing data
-## Part 1. Data Import
+## Part 1. Examining the Data
We will use the `BioDIGS` package to retrieve the data. We first need to install the package from where it is stored on GitHub.
@@ -49,10 +49,10 @@ soil.values <- BioDIGS_soil_data()
It _seems_ like the dataset loaded, but it's always a good idea to verify. There are many ways to check, but the easiest approach (if you're using RStudio) is to look at the Environment tab on the upper right-hand side of the screen. You should now have an object called `soil.values` that includes some number of observations for 28 variables. The _observations_ refer to the number of rows in the dataset, while the _variables_ tell you the number of columns. As long as neither the observations or variables are 0, you can be confident that your dataset loaded.
-Let's take a quick look at the dataset. We can do this by clicking on soil.values object in the Environment tab. (Note: this is equivalent to typing `View(soil.values)` in the R console.)
+Let's take a quick look at the dataset. We can do this by clicking on soil.values object in the Environment tab. (Note: this is equivalent to typing `View(soil.values)` in the R console.)
This will open a new window for us to scroll through the dataset.
@@ -105,7 +105,18 @@ In this case, the data dictionary can help us make sense of what sort of values
:::
-Using the data dictionary, we find that the values in column `As_EPA3051` give us the arsenic concentration in mg/kg of each soil sample, as determined by EPA Method 3051A. While arsenic can occur naturally in soils, higher levels suggest the soil may have been contaminated by mining, hazardous waste, or pesticide application. Arsenic is toxic to humans.
+Using the data dictionary, we find that the values in column `As_EPA3051` give us the arsenic concentration in mg/kg of each soil sample, as determined by [EPA Method 3051A](https://www.epa.gov/sites/default/files/2015-12/documents/3051a.pdf). This method uses a combination of heat and acid to extract specific elements (like arsenic, cadmium, chromium, copper, nickel, lead, and zinc) from soil samples.
+
+While arsenic can occur naturally in soils, higher levels suggest the soil may have been contaminated by mining, hazardous waste, or pesticide application. Arsenic is toxic to humans.
+
+::: {.reflection}
+QUESTIONS:
+
+1. What data is found in the column labeled "Fe_Mehlich3"? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.)
+
+2. What data is found in the column labeled "Base_Sat_pct"? What does this variable tell us about the soil?
+
+:::
We can also look at just the names of all the columns using the R console using the `colnames()` command.
@@ -134,7 +145,7 @@ View(soil.values)
-If you scroll to the end of the table, we can see that "region" seems to refer to the city or area where the samples were collected. For example, the first 24 samples all come from Baltimore City.
+If you scroll to the end of the table, we can see that "region" seems to refer to the city or area where the samples were collected. For example, the first 6 samples all come from Baltimore City.
@@ -144,13 +155,13 @@ You may notice that some cells in the `soil.values` table contain _NA_. This jus
::: {.reflection}
QUESTIONS:
-1. How many observations are in the soil testing values dataset that you loaded? What do each of these observations refer to?
+3. How many observations are in the soil testing values dataset that you loaded? What do each of these observations refer to?
-2. What data is found in the column labeled "Fe_Mehlich3"? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.)
+4. How many different regions are represented in the soil testing dataset? How many of them have soil testing data available?
:::
-## Part 2. Data Summarization
+## Part 2. Summarizing the Data with Statistics
Now that we have the dataset loaded, let's explore the data in more depth.
@@ -169,7 +180,7 @@ library(tidyr)
soil.values.clean <- soil.values %>% drop_na(As_EPA3051)
```
-Great! Now let's calculate some basic statistics. For example, we might want to know what the mean (average) lead concentration is for each soil sample. According to the data dictionary, the values for lead concentration are in the column labeled "Pb_EPA3051". We can use a combination of two functions: `pull()` and `mean()`.`pull()` lets you extract a column from your table for statistical analysis, while `mean()` calculates the average value for the extracted column.
+Great! Now let's calculate some basic statistics. For example, we might want to know what the mean (average) arsenic concentration is for all the soil samples. We can use a combination of two functions: `pull()` and `mean()`. `pull()` lets you extract a column from your table for statistical analysis, while `mean()` calculates the average value for the extracted column.
This command follows the code structure:
@@ -188,7 +199,7 @@ soil.values.clean %>% pull(As_EPA3051) %>% mean()
## [1] 5.10875
```
-We can run similar commands to calculate the standard deviation, minimum, and maximum for the soil arsenic values.
+We can run similar commands to calculate the standard deviation (`sd`), minimum (`min`), and maximum (`max`) for the soil arsenic values.
``` r
@@ -214,52 +225,98 @@ soil.values.clean %>% pull(As_EPA3051) %>% max()
```
## [1] 27.3
```
-As you can see, the standard deviation of the arsenic concentrations is listed first, then the minimum concentration, and finally the maximum concentration.
-The soil testing dataset contains samples from multiple geographic regions, so maybe it's more meaningful to find out what the average arsenic values are for each region. We have to do a little bit of clever coding trickery for this using the `group_by` and `summarize` functions. First, we tell R to split our dataset up by a particular column (in this case, region) using the `group_by` function, then we tell R to summarize the mean arsenic concentration for each group. Because there are several different functions with the name `summarize` in R, we have to specify that we want to use `summarize` from the `dplyr` package. This command follows the code structure:
+The soil testing dataset contains samples from multiple geographic regions, so maybe it's more meaningful to find out what the average arsenic values are for each region. We have to do a little bit of clever coding trickery for this using the `group_by` and `summarize` functions. First, we tell R to split our dataset up by a particular column (in this case, region) using the `group_by` function, then we tell R to summarize the mean arsenic concentration for each group.
+
+When using the `summarize` function, we tell R to make a new table (technically, a tibble in R) that contains two columns: the column used to group the data and the statistical measure we calculated for each group.
+
+This command follows the code structure:
dataset %>%
group_by(column_name) %>%
- dplyr::summarize(Mean = mean(column_name))
+ summarize(mean(column_name))
``` r
soil.values.clean %>%
group_by(region) %>%
- dplyr::summarize(Mean = mean(As_EPA3051))
+ summarize(mean(As_EPA3051))
```
```
## # A tibble: 2 × 2
-## region Mean
-##
-## 1 Baltimore City 5.56
-## 2 Montgomery County 4.66
+## region `mean(As_EPA3051)`
+##
+## 1 Baltimore City 5.56
+## 2 Montgomery County 4.66
```
-Now we know that the mean arsenic concentration might be different for each region, and appears higher for the Baltimore City samples than the Montgomery County samples.
+Now we know that the mean arsenic concentration might be different for each region. If we compare the samples from Baltimore City and Montgomery County, the Baltimore City samples appear to have a higher mean arsenic concentration than the Montgomery County samples.
::: {.reflection}
QUESTIONS:
-3. What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value?
+5. All the samples from Baltimore City and Montgomery County were collected from public park land. The parks sampled from Montgomery County were located in suburban and rural areas, compared to the urban parks sampled in Baltimore City. Why might the Montgomery County samples have a lower average arsenic concentration than the samples from Baltimore City?
+
+6. What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value?
-2. Calculate the mean iron concentration by region. Which region has the highest mean iron concentration? What about the lowest?
+7. Calculate the mean iron concentration by region. Which region has the highest mean iron concentration? What about the lowest?
:::
-## Part 3. Data Visualization
+Let's say we're interested in looking at mean concentrations that were determined using EPA Method 3051. Given that there are 8 of these measures in the `soil.values` dataset, it would be time consuming to run our code from above for each individual measure.
+
+We can add two arguments to our `summarize` statement to calculate statistical measures for multiple columns at once: the `across` argument, which tells R to apply the `summarize` command to multiple columns; and the `ends_with` parameter, which tells R which columns should be included in the statistical calculation.
+
+We are using `ends_with` because for this question, all the columns that we're interested in end with the string 'EPA3051'.
+
+This command follows the code structure:
+
+dataset %>%
+ group_by(column_name) %>%
+ summarize(across(ends_with(common_column_name_ending), mean))
+
+
+``` r
+soil.values.clean %>%
+ group_by(region) %>%
+ summarize(across(ends_with('EPA3051'), mean))
+```
+
+```
+## # A tibble: 2 × 8
+## region As_EPA3051 Cd_EPA3051 Cr_EPA3051 Cu_EPA3051 Ni_EPA3051 Pb_EPA3051
+##
+## 1 Baltimore C… 5.56 0.359 34.5 35.0 17.4 67.2
+## 2 Montgomery … 4.66 0.402 29.9 24.3 23.4 38.7
+## # ℹ 1 more variable: Zn_EPA3051
+```
+
+This is a _much_ more efficient way to calculate statistics.
+
+::: {.reflection}
+QUESTIONS:
+
+8. Calculate the maximum values for concentrations that were determined using EPA Method 3051. (HINT: change the function you call in the `summarize` statement.) Which of these metals has the maximum concentration you see, and in which region is it found?
+
+9. Calculate both the mean and maximum values for concentrations that were determined using the Mehlich3 test. (HINT: change the terms in the `columns_to_include` vector, as well as the function you call in the `summarize` statement.) Which of these metals has the highest average and maximum concentrations, and in which region are they found?
+
+:::
+
+## Part 3. Visualizing the Data
Often, it can be easier to immediately interpret data displayed as a plot than as a list of values. For example, we can more easily understand how the arsenic concentration of the soil samples are distributed if we create histograms compared to looking at point values like mean, standard deviation, minimum, and maximum.
-One way to make histograms in R is to use the `hist()` function. We can again use the `pull()` command and pipes (`%>%`) to choose the column we want from the `soil.values.clean` dataset and make a histogram of them. Remember, this command follows the code structure:
+One way to make histograms in R is with the `hist()` function. This function only requires that we tell R which column of the dataset that we want to plot. (However, we also have the option to tell R a histogram name and a x-axis label.)
+
+We can again use the `pull()` command and pipes (`%>%`) to choose the column we want from the `soil.values.clean` dataset and make a histogram of them.
+
+This combination of commands follows the code structure:
dataset %>%
pull(column_name) %>%
hist(main = chart_title, xlab = x_axis_title)
-In this case, we do _not_ have to use the `dplyr::summarize` command before `hist()` because there's only one function called `hist()` in the packages we're using.
-
``` r
soil.values.clean %>%
@@ -268,15 +325,17 @@ soil.values.clean %>%
xlab ='Concentration in mg/kg' )
```
-
+
We can see that almost all the soil samples had very low concentrations of arsenic (which is good news for the soil health!). In fact, many of them had arsenic concentrations close to 0, and only one sampling location appears to have high levels of arsenic.
-We might also want to graphically compare arsenic concentrations among the geographic regions in our dataset. We can do this by creating boxplots. Boxplots are particularly useful when comparing the mean, variation, and distributions among multiple groups. In R, one way to create a boxplot is using the `boxplot()` function. We don't need to use pipes for this command, but instead will specify what columns we want to use from the dataset inside the `boxplot()` function itself.
+We might also want to graphically compare arsenic concentrations among the geographic regions in our dataset. We can do this by creating boxplots. Boxplots are particularly useful when comparing the mean, variation, and distributions among multiple groups.
+
+In R, one way to create a boxplot is using the `boxplot()` function. We don't need to use pipes for this command, but instead will specify what columns we want to use from the dataset inside the `boxplot()` function itself.
This command follows the code structure:
-boxplot(arsenic_concentration ~ grouping_variable,
+boxplot(column_we're_plotting ~ grouping_variable,
data = dataset,
main = "Title of Graph",
xlab = "x_axis_title",
@@ -284,23 +343,26 @@ boxplot(arsenic_concentration ~ grouping_variable,
``` r
-boxplot(As_EPA3051 ~ region, data = soil.values.clean,
+boxplot(As_EPA3051 ~ region,
+ data = soil.values.clean,
main = "Arsenic Concentration by Geographic Region",
xlab = "Region",
ylab = "Arsenic Concentration in mg/kg")
```
-
+
By using a boxplot, we can quickly see that, while one sampling site within Baltimore City has a very high concentration of arsenic in the soil, in general there isn't a difference in arsenic content between Baltimore City and Montgomery County.
::: {.reflection}
QUESTIONS:
-5. Create a histogram for _iron_ concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations?
+10. Create a histogram for _iron_ concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations?
-6. Create a histogram for _lead_ concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations?
+11. Create a histogram for _lead_ concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations?
-7. Look at the maps for [iron](https://biodigs.org/#iron_map) and [lead](https://biodigs.org/#lead_map) on the BioDIGS website. Do the boxplots you created make sense, given what you see on these maps? Why or why not?
+12. Look at the maps for [iron](https://biodigs.org/#iron_map) and [lead](https://biodigs.org/#lead_map) on the BioDIGS website. Do the boxplots you created make sense, given what you see on these maps? Why or why not?
:::
+
+
diff --git a/docs/10-module_questions.md b/docs/10-module_questions.md
new file mode 100644
index 0000000..f9a0eb3
--- /dev/null
+++ b/docs/10-module_questions.md
@@ -0,0 +1,31 @@
+
+# Activity Questions
+
+## Part 1. Examining the Data
+1. What data is found in the column labeled "Fe_Mehlich3"? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.)
+
+2. What data is found in the column labeled "Base_Sat_pct"? What does this variable tell us about the soil?
+
+3. How many observations are in the soil testing values dataset that you loaded? What do each of these observations refer to?
+
+4. How many different regions are represented in the soil testing dataset? How many of them have soil testing data available?
+
+## Part 2. Summarizing the Data with Statistics
+
+5. All the samples from Baltimore City and Montgomery County were collected from public park land. The parks sampled from Montgomery County were located in suburban and rural areas, compared to the urban parks sampled in Baltimore City. Why might the Montgomery County samples have a lower average arsenic concentration than the samples from Baltimore City?
+
+6. What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value?
+
+7. Calculate the mean iron concentration by region. Which region has the highest mean iron concentration? What about the lowest?
+
+8. Calculate the maximum values for concentrations that were determined using EPA Method 3051. (HINT: change the function you call in the `summarize` statement.) Which of these metals has the maximum concentration you see, and in which region is it found?
+
+9. Calculate both the mean and maximum values for concentrations that were determined using the Mehlich3 test. (HINT: change the terms in the `columns_to_include` vector, as well as the function you call in the `summarize` statement.) Which of these metals has the highest average and maximum concentrations, and in which region are they found?
+
+## Part 3. Visualizing the Data
+
+10. Create a histogram for _iron_ concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations?
+
+11. Create a histogram for _lead_ concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations?
+
+12. Look at the maps for [iron](https://biodigs.org/#iron_map) and [lead](https://biodigs.org/#lead_map) on the BioDIGS website. Do the boxplots you created make sense, given what you see on these maps? Why or why not?
diff --git a/docs/404.html b/docs/404.html
index 7d0f689..860c893 100644
--- a/docs/404.html
+++ b/docs/404.html
@@ -22,7 +22,7 @@
-
+
@@ -174,12 +174,18 @@
What data is found in the column labeled “Fe_Mehlich3”? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.)
+
What data is found in the column labeled “Base_Sat_pct”? What does this variable tell us about the soil?
+
How many observations are in the soil testing values dataset that you loaded? What do each of these observations refer to?
+
How many different regions are represented in the soil testing dataset? How many of them have soil testing data available?
+
+
+
+
14.2 Part 2. Summarizing the Data with Statistics
+
+
All the samples from Baltimore City and Montgomery County were collected from public park land. The parks sampled from Montgomery County were located in suburban and rural areas, compared to the urban parks sampled in Baltimore City. Why might the Montgomery County samples have a lower average arsenic concentration than the samples from Baltimore City?
+
What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value?
+
Calculate the mean iron concentration by region. Which region has the highest mean iron concentration? What about the lowest?
+
Calculate the maximum values for concentrations that were determined using EPA Method 3051. (HINT: change the function you call in the summarize statement.) Which of these metals has the maximum concentration you see, and in which region is it found?
+
Calculate both the mean and maximum values for concentrations that were determined using the Mehlich3 test. (HINT: change the terms in the columns_to_include vector, as well as the function you call in the summarize statement.) Which of these metals has the highest average and maximum concentrations, and in which region are they found?
+
+
+
+
14.3 Part 3. Visualizing the Data
+
+
Create a histogram for iron concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations?
+
Create a histogram for lead concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations?
+
Look at the maps for iron and lead on the BioDIGS website. Do the boxplots you created make sense, given what you see on these maps? Why or why not?
This activity will teach you how to use the AnVIL platform to:
-
Import data into RStudio
-
Examine a csv file that contains the soil testing data from the BioDIGS project
+
Open data from an R package
+
Examine objects in R
Calculate summary statistics for variables in the soil testing data
Create and interpret histograms and boxplots for variables in the soil testing data
-
-
13.3 Part 1. Data Import
+
+
13.3 Part 1. Examining the Data
We will use the BioDIGS package to retrieve the data. We first need to install the package from where it is stored on GitHub.
devtools::install_github("fhdsl/BioDIGSData")
Once you’ve installed the package, we can load the library and assign the soil testing data to an object. This command follows the code structure:
@@ -243,8 +249,8 @@
13.3 Part 1. Data Importsoil.values <-BioDIGS_soil_data()
It seems like the dataset loaded, but it’s always a good idea to verify. There are many ways to check, but the easiest approach (if you’re using RStudio) is to look at the Environment tab on the upper right-hand side of the screen. You should now have an object called soil.values that includes some number of observations for 28 variables. The observations refer to the number of rows in the dataset, while the variables tell you the number of columns. As long as neither the observations or variables are 0, you can be confident that your dataset loaded.
-
Let’s take a quick look at the dataset. We can do this by clicking on soil.values object in the Environment tab. (Note: this is equivalent to typing View(soil.values) in the R console.)
+
Let’s take a quick look at the dataset. We can do this by clicking on soil.values object in the Environment tab. (Note: this is equivalent to typing View(soil.values) in the R console.)
This will open a new window for us to scroll through the dataset.
Well, the data definitely loaded, but those column names aren’t immediately understandable. What could As_EPA3051 possibly mean? In addition to the dataset, we need to load the data dictionary as well.
@@ -284,7 +290,15 @@
13.3 Part 1. Data ImportEPA Method 3051A. This method uses a combination of heat and acid to extract specific elements (like arsenic, cadmium, chromium, copper, nickel, lead, and zinc) from soil samples.
+
While arsenic can occur naturally in soils, higher levels suggest the soil may have been contaminated by mining, hazardous waste, or pesticide application. Arsenic is toxic to humans.
+
+
QUESTIONS:
+
+
What data is found in the column labeled “Fe_Mehlich3”? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.)
+
What data is found in the column labeled “Base_Sat_pct”? What does this variable tell us about the soil?
+
+
We can also look at just the names of all the columns using the R console using the colnames() command.
If you scroll to the end of the table, we can see that “region” seems to refer to the city or area where the samples were collected. For example, the first 24 samples all come from Baltimore City.
+
If you scroll to the end of the table, we can see that “region” seems to refer to the city or area where the samples were collected. For example, the first 6 samples all come from Baltimore City.
You may notice that some cells in the soil.values table contain NA. This just means that the soil testing data for that sample isn’t available yet. We’ll take care of those values in the next part.
QUESTIONS:
-
+
How many observations are in the soil testing values dataset that you loaded? What do each of these observations refer to?
-
What data is found in the column labeled “Fe_Mehlich3”? Why would we be interested how much of this is in the soil? (You may have to search the internet for this answer.)
+
How many different regions are represented in the soil testing dataset? How many of them have soil testing data available?
-
-
13.4 Part 2. Data Summarization
+
+
13.4 Part 2. Summarizing the Data with Statistics
Now that we have the dataset loaded, let’s explore the data in more depth.
First, we should remove those samples that don’t have soil testing data yet. We could keep them in the dataset, but removing them at this stage will make the analysis a little cleaner. In this case, as we know the reason the data are missing (and that reason will not skew our analysis), we can safely remove these samples. This will not be the case for every data analysis.
We can remove the unanalyzed samples using the drop_na() function from the tidyr package. This function removes any rows from a table that contains NA for a particular column. This command follows the code structure:
Great! Now let’s calculate some basic statistics. For example, we might want to know what the mean (average) lead concentration is for each soil sample. According to the data dictionary, the values for lead concentration are in the column labeled “Pb_EPA3051”. We can use a combination of two functions: pull() and mean().pull() lets you extract a column from your table for statistical analysis, while mean() calculates the average value for the extracted column.
+
Great! Now let’s calculate some basic statistics. For example, we might want to know what the mean (average) arsenic concentration is for all the soil samples. We can use a combination of two functions: pull() and mean(). pull() lets you extract a column from your table for statistical analysis, while mean() calculates the average value for the extracted column.
This command follows the code structure:
OBJECT %>% pull(column_name) %>% mean()
pull() is a command from the tidyverse package, so we’ll need to load that library before our command.
@@ -326,65 +340,94 @@
13.4 Part 2. Data Summarizationsoil.values.clean %>%pull(As_EPA3051) %>%mean()
## [1] 5.10875
-
We can run similar commands to calculate the standard deviation, minimum, and maximum for the soil arsenic values.
+
We can run similar commands to calculate the standard deviation (sd), minimum (min), and maximum (max) for the soil arsenic values.
soil.values.clean %>%pull(As_EPA3051) %>%sd()
## [1] 5.606926
soil.values.clean %>%pull(As_EPA3051) %>%min()
## [1] 0
soil.values.clean %>%pull(As_EPA3051) %>%max()
## [1] 27.3
-
As you can see, the standard deviation of the arsenic concentrations is listed first, then the minimum concentration, and finally the maximum concentration.
-
The soil testing dataset contains samples from multiple geographic regions, so maybe it’s more meaningful to find out what the average arsenic values are for each region. We have to do a little bit of clever coding trickery for this using the group_by and summarize functions. First, we tell R to split our dataset up by a particular column (in this case, region) using the group_by function, then we tell R to summarize the mean arsenic concentration for each group. Because there are several different functions with the name summarize in R, we have to specify that we want to use summarize from the dplyr package. This command follows the code structure:
+
The soil testing dataset contains samples from multiple geographic regions, so maybe it’s more meaningful to find out what the average arsenic values are for each region. We have to do a little bit of clever coding trickery for this using the group_by and summarize functions. First, we tell R to split our dataset up by a particular column (in this case, region) using the group_by function, then we tell R to summarize the mean arsenic concentration for each group.
+
When using the summarize function, we tell R to make a new table (technically, a tibble in R) that contains two columns: the column used to group the data and the statistical measure we calculated for each group.
## # A tibble: 2 × 2
-## region Mean
-## <chr> <dbl>
-## 1 Baltimore City 5.56
-## 2 Montgomery County 4.66
-
Now we know that the mean arsenic concentration might be different for each region, and appears higher for the Baltimore City samples than the Montgomery County samples.
+## region `mean(As_EPA3051)`
+## <chr> <dbl>
+## 1 Baltimore City 5.56
+## 2 Montgomery County 4.66
+
Now we know that the mean arsenic concentration might be different for each region. If we compare the samples from Baltimore City and Montgomery County, the Baltimore City samples appear to have a higher mean arsenic concentration than the Montgomery County samples.
QUESTIONS:
-
+
+
All the samples from Baltimore City and Montgomery County were collected from public park land. The parks sampled from Montgomery County were located in suburban and rural areas, compared to the urban parks sampled in Baltimore City. Why might the Montgomery County samples have a lower average arsenic concentration than the samples from Baltimore City?
What is the mean iron concentration for samples in this dataset? What about the standard deviation, minimum value, and maximum value?
Calculate the mean iron concentration by region. Which region has the highest mean iron concentration? What about the lowest?
+
Let’s say we’re interested in looking at mean concentrations that were determined using EPA Method 3051. Given that there are 8 of these measures in the soil.values dataset, it would be time consuming to run our code from above for each individual measure.
+
We can add two arguments to our summarize statement to calculate statistical measures for multiple columns at once: the across argument, which tells R to apply the summarize command to multiple columns; and the ends_with parameter, which tells R which columns should be included in the statistical calculation.
+
We are using ends_with because for this question, all the columns that we’re interested in end with the string ‘EPA3051’.
This is a much more efficient way to calculate statistics.
+
+
QUESTIONS:
+
+
Calculate the maximum values for concentrations that were determined using EPA Method 3051. (HINT: change the function you call in the summarize statement.) Which of these metals has the maximum concentration you see, and in which region is it found?
+
Calculate both the mean and maximum values for concentrations that were determined using the Mehlich3 test. (HINT: change the terms in the columns_to_include vector, as well as the function you call in the summarize statement.) Which of these metals has the highest average and maximum concentrations, and in which region are they found?
+
+
-
-
13.5 Part 3. Data Visualization
+
+
13.5 Part 3. Visualizing the Data
Often, it can be easier to immediately interpret data displayed as a plot than as a list of values. For example, we can more easily understand how the arsenic concentration of the soil samples are distributed if we create histograms compared to looking at point values like mean, standard deviation, minimum, and maximum.
-
One way to make histograms in R is to use the hist() function. We can again use the pull() command and pipes (%>%) to choose the column we want from the soil.values.clean dataset and make a histogram of them. Remember, this command follows the code structure:
+
One way to make histograms in R is with the hist() function. This function only requires that we tell R which column of the dataset that we want to plot. (However, we also have the option to tell R a histogram name and a x-axis label.)
+
We can again use the pull() command and pipes (%>%) to choose the column we want from the soil.values.clean dataset and make a histogram of them.
+
This combination of commands follows the code structure:
In this case, we do not have to use the dplyr::summarize command before hist() because there’s only one function called hist() in the packages we’re using.
-
soil.values.clean %>%
-pull(As_EPA3051) %>%
-hist(main ='Histogram of Arsenic Concentration',
-xlab ='Concentration in mg/kg' )
-
+
soil.values.clean %>%
+pull(As_EPA3051) %>%
+hist(main ='Histogram of Arsenic Concentration',
+xlab ='Concentration in mg/kg' )
+
We can see that almost all the soil samples had very low concentrations of arsenic (which is good news for the soil health!). In fact, many of them had arsenic concentrations close to 0, and only one sampling location appears to have high levels of arsenic.
-
We might also want to graphically compare arsenic concentrations among the geographic regions in our dataset. We can do this by creating boxplots. Boxplots are particularly useful when comparing the mean, variation, and distributions among multiple groups. In R, one way to create a boxplot is using the boxplot() function. We don’t need to use pipes for this command, but instead will specify what columns we want to use from the dataset inside the boxplot() function itself.
+
We might also want to graphically compare arsenic concentrations among the geographic regions in our dataset. We can do this by creating boxplots. Boxplots are particularly useful when comparing the mean, variation, and distributions among multiple groups.
+
In R, one way to create a boxplot is using the boxplot() function. We don’t need to use pipes for this command, but instead will specify what columns we want to use from the dataset inside the boxplot() function itself.
boxplot(column_we’re_plotting ~ grouping_variable,
data = dataset,
main = “Title of Graph”,
xlab = “x_axis_title”,
ylab = “y_axis_title”)
-
boxplot(As_EPA3051 ~ region, data = soil.values.clean,
-main ="Arsenic Concentration by Geographic Region",
-xlab ="Region",
-ylab ="Arsenic Concentration in mg/kg")
-
+
boxplot(As_EPA3051 ~ region,
+data = soil.values.clean,
+main ="Arsenic Concentration by Geographic Region",
+xlab ="Region",
+ylab ="Arsenic Concentration in mg/kg")
+
By using a boxplot, we can quickly see that, while one sampling site within Baltimore City has a very high concentration of arsenic in the soil, in general there isn’t a difference in arsenic content between Baltimore City and Montgomery County.
QUESTIONS:
-
+
Create a histogram for iron concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations?
Create a histogram for lead concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations?
Look at the maps for iron and lead on the BioDIGS website. Do the boxplots you created make sense, given what you see on these maps? Why or why not?