add images and part 3

fhdsl · Feb 1, 2024 · 4dbb7ed · 4dbb7ed
1 parent a1002f3
commit 4dbb7ed
Show file tree

Hide file tree

Showing 6 changed files with 137 additions and 8 deletions.
diff --git a/08-data_exploration.Rmd b/08-data_exploration.Rmd
@@ -7,20 +7,28 @@ ottrpal::set_knitr_image_path()
 
 # Exploring Soil Testing Data With R
 
-
+In this activity, you'll have a chance to become familiar with the BioDIGS soil testing data. This dataset includes information on the inorganic components of each soil sample. 
 
 ## Opening the environment
 
 ## Part 1. Data Import
 
-We will use the `readr` package to import the current soil testing dataset from the [BioDIGS](https://biodigs.org/#soil_data) website. The dataset is saved in a comma separated values, or csv, format, so we need to use the `read_csv` command. This command follows the code structure:
+We will use the `BioDIGS` package to retrieve the data. We first need to install the package from where it is store on GitHub.
+
+```{r, message = FALSE, warning = FALSE}
+
+devtools::install_github("fhdsl/BioDIGSData")
+```
 
-dataset <- read_csv(FILE)
+Once you've installed the package, we can load the library and  assign the soil testing data to an _object_. This command follows the code structure:
+
+dataset_object_name <- stored_BioDIGS_dataset
 
 ```{r, message = FALSE, warning = FALSE}
 
-library(readr)
-soil.values <- read_csv(file = "https://biodigs.org/session/c54558b5bd6750de7244bc29ad2b8e4b/download/soiltest_download?w=")
+library(BioDIGSData)
+
+soil.values <- BioDIGS_soil_data()
 ```
 
 It _seems_ like the dataset loaded, but it's always a good idea to verify. There are many ways to check, but the easiest approach (if you're using RStudio) is to look at the Environment tab on the upper right-hand side of the screen. You should now have an object called `soil.values` that includes some number of observations for 28 variables. The _observations_ refer to the number of rows in the dataset, while the _variables_ tell you the number of columns. As long as neither the observations or variables are 0, you can be confident that your dataset loaded.
@@ -42,7 +50,12 @@ Well, the data definitely loaded, but those column names aren't immediately unde
 
 :::
 
-In this case, the data dictionary can help us make sense of what sort of values each column represents. The data dictionary for the BioDIGS soil testing data is available on the website, but we have also reproduced it here.
+In this case, the data dictionary can help us make sense of what sort of values each column represents. The data dictionary for the BioDIGS soil testing data is available in the R package (see code below), but we have also reproduced it here.
+
+```{r, message = FALSE, warning = FALSE, eval=FALSE}
+
+?BioDIGS_soil_data()
+```
 
 :::{.dictionary}
 
@@ -136,8 +149,12 @@ This command follows the code structure:
 
 OBJECT %>% pull(column_name) %>% mean()
 
+`pull()` is a command from the `tidyverse` package, so we'll need to load that library before our command.
+
 ```{r, message = FALSE, warning = FALSE}
 
+library(tidyverse)
+
 soil.values.clean %>% pull(As_EPA3051) %>% mean()
 ```
 
@@ -149,7 +166,7 @@ soil.values.clean %>% pull(As_EPA3051) %>% sd()
 soil.values.clean %>% pull(As_EPA3051) %>% min()
 soil.values.clean %>% pull(As_EPA3051) %>% max()
 ```
-
+As you can see, the standard deviation of the arsenic concentrations is listed first, then the minimum concentration, and finally the maximum concentration.
 
 The soil testing dataset contains samples from multiple geographic regions, so maybe it's more meaningful to find out what the average arsenic values are for each region. We have to do a little bit of clever coding trickery for this using the `group_by` and `summarize` functions. First, we tell R to split our dataset up by a particular column (in this case, region) using the `group_by` function, then we tell R to summarize the mean arsenic concentration for each group. Because there are several different functions with the name `summarize` in R, we have to specify that we want to use `summarize` from the `dplyr` package. This command follows the code structure:
 
@@ -177,5 +194,50 @@ QUESTIONS:
 
 ## Part 3. Data Visualization
 
-You've calculated some statistics on the soil testing data. Often, it can be easier to immediately interpret these statistics in a plot than in a list of values. For example, we can get more easily understand how data values are distributed if we create histograms compared to looking at point values like mean, standard deviation, minimum, and maximum.
+Often, it can be easier to immediately interpret data displayed as a plot than as a list of values. For example, we can get more easily understand how the arsenic concentration of the soil samples are distributed if we create histograms compared to looking at point values like mean, standard deviation, minimum, and maximum.
+
+One way to make histograms in R is to use the `hist()` function. We can again use the `pull()` command and pipes (`%>%`) to choose the column we want from the `soil.values.clean` dataset and make a histogram of them. Remember, this command follows the code structure:
+
+dataset %>%
+    pull(column_name) %>%
+    hist(main = chart_title, xlab = x_axis_title)
+
+In this case, we do _not_ have to use the `dplyr::summarize` command before `hist()` because there's only one function called `hist()` in the packages we're using.
+
+```{r, message = FALSE, warning = FALSE}
+
+soil.values.clean %>% 
+    pull(As_EPA3051) %>% 
+    hist(main = 'Histogram of Arsenic Concentration', 
+         xlab ='Concentration in mg/kg' )
+```
+
+We can see that almost all the soil samples had very low concentrations of arsenic (which is good news for the soil health!). In fact, many of them had arsenic concentrations close to 0, and only one sampling location appears to have high levels of arsenic. 
+
+We might also want to graphically compare arsenic concentrations among the geographic regions in our dataset. We can do this by creating boxplots. Boxplots are particularly useful when comparing the mean, variation, and distributions among multiple groups. In R, one way to create a boxplot is using the `boxplot()` function. We don't need to use pipes for this command, but instead will specify what columns we want to use from the dataset inside the `boxplot()` function itself.
+
+This command follows the code structure:
+
+boxplot(arsenic_concentration ~ grouping_variable, 
+    data = dataset,
+    main = "Title of Graph",
+    xlab = "x_axis_title",
+    ylab = "y_axis_title")
+
+```{r, message = FALSE, warning = FALSE}
+boxplot(As_EPA3051 ~ region, data = soil.values.clean,
+        main = "Arsenic Concentration by Geographic Region",
+        xlab = "Region",
+        ylab = "Arsenic Concentration in mg/kg")
+```
+
+By using a boxplot, we can quickly see that, while one sampling site within Baltimore City has a very high concentration of arsenic in the soil, in general there isn't a different in arsenic content between Baltimore City and Montgomery County.
+
+::: {.reflection}
+QUESTIONS:
+
+5. Create a histogram for iron concentration, as well as a boxplot comparing iron concentration by region. Is the iron concentration similar among regions? Are there any outlier sites with unusually high or low iron concentrations?
+
+6. Create a histogram for _lead_ concentration, as well as a boxplot comparing lead concentration by region. Is the lead concentration similar among regions? Are there any outlier sites with unusually high or low lead concentrations?
 
+:::
diff --git a/resources/images/08-environment.png b/resources/images/08-environment.png
diff --git a/resources/images/08-region.png b/resources/images/08-region.png
diff --git a/resources/images/08-scrolling_through_dataset.png b/resources/images/08-scrolling_through_dataset.png
diff --git a/resources/images/08-soil_values_object.png b/resources/images/08-soil_values_object.png