Skip to content

Commit

Permalink
gut checks and addtl functions
Browse files Browse the repository at this point in the history
  • Loading branch information
clifmckee committed Jan 14, 2025
1 parent 6f3681d commit 9e145c2
Showing 1 changed file with 52 additions and 56 deletions.
108 changes: 52 additions & 56 deletions modules/Statistics/Statistics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ library(emo)
options(scipen = 999)
```

## Summary
## Summary: Data visualization

- `ggplot()` specifies what data to use and what variables will be mapped to where
- inside `ggplot()`, `aes(x = , y = , color = )` specify what variables correspond to what aspects of the plot in general
Expand All @@ -30,7 +30,7 @@ options(scipen = 999)
- make sure you are plotting what you think you are by checking the numbers!
- `facet_grid(~variable)` and `facet_wrap(~variable)` can be helpful to quickly split up your plot

## Summary
## Summary: Factors

- the factor class allows us to have a different order from alphanumeric for categorical data
- we can change data to be a factor variable using `mutate()`, `as_factor()` (in the `forcats` package), or `factor()` functions and specifying the levels with the `levels` argument
Expand Down Expand Up @@ -79,7 +79,7 @@ knitr::include_graphics("https://www.mathsisfun.com/data/images/correlation-exam

See this [case study](https://www.opencasestudies.org/ocs-bp-co2-emissions/#Data_Analysis) for more information.

## Correlation
## Correlation

Function `cor()` computes correlation in R.

Expand Down Expand Up @@ -107,16 +107,6 @@ cor.test(x, y = NULL, alternative(c("two.sided", "less", "greater")),
- greater means true correlation coefficient is > 0 (positive relationship)
- less means true correlation coefficient is < 0 (negative relationship)


## Correlation {.small}

Using the Charm City Circulator data

```{r cor1, comment="", message = FALSE}
circ <- read_csv("https://jhudatascience.org/intro_to_r/data/Charm_City_Circulator_Ridership.csv")
head(circ)
```

## GUT CHECK!

What class of data do you need to calculate a correlation?
Expand All @@ -127,11 +117,20 @@ B. Factor data

C. Numeric data

## Correlation {.codesmall}

Using the Charm City Circulator data.

```{r cor1, comment="", message = FALSE}
circ <- read_csv("https://jhudatascience.org/intro_to_r/data/Charm_City_Circulator_Ridership.csv")
head(circ)
```

## Correlation for two vectors

First, we compute correlation by providing two vectors.
First, we compute correlation by providing two numeric vectors.

Like other functions, if there are `NA`s, you get `NA` as the result. But if you specify use only the complete observations, then it will give you correlation using the non-missing data.
Like other functions, if there are `NA`s, you get `NA` as the result. But if you specify to use only the complete observations, then it will give you correlation using the non-missing data.

```{r}
# x and y must be numeric vectors
Expand Down Expand Up @@ -177,22 +176,6 @@ circ %>%
annotate("text", x = 2000, y = 7500, label = cor_label)
```

<!-- ## Plotting with `ggpubr` -->

<!-- In plot form... `geom_smooth()` of `ggplot2` can help, as can `stat_cor()` of `ggpubr`. -->
<!-- ```{r, fig.width=3, fig.height=3} -->
<!-- install.packages("ggpubr") -->

<!-- library(ggpubr) -->
<!-- circ %>% -->
<!-- ggplot(aes(x = orangeAverage, y = purpleAverage)) + -->
<!-- geom_point(size = 0.3) + -->
<!-- geom_smooth() + -->
<!-- stat_cor(p.accuracy = 0.001) -->
<!-- ``` -->



## Correlation for data frame columns

We can compute correlation for all pairs of columns of a data frame / matrix. This is often called, *"computing a correlation matrix"*.
Expand All @@ -205,6 +188,7 @@ head(circ_subset_Average)
```

## Correlation for data frame columns

We can compute correlation for all pairs of columns of a data frame / matrix. This is often called, *"computing a correlation matrix"*.

```{r}
Expand All @@ -230,7 +214,6 @@ knitr::include_graphics(here::here("images/lyme_and_fried_chicken.png"))

[source](http://doi.org/10.1007/s10393-020-01472-1)


# T-test

## T-test
Expand Down Expand Up @@ -259,7 +242,6 @@ It tests the mean of a variable in one group. By default (i.e., without us expli
- omits `NA` values in data

```{r}
x <- circ %>% pull(orangeAverage)
sum(is.na(x)) # count NAs in x
t.test(x)
```
Expand All @@ -280,16 +262,14 @@ Check out this this [case study](https://www.opencasestudies.org/ocs-bp-rural-an
## Running two-sample t-test in R

```{r}
x <- circ %>% pull(orangeAverage)
y <- circ %>% pull(purpleAverage)
sum(is.na(x))
sum(is.na(y)) # count NAs in x and y
t.test(x, y)
```

## T-test: retrieving information from the result with `broom` package

The `broom` package has a `tidy()` function that can organize results into a data frame so that they are easily manipulated (or nicely printed)
The `broom` package has a `tidy()` function that can organize results into a data frame so that they are easily manipulated (or nicely printed).

```{r broom, comment=""}
result <- t.test(x, y)
Expand All @@ -301,7 +281,7 @@ glimpse(result_tidy)

`r emo::ji("rotating_light")` You run an increased risk of Type I errors (a "false positive") when multiple hypotheses are tested simultaneously. `r emo::ji("rotating_light")`

Use the `p.adjust()` function on a vector of p values. Use `method = ` to specify the adjustment method:
Use the `p.adjust()` function on a vector of p-values. Use `method = ` to specify the adjustment method:

```{r}
my_pvalues <- c(0.049, 0.001, 0.31, 0.00001)
Expand All @@ -310,25 +290,25 @@ p.adjust(my_pvalues, method = "bonferroni") # multiply by number of tests
my_pvalues * 4
```

See [here](https://www.nature.com/articles/nbt1209-1135) for more about multiple testing correction. Bonferroni also often done as p value threshold divided by number of tests (0.05/test number).
See [here](https://www.nature.com/articles/nbt1209-1135) for more about multiple testing correction. Bonferroni also often done as p-value threshold divided by number of tests (0.05/test number).

## Some other statistical tests

- `wilcox.test()` -- Wilcoxon signed rank test, Wilcoxon rank sum test
- `shapiro.test()` -- Shapiro test
- `shapiro.test()` -- Shapiro-Wilk test of normality
- `ks.test()` -- Kolmogorov-Smirnov test
- `var.test()`-- Fisher’s F-Test
- `chisq.test()` -- Chi-squared test
- `aov()` -- Analysis of Variance (ANOVA)

## Summary

- Use `cor()` to calculate correlation between two vectors, `cor.test()` can give more information.
- use `cor()` to calculate correlation between two vectors, `cor.test()` can give more information
- `corrplot()` is nice for a quick visualization!
- `t.test()` one sample test to test the difference in mean of a single vector from zero (one input)
- `t.test()` two sample test to test the difference in means between two vectors (two inputs)
- `tidy()` in the `broom` package is useful for organizing and saving statistical test output
- Remember to adjust p-values with `p.adjust()` when doing multiple tests on data
- remember to adjust p-values with `p.adjust()` when doing multiple tests on data

## Lab Part 1

Expand Down Expand Up @@ -488,23 +468,25 @@ fit

## Linear regression: model summary

The `summary()` function returns a list that shows us some more detail
The `summary()` function returns a list that shows us some more detail.

```{r}
summary(fit)
```

## tidy results

The broom package can help us here too!
The estimate is the coefficient or slope - for one change in hours worked (1 hour increase), we see 1.58 more visits. The error for this estimate is relatively small at 0.167. This relationship appears to be significant with a small p value <0.001.

The estimate is the coefficient or slope -- for one change in hours worked (1 hour increase), we see 1.58 more visits. The error for this estimate is relatively small at 0.167. This relationship appears to be significant with a small p-value <0.001.

```{r}
tidy(fit) %>% glimpse()
```

## Linear regression: multiple predictors {.smaller}

Let's try adding another explanatory variable to our model, dollars per hour earned by the doctor (`revenue`). The tidy function will not work with this unfortunately. The meaning of coefficients is more complicated here.
Let's try adding another explanatory variable to our model, dollars per hour earned by the doctor (`revenue`). The meaning of coefficients is more complicated here.

```{r}
fit2 <- glm(visits ~ hours + revenue, data = esdcomp)
Expand All @@ -516,15 +498,14 @@ summary(fit2)
Can also use `tidy` and `glimpse` to see the output nicely.

```{r}
fit2 <- glm(visits ~ hours + revenue, data = esdcomp)
fit2 %>%
tidy() %>%
glimpse()
```

## Linear regression: factors

Factors get special treatment in regression models - lowest level of the factor is the comparison group, and all other factors are **relative** to its values.
Factors get special treatment in regression models -- lowest level of the factor is the comparison group, and all other factors are **relative** to its values.

`residency` takes values Y or N to indicate whether the doctor is a resident.

Expand All @@ -534,7 +515,7 @@ esdcomp %>% count(residency)

## Linear regression: factors {.smaller}

Yes relative to No - baseline is no
Yes relative to No -- baseline is no

```{r regressbaseline, comment="", fig.height=4,fig.width=8}
fit_3 <- glm(visits ~ residency, data = esdcomp)
Expand All @@ -543,7 +524,7 @@ summary(fit_3)

## Linear regression: factors {.smaller}

Comparison group is not listed - treated as intercept. All other estimates are relative to the intercept.
Comparison group is not listed -- treated as intercept. All other estimates are relative to the intercept.

```{r regress8, comment="", fig.height=4,fig.width=8}
circ <- read_csv("https://jhudatascience.org/intro_to_r/data/Charm_City_Circulator_Ridership.csv")
Expand All @@ -569,7 +550,11 @@ summary(fit_5)

## Linear regression: factors {.smaller}

Can view estimates for the comparison group by removing the intercept in the GLM formula `y ~ x - 1`. Caveat is that the p-values change.
You can view estimates for the comparison group by removing the intercept in the GLM formula

`y ~ x - 1`

*Caveat* is that the p-values change, and interpretation is often confusing.

```{r regress9, comment="", fig.height=4, fig.width=8}
fit_6 <- glm(orangeBoardings ~ factor(day) - 1, data = circ)
Expand All @@ -578,7 +563,7 @@ summary(fit_6)

## Linear regression: interactions {.smaller}

Can also specify interactions between variables in a formula `y ~ x1 + x2 + x1 * x2`. This allows for not only the intercepts between factors to differ, but also the slopes with regard to the interacting variable.
You can also specify interactions between variables in a formula `y ~ x1 + x2 + x1 * x2`. This allows for not only the intercepts between factors to differ, but also the slopes with regard to the interacting variable.

```{r fig.height=4, fig.width=8}
fit_7 <- glm(visits ~ hours + residency + hours * residency, data = esdcomp)
Expand Down Expand Up @@ -617,11 +602,11 @@ See `?family` documentation for details of family functions.

We will use data about breast cancer tumors.

"The purpose of this study was to determine whether a new procedure called fine needle aspiration which draws only a small sample of tissue could be effective in determining tumor status."
"Data come from a study of breast cancer in Wisconsin. There are 681 cases of potentially cancerous tumors of which 238 are actually malignant. Determining whether a tumor is really malignant is traditionally determined by an invasive surgical procedure. The purpose of this study was to determine whether a new procedure called fine needle aspiration which draws only a small sample of tissue could be effective in determining tumor status."

```{r}
data(wbca)
wbca
head(wbca)
```

## Logistic regression {.smaller}
Expand All @@ -642,7 +627,9 @@ summary(binom_fit)

> An odds ratio (OR) is a measure of association between an exposure and an outcome. The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.
Check out [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/).
Check out [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/).

Use `oddsratio(x, y)` from the `epitools()` package to calculate odds ratios.

## Odds ratios

Expand Down Expand Up @@ -670,17 +657,26 @@ ice_cream %>% count(ill, vanilla.ice.cream)

## Odds ratios {.smaller}

Use `oddsratio(x, y)` from the `epitools()` package to calculate odds ratios.

```{r}
library(epitools)
response <- ice_cream %>% pull(ill)
predictor <- ice_cream %>% pull(vanilla.ice.cream)
oddsratio(predictor, response)
```

See this [case study](https://www.opencasestudies.org/ocs-bp-vaping-case-study/#Logistic_regression_%E2%80%9Cby_hand%E2%80%9D_and_by_model) for more information.

## Odds ratios {.smaller}

The odds ratio is 21.4. When the predictor is TRUE (aka the individual ate vanilla ice cream), the odds of the response (having acute GI illness) were 21 times great than when it is FALSE (did not eat ice cream).

## Functions you might also see

- the `stat_cor()` function in the `ggpubr` can add correlation coefficients and p-values as a layer to `ggplot` objects
- the `graphics::pairs()` or `GGally::ggpairs()` functions are also useful for visualizing correlations across variables in a data frame
- `acf()` in the `stats` package can compute autocorrelation and cross-correlation with lags
- calculate confidence intervals for intercept and slopes from `glm`/lm` objects using `confint()`
- principal components analysis -- use `prcomp()`

## Final note

Some final notes:
Expand Down

0 comments on commit 9e145c2

Please sign in to comment.