forked from LucyNjuki/R.U.M-publication-ready-tables
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.RMD
201 lines (153 loc) · 7.03 KB
/
index.RMD
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
title: "R.U.M - Publication Ready Tables"
author:
- name: Lucy Njoki Njuki
affiliation: Centre for Epidemiology VS Arthritis, UoM, Manchester
subtitle: User-defined functions to create summary tables
date: "`r Sys.Date()`"
output:
rmdformats::readthedown:
highlight: espresso
use_bookdown: TRUE
vignette: >
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
categories: ["R"]
---
```{r style, echo=FALSE, message=FALSE, warning=FALSE, results="hide"}
suppressPackageStartupMessages({
library(knitr) # A General-Purpose Package for Dynamic Report Generation in R, CRAN v1.41
library(rmarkdown) # Dynamic Documents for R, CRAN v2.19
library(bookdown) # Authoring Books and Technical Documents with R Markdown, CRAN v0.33
library(tidylog) # Logging for 'dplyr' and 'tidyr' Functions, CRAN v1.0.2
library(tidyverse) # Easily Install and Load the 'Tidyverse', CRAN v1.3.2
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data, CRAN v2.2.0
library(tidyr) # Tidy Messy Data, CRAN v1.2.1
library(palmerpenguins) # Palmer Archipelago (Antarctica) Penguin Data, CRAN v0.1.1
library(summarytools) # Tools to Quickly and Neatly Summarize Data, CRAN v1.0.1
})
options(width = 100)
```
\newpage
# Introduction
- Occasionally, we encounter environments like Microsoft Azure that do not support the default viewer mode.
- Consequently, utilising packages like `gtsummary` becomes impractical.
- Therefore, it would be preferable to develop user-defined functions for data summarisation.
```{r penguins_df, results='hide'}
penguins = penguins
```
## Structure of the dataframe
```{r skim_data_fn, warning=FALSE, message=FALSE}
skim_data <- function(df, vars=NULL) {
df<-dplyr::as_tibble(df)
if (is.null(vars) == TRUE) vars <- names(df)
variable_type <- sapply(vars,
function(x) is(df[, x][[1]])[1])
missing_count <- sapply(vars,
function(x) sum(!complete.cases(df[, x])))
unique_count <- sapply(vars,
function(x) dplyr::n_distinct(df[, x]))
data_count <- nrow(dplyr::as_tibble(df))
Example <- sapply(vars,
function(x) (df[1, x]))
dplyr::tibble(variables = vars, types = variable_type,
example = Example,
missing_count = missing_count,
missing_percent = (missing_count / data_count) * 100,
unique_count = unique_count,
total_data = data_count - missing_count)
}
```
- An example: Assess the structure of `penguins` data
```{r warning=FALSE, message=FALSE}
skim_data(penguins) |> knitr::kable(caption = "Structure of the penguin species dataset")
```
# Summary tables for numeric variables
```{r explore_numeric_fn, warning=FALSE, message=FALSE}
explore_numeric <- function(df, ...) {
df<-dplyr::as_tibble(df)
df %>%
summarise(across(
.cols = where(is.numeric), # checks if a variable si numeric
.fns = list(Min = min, Max = max, Median = median, Mean = mean, SD = sd), na.rm = TRUE,
.names = "{col}_{fn}"
))
}
```
- An example: Summarise penguins data.frame
```{r explore_numeric_fn_example, warning=FALSE, message=FALSE, results='hide'}
(table1 = explore_numeric(penguins))
```
```{r warning=FALSE, message=FALSE}
table1 |> knitr::kable(caption = 'Summary statistics for numerical variables in a DF for penguin species')
```
# Summary tables for categorical variables
```{r explore_factors_fn}
explore_factors <- function(df, ...){
df<-dplyr::as_tibble(df)
df%>%
dplyr::select(...)%>%
tidyr::gather(., "variable", "variable_level") %>%
dplyr::count(variable, variable_level) %>%
dplyr::group_by(variable) %>%
dplyr::mutate(proportion = round(prop.table(n)*(100), digits=2))%>%
mutate(propotion_count = paste(n,"(",proportion,"%)")) %>%
dplyr::group_by(variable)%>%
dplyr::arrange(desc(n),.by_group = TRUE)%>%
rename("frequency" = "n")
}
```
- An example: Summarise penguins data.frame
```{r explore_factors_fn_example, warning=FALSE, message=FALSE, results='hide'}
(table2 = explore_factors(penguins, species, island, sex))
```
```{r warning=FALSE, message=FALSE}
table2 |> knitr::kable(caption = 'Summary statistics for factor variables in a DF for penguin species', align = "c")
```
# Combine the two summary tables
- The utilisation of `knitr::kable()` is significant when it comes to conveniently visualizing datasets like these two tables in a platform like Microsoft Azure.
```{r warning=FALSE, message=FALSE}
knitr::kable(
list(table2, table1),
caption = 'Summary statistics for penguins DF',
booktabs = TRUE, valign = 't'
)
```
\newpage
# Other functions
- Sometimes, it becomes necessary for us to determine the mode, like finding the most common International Statistical Classification of Diseases, 10th Revision (ICD-10) codes associated with a patient.
- To accomplish this, we need to calculate the mode of the variable.
- Regrettably, the default mode function is not available in R. - Therefore, creating our own custom function to calculate the mode becomes a solution.
```{r get_mode, warning=FALSE, message=FALSE}
getmode <- function(v) {
uniqv <- unique({{v}})
tab <- tabulate(match(v, uniqv))
uniqv[tab == max(tab)]
}
```
- An example: What is the common `Petal.Length` and `Sepal.Length` for the different species?
```{r iris_df, results='hide', warning=FALSE, message=FALSE}
iris = iris
```
```{r warning=FALSE, message=FALSE}
(mode_example = iris %>%
group_by(Species) %>%
summarise(sepal_length_mode = getmode(Sepal.Length), petal_length_mode = getmode(Petal.Length)) %>%
kable(caption = "Example of mode", align = "c"))
```
# R package: `summarytools`
- The function `summarytools::dfSummary` proves to be valuable in performing basic descriptive statistics for both numeric variables and categorical variables.
- Additionally, it attempts to generate visual representations of the variable distributions, but <p style="color:red">these plots lack utility.</p>
- Furthermore, the function also identifies duplicate values and missing values within the dataset.
> No need for Viewer mode! `r emojifont::emoji("smiley")` `r emojifont::emoji("raised_hands")`
```{r df_summary_tables, warning=FALSE, message=FALSE}
# create a summary table using dfSummary function
(table_stat = dfSummary(penguins))
```
# Acknowledgement
1. Dr. Belay Birlie Yimer, Centre for Epidemiology VS Arthritis, UoM, major contributor in writing the functions `skim_data`, `explore_numeric` and `explore_factors`.
2. Lana Bojanic, Centre for Mental Health and Safety, UoM, Manchester
# More resources
1. [Deep Exploratory Data Analysis (EDA) in R](https://yuzar-blog.netlify.app/posts/2021-01-09-exploratory-data-analysis-and-beyond-in-r-in-progress/#summarytools)
2. [A sufficient Introduction to R](https://dereksonderegger.github.io/570L/12-user-defined-functions.html)