-
Notifications
You must be signed in to change notification settings - Fork 29
/
Copy pathstats.qmd
197 lines (147 loc) · 5.36 KB
/
stats.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
# Statistical Tests and Models
## Tests for Exploratory Data Analysis
A collection of functions are available from `scipy.stats`.
+ Comparing the locations of two samples
- `ttest_ind`: t-test for two independent samples
- `ttest_rel`: t-test for paired samples
- `ranksums`: Wilcoxon rank-sum test for two independent samples
- `wilcoxon`: Wilcoxon signed-rank test for paired samples
+ Comparing the locations of multiple samples
- `f_oneway`: one-way ANOVA
- `kruskal`: Kruskal-Wallis H-test
+ Tests for associations in contigency tables
- `chi2_contingency`: Chi-square test of independence of variables
- `fisher_exact`: Fisher exact test on a 2x2 contingency table
+ Goodness of fit
- `goodness_of_fit`: distribution could contain unspecified parameters
- `anderson`: Anderson-Darling test
- `kstest`: Kolmogorov-Smirnov test
- `chisquare`: one-way chi-square test
- `normaltest`: test for normality
Since R has a richer collections of statistical functions, we can call
R function from Python with `rpy2`. See, for example, a [blog on this
subject](https://rviews.rstudio.com/2022/05/25/calling-r-from-python-with-rpy2/).
For example, `fisher_exact` can only handle 2x2 contingency tables. For
contingency tables larger than 2x2, we can call `fisher.text()` from R through
`rpy2`.
See [this StackOverflow post](https://stackoverflow.com/questions/25368284/fishers-exact-test-for-bigger-than-2-by-2-contingency-table).
Note that the `.` in function names and arguments are replaced with `_`.
```{python}
import pandas as pd
import numpy as np
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
stats = importr('stats')
jan23 = pd.read_csv("data/nyc_crashes_202301.csv")
jan23["injury"] = np.where(jan23["NUMBER OF PERSONS INJURED"] > 0, 1, 0)
m = pd.crosstab(jan23["injury"], jan23["BOROUGH"])
print(m)
res = stats.fisher_test(m.to_numpy(), simulate_p_value = True)
res[0][0]
```
<!-- ## scypy statistics functions -->
{{< include _scipy-stat.qmd >}}
## Linear Model
Let's simulate some data for illustrations.
```{python}
import numpy as np
nobs = 200
ncov = 5
np.random.seed(123)
x = np.random.random((nobs, ncov)) # Uniform over [0, 1)
beta = np.repeat(1, ncov)
y = 2 + np.dot(x, beta) + np.random.normal(size = nobs)
```
Check the shape of `y`:
```{python}
y.shape
```
Check the shape of `x`:
```{python}
x.shape
```
That is, the true linear regression model is
$$
y = 2 + x_1 + x_2 + x_3 + x_4 + x_5 + \epsilon.
$$
A regression model for the observed data can be fitted as
```{python}
import statsmodels.api as sma
xmat = sma.add_constant(x)
mymod = sma.OLS(y, xmat)
myfit = mymod.fit()
myfit.summary()
```
Questions to review:
+ How are the regression coefficients interpreted? Intercept?
+ Why does it make sense to center the covariates?
Now we form a data frame with the variables
```{python}
import pandas as pd
df = np.concatenate((y.reshape((nobs, 1)), x), axis = 1)
df = pd.DataFrame(data = df,
columns = ["y"] + ["x" + str(i) for i in range(1,
ncov + 1)])
df.info()
```
Let's use a formula to specify the regression model as in R, and fit
a robust linear model (`rlm`) instead of OLS. Note that the model specification
and the function interface is similar to R.
```{python}
import statsmodels.formula.api as smf
mymod = smf.rlm(formula = "y ~ x1 + x2 + x3 + x4 + x5", data = df)
myfit = mymod.fit()
myfit.summary()
```
For model diagnostics, one can check residual plots.
```{python}
import matplotlib.pyplot as plt
myOlsFit = smf.ols(formula = "y ~ x1 + x2 + x3 + x4 + x5", data = df).fit()
fig = plt.figure(figsize = (6, 6))
## residual versus x1; can do the same for other covariates
fig = sma.graphics.plot_regress_exog(myOlsFit, 'x1', fig=fig)
```
See more on [residual diagnostics and specification
tests](https://www.statsmodels.org/stable/stats.html#residual-diagnostics-and-specification-tests).
## Generalized Linear Regression
A linear regression model cannot be applied to presence/absence or
count data.
Binary or count data need to be modeled under a generlized
framework. Consider a binary or count variable $Y$ with possible
covariates $X$. A generalized model describes a transformation $g$
of the conditional mean $E[Y | X]$ by a linear predictor
$X^{\top}\beta$. That is
$$
g( E[Y | X] ) = X^{\top} \beta.
$$
The transformation $g$ is known as the link function.
For logistic regression with binary outcomes, the link function is
the logit function
$$
g(u) = \log \frac{u}{1 - u}, \quad u \in (0, 1).
$$
What is the interpretation of the regression coefficients in a
logistic regression? Intercept?
A logistic regression can be fit with `statsmodels.api.glm`.
Let's generate some binary data first by dichotomizing existing variables.
```{python}
import statsmodels.genmod as smg
eta = x.dot([2, 2, 2, 2, 2]) - 5
p = smg.families.links.Logit().inverse(eta)
df["yb"] = np.random.binomial(1, p, p.size)
```
Fit a logistic regression for `y1b`.
```{python}
mylogistic = smf.glm(formula = 'yb ~ x1 + x2 + x3 + x4 + x5', data = df,
family = smg.families.Binomial())
mylfit = mylogistic.fit()
mylfit.summary()
```
If we treat `y1b` as count data, a Poisson regression can be fitted.
```{python}
myPois = smf.glm(formula = 'yb ~ x1 + x2 + x3 + x4 + x5', data = df,
family = smg.families.Poisson())
mypfit = myPois.fit()
mypfit.summary()
```