-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfinal_report.Rmd
324 lines (213 loc) · 8.36 KB
/
final_report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
---
title: "Final Report"
subtitle: "Environmental System Analysis"
author: "Vajira Lasantha"
date: "10/12/2021"
output:
html_document: default
word_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Exercises in this report use the base datasets bundled with R in *`datasets`* package.
### 1. Influence of feed type on the weight of chickens
An experiment was conducted to measure and compare the effectiveness of various feed supplements on the growth rate of chickens. `chickwts` dataset from R *`datasets`* package provides the data from this study.
It has 71 observations on 2 variables; Weight, a numerical variable and feed, a categorical variable.
```{r chickwts summary}
summary(chickwts)
```
#### 1.1 Histogram of the weight of the chicken
```{r Chicken weight histogram}
hist(chickwts$weight, breaks=6, col='salmon')
```
#### 1.2 Testing for normality
```{r Test for Normality}
shapiro.test(chickwts$weight)
```
Shapiro-Wilks test for the wright data gives p-value of 0.21. Because p-value in greater than 0.05, the null hypothesis that the data follows normal distribution, cannot be rejected. Weight data are normally distributed.
#### 1.4 Comparison of means
**Between two types of feeds**
Alternative hypothesis: There is a significant difference in weight between chickens fed with Soybean feed and Sunflower feed.
Null Hypothesis: There is no significant difference in weight.
First, subset the data for Soybean and Sunflower from the original `chickwts` dataset.
```{r}
chick2 <- chickwts[chickwts$feed=='soybean' | chickwts$feed=='sunflower', ]
```
Then the data is checked for equal variance.
```{r test for equal variance}
bartlett.test(chick2$weight~chick2$feed)
```
In Bartlett test for equal variance, p-value is 0.72 (\> 0.05). Therefore, weight data show homogeneity of variance between different feed types. Because of this, student-t test is used for comparison of means.
```{r compare means}
t.test(chick2$weight~chick2$feed, var.equal=TRUE)
```
Null hypothesis is rejected.
There is a significant difference in means between chicken fed with soybean feed and sunflower feed.
**Between all six types of feed**
```{r equal variance}
bartlett.test(chickwts$weight~chickwts$feed)
```
In Bartlett test for equal variance, p-value is 0.66 (\> 0.05). Therefore, weight data show homogeneity of variance between different feed types. One-way ANOVA is used for comparison of means.
Null hypothesis: Means of the weights for six types of feed are equal.
```{r}
oneway.test(weight~feed, data = chickwts, var.equal = TRUE )
```
With the extremely small p-value, null hypothesis can be rejected. There is a significant difference between the weight of chicken fed with different types of feed.
### 2. Relationship between hair colour and eye colour
Hair and eye colour and sex of 592 students are recorded the `HairEyeColor` dataset from R *`datasets`* package.
```{r summary}
summary(HairEyeColor)
```
Data are in separate tables by sexes. They are combined to a single table.
```{r create a single table}
h <- margin.table(HairEyeColor, margin = c(1,2))
h
```
Chi-square test for hair and eye colour
```{r chisquare test}
chisq_h <- chisq.test(h)
chisq_h
```
p-value \< 2.2e-16, therefore there is significant difference between actual and expected distributions.
```{r check chi squared values}
chisq_h$residuals
```
Individual chi-squared residuals show strong associations/disassociations between some hair and eye colours.
There are strong associations between
- Blond hair and Blue eyes
- Black hair and Brown eyes
There are strong negative associations between
- Blond hair and Brown eyes
- Black hair and Blue eyes
### 3. Relationship between speed of cars and stopping distance
Data is from `cars` dataset. It has two variables, speed (mph) and stopping distance (ft).
```{r summary of data}
summary(cars)
```
**3.1 Scatter plot**
```{r plot the data}
plot(cars$speed, cars$dist)
```
**3.2 Correlation test**
```{r}
cor.test(cars$speed, cars$dist)
```
There is significant correlation between speed and stopping distance of cars.
**3.3 Regression Analysis**
Above data are fitted to a linear model using linear regression.
```{r}
SR <- lm(cars$speed~cars$dist)
summary(SR)
```
Linear regression model gives R^2^ value of 0.65.
### 4. Employment and economic indicators
`Longley` dataset presents 7 variables from 1947 to 1962 that can be used for predicting the number of employed persons in the USA.
Independent variable: Number of employed persons in USA for the year
Dependent variables:
• Year
• Population - Non-institutionalized above 14-year-old population
• Unemployed - Number of unemployed
• GNP -- Gross National Product
• GNP.deflator -- Price deflator centered on year 1954
• Armed.Forces -- Number of people in armed forces
A multiple regression model is fitted to this data
**4.1 Multiple regression analysis**
```{r Longley dataset summary}
summary(longley)
```
```{r Multiple Regression}
#Scaling the data
scl <- scale(longley)
scl_df <- data.frame(scl)
mlr <- lm(Employed~.,data= scl_df)
summary(mlr)
```
Multiple linear regression model is fitted to the data with R^2^ \> 99.5%.
**4.2 Multicolliniarity**
```{r check for multicollinearity}
library(car)
vif(mlr)
cor(scl_df)
```
VIF \> 10 and high correlation (\>0.95) between GNP and GNP-deflator, GNP and Population, GNP and Year, Population and Unemployed, Year and Unemployed, and Population and Year.
Some variables should be removed to reduce the multicolliniarity.
**4.3 Step-wise method**
Stepwise method of multiple regression is used for selecting a model with reduced multicollinearity.
```{r stepwise method}
library(MASS)
stp_mr <- stepAIC(mlr,direction = 'both')
summary(stp_mr)
```
Best model is selected. Employment is best predicted by GNP, Unemployment, Armed Forces and Year.
**4.4 Principal Component Analysis**
Principal component analysis is used on the same data, to reduce dimensionality.
```{r PCA}
rpca = prcomp(x=longley, scale=T)
summary(rpca)
rpca$x[,1]
rpca$x[,2]
```
**Plot of first two components**
```{r Plotting first two components}
plot(rpca$x[,1],rpca$x[,2])
```
Selection of the number of principal components
```{r pc selection}
cor01 <- cor(longley)
eigen01 <- eigen(cor01)
plot(eigen01$values, type = 'b', main = 'Scree Plot', xlab = 'Number')
abline(h=1, untf = FALSE)
```
Based on the elbow on scree plot and the Kaiser criterion, first two PCs are selected.
**4.5 Factor analysis**
```{r Factor Analysis}
rfa <- factanal(x=longley, factors = 2, rotation = 'promax', scores = 'Bartlett')
rfa
```
All variables are well represented in Factor 1. Variables that increase annually i.e., GNP, GNP deflator, Population, Year and Employed have the largest values. Factor 2 mainly consists of Unemployment and Armed forces.
**Factor Mapping**
```{r Factor mapping}
plot(rfa$loadings)
```
**4.6 Path Analysis**
Relationships described in the following figure in expected to exist among the variables in `longley` dataset.
![](images/dia1.PNG "Path analysis model")
```{r Path analysis }
shp.model <- 'Employed~Population+Unemployed+Armed.Forces+GNP
GNP~GNP.deflator
GNP.deflator~Year
Population~Year
Unemployed~Population
Armed.Forces~Unemployed'
library(lavaan)
shp.fit <- sem(shp.model, data=longley)
summary(shp.fit, standardized=TRUE, rsquare=TRUE, fit.measures=TRUE)
```
Comparative Fit Index (CFI) is 0.772
Akaike (AIC) is 621.571
RMSEA is 0.648
Modification indices of the model are investigated for further improve the model.
```{r modification of the model}
modificationindices(shp.fit)
```
Two new paths, GNP\~Population and Armed.Forces\~GNP.deflation are added to the model as suggested by the high modification indices.
![Modified model](images/dia2-02.PNG "Modified model")
```{r Path analysis - modified}
shp.model <- 'Employed~Population+Unemployed+Armed.Forces+GNP
GNP~GNP.deflator+Population
GNP.deflator~Year
Population~Year
Unemployed~Population
Armed.Forces~Unemployed+GNP.deflator'
library(lavaan)
shp.fit <- sem(shp.model, data=longley)
summary(shp.fit, standardized=TRUE, rsquare=TRUE, fit.measures=TRUE)
```
Comparative Fit Index (CFI) is 0.843
Akaike (AIC) is 598.411
RMSEA is 0.589
Shown by the increased fit and reduced AIC and RMSEA, the model has been improved by the modifications.
============================================================
Github repository <https://github.com/VajiraL/UT_EnvSys21>