-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathwine_segmentation_nemo.Rmd
207 lines (132 loc) · 5.04 KB
/
wine_segmentation_nemo.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
title: " Wine Quality Segmentation: K Means CLustering "
author: "Neha More"
date: "December 1, 2017"
output: html_document
---
###Problem Statement:
Wine quality depends on a lot of factors like alcohol content,presence of sulphates,its pH values etc.The taste,smell and potency of the wine is defined by its chemical ingredients and its percentages in wines. A restaurant needs to classify its wines into different categories depending on its ingredients and label it accordingly for its different category of customers.
###Aim:
Using K-means clustering algorithm classify the wines into appropriate distinguished optimal clusters having similar properties in each cluster.
###Data Dictionary:
winequality-red.csv consists of 1599 observations of wines having 12 variables.
Use variables pH, alcohol, sulphates and total.sulpur.dioxide to segment the dataset appropriately using k means clustering algorithm.
###Initial Setup:
```{r}
library(dplyr)
library(ggplot2)
library(cluster)
```
###Step 1 : Reading and standardizing dataset as per the requirements.
```{r}
wine=read.csv("winequality-red.csv",sep=";")
glimpse(wine) #1599 obs and 12 variables.
```
#####Lets select four variables which we think will help us to form good clusters.
```{r}
wine=wine %>%
select(pH,sulphates,alcohol,total.sulfur.dioxide)
glimpse(wine)
```
#####Lets further standardise each variables of the dataset wine.
```{r}
md=function(x){
return((x-mean(x))/sd(x))
}
```
```{r}
wine_std=wine %>%
mutate(pH=md(pH),
sulphates=md(sulphates),
alcohol=md(alcohol),
total.sulfur.dioxide=md(total.sulfur.dioxide))
```
#####Lets check the summary of wine and wine_std datasets:
```{r}
summary(wine)
summary(wine_std)
```
#####Thus out dataset is standardized and can be further used for implementing algorithm.
###Step 2: Using K mean clustering for k=1 to 15 and determine best value of K.
####We will find clusters for values of k=1 to 15 and record the values of SSW for each K and then plot (K Vs SSW)
```{r}
mydata <- wine_std
```
```{r}
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
#finding within sum of squares
```
```{r}
for (i in 2:15) wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
#for k=2 to 15,kmeans function takes two arguments(data, and i) and thus finding SSW for each K=i(2 to 15) and storing it in vector wss[i]
```
#####Lets plot SSW vs K values:
```{r}
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares",col="mediumseagreen",pch=12)
#
```
#####The plot shows that K=5/6 is perfect choice of K.
#####Thus the plot shows that the number of groups to choose is 5.
#####Hence lets run K means algorithm for k=5 and find clusters in mydata.(i.e. wine_std)
###Step 3: Use k=5 and make 5 clusters by K means algorithm in dataset wine_std.
```{r}
fit <- kmeans(mydata,5 )
#running kmeans for mydata dataset with k=5 and storing the result in fit.
```
fit$cluster will give the cluster in which the obs go to.Lets Store it in mydata dataset with header cluster.
```{r}
mydata$cluster=fit$cluster
#or same as:
wine_std$cluster=fit$cluster
```
###Step 4: Making Pair wise profiling plots and labelling wines with respect to its ingredients:
#####Plotting alcohol Vs pH clusters:
```{r}
ggplot(wine_std,aes(pH,alcohol,color=as.factor(cluster)))+geom_point()
```
#####Plotting pH vs Sulphates groups:
```{r}
ggplot(wine_std,aes(pH,sulphates,color=as.factor(cluster)))+geom_point()
```
#####Plotting pH vs total sulfur dioxide groups:
```{r}
ggplot(wine_std,aes(pH,total.sulfur.dioxide,color=as.factor(cluster)))+geom_point()
```
#####Plotting alcohol vs sulphates groups:
```{r}
ggplot(wine_std,aes(alcohol,sulphates,color=as.factor(cluster)))+geom_point()
```
#####Plotting alcohol vs total.sulpur.dioxide groups:
```{r}
ggplot(wine_std,aes(alcohol,total.sulfur.dioxide,color=as.factor(cluster)))+geom_point()
```
#####Plotting sulphates vs total.sulfur.dioxide groups:
```{r}
ggplot(wine_std,aes(sulphates,total.sulfur.dioxide,color=as.factor(cluster)))+geom_point()
```
###Inferences from plots:
> Cluster 1:low pH,high sulphates,low alcohol
> Cluster 2:high pH,low sulphates,high alcohol,low total.sulpur.dioxide
> Cluster 3:Low alcohol,low sulphates,high total.sulpur.dioxide
> Cluster 4:high alcohol,low pH,low total.sulpur.dioxide
> Cluster 5:Low alcohol,low sulphates,low total.sulphur.dioxide
###Step 5:Numerical inferences:
```{r}
apply(wine,2,function(x)tapply(x,wine_std$cluster,mean))
```
##**Conclusion:**
#####*Thus we see that:*
#####*pH is high in cluster 2 and low in cluster 1.*
#####*Sulphates is high in cluster 1 and low in cluster 3.*
#####*Alcohol is high in cluster 2 & 4 and low in Rest of the clusters(1,3,5).*
#####*Total.sulput.dioxide is high in cluster 3 and low in cluster 4.*
###Step 6: Plotting silhouette.
#####Lets plot silhouette and determine whether the 5 clusters members were well-assigned:
```{r}
diss=daisy(wine_std)
sk=silhouette(wine_std$cluster,diss)
plot(sk)
```
#####Average silhouette width is 0.4.
####*Thus we can say, clusters are well-assigned respectively.*