-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfinal-project.Rmd
387 lines (273 loc) · 21.1 KB
/
final-project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
---
title: "Exploring the Expedia Recommendation Algorithm"
author: 'Group 35: Max Ming Yi Koh (1007972785) and Kevin Le (1007952805)'
date: "March 31, 2022"
output:
beamer_presentation:
theme: Pittsburgh
colortheme: orchid
fonttheme: structurebold
slide_level: 2
pdf_document: default
classoption: aspectratio=169
fontsize: 11pt
urlcolor: blue
---
```{r, echo=FALSE, message=FALSE, warning=FALSE}
# echo=FALSE will stop the code chunk from appearing in the knit document
# warning=FALSE and message=FALSE will stop R messages from appearing in the knit document
library(tidyverse)
library(dplyr)
# here is the data for this project
expedia_data <- read_csv("ExpediaSearchData.csv")
# see the Project Overview page for more information:
# https://q.utoronto.ca/courses/235890/pages/project-overview
```
## General Motivations and Dataset Description
* Expedia generates revenue by reselling bookings (purchased in bulks at discounted pricing) or charging commissions from hoteliers.$^{[1]}$
* Thus, an effective recommendation algorithm that recognizes consumer needs is crucial to improve user experience and optimize the profitability of the business.
* We formulate 3 research questions that answer these 2 broad scoped questions:
1. How effective is Expedia's recommendation algorithm?
2. What are some factors that affect purchasing decisions of consumers?
* The investigation uses a dataset of 1,000 Expedia user searches along with certain variables related to users and the top 3 recommended properties for each user search between June $1^{st}$, 2021 and July $31^{st}$, 2021.
* Unless restated, you may assume the sample of the 3 research questions consists of the users that made these 1,000 Expedia searches while the population consists of all Expedia users who made a search between June $1^{st}$, 2021 and July $31^{st}$, 2021.
## Data Summary
Below is a table of variables used in the 3 research problems. Note that {n} is a placeholder for integers 1, 2, or 3.
| Variable | Description |
|-----------------|-----------------------------------------------------------------------------------------------------|
| is_trans{n} | whether the consumer transacted the n$^{th}$ displayed property within 180 minutes of a user search|
| is_drr{n} | whether the n$^{th}$ displayed property is discounted |
| num_clicks{n} | number of clicks for the n$^{th}$ displayed property within 180 minutes of a user search |
| checkin_date | stay start date |
| checkout_date | stay end date |
| adult_count | number of adults on the trip |
| child_count | number of children on the trip |
## Research Question 1 - Introduction
**Research Question**: What is the proportion of Expedia users between June $1^{st}$, 2021 and July $31^{st}$, 2021 who purchased one of the top 3 recommended properties within 180 minutes of a search?
**Research Motivation**:
* The proportion of users who purchase a top 3 recommended property is a good metric to measure the effectiveness of Expedia's recommendation algorithm.
* For instance, a higher proportion may imply that the algorithm is effective at recognizing user needs while a lower proportion may imply that the algorithm is less effective at recognizing user needs.
## Research Question 1 - Data Wrangling
* We applied the `select()` function to obtain the required variables, namely is_trans1, is_trans2 and is_trans3.
* Using is_trans1, is_trans2 and is_trans3 columns, we applied the `mutate()` function to create a new variable named trans_made which indicates whether any transactions for the top 3 recommended properties were made by a user within 180 minutes of his or her search.
* trans_made is set to TRUE if (at least) one top 3 property is transacted within 180 minutes of a user search. Otherwise, trans_made is set to FALSE.
## Research Question 1 - Data Visualization
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=2.5, fig.width=3.7, fig.align='center'}
expedia_data_q1 <- expedia_data %>%
select(is_trans1, is_trans2, is_trans3) %>%
mutate(is_trans1 = ifelse(!is.na(is_trans1) & is_trans1 == 1, 1, 0),
is_trans1 = ifelse(!is.na(is_trans2) & is_trans2 == 1, 1, 0),
is_trans1 = ifelse(!is.na(is_trans3) & is_trans3 == 1, 1, 0),
trans_made = ifelse(is_trans1 + is_trans2 + is_trans3 > 0, TRUE, FALSE)
)
expedia_data_q1 %>%
ggplot(aes(x=trans_made)) +
geom_bar(color="black", fill="gray") +
labs(x="Transaction made by users for any of the top 3\nrecommended properties within 180 minutes of \na user search",
y="Count",
title="Number of users who transacted \nand have not transacted a top 3 \nrecommended property within 180 \nminutes of a user search")
```
This figure shows that 9 out of 1,000 consumers in the sample purchased a top 3 listing recommended by Expedia within 180 minutes of their search.
## Research Question 1 - Statistical Analysis
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=3, fig.width=4, fig.align='left', results="hide"}
set.seed(15)
n <- 3000
sample_means <- rep(NA, 3000)
for (i in 1:n) {
sample <- expedia_data_q1 %>% sample_n(size=1000, replace=TRUE)
sample_means[i] <- as.numeric(sample %>% summarize(mean(trans_made)))
}
sample_means <- tibble(trans_proportion = sample_means)
confidence_interval <- quantile(sample_means$trans_proportion, c(0.025, 0.975))
```
* We assume that the sample is representative of the population in order to perform bootstrapping.
* By resampling from the sample of 1,000 users for 3,000 repetitions and choosing a 95% confidence level, we find the confidence interval for the proportion of users between June $1^{st}$, 2021 and July $31^{st}$, 2021 who purchased a top 3 recommended property within 180 minutes is (`r confidence_interval[1]`, `r confidence_interval[2]`).
* A 95% confidence level means that 95% of confidence intervals generated in a similar manner (i.e., resampling from the sample of 1,000 user data 3,000 times) will capture the true proportion of consumers between June $1^{st}$, 2021 and July $31^{st}$, 2021 who purchased a top 3 recommended property within 180 minutes.
* The width of the confidence interval (i.e., $0.010 - 0.001 = 0.009$) is very narrow. So, we expect the true proportion to be very similar to the estimate made.
## Research Question 2 - Introduction
**Research Question**: Do the mean number of clicks received within 180 minutes of a user search differ for top 3 recommended discounted and non-discounted listings between June $1^{st}$, 2021 and July $31^{st}$, 2021?
**Hypothesis:**
The population contains top 3 recommended properties for each search result between June $1^{st}$, 2021 and July $31^{st}$, 2021.
$H_0: \mu_{no\ discount} - \mu_{discount} = 0$
$H_1: \mu_{no\ discount} - \mu_{discount} \neq 0$
where $\mu_{no\ discount}$ and $\mu_{discount}$ are the mean number of clicks for non-discounted and discounted top 3 listings respectively between June $1^{st}$, 2021 and July $31^{st}$, 2021.
**Research Motivation**:
* Hypothetically, if users are more interested in discounted listings, Expedia may increase the number of discounted listings to attract more consumers.
## Research Question 2 - Data Wrangling
* We applied the `select()` function to obtain the required variables, namely num_clicks1, num_clicks2, num_clicks3, is_drr1, is_drr2 and is_drr3.
* To tidy the data (due to change in the population of interest), we ran a for loop to reshape the tibble to have rows which represent listings instead of user search results. The new tibble has columns num_clicks and is_drr.
* num_clicks represents the number of clicks received by a top 3 recommended property within 180 minutes of a user search.
* is_drr represents whether a top 3 listed property is discounted.
* We applied the `group_by()` and `summarise()` functions to obtain the mean number of clicks received within 180 minutes of a user search for discounted and non-discounted top 3 listings.
## Research Question 2 - Visualization
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=3, fig.width=4.2, fig.align='center'}
untidy_expedia_data_q2 <- expedia_data %>%
select(num_clicks1, is_drr1, num_clicks2, is_drr2, num_clicks3, is_drr3)
expedia_data_q2 <- tibble(num_clicks=c(NA), is_drr=c(NA))
n <- nrow(untidy_expedia_data_q2)
for (i in 1:n) {
expedia_data_q2 <- add_row(expedia_data_q2,
num_clicks = as.numeric(untidy_expedia_data_q2[i, "num_clicks1"]),
is_drr = as.numeric(untidy_expedia_data_q2[i, "is_drr1"]))
expedia_data_q2 <- add_row(expedia_data_q2,
num_clicks = as.numeric(untidy_expedia_data_q2[i, "num_clicks2"]),
is_drr = as.numeric(untidy_expedia_data_q2[i, "is_drr2"]))
expedia_data_q2 <- add_row(expedia_data_q2,
num_clicks = as.numeric(untidy_expedia_data_q2[i, "num_clicks3"]),
is_drr = as.numeric(untidy_expedia_data_q2[i, "is_drr3"]))
}
expedia_data_q2 <- expedia_data_q2 %>%
filter(!is.na(num_clicks) & !is.na(is_drr)) %>%
mutate(is_drr = ifelse(is_drr == 1, TRUE, FALSE))
expedia_data_q2_summary <- expedia_data_q2 %>%
group_by(is_drr) %>%
summarise(mean_num_clicks=mean(num_clicks))
expedia_data_q2_summary %>%
ggplot(aes(x=is_drr, y=mean_num_clicks)) +
geom_bar(stat="identity") +
labs(x="Property discounted",
y="Mean number of clicks\nwithin 180 minutes \nof a user search",
title="Average number of clicks within 180 \nminutes of a user search for discounted \nand non-discounted top 3 properties")
```
## Research Question 2 - Statistical Analysis
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=3, fig.width=4, fig.align='left', results="hide"}
test_stat <- expedia_data_q2_summary %>%
summarize(test_stat=diff(mean_num_clicks)) %>%
as.numeric()
set.seed(15)
n <- 5000
simulated_values <- rep(NA, nrow(expedia_data_q2))
for (i in 1:n) {
simdata <- expedia_data_q2 %>%
mutate(is_drr = sample(is_drr))
simulated_values[i] <- simdata %>%
group_by(is_drr) %>%
summarize(mean_num_clicks=mean(num_clicks)) %>%
summarize(mean_clicks_diff = diff(mean_num_clicks)) %>%
as.numeric()
}
simulated_values <- tibble(mean_clicks_diff = simulated_values)
more_extreme_simulated_values <- simulated_values %>%
filter(abs(mean_clicks_diff) >= abs(test_stat)) %>%
summarize(size = n())
p_value <- as.numeric(more_extreme_simulated_values) / n
```
* The calculated test statistic from the dataset for the difference between the mean number of clicks received within 180 minutes of a user search for non-discounted and discounted top 3 properties is `r round(test_stat, digits=4)`.
* This means that within the sample, discounted top 3 properties get `r -round(test_stat, digits=4)` less clicks than non-discounted top 3 properties on average within 180 minutes of a user search.
* After running 5,000 simulations under the assumption that the null hypothesis is true by shuffling the label id_drr indicating whether a property is discounted, the p-value is found to be `r round(p_value, digits=4)`.
* Since the p-value is between 0.01 and 0.001, there is strong evidence against the null hypothesis.
## Research Question 3 - Introduction
**Research Question**: Do the number of adults and children on a trip affect the length of
travel for Expedia searches made between June $1^{st}$, 2021 and July $31^{st}$, 2021?
**Hypothesis**:
$H_0: \beta_{i} = 0$
$H_1: \beta_{i} \neq 0$
where $\beta_{i}$ is a slope coefficient for the linear regression model where the predictors are the number of children and adults, and the response is the length of travel.
**Research Motivation**:
* If a correlation exists between the type and number of travelers, and length of travel, Expedia may want to consider this correlation when booking properties in bulk.
* For instance, if the correlation is found to be positive, larger properties which can fit more people may be booked for a longer time interval.
## Research Question 3 - Data Wrangling
* We applied the `select()` function to obtain the required variables, namely adult_count, child_count, checkin_date and checkout_date.
* We applied the `mutate()` function to create a new variable called travel_length.
* travel_length represents the time difference between checkout_date and checkin_date in days.
* To split the dataset into training and testing data when performing linear regression, we applied the `rowid_to_column()` function to add a unique identifier to each row in the tibble.
## Research Question 3 - Statistical Analysis
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=3, fig.width=4, fig.align='left', results="hide"}
expedia_data_q3 <- expedia_data %>%
select(adult_count, child_count, checkin_date, checkout_date) %>%
filter(!is.na(checkin_date) & !is.na(checkout_date)) %>%
mutate(travel_length = as.numeric(checkout_date - checkin_date)) %>%
select(adult_count, child_count, travel_length) %>%
rowid_to_column()
set.seed(15)
n <- nrow(expedia_data_q3)
training_indices <- sample(1:n, size=round(0.8 * n))
glimpse(expedia_data_q3)
train <- expedia_data_q3 %>% filter(rowid %in% training_indices)
y_train <- train$travel_length
test <- expedia_data_q3 %>% filter(!(rowid %in% training_indices))
y_test <- test$travel_length
mod1_train <- lm(travel_length ~ adult_count, data = train)
mod2_train <- lm(travel_length ~ child_count, data = train)
mod3_train <- lm(travel_length ~ adult_count + child_count, data = train)
mod4_train <- lm(travel_length ~ adult_count * child_count, data = train)
yhat_mod1_test <- predict(mod1_train, newdata = test)
yhat_mod2_test <- predict(mod2_train, newdata = test)
yhat_mod3_test <- predict(mod3_train, newdata = test)
yhat_mod4_test <- predict(mod4_train, newdata = test)
yhat_mod1_train <- predict(mod1_train, newdata = train)
yhat_mod2_train <- predict(mod2_train, newdata = train)
yhat_mod3_train <- predict(mod3_train, newdata = train)
yhat_mod4_train <- predict(mod4_train, newdata = train)
mod1_test_RMSE <- sqrt(sum((y_test - yhat_mod1_test)^2 / nrow(test)))
mod2_test_RMSE <- sqrt(sum((y_test - yhat_mod2_test)^2 / nrow(test)))
mod3_test_RMSE <- sqrt(sum((y_test - yhat_mod3_test)^2 / nrow(test)))
mod4_test_RMSE <- sqrt(sum((y_test - yhat_mod4_test)^2 / nrow(test)))
mod1_train_RMSE <- sqrt(sum((y_train - yhat_mod1_train)^2 / nrow(train)))
mod2_train_RMSE <- sqrt(sum((y_train - yhat_mod2_train)^2 / nrow(train)))
mod3_train_RMSE <- sqrt(sum((y_train - yhat_mod3_train)^2 / nrow(train)))
mod4_train_RMSE <- sqrt(sum((y_train - yhat_mod4_train)^2 / nrow(train)))
model_comparison_table <- tibble(Model = c("1", "2", "3", "4"),
RMSE_testdata = c(mod1_test_RMSE, mod2_test_RMSE,
mod3_test_RMSE, mod4_test_RMSE),
RMSE_traindata = c(mod1_train_RMSE, mod2_train_RMSE,
mod3_train_RMSE, mod4_train_RMSE),
ratio_of_RMSEs = RMSE_testdata / RMSE_traindata)
# model_comparison_table
# summary(mod2_train)$r.squared
# summary(mod3_train)$r.squared
# summary(mod4_train)$r.squared
# summary(mod2_train)$coefficients
# summary(mod3_train)$coefficients
# summary(mod4_train)$coefficients
slope <- summary(mod1_train)$coefficients[2,1]
r_squared <- summary(mod1_train)$r.squared
p_value <- summary(mod1_train)$coefficients[2,4]
```
* We created 4 linear regression models which differ by number of predictors (e.g. 1 predictor versus 2 predictors) and whether predictor variables interact.
* By picking the "best" model based on prediction accuracy, low signs of overfitting and simplicity, we find that the "best" model is a simple linear regression model (a model with 1 predictor) that uses the number of adults to predict the length of a trip.
* The extremely small p-value of `r round(p_value, digits=6)` suggests there is very strong evidence against the null hypothesis that there is no relationship between the number of adults on a trip and the length of a trip. Thus, we reject the null hypothesis.
* The calculated slope of `r round(slope, digits=4)` implies that for each additional adult, the length of trip increases by `r round(slope, digits=4)` days on average.
* However, the $R^2$ value of `r round(r_squared, 4)` for the linear regression model implies only `r round(r_squared, 4) * 100`% of variability in the length of a trip is accounted by the number of adults on the trip.
## Research Question 3 - Visualization
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=2.5, fig.width=4, fig.align='center'}
expedia_data_q3 %>% ggplot(aes(x=adult_count, y=travel_length)) +
geom_point() +
geom_smooth(se=FALSE, method="lm") +
labs(x="Number of adults",
y="Length of travel (days)",
title="How number of adults affects length of \ntravel") +
theme_minimal() +
scale_x_continuous(breaks=c(0,1,2,3,4,5,6,7,8,9,10)) +
scale_y_continuous(breaks=c(0,10,20,30))
```
This is a figure of the "best" linear regression model which shows the association between the number of adults on a trip and the length of travel.
## Limitations
* We assume that data filled in by users is accurate. Inaccurate data entries (e.g. typos) may skew the statistical results.
* We assume that the sample is representative of the population. Otherwise, statistical methods (especially bootstrapping) will not yield accurate results.
* For research questions 1 and 2, the variables is_trans{n} and num_clicks{n} reflect user events like transaction or clicks within 180 minutes of a user search. So, statistical inferences for such events are limited to 180 minutes within a user search.
* For research question 2, transactions are a better metric of consumer interest compared to clicks received by a listing. However, only 9 properties are transacted out of the 3,000 properties. So, we used the number of clicks as the metric.
* For research question 3, we intended to research how the number of adults, children and infants affect the length of travel. However, only 2 user searches had infants in their travel group. So, we removed the number of infants as a potential predictor.
## Overall Conclusion - Looking Ahead Part 1
Here are some closing thoughts for each research question.
For **research question 1**,
* The 95% confidence interval for the proportion of top 3 recommended listings transacted within 180 minutes is (`r confidence_interval[1]`, `r confidence_interval[2]`).
* The bounds of the interval are low which implies that the recommendation algorithm has room for improvement in terms of increasing the proportion of users who make a transaction within 180 minutes of a user search.
* We recommend Expedia invest in research and development of recommendation algorithms as such algorithms have potential of directly improving the profitability of Expedia.
## Overall Conclusion - Looking Ahead Part 2
For **research question 2**,
* There is strong evidence against the null hypothesis that the mean number of clicks within 180 minutes of a user search for discounted and non-discounted top 3 properties is the same between June $1^{st}$, 2021 and July $31^{st}$, 2021.
* We recommend Expedia perform A/B testing with 2 variations of the Expedia website which differ by the number of discounted listings.$^{[2]}$
* This allows Expedia to find out how user consumption behaviours change based on change in the number of discounted properties.
* Knowing this may help improve the recommendation algorithm in terms of recommending more or less discounted properties to suit user preferences.
* At the same time, understanding consumption behaviours may lead to the development of new marketing strategies in terms of discounting more or less properties to maximize profit.
## Overall Conclusion - Looking Ahead Part 3
For **research question 3**,
* It was found that as the number of adults increases, the length of travel increases for Expedia searches made between June $1^{st}$, 2021 and July $31^{st}$, 2021.
* Expedia can consider the linear regression model when purchasing properties in bulk or advising hoteliers regarding the expected intervals of booking.
* For instance, when booking larger properties (which can fit more people) for resale, Expedia should book these properties for longer period to reduce chances of them not being transacted.
* However, the low $R^2$ value of `r round(r_squared, 4)` of the model implies that there are other variables (apart from number of adults) that explain the variability in travel length.
* To get a more "complete" model, Expedia may choose to explore how other variables (including those that do not appear in the current dataset) affect the length of travel.
## Citations
1. https://www.nasdaq.com/articles/how-expedia-makes-most-its-money-2017-08-28
2. https://vwo.com/ab-testing/