final-project.Rmd

---
title: "Exploring the Expedia Recommendation Algorithm"
author: 'Group 35: Max Ming Yi Koh (1007972785) and Kevin Le (1007952805)'
date: "March 31, 2022"
output:
  beamer_presentation:
    theme: Pittsburgh
    colortheme: orchid
    fonttheme: structurebold
    slide_level: 2
  pdf_document: default
classoption: aspectratio=169
fontsize: 11pt
urlcolor: blue
---


```{r, echo=FALSE, message=FALSE, warning=FALSE}
# echo=FALSE will stop the code chunk from appearing in the knit document
# warning=FALSE and message=FALSE will stop R messages from appearing in the knit document
library(tidyverse)
library(dplyr)
# here is the data for this project
expedia_data <- read_csv("ExpediaSearchData.csv")
# see the Project Overview page for more information: 
# https://q.utoronto.ca/courses/235890/pages/project-overview
```

## General Motivations and Dataset Description

* Expedia generates revenue by reselling bookings (purchased in bulks at discounted pricing) or charging commissions from hoteliers.$^{[1]}$

* Thus, an effective recommendation algorithm that recognizes consumer needs is crucial to improve user experience and optimize the profitability of the business.

* We formulate 3 research questions that answer these 2 broad scoped questions:
  1. How effective is Expedia's recommendation algorithm? 
  2. What are some factors that affect purchasing decisions of consumers? 

* The investigation uses a dataset of 1,000 Expedia user searches along with certain variables related to users and the top 3 recommended properties for each user search between June $1^{st}$, 2021 and July $31^{st}$, 2021.

* Unless restated, you may assume the sample of the 3 research questions consists of the users that made these 1,000 Expedia searches while the population consists of all Expedia users who made a search between June $1^{st}$, 2021 and July $31^{st}$, 2021.

## Data Summary
Below is a table of variables used in the 3 research problems. Note that {n} is a placeholder for integers 1, 2, or 3. 
 
| Variable      	| Description                                                                                  	      |
|-----------------|-----------------------------------------------------------------------------------------------------|
| is_trans{n}   	|  whether the consumer transacted the n$^{th}$ displayed property within 180 minutes of a user search|
| is_drr{n}     	|  whether the n$^{th}$ displayed property is discounted                                              |
| num_clicks{n} 	|  number of clicks for the n$^{th}$ displayed property within 180 minutes of a user search           |
| checkin_date  	|  stay start date                                                                                    |
| checkout_date 	|  stay end date                                                                                      |
| adult_count   	|  number of adults on the trip                                                                       |
| child_count   	|  number of children on the trip                                                                     |


## Research Question 1 - Introduction
**Research Question**: What is the proportion of Expedia users between June $1^{st}$, 2021 and July $31^{st}$, 2021 who purchased one of the top 3 recommended properties within 180 minutes of a search? 

**Research Motivation**: 

* The proportion of users who purchase a top 3 recommended property is a good metric to measure the effectiveness of Expedia's recommendation algorithm. 

* For instance, a higher proportion may imply that the algorithm is effective at recognizing user needs while a lower proportion may imply that the algorithm is less effective at recognizing user needs. 

## Research Question 1 - Data Wrangling  
* We applied the `select()` function to obtain the required variables, namely is_trans1, is_trans2 and is_trans3. 

* Using is_trans1, is_trans2 and is_trans3 columns, we applied the `mutate()` function to create a new variable named trans_made which indicates whether any transactions for the top 3 recommended properties were made by a user within 180 minutes of his or her search.

* trans_made is set to TRUE if (at least) one top 3 property is transacted within 180 minutes of a user search. Otherwise, trans_made is set to FALSE.

## Research Question 1 - Data Visualization
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=2.5, fig.width=3.7, fig.align='center'}
expedia_data_q1 <- expedia_data %>% 
  select(is_trans1, is_trans2, is_trans3) %>% 
  mutate(is_trans1 = ifelse(!is.na(is_trans1) & is_trans1 == 1, 1, 0),
         is_trans1 = ifelse(!is.na(is_trans2) & is_trans2 == 1, 1, 0),
         is_trans1 = ifelse(!is.na(is_trans3) & is_trans3 == 1, 1, 0), 
         trans_made = ifelse(is_trans1 + is_trans2 + is_trans3 > 0, TRUE, FALSE)
  ) 

expedia_data_q1 %>%  
  ggplot(aes(x=trans_made)) + 
  geom_bar(color="black", fill="gray") + 
  labs(x="Transaction made by users for any of the top 3\nrecommended properties within 180 minutes of \na user search",
       y="Count", 
       title="Number of users who transacted \nand have not transacted a top 3 \nrecommended property within 180 \nminutes of a user search")
```

This figure shows that 9 out of 1,000 consumers in the sample purchased a top 3 listing recommended by Expedia within 180 minutes of their search.

## Research Question 1 - Statistical Analysis
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=3, fig.width=4, fig.align='left', results="hide"}
set.seed(15) 
n <- 3000
sample_means <- rep(NA, 3000)

for (i in 1:n) {
  sample <- expedia_data_q1 %>% sample_n(size=1000, replace=TRUE)
  sample_means[i] <- as.numeric(sample %>% summarize(mean(trans_made)))
}

sample_means <- tibble(trans_proportion = sample_means) 
confidence_interval <- quantile(sample_means$trans_proportion, c(0.025, 0.975))
```

* We assume that the sample is representative of the population in order to perform bootstrapping. 

* By resampling from the sample of 1,000 users for 3,000 repetitions and choosing a 95% confidence level, we find the confidence interval for the proportion of users between June $1^{st}$, 2021 and July $31^{st}$, 2021 who purchased a top 3 recommended property within 180 minutes is (`r confidence_interval[1]`, `r confidence_interval[2]`).
  
* A 95% confidence level means that 95% of confidence intervals generated in a similar manner (i.e., resampling from the sample of 1,000 user data 3,000 times) will capture the true proportion of consumers between June $1^{st}$, 2021 and July $31^{st}$, 2021 who purchased a top 3 recommended property within 180 minutes.

* The width of the confidence interval (i.e., $0.010 - 0.001 = 0.009$) is very narrow. So, we expect the true proportion to be very similar to the estimate made. 

## Research Question 2 - Introduction
**Research Question**: Do the mean number of clicks received within 180 minutes of a user search differ for top 3 recommended discounted and non-discounted listings between June $1^{st}$, 2021 and July $31^{st}$, 2021? 

**Hypothesis:** 
The population contains top 3 recommended properties for each search result between June $1^{st}$, 2021 and July $31^{st}$, 2021. 

$H_0: \mu_{no\ discount} - \mu_{discount} = 0$

$H_1: \mu_{no\ discount} - \mu_{discount} \neq 0$ 

where $\mu_{no\ discount}$ and $\mu_{discount}$ are the mean number of clicks for non-discounted and discounted top 3 listings respectively between June $1^{st}$, 2021 and July $31^{st}$, 2021. 

**Research Motivation**: 

* Hypothetically, if users are more interested in discounted listings, Expedia may increase the number of discounted listings to attract more consumers. 

## Research Question 2 - Data Wrangling  
* We applied the `select()` function to obtain the required variables, namely num_clicks1, num_clicks2, num_clicks3, is_drr1, is_drr2 and is_drr3. 

* To tidy the data (due to change in the population of interest), we ran a for loop to reshape the tibble to have rows which represent listings instead of user search results. The new tibble has columns num_clicks and is_drr. 

* num_clicks represents the number of clicks received by a top 3 recommended property within 180 minutes of a user search. 
  
* is_drr represents whether a top 3 listed property is discounted. 

* We applied the `group_by()` and `summarise()` functions to obtain the mean number of clicks received within 180 minutes of a user search for discounted and non-discounted top 3 listings. 

## Research Question 2 - Visualization
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=3, fig.width=4.2, fig.align='center'}
untidy_expedia_data_q2 <- expedia_data %>% 
  select(num_clicks1, is_drr1, num_clicks2, is_drr2, num_clicks3, is_drr3)

expedia_data_q2 <- tibble(num_clicks=c(NA), is_drr=c(NA))
n <- nrow(untidy_expedia_data_q2)

for (i in 1:n) {
  expedia_data_q2 <- add_row(expedia_data_q2, 
                             num_clicks = as.numeric(untidy_expedia_data_q2[i, "num_clicks1"]),
                             is_drr = as.numeric(untidy_expedia_data_q2[i, "is_drr1"]))
  expedia_data_q2 <- add_row(expedia_data_q2, 
                             num_clicks = as.numeric(untidy_expedia_data_q2[i, "num_clicks2"]),
                             is_drr = as.numeric(untidy_expedia_data_q2[i, "is_drr2"]))
  expedia_data_q2 <- add_row(expedia_data_q2, 
                             num_clicks = as.numeric(untidy_expedia_data_q2[i, "num_clicks3"]),
                             is_drr = as.numeric(untidy_expedia_data_q2[i, "is_drr3"]))
}

expedia_data_q2 <- expedia_data_q2 %>% 
  filter(!is.na(num_clicks) & !is.na(is_drr)) %>% 
  mutate(is_drr = ifelse(is_drr == 1, TRUE, FALSE))

expedia_data_q2_summary <- expedia_data_q2 %>% 
  group_by(is_drr) %>% 
  summarise(mean_num_clicks=mean(num_clicks))

expedia_data_q2_summary %>% 
  ggplot(aes(x=is_drr, y=mean_num_clicks)) + 
  geom_bar(stat="identity") + 
  labs(x="Property discounted",
       y="Mean number of clicks\nwithin 180 minutes \nof a user search",
       title="Average number of clicks within 180 \nminutes of a user search for discounted \nand non-discounted top 3 properties")
```

## Research Question 2 - Statistical Analysis
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=3, fig.width=4, fig.align='left', results="hide"}
test_stat <- expedia_data_q2_summary %>%  
  summarize(test_stat=diff(mean_num_clicks)) %>%  
  as.numeric()

set.seed(15) 

n <- 5000
simulated_values <- rep(NA, nrow(expedia_data_q2))

for (i in 1:n) {
  simdata <- expedia_data_q2 %>% 
    mutate(is_drr = sample(is_drr)) 
  
  simulated_values[i] <- simdata %>% 
    group_by(is_drr) %>% 
    summarize(mean_num_clicks=mean(num_clicks)) %>%  
    summarize(mean_clicks_diff = diff(mean_num_clicks)) %>%  
    as.numeric()
}

simulated_values <- tibble(mean_clicks_diff = simulated_values) 

more_extreme_simulated_values <- simulated_values %>% 
  filter(abs(mean_clicks_diff) >= abs(test_stat)) %>%  
  summarize(size = n())

p_value <- as.numeric(more_extreme_simulated_values) / n
```

* The calculated test statistic from the dataset for the difference between the mean number of clicks received within 180 minutes of a user search for non-discounted and discounted top 3 properties is `r round(test_stat, digits=4)`. 

* This means that within the sample, discounted top 3 properties get `r -round(test_stat, digits=4)` less clicks than non-discounted top 3 properties on average within 180 minutes of a user search. 

* After running 5,000 simulations under the assumption that the null hypothesis is true by shuffling the label id_drr indicating whether a property is discounted, the p-value is found to be `r round(p_value, digits=4)`. 

* Since the p-value is between 0.01 and 0.001, there is strong evidence against the null hypothesis. 

## Research Question 3 - Introduction
**Research Question**: Do the number of adults and children on a trip affect the length of
travel for Expedia searches made between June $1^{st}$, 2021 and July $31^{st}$, 2021?

**Hypothesis**: 

$H_0: \beta_{i} = 0$

$H_1: \beta_{i} \neq 0$

where $\beta_{i}$ is a slope coefficient for the linear regression model where the predictors are the number of children and adults, and the response is the length of travel. 

**Research Motivation**: 

* If a correlation exists between the type and number of travelers, and length of travel, Expedia may want to consider this correlation when booking properties in bulk.

* For instance, if the correlation is found to be positive, larger properties which can fit more people may be booked for a longer time interval.

## Research Question 3 - Data Wrangling  

* We applied the `select()` function to obtain the required variables, namely adult_count, child_count, checkin_date and checkout_date. 

* We applied the `mutate()` function to create a new variable called travel_length. 

* travel_length represents the time difference between checkout_date and checkin_date in days. 

* To split the dataset into training and testing data when performing linear regression, we applied the `rowid_to_column()` function to add a unique identifier to each row in the tibble. 

## Research Question 3 - Statistical Analysis
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=3, fig.width=4, fig.align='left', results="hide"}
expedia_data_q3 <- expedia_data %>%  
  select(adult_count, child_count, checkin_date, checkout_date) %>%  
  filter(!is.na(checkin_date) & !is.na(checkout_date)) %>%  
  mutate(travel_length = as.numeric(checkout_date - checkin_date)) %>%  
  select(adult_count, child_count, travel_length) %>%  
  rowid_to_column()

set.seed(15) 
n <- nrow(expedia_data_q3)
training_indices <- sample(1:n, size=round(0.8 * n))
glimpse(expedia_data_q3)

train <- expedia_data_q3 %>% filter(rowid %in% training_indices)
y_train <- train$travel_length
test <- expedia_data_q3 %>% filter(!(rowid %in% training_indices)) 
y_test <- test$travel_length

mod1_train <- lm(travel_length ~ adult_count, data = train) 
mod2_train <- lm(travel_length ~ child_count, data = train) 
mod3_train <- lm(travel_length ~ adult_count + child_count, data = train) 
mod4_train <- lm(travel_length ~ adult_count * child_count, data = train) 

yhat_mod1_test <- predict(mod1_train, newdata = test) 
yhat_mod2_test <- predict(mod2_train, newdata = test) 
yhat_mod3_test <- predict(mod3_train, newdata = test) 
yhat_mod4_test <- predict(mod4_train, newdata = test) 

yhat_mod1_train <- predict(mod1_train, newdata = train)
yhat_mod2_train <- predict(mod2_train, newdata = train) 
yhat_mod3_train <- predict(mod3_train, newdata = train) 
yhat_mod4_train <- predict(mod4_train, newdata = train) 

mod1_test_RMSE <- sqrt(sum((y_test - yhat_mod1_test)^2 / nrow(test)))
mod2_test_RMSE <- sqrt(sum((y_test - yhat_mod2_test)^2 / nrow(test)))
mod3_test_RMSE <- sqrt(sum((y_test - yhat_mod3_test)^2 / nrow(test)))
mod4_test_RMSE <- sqrt(sum((y_test - yhat_mod4_test)^2 / nrow(test)))

mod1_train_RMSE <- sqrt(sum((y_train - yhat_mod1_train)^2 / nrow(train)))
mod2_train_RMSE <- sqrt(sum((y_train - yhat_mod2_train)^2 / nrow(train)))
mod3_train_RMSE <- sqrt(sum((y_train - yhat_mod3_train)^2 / nrow(train)))
mod4_train_RMSE <- sqrt(sum((y_train - yhat_mod4_train)^2 / nrow(train)))

model_comparison_table <- tibble(Model = c("1", "2", "3", "4"), 
                                 RMSE_testdata = c(mod1_test_RMSE, mod2_test_RMSE,
                                                   mod3_test_RMSE, mod4_test_RMSE),
                                 RMSE_traindata = c(mod1_train_RMSE, mod2_train_RMSE,
                                                    mod3_train_RMSE, mod4_train_RMSE),
                                 ratio_of_RMSEs = RMSE_testdata / RMSE_traindata)

# model_comparison_table
# summary(mod2_train)$r.squared
# summary(mod3_train)$r.squared
# summary(mod4_train)$r.squared
# summary(mod2_train)$coefficients
# summary(mod3_train)$coefficients
# summary(mod4_train)$coefficients

slope <- summary(mod1_train)$coefficients[2,1]
r_squared <- summary(mod1_train)$r.squared
p_value <- summary(mod1_train)$coefficients[2,4]
```

* We created 4 linear regression models which differ by number of predictors (e.g. 1 predictor versus 2 predictors) and whether predictor variables interact.

* By picking the "best" model based on prediction accuracy, low signs of overfitting and simplicity, we find that the "best" model is a simple linear regression model (a model with 1 predictor) that uses the number of adults to predict the length of a trip.

* The extremely small p-value of `r round(p_value, digits=6)` suggests there is very strong evidence against the null hypothesis that there is no relationship between the number of adults on a trip and the length of a trip. Thus, we reject the null hypothesis. 

* The calculated slope of `r round(slope, digits=4)` implies that for each additional adult, the length of trip increases by `r round(slope, digits=4)` days on average. 

* However, the $R^2$ value of `r round(r_squared, 4)` for the linear regression model implies only `r round(r_squared, 4) * 100`% of variability in the length of a trip is accounted by the number of adults on the trip.

## Research Question 3 - Visualization
```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.height=2.5, fig.width=4, fig.align='center'}
expedia_data_q3 %>% ggplot(aes(x=adult_count, y=travel_length)) + 
  geom_point() + 
  geom_smooth(se=FALSE, method="lm") + 
  labs(x="Number of adults",
       y="Length of travel (days)",
       title="How number of adults affects length of \ntravel") + 
  theme_minimal() + 
  scale_x_continuous(breaks=c(0,1,2,3,4,5,6,7,8,9,10)) + 
  scale_y_continuous(breaks=c(0,10,20,30))
```
This is a figure of the "best" linear regression model which shows the association between the number of adults on a trip and the length of travel.

## Limitations 

* We assume that data filled in by users is accurate. Inaccurate data entries (e.g. typos) may skew the statistical results. 

* We assume that the sample is representative of the population. Otherwise, statistical methods (especially bootstrapping) will not yield accurate results. 

* For research questions 1 and 2, the variables is_trans{n} and num_clicks{n} reflect user events like transaction or clicks within 180 minutes of a user search. So, statistical inferences for such events are limited to 180 minutes within a user search.

* For research question 2, transactions are a better metric of consumer interest compared to clicks received by a listing. However, only 9 properties are transacted out of the 3,000 properties. So, we used the number of clicks as the metric. 

* For research question 3, we intended to research how the number of adults, children and infants affect the length of travel. However, only 2 user searches had infants in their travel group. So, we removed the number of infants as a potential predictor. 

## Overall Conclusion - Looking Ahead Part 1
Here are some closing thoughts for each research question.

For **research question 1**, 

* The 95% confidence interval for the proportion of top 3 recommended listings transacted within 180 minutes is (`r confidence_interval[1]`, `r confidence_interval[2]`). 

* The bounds of the interval are low which implies that the recommendation algorithm has room for improvement in terms of increasing the proportion of users who make a transaction within 180 minutes of a user search.

* We recommend Expedia invest in research and development of recommendation algorithms as such algorithms have potential of directly improving the profitability of Expedia. 

## Overall Conclusion - Looking Ahead Part 2
For **research question 2**, 

* There is strong evidence against the null hypothesis that the mean number of clicks within 180 minutes of a user search for discounted and non-discounted top 3 properties is the same between June $1^{st}$, 2021 and July $31^{st}$, 2021. 

* We recommend Expedia perform A/B testing with 2 variations of the Expedia website which differ by the number of discounted listings.$^{[2]}$

* This allows Expedia to find out how user consumption behaviours change based on change in the number of discounted properties.

* Knowing this may help improve the recommendation algorithm in terms of recommending more or less discounted properties to suit user preferences.

* At the same time, understanding consumption behaviours may lead to the development of new marketing strategies in terms of discounting more or less properties to maximize profit.

## Overall Conclusion - Looking Ahead Part 3
For **research question 3**, 

* It was found that as the number of adults increases, the length of travel increases for Expedia searches made between June $1^{st}$, 2021 and July $31^{st}$, 2021.

* Expedia can consider the linear regression model when purchasing properties in bulk or advising hoteliers regarding the expected intervals of booking. 

* For instance, when booking larger properties (which can fit more people) for resale, Expedia should book these properties for longer period to reduce chances of them not being transacted. 

* However, the low $R^2$ value of `r round(r_squared, 4)` of the model implies that there are other variables (apart from number of adults) that explain the variability in travel length. 

* To get a more "complete" model, Expedia may choose to explore how other variables (including those that do not appear in the current dataset) affect the length of travel. 

## Citations 

1. https://www.nasdaq.com/articles/how-expedia-makes-most-its-money-2017-08-28

2. https://vwo.com/ab-testing/