-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathProject1.Rmd
282 lines (213 loc) · 9.8 KB
/
Project1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
---
title: "NBA Game Outcome Prediction"
author: "Darsh Chaurasia"
date: "2024-09-29"
output:
html_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Description
In this project, I will focus on predicting the outcome of NBA games using machine learning models based on key game statistics. I will work with a dataset that includes information such as points scored, field goal percentages, assists, rebounds, and whether the home team won or lost. I will begin by exploring and preprocessing the data, handling any missing values, creating new features like the point difference, and converting categorical variables into dummy variables for modeling. I will perform exploratory data analysis to understand the relationships between various statistics and game outcomes, using visualizations like histograms and heatmaps. I will then train a Random Forest model to predict whether the home team will win, using features like points, assists, and rebounds, and evaluate the model's performance with metrics such as accuracy and precision. I expect the results to be promising, and I will conclude by discussing how certain statistics, like field goal percentage and point difference, are strong predictors of game outcomes, while also suggesting further improvements for future work.
# Importing Libraries
For this project, I will use several important R libraries. I will rely on **dplyr** and **tidyr** for efficient data manipulation, allowing me to clean and transform the dataset by handling missing values, creating new features, and converting categorical variables. To load the dataset, I will use **readr**, which will help me easily import the data into R. For visualizations, I will utilize **ggplot2** to create plots like histograms and bar charts, and **corrplot**/**ggcorrplot** to visualize correlation matrices in a clear and informative way. For building the machine learning model, I will choose **caret**, which simplifies model training, data splitting, and evaluation. I will use **randomForest** to build the predictive model itself, as it's a robust and popular method for classification tasks. Finally, I will employ **pROC** to evaluate the model's performance, generating ROC curves and calculating metrics like AUC to assess prediction accuracy.
```{r, message = FALSE}
# Data manipulation and cleaning
library(dplyr)
library(tidyr)
library(readr)
# Data visualization
library(ggplot2)
library(corrplot)
library(ggcorrplot)
# Machine learning and modeling
library(caret)
library(randomForest)
# Performance evaluation
library(pROC)
```
# Importing the data
```{r, message = FALSE}
nba_data <- read_csv("nba.csv")
```
# View the first few rows of the dataset
```{r}
head(nba_data)
```
# Quick Overview of the Data
## Summary of the dataset
```{r}
summary(nba_data)
```
\vspace{3cm}
## Visualizing the Distribution of Points Scored by the Home Team
```{r}
# Histogram of points scored by the home team
ggplot(nba_data, aes(x = pts_home)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 5, fill = "steelblue",
color = "black",alpha = 0.7) +
geom_density(color = "red", linewidth = 1) +
labs(title = "Distribution of Home Team Points",
x = "Home Team Points",
y = "Density") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 15),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12)) +
geom_vline(aes(xintercept = mean(pts_home)), color = "blue", linetype = "dashed",
linewidth = 1) +
annotate("text", x = mean(nba_data$pts_home), y = 0.01, label = "Mean", color = "blue",
angle = 90, vjust = -0.5)
```
# Pre-Processing the Data
Handling missing values, creating new features, and converting categorical variables.
## Handling Missing Values
```{r}
# Check for missing values
sum(is.na(nba_data))
# Impute or remove missing values if necessary
nba_data <- nba_data %>% mutate_if(is.numeric, ~ ifelse(is.na(.),
median(., na.rm = TRUE), .))
```
## Creating New Features
Create a new feature representing the point difference between the home and away teams.
```{r}
# Create a new feature: Point Difference
nba_data <- nba_data %>%
mutate(PointDifference = pts_home - pts_away)
```
## Converting Categorical Variables to Dummy Variables
Convert team names and other categorical variables into dummy variables for modeling.
```{r}
# Convert categorical variables into factors
nba_data$team_home <- as.factor(nba_data$team_home)
nba_data$team_away <- as.factor(nba_data$team_away)
# Use dummy encoding for team names
nba_data_encoded <- model.matrix(~ team_home + team_away + 0, data = nba_data) %>%
as.data.frame()
# Combine the dummy variables back with the original dataset
# (excluding the original team columns)
nba_data <- cbind(nba_data_encoded, nba_data %>% select(-team_home, -team_away))
```
\vspace{3cm}
# Exploratory Data Analysis (EDA)
Exploratory analysis to understand the relationships between different game statistics and the outcome.
## Distribution of Game Outcomes
```{r}
# Visualize the distribution of game outcomes (win/loss)
ggplot(nba_data, aes(x = factor(home_team_win), fill = factor(home_team_win))) +
geom_bar() +
scale_fill_manual(values = c("0" = "red", "1" = "blue"),
labels = c("0" = "Loss", "1" = "Win")) +
labs(title = "Distribution of Home Team Wins and Losses",
x = "Home Team Outcome (0 = Loss, 1 = Win)",
y = "Count",
fill = "Outcome") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 15),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
legend.position = "right")
```
## Correlation Analysis
- r represents the correlation coefficient
- x and y represent two variables
- n is the number of data points
- sum of products of differences from mean
\vspace{3cm}
$$
r = \frac{ \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) }
{ \sqrt{ \sum_{i=1}^{n} (x_i - \bar{x})^2 } \cdot \sqrt{ \sum_{i=1}^{n} (y_i - \bar{y})^2 } }
$$
\newpage
```{r}
# Correlation calculation in R
# Select numerical columns from the dataset
nba_num_data <- nba_data %>% select(pts_home, pts_away, PointDifference,
ast_home, reb_home, ast_away, reb_away)
# Calculate correlation matrix
cor_matrix <- cor(nba_num_data)
# View the correlation matrix
cor_matrix
```
```{r}
# install.packages("ggcorrplot")
library(ggcorrplot)
# Select numerical columns for correlation analysis
nba_num_data <- nba_data %>% select(pts_home, pts_away, PointDifference, ast_home,
reb_home, ast_away, reb_away)
# Compute the correlation matrix
cor_matrix <- cor(nba_num_data)
# Create an advanced correlation heatmap with ggcorrplot
ggcorrplot(cor_matrix,
method = "square", # Use squares to represent the correlation
type = "lower", # Display only the lower triangle of the matrix
lab = TRUE, # Show correlation coefficients
lab_size = 4,
colors = c("red", "white", "blue"), # Color gradient
title = "Correlation Heatmap of Game Statistics",
ggtheme = theme_minimal())
```
# Model Creation
## Selecting a Machine Learning Algorithm
I will use a Random Forest model to predict whether the home team will win.
```{r, warning=FALSE, message=FALSE}
# Load necessary libraries
library(randomForest)
library(caret)
library(dplyr)
# Define the response variable and features
response <- nba_data$home_team_win
features <- nba_data %>% select(pts_home, pts_away, fg_pct_home, ast_home, reb_home)
# Split the data into training and testing sets (80% training, 20% testing)
set.seed(123)
train_index <- createDataPartition(response, p = 0.8, list = FALSE)
train_data <- features[train_index, ]
train_labels <- response[train_index]
test_data <- features[-train_index, ]
test_labels <- response[-train_index]
# Train a Random Forest model
model_rf <- randomForest(x = train_data, y = train_labels)
```
## Applying Model to Test Data
```{r}
# Predict on the test data
predictions_rf <- predict(model_rf, test_data)
# View predictions
head(predictions_rf)
```
# Model Results
## Confusion Matrix and Accuracy
```{r}
# Predict on the test data (probability predictions)
predictions_rf_prob <- predict(model_rf, test_data)
# Convert probabilities to binary class labels (using 0.5 as threshold)
predictions_rf <- ifelse(predictions_rf_prob > 0.5, 1, 0)
# Ensure that both the predictions and test labels are factors with the same levels
test_labels <- factor(test_labels, levels = c(0, 1)) # Ensure test labels are factors
predictions_rf <- factor(predictions_rf, levels = c(0, 1)) # Ensure predictions are factors
# Create a confusion matrix
conf_matrix <- confusionMatrix(predictions_rf, test_labels)
# Calculate accuracy, precision, and recall
accuracy <- conf_matrix$overall['Accuracy']
precision <- conf_matrix$byClass['Pos Pred Value']
recall <- conf_matrix$byClass['Sensitivity']
# Print accuracy, precision, and recall
print(accuracy)
print(precision)
print(recall)
```
## ROC Curve
```{r,message=FALSE, warning=FALSE}
# Curve
library(pROC)
# Compute ROC curve and AUC
roc_curve <- roc(test_labels, as.numeric(predictions_rf))
plot(roc_curve, main = "ROC Curve")
```
# Conclusion
In conclusion, I identified several key statistics, such as field goal percentage and point difference, as significant predictors of whether the home team wins. The Random Forest model provided an accuracy of r accuracy with reasonable precision and recall.
# Limitations
While the model performed well, there is room for improvement. One limitation is that this model doesn't account for advanced basketball metrics like turnovers or fouls. Incorporating these statistics could improve the model's predictive power.