README

CSE 5243 - Final Project
Zhuoer Wang, Zhe Huang

Data
Our data can be retrieved here: https://www.kaggle.com/c/expedia-hotel-recommendations/data

FinalReport.pdf: Final group report and individual report

Directory
Folder code includes all the code we wrote
	RFpredictor.py: The final model we used for the prediction.
	downSize.py: downsize the data
	booking.py: retrieve only booking records in the training set
	ca.py: formatting the prediction output generated by R for kaggle submission
	EDAforWhole.R: perform EDA on the whole training set
	EDAforSample.R: perform EDA on the sample set
	Preparedata.R: handling missing value and removing features that were not used  
	DT.R: 5-fold cross-validation of DT model
	KNN.R: 5-fold cross-validation of KNN model
	NB.R: 5-fold cross-validation of Naïve Bayes model
	NN.R: 5-fold cross-validation of Neural Network model 
	RandomForest.R: 5-fold cross-validation of RandomForest model
	SVM.R: 5-fold cross-validation of SVM model
	Others.R: Code used for PCA and basic parameter adjusting using R packages	
	Outputgeneration.R: generate prediction output for test set  

Folder predictions includes all the predictions we made at the different stage of developing our random forest model. You can submit it directly to kaggle @ https://www.kaggle.com/c/expedia-hotel-recommendations/submissions/attach for MAP@5 evaluation

Run
All of our Python library can be installed by using pip install (lib_name)
Run predictor.py: Python3 predictor.py
- Set training set path: line 10
- Set testing set path: line 40
- Cross validation on train: uncomment line 36 & 37
- Predictions will be output to "result.csv" under default directory
Run other python code: Python2 (name).py

DT.R, KNN.R, NB.R, NN.R, RandomForest.R, SVM.R are the completed R code for building and performing 5-fold cross validation in R. 
Load the data, run Preparedata.R to perform the basic data manipulation that shared by all the 6 models. And then can run each model, need to install the required packages before running each models.
Others.R include some codes that was used for parameter testing and principle component analysis. And may be recycled in the future. Noted that these code can be used for different model by making minor adjustment, so it is only included once. Other parameter testing has been done by manually trying different input value, and is not included in the file. 
Outputgeneration.R can be used to write the testing result into csv file.