A kaggle competition to predict whether a user will download an app after clicking a mobile app ad
You can download the data from the competion page
I built an XGBoost model that is fitted in a numerical encoded data, The model was trained using 55000000
records of the train dataset.
It took time between 25:45 minutes (it changes according to Data Size) on a 16GB and 8 cores machine. The algorithm implementation is very robust even with very large datatest.
After some excessive hyperparameters tuning, I got AUC of 0.9638 ln the public Leaderboard
The Model Predictions (My 2 Top score Submissions):
- The submission file in this Google Drive link
LightGBM model has too many hyperparameters and those needs carefull tuning, the best way to do that is using Grid Search Optimization
According to LightGBM documentation hasTo increase the accuracy of a LightGBM model, to increase the prediction accuracy of the algorithm you can start by doing one of the following:
- Use large
max_bin
(but it may be slower) - Use small
learning_rate
with largenum_iterations
- Use large
num_leaves
(but it may cause over-fitting) - Use bigger training data
- Try dart, You can choose the
boosting_type
of the algorithm betweengbdt
,dart
,rf
orgoss
To Do::
- TalkingData company is providing a huge amount of data
200 million
record , if you have an enought powerful machine, you diffently would want to train using the whole dataset. - You can try tp implement a deep learning model for such a huge data
- Try to do more feature engineering and see if results can get better
- Try to do downsampling due to the inbalanced ratio of the fraud and non-fraud records