This project shows how to use SPARK MLLib and build a ML Dashboard to:
- Expose our ML model via an endpoint for users to play and tweak with the model-params
- Quickly test the model with the new params using sample test data.
Logistic regressions, decision trees, SVMs, neural networks etc have a set of structural choices that one must make before actually fitting the model parameters. For example, within the logistic regression family, you can build separate models using either L1 or L2 regularization penalties. Within the decision tree family, you can have different models with different structural choices such as the depth of the tree, pruning thresholds, or even the splitting criteria. These structural choices are called hyperparameters.
Taditionally, for a datascientist, building a classification model is an iterative process of coming up with the model, tweak the hyperparameters and test it using test data. If the results are not matching the expectations, this could potentially lead to another iteration of tweaking the model and evaluating it. In this project, am proposing a solution to ease datascientist's iteration turn-around time and improve their efficiency.
- Load your trained model using Spark-ML
- Have a dashboard where the hyperparameters like regularization params, thresholds etc are exposed for user to tweak
- Quickly test the model with the new params on test dataset
Model used for testing in this project: For the purpose of demo, I've implemented a model using Spark 2.1 ML, to classify news documents into Science or NonScience category. I've done this using K-Fold CrossValidation on a ML Pipeline. Further details on the trained model can be found here.
Dashboard Inputs submitted by user:
-
Model params: As you can see in the above demo, I have exposed following four parameters of this model for user to play and test:
- LinearRegression - Threshold
- LinearRegression - RegularizationParam
- LinearRegression - Max Iterations
- HashingTF - Number of Features
-
Test Data to evaluate: Folder containing the documents to test
Initial values of the model params displayed in the dashboard: These params are initialised with their respective default values that the model was trained to have.
Dashboard Output: Table with 2 columns: DocumentName and ClassificationResult (whether its a science document or not)
Let's start tweaking the parameters to verify the working of dashboard..
- mvn clean install
- spark-submit --class com.spoddutur.MainClass <PATH_TO_20news-bydate.jar_FILE>
- data: Contains training and test news data taken from scikit.
- predictions.json: Final output of our trained model predictions on test data.
- trained_model: Final model we trained
- src/main/scala/com/spoddutur/MainApp.scala: Main class of this project.
- Spark 2.1 and Spark ML
- Scala 2.11
This project should be a good starting point on building a ML Dashboard where you can plug and play your Models and quickly verify how your model is classifying any corner test cases.