Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added details to quickstart documentation on missing r-squared value #473

Merged
merged 10 commits into from
Jun 13, 2018
Merged

Added details to quickstart documentation on missing r-squared value #473

merged 10 commits into from
Jun 13, 2018

Conversation

mitevpi
Copy link
Contributor

@mitevpi mitevpi commented Jun 9, 2018

This PR is in response to #406

Added some more details and explanation on where the value was coming from based on which we are determining which model to use. Linked to the source sklearn .score() documentation rather than diving into details on the quickstart page.

Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mitevpi thank you for expanding that paragraph; it certainly adds more detail! Just a few typos and a quick suggestion. Thanks again!


Finally the residuals are colored by training and test set. This helps us identify errors in creating train and test splits. If the test error doesn't match the train error then our model is either overfit or underfit. Otherwise it could be an error in shuffling the dataset before creating the splits.

Because our coefficient of determination for this model is 0.328, let's see if we can fit a better model using *regularization*, and explore another visualizer at the same time.
Along with generating the residuals plot, we also measured the peformance, or "scored" our model on the test data above: ``visualizer.score(X_test, y_test)``. Because we used a Linear Regression model, the `scoring consists of finding the R-squared value of the data <http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score>`_, which is a statistical measure of how close the data are to the fitted regression line. The R-squared value of any model may vary slightly between prediction/test runs, however it should generally be comparable. In our case, the R-squared value for this model was only 0.328, suggestion that linear correlation may not be the most appropriate to use for fitting this data. Let's see if we can fit a better model using *regularization*, and explore another visualizer at the same time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typos:

peformance --> performance
Linear Regression --> LinearRegression or "linear regression".
was only 0.328, suggestion --> suggesting

also, how about "we also measured the performance by "scoring" our model on the test data, e.g. the code snippet visualizer.score(X_test, y_test)."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I'll make the changes and re-commit.

Copy link
Member

@bbengfort bbengfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again @mitevpi for working on the documentation - looks great!

@bbengfort bbengfort merged commit 90355a3 into DistrictDataLabs:develop Jun 13, 2018
@mitevpi mitevpi deleted the quickstart-documentation branch June 13, 2018 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants