-
-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added details to quickstart documentation on missing r-squared value #473
Added details to quickstart documentation on missing r-squared value #473
Conversation
Syncing from original
Syncing latest from DistrictDataLabs head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mitevpi thank you for expanding that paragraph; it certainly adds more detail! Just a few typos and a quick suggestion. Thanks again!
docs/quickstart.rst
Outdated
|
||
Finally the residuals are colored by training and test set. This helps us identify errors in creating train and test splits. If the test error doesn't match the train error then our model is either overfit or underfit. Otherwise it could be an error in shuffling the dataset before creating the splits. | ||
|
||
Because our coefficient of determination for this model is 0.328, let's see if we can fit a better model using *regularization*, and explore another visualizer at the same time. | ||
Along with generating the residuals plot, we also measured the peformance, or "scored" our model on the test data above: ``visualizer.score(X_test, y_test)``. Because we used a Linear Regression model, the `scoring consists of finding the R-squared value of the data <http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score>`_, which is a statistical measure of how close the data are to the fitted regression line. The R-squared value of any model may vary slightly between prediction/test runs, however it should generally be comparable. In our case, the R-squared value for this model was only 0.328, suggestion that linear correlation may not be the most appropriate to use for fitting this data. Let's see if we can fit a better model using *regularization*, and explore another visualizer at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor typos:
peformance --> performance
Linear Regression --> LinearRegression
or "linear regression".
was only 0.328, suggestion --> suggesting
also, how about "we also measured the performance by "scoring" our model on the test data, e.g. the code snippet visualizer.score(X_test, y_test)
."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I'll make the changes and re-commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again @mitevpi for working on the documentation - looks great!
This PR is in response to #406
Added some more details and explanation on where the value was coming from based on which we are determining which model to use. Linked to the source sklearn .score() documentation rather than diving into details on the quickstart page.