Skip to content

Releases: EducationalTestingService/skll

SKLL 2.0

24 Oct 16:58
6472139
Compare
Choose a tag to compare

This is a major new release. It's probably the largest SKLL release we have ever done since SKLL 1.0 came out! It includes dozens of new features, bugfixes, and documentation updates!

⚡️ SKLL 2.0 is backwards incompatible with previous versions of SKLL and might yield different results compared to previous versions even with the same data and same settings. ⚡️

💥 Incompatible Changes 💥

  • Python 2.7 is no longer supported since the underlying version of scikit-learn no longer supports it (Issue #497, PR #506).

  • Configuration field objective has been deprecated and replaced with objectives which allows specifying multiple tuning objectives for grid search (Issue #381, PR #458).

  • Grid search is now enabled by default in both the API as well as while using a configuration file (Issue #463, PR #465).

  • The Predictor class previously provided by the generate_predictions utility script is no longer available. If you were relying on this class, you should just load the model file and call Learner.predict() instead (Issue #562, PR #566).

  • There are no longer any default grid search objectives since the choice of objective is best left to the user. Note that since grid search is enabled by default, you must either choose an objective or explicitly disable grid search (Issue #381, PR #458).

  • mean_squared_error is no longer supported as a metric. Use neg_mean_squared_error instead (Issue #382, PR #470).

  • The cv_folds_file configuration file field is now just called folds_file (Issue #382, PR #470).

  • Running an experiment with the learning_curve task now requires specifying metrics in the Output section instead of objectives in the Tuning section (Issue #382, PR #470).

  • Previously when reading in CSV/TSV files, missing data was automatically imputed as zeros. This is not appropriate in all cases. This no longer the case and blanks are retained as is. Missing values will need to be explicitly dropped or replaced (see below) before using the file with SKLL (Issue #364, PRs #475 & #518).

  • pandas and seaborn are now direct dependencies of SKLL, and not optional (Issues #455 & #364, PRs #475 & #508).

💡 New features 💡

  • CSVReader/CSVWriter & TSVReader/TSVWriter now use pandas as the backend rather than custom code that relied on the csv module. This leads to significant speedups, especially for very large files (~5x for reading and ~10x for writing)! The speedup comes at the cost of moderate increase in memory consumption. See detailed benchmarks here (Issue #364, PRs #475 & #518).

  • SKLL models now have a new pipeline attribute which makes it easy to manipulate and use them in scikit-learn, if needed (Issue #451, PR #474).

  • scikit-learn updated to 0.21.3 (Issue #457, PR #559).

  • The SKLL conda package is now a generic Python package which means the same package works on all platforms and on all Python versions >= 3.6. This package is hosted on the new, public ETS anaconda channel.

  • SKLL learner hyperparameters have been updated to match the new scikit-learn defaults and those upcoming in 0.22.0 (Issue #438, PR #533).

  • Intermediate results for the grid search process are now available in the results.json files (Issue #431, #471).

  • The K models trained for each split of a K-fold cross-validation experiment can now be saved to disk (Issue #501, PR #505).

  • Missing values in CSV/TSV files can be dropped/replaced both via the command line and the API (Issue #540, PR #542).

  • Warnings from scikit-learn are now captured in SKLL log files (issue #441, PR #480).

  • Learner.model_params() and, consequently, the print_model_weights utility script now work with models trained on hashed features (issue #444, PR #466).

  • The print_model_weights utility script can now output feature weights sorted by class labels to improve readability (Issue #442, PR #468).

  • The skll_convert utility script can now convert feature files that do not contain labels (Issue #426, PR #453).

🛠 Bugfixes & Improvements 🛠

  • Fix several bugs in how various tuning objectives and output metrics were computed (Issues #545 & #548, PR #551).

  • Fix how pos_label_str is documented, read in, and used for classification tasks (Issues #550 & #570, PRs #566 & #571).

  • Fix several bugs in the generate_predictions utility script and streamline its implementation to not rely on an externally specified positive label or index but rather read it from the model file or infer it (Issues #484 & #562, PR #566).

  • Fix bug due to overlap between tuning objectives that metrics that could prevent metric computation (Issue #564, PR #567).

  • Using an externally specified folds_file for grid search now works for evaluate and predict tasks, not just train (Issue #536, PR #538).

  • Fix incorrect application of sampling before feature scaling in Learner.predict() (Issue #472, PR #474).

  • Disable feature sampling for MultinomialNB learner since it cannot handle negative values (Issue #473, PR #474).

  • Add missing logger attribute to Learner.FilteredLeaveOneGroupOut (Issue #541, PR #543).

  • Fix FeatureSet.has_labels to recognize list of None objects which is what happens when you read in an unlabeled data set and pass label_col=None (Issue #426, PR #453).

  • Fix bug in ARFFWriter that adds/removes label_col from the field names even if it's None to begin with (Issue #452, PR #453).

  • Do not produce unnecessary warnings for learning curves (Issue #410, PR #458).

  • Show a warning when applying feature hashing to multiple feature files (Issue #461, PR #479).

  • Fix loading issue for saved MultinomialNB models (Issue #573, PR #574).

  • Reduce memory usage for learning curve experiments by explicitly closing matplotlib figure instances after they are saved.

  • Improve SKLL’s cross-platform operation by explicitly reading and writing files as UTF-8 in readers and writers and by using the newline parameter when writing files.

📖 Documentation Updates 📖

  • Reorganize documentation to explicitly document all types of output files and link them to the corresponding configuration fields in the Output section (Issue #459, PR #568).

  • Add new interactive tutorial that uses a Jupyter notebook hosted on binder (Issue #448, PRs #547 & #552).

  • Add a new page to official documentation explaining how the SKLL code is organized for new developers (Issue #511, PR #519).

  • Update SKLL contribution guidelines and link to them from official documentation (Issues #498 & #514, PR #503 & #519).

  • Update documentation to indicate that pandas and seaborn are now direct dependencies and not optional (Issue #553, PR #563).

  • Update LogisticRegression learner documentation to talk explicitly about penalties and solvers (Issue #490, PR #500).

  • Properly document the internal conversion of string labels to ints/floats and possible edge cases (Issue #436, PR #476).

  • Add feature scaling to Boston regression example (Issue #469, PR #478).

  • Several other additions/updates to documentation (Issue #459, PR #568).

✔️ Tests ✔️

  • Make tests into a package so that we can do something like from skll.tests.utils import X etc. (Issue #530 , PR #531).

  • Add new tests based on SKLL examples so that we would know if examples ever break with any SKLL updates (Issues #529 & #544, PR #546).

  • Tweak tests to make test suite runnable on Windows (and pass!).

  • Add Azure Pipelines integration for automated test builds on Windows.

  • Added several new comprehensive tests for all new features and bugfixes. Also, removed older, unnecessary tests. See various PRs above for details.

  • Current code coverage for SKLL tests is at 95%, the highest it has ever been!

🔍 Other changes 🔍

  • Replace prettytable with the more actively maintained tabulate (Issue #356, PR #467).

  • Make sure entire codebase complies with PEP8 (Issue #460, PR #568).

  • Update the year to 2019 everywhere (Issue #447, PRs #456 & #568).

  • Update TravisCI configuration to use conda_requirements.txt for building environment (PR #515).

👩‍🔬 Contributors 👨‍🔬

(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)

Supreeth Baliga (@SupreethBaliga), Jeremy Biggs (@jbiggsets), Aoife Cahill (@aoifecahill), Ananya Ganesh (@ananyaganesh), R. Gokul (@rgokul), Binod Gyawali (@bndgyawali), Nitin Madnani (@desilinguist), Matt Mulholland (@mulhod), Robert Pugh (@Lguyogiro), Maxwell Schwartz (@maxwell-schwartz), Eugene Tsuprun (@etsuprun), Avijit Vajpayee (@AVajpayeeJr), Mengxuan Zhao (@chaomenghsuan)

SKLL 1.5.3

14 Dec 19:12
f08b38d
Compare
Choose a tag to compare

This is a minor release of SKLL with the most notable change being compatibility with the latest version of scikit-learn (v0.20.1).

What's new

  • SKLL is now compatible with scikit-learn v0.20.1 (Issue #432, PR #439).
  • GradientBoostingClassifier and GradientBoostingRegressor now accept sparse matrices as input (Issue #428, PR #429).
  • The model_params property now works for SVC learners with a linear kernel (Issue #425, PR #443).
  • Improved documentation (Issue #423, PR #437).
  • Update generate_predictions to output the probabilities for all classes instead of just the first class (Issue #430, PR #433). Note: this change breaks backward compatibility with previous SKLL versions since the output file now always includes a column header.

Bugfixes

  • Fixed broken links in documentation (Issues #421 and #422, PR #437).
  • Fixed data type conversion in NDJWriter (Issue #416, PR #440).
  • Properly handle the possible combinations of trained model and prediction set vectorizers in Learner.predict (Issue #414, PR #445).

Other changes

  • Make the tests for MLPClassifier and MLPRegressor go faster (by turning off grid search) to prevent Travis CI from timing out (issue #434, PR #435).

SKLL 1.5.2

12 Apr 17:01
a10b28e
Compare
Choose a tag to compare

This is a hot fix release that addresses a single issue.

Learner instances created via from_file() method did not get loggers associated with them. This meant that any and all warnings generated for such learner instances would have led to AttributeError exceptions.

SKLL 1.5.1

31 Jan 18:04
f195e4b
Compare
Choose a tag to compare

This is primarily a bug fix release.

Bugfixes

  • Generate the "folds_file" warnings only when "folds_file" is specified (issue #404, PR #405).
  • Modify Learner.save() to deal properly with reading in and re-saving older models (issue #406, PR #407).
  • Fix regression that caused the output directories to not be automatically created (issue #408, PR #409).

SKLL 1.5

14 Dec 20:27
Compare
Choose a tag to compare

This is a major new release of SKLL.

What's new

  • Several new scikit-learn learners included along with reasonable default parameter grids for tuning, where appropriate (issues #256 & #375, PR #377).
    • BayesianRidge
    • DummyRegressor
    • HuberRegressors
    • Lars
    • MLPRegressor
    • RANSACRegressor
    • TheilSenRegressor
    • DummyClassifier
    • MLPClassifier
    • RidgeClassifier
  • Allow computing any number of additional evaluation metrics in addition to the tuning objective (issue #350, PR #384).
  • Rename cv_folds_file configuration option to folds_file. The former is still supported with a deprecation warning but will be removed in the next release (PR #367).
  • Add a new configuration option use_folds_file_for_grid_search which controls whether the inner-loop grid-search in a cross-validation experiment with a custom folds file also uses the folds from the file. It's set to True by default. Setting it to False means that the inner loop uses regular 3-fold cross-validation and ignores the file (PR #367).
  • Also add a keyword argument called use_custom_folds_for_grid_search to the Learner.cross_validate() method (PR #367).
  • Learning curves can now be plotted from existing summary files using the new plot_learning_curves command line utility (issue #346, PR #396).
  • Overhaul logging in SKLL. All messages are now logged both to the console (if running interactively) and to log files. Read more about the SKLL log files in the Output Files section of the documentation (issue #369, PR #380).
  • neg_log_loss is now available as an objective function for classification (issue #327, PR #392).

Changes

  • SKLL now supports Python 3.6. Although Python 3.4 and 3.5 will still work, 3.6 is now the officially supported Python 3 version. Python 2.7 is still supported. (issue #355, PR #360).
  • The required version of scikit-learn has been bumped up to 0.19.1 (issue #328, PR #330).
  • The learning curve y-limits are now computed a bit more intelligently (issue #389, PR #390).
  • Raise a warning if ablation flag is used for an experiment that uses train_file/test_file - this is not supported (issue #313, PR #392).
  • Raise a warning if both fixed_parameters and param_grids are specified (issue #185, PR #297).
  • Disable grid search if no default parameter grids are available in SKLL and the user doesn't provide parameter grids either (issue #376, PR #378).
  • SKLL has a copy of scikit-learn's DictVectorizer because it needs some custom functionality. Most (but not all) of our modifications have now been merged into scikit-learn so our custom version is now significantly condensed down to just a single method (issue #263, PR #374).
  • Improved outputs for cross-validation tasks (issues #349 & #371, PRs #365 & #372)
    • When a folds file is specified, the log erroneously showed the full dictionary.
    • Show number of cross-validation folds in results to be via folds file if a folds file is specified.
    • Show grid search folds in results to be via folds file if the grid search ends up using the folds file.
    • Do not show the stratified folds information in results when a folds file is specified.
    • Show the value of use_folds_file_for_grid_search in results when appropriate.
    • Show grid search related information in results only when we are actually doing grid search.
  • The Travis CI plan was broken up into multiple jobs in order to get around the 50 minute limit (issue #385, PR #387).
  • For the conda package, some of the dependencies are now sourced from the conda-forge channel.

Bugfixes

  • Fix the bug that was causing the inner grid-search loop of a cross-validation experiment to use a single job instead of the number specified via grid_search_jobs (issue #363, PR #367).
  • Fix unbound variable in readers.py (issue #340, PR #392).
  • Fix bug when running a learning curve experiment via gridmap (issue #386, PR #390).
  • Fix a mismatch between the default number of grid search folds and the default number of slots requested via gridmap (issue #342, PR #367).

Documentation

  • Update documentation and tests for all of the above changes and new features.
  • Update tutorial and installation instructions (issues #383 and #394, PR #399).
  • Standardize all of the function and method docstrings to be NumPy style. Add docstrings where missing (issue #373, PR #397).

SKLL 1.3

13 Feb 19:48
Compare
Choose a tag to compare

This is a major new release of SKLL.

New features

  • You can now generate learning curves for multiple learners, multiple feature sets, and multiple objectives in a single experiment by using task=learning_curve in the configuration file. See documentation for more details (issue #221, PR #332).

Changes

  • The required version of scikit-learn has been bumped up to 0.18.1 (issue #328, PR #330).
  • SKLL now uses the MKL backend on macOS/Linux instead of OpenBLAS when used as a conda package.

Bugfixes

  • Fix deprecation warning when using Learner.model_params() (issue #325, PR #329).
  • Update the definitions of SKLL F1 metrics as a result of scikit-learn upgrade (issue #325, PR #330).
  • Bring documentation for SVC parameter grids up to date with the code (issue #334, PR #337).
  • Update documentation to make it clear that the SKLL conda package is only available for Python 3.4. For other Python versions, users should use pip.

SKLL 1.2.1

20 May 19:10
Compare
Choose a tag to compare

This is primarily a bug fix release but also adds a major new API feature.

New API Feature:

  • If you use the SKLL API, you can now create FeatureSet instances directly from pandas data frames (issue #261, PR #292).

Bugfixes:

  • Correctly parse floats in scientific notation, e.g., when specifying parameter grids and/or fixed parameters (issue #318, PR #320)
  • print_model_weights now correctly handles models trained with fit_intercept=False (issue #322, PR #323).

SKLL 1.2

24 Feb 01:32
Compare
Choose a tag to compare

This release includes major changes as well as a number of bugfixes.

Changes:

  • The required version of scikit-learn has been bumped up to 0.17.1 (issue #273, PRs #288 and #308)
  • You can now optionally save cross-validation folds to a file for later analysis (issue #259, PR #262)
  • Update documentation to be clear about when two FeatureSet instances are deemed equal (issue #272, PR #294)
  • You can now specify multiple objective functions for parameter tuning (issue #115, PR #291)

Bugfixes:

  • Use a fixed random state when doing non-stratified k-fold cross-validation (issue #247, PR #286)
  • Fix errors when using reusing relative paths in output section (issue #252, PR #287)
  • print_model_weights now works correctly for multi-class logistic regression models (issue #274, PR #267)
  • Correctly raise an IOError if the config file is not correctly specified (issue #275, PR #281)
  • The evaluate task does not crash when the test data has labels that were not seen in training data (issue #279, PR #290)
  • The fit() method for rescaled versions of learners now works correctly when not doing grid search (issue #304, PR #306)
  • Fix minor typos in the documentation and tutorial.

SKLL 1.1.1

23 Oct 17:20
Compare
Choose a tag to compare

This is a minor bugfix release. It fixes:

  • Issue where a FileExistsError would be raised when processing many configs (PR #260)
  • Instance of cv_folds instead of num_cv_folds in the documentation (PR #248).
  • Crash with print_model_weights and Logistic Regression models without intercepts (issue #250, PR #251)
  • Division by zero error when there was only one example (issue #253, PR #254)

SKLL 1.1.0

20 Jul 13:35
Compare
Choose a tag to compare

The biggest changes in this release are that the required version of scikit-learn has been bumped up to 0.16.1 and config file parsing is much more robust and gives much better error messages when users make mistakes.

Implemented enhancements

  • Base estimators other than the defaults are now supported for AdaBoost classifiers and regressors (#238)
  • User can now specify number of cross-validation folds to use in the config file (#222)
  • Decision Trees and Random Forests no longer need dense inputs (#207)
  • Stratification during cross-validation is now optional (#160)

Fixed bugs

  • Bug when checking if hasher_features is a valid option (#234)
  • Invalid/missing/duplicate options in configuration are now detected (#223)
  • Stop modifying global numpy random seed (#220)
  • Relative paths specified in the config file are now relative to the config file location instead of to the current directory (#213)

Closed issues

  • Incompatibility with the latest version of scikit-learn (v0.16.1) (#235, #241, #233)
  • Learner.model_params will return weights with the wrong sign if sklearn is fixed (#111)

Merged pull requests

Full Changelog