Skip to content

Commit

Permalink
add changelog
Browse files Browse the repository at this point in the history
  • Loading branch information
RobinL committed Nov 7, 2021
1 parent b6e0dd3 commit 7d1a303
Showing 1 changed file with 30 additions and 16 deletions.
46 changes: 30 additions & 16 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,45 @@
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [2.0.0]

### Changed

- Term frequency adjustments are now calculated directly from a term freqency lookup table making them more accurate
- Term frequency adjustments are now part of the iterative EM estimation step, improving convergence
- All internal calculations are changed to use bayes factors (match weights) rather than probabilities to make the maths simpler

### Added

- Splink now outputs `match_weight`, the log2(Bayes Factor) of the match score.
- New `splink.charts.save_offline_chart` function that produces charts that work in airgapped environments with no internet connection
- New `splink.cluster.clusters_at_thresholds` function that clusters are one or more match thresholds
- The `splink.truth.roc_chart` function now allows several ROCS to be plotted on a single chart, to compare the accuracy of different models
- Splink now includes an slower Python implementation of jaro_winkler, in case users are having trouble with the string similarity jar

### Removed

- Since term frequency adjustments are no longer an ex-post step, there's no longer a need for them to be calculated separately. Splink therefore no longer outputs `tf_adjusted_match_prob`. Instead, TF adjustments are included within `match_probability`

## [1.0.5]

### Fixed

- Bug that meant default numerical case statements were not available. See [here](https://github.com/moj-analytical-services/splink/issues/189). Thanks to [geobetts](https://github.com/geobetts)
- Bug that meant default numerical case statements were not available. See [here](https://github.com/moj-analytical-services/splink/issues/189). Thanks to [geobetts](https://github.com/geobetts)

### Changed

- `m` and `u` probabilities are now reset to `None` rather than `0` in EM iteration when they cannot be estimated
- `m` and `u` probabilities are now reset to `None` rather than `0` in EM iteration when they cannot be estimated
- Now use `_repr_pretty_` so that objects display nicely in Jupyter Lab rather than `__repr__`, which had been interfering with the interpretatino of stack trace errors

## [1.0.3] - 2020-02-04



- Bug whereby Splink lowercased case expressions, see [here](https://github.com/moj-analytical-services/splink/issues/174)

## [1.0.2] - 2020-02-02

### Changed
Expand All @@ -39,34 +59,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- m and u history charts now display barchart correctly

## [1.0.0] - 2020-01-20

### Added
- Charts now feature improved tooltips, and have a cleaner appearance. Many are now zoomable

- Charts now feature improved tooltips, and have a cleaner appearance. Many are now zoomable
- Charts now display better in Jupyter Lab, especially the html file produced by `all_charts_write_html_file()`
- `m` and `u` probabilities charts can now be produced from `Settings` objects
- The user can now combine settings objects using `ModelCombiner from splink.combine_models`
### Changed

### Changed

A number of **backwards incompatible** changes have been made for Splink 1.0.

- The main `Splink` API is different. Instead of `Splink(...,df=df)` for dedupe and `Splink(...,df_l=df_l,df_r=df_r)` for linking, the user provides an agument `df_or_dfs`, which is either a single DataFrame or a list of DataFrames. This allows linking n>2 datasets.
- The main `Splink` API is different. Instead of `Splink(...,df=df)` for dedupe and `Splink(...,df_l=df_l,df_r=df_r)` for linking, the user provides an agument `df_or_dfs`, which is either a single DataFrame or a list of DataFrames. This allows linking n>2 datasets.
- When linking multiple dataframes, the user must now include a `source_dataset` column (default name `source_dataset`, configurable via `source_dataset_column_name` in the settings dict)
- The `Params` class is now called `Model` in the `model.py` module.
- The on-disk (json) format of the `Model` object has changed and is incompatible with `Params`
- The new `Model` class now uses the same representation for parameters as the Settings object, reducing duplicate code. Internal functions now have `settings` or `model` as function arguments, never both.
- The new `Model` class now uses the same representation for parameters as the Settings object, reducing duplicate code. Internal functions now have `settings` or `model` as function arguments, never both.
- Vega lite chart definitions now stored in json files in splink/files/chart_defs
- All case statement generation functions are now consistently named, with all names starting `sql_gen_case_stmt_`
- Fixed `case_statements.sql_gen_case_smnt_strict_equality_2` which previously behaved differently to all other case functions
- All case statements now have a default threshold of exact equality on their top gamma level



### Fixed

### Removed






0 comments on commit 7d1a303

Please sign in to comment.