-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathNEWS
100 lines (66 loc) · 3.66 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
CHANGES IN VERSION 0.5.0
BUG FIXES
- Severe bug: `greedy` and `select_greedy` can give incorrect results. There
was a bug in `greedy` (also affecting `select_greedy`) where keeping track
of the number of times records from the second data set are linked had an
error. This has been fixed.
CHANGES IN VERSION 0.4.0
NEW FEATURES
- Additional arguments to `compare_vars` are passed on to the comparison
function.
- Added `inplace` argument to `predict`.
- The `greedy` function is now exported.
- The `identical` function has been removed. It was already deprecated. The
identical `cmp_identical` can be used instead.
BUG FIXES
- `identical` called itself resulting in a stack overflow. This has been
fixed.
- In extreme cases where the one of the m-probabilities converged to zero
`problink_em` gave an error. The algorithm will now converge to a small
value and give a warning (as this usually means that the algorithm didn't
converge to a valid solution.
CHANGES IN VERSION 0.3.2
NEW FEATURES
- The `select_unique` function has been added. This function filters pairs and
removes pairs that have been matched more than once. For pairs that are
matched more than once (with a high enough link quality) it is not possible
to decide which link is the correct one. In some linkage scenarios (with a
focus on reducing false matches) it is then preferable to remove both.
- The `score_simple` function has been added which calculates a weighted sum
of the comparison vectors. Different weights are allowed for agreement,
non-agreement and missing values for each of the variables in the comparison
vector.
- The `merge_pairs` function can combine different sets of pairs for the same
two data sets. For example, combine two sets of pairs generated with
different blocking variables.
- The functions `identical`, `jaro_winkler`, `lcs` and `jaccard` are
deprecated and will be removed in future versions of the package. Instead
use the functions `cmp_identical`, `cmp_jarowinkler`, `cmp_lcs` and
`cmp_jaccard`. The function `identical` caused a conflict with a function in
base R. Also this, hopefully, stresses that these functions return a
function.
- `select_greedy` has `n` and `m` arguments that allow the user to specify
what type of linkage is allows (1-to-1, n-to-1, 1-to-m, n-to-m).
- `select_greedy` has an `include_ties` option that will include multiple pairs
for a record when these pairs have an equal weight/score.
- `pair_minsim` and `cluster_pair_minsim` now have an `on_blocking` option
that functions similar to the `on` argument of `pair_blocking`: pairs are
only generated when the agree exactly on the blocking keys.
- When selecting pairs using a threshold. Pairs are now selected when their
score is above or *equal to* the threshold. This has an effect for all of
the `select_` functions.
BUG FIXES
- `select_greedy` gave an error when the set of pairs is empty.
- The `by_x` and `by_y` arguments were not handled correctly. Fixed.
- The result from `cluster_collect` can now be used without introducing
a warning from `data.table`.
CHANGES IN VERSION 0.2.0
NEW FEATURES:
- `merge_pairs` combines two sets of pairs into one (like `rbind`). Especially
useful in combination with `pair_blocking` to generate pairs that are blocked
on multiple keys (e.g. they have match on either `postcode` or `lastname`).
BUG FIXES:
- `pairs_minsim` gave an error when the number of records squared is larger
the maximum integer value.
- `pairs_minsim` did not generate pairs when one of the keys contained
missing values.