From 1bdf02c067eaac0b393b006ac624d2e68c1f9456 Mon Sep 17 00:00:00 2001 From: Robin Linacre Date: Sat, 7 Dec 2024 11:45:24 +0000 Subject: [PATCH 1/4] add modelling tips first draft --- .../splink_fundamentals/modelling_tips.md | 39 +++++++++++++++++++ 1 file changed, 39 insertions(+) create mode 100644 docs/topic_guides/splink_fundamentals/modelling_tips.md diff --git a/docs/topic_guides/splink_fundamentals/modelling_tips.md b/docs/topic_guides/splink_fundamentals/modelling_tips.md new file mode 100644 index 0000000000..5da26b39ce --- /dev/null +++ b/docs/topic_guides/splink_fundamentals/modelling_tips.md @@ -0,0 +1,39 @@ +--- +tags: + - Modelling +--- + +# Next steps:Top Tips for Building your own model + +This page summarises some recommendations for how to approach building a new Splink model, to get an accurate model as quickly as possible. + +At a high level, we recommend beginning with a small sample and a basic model, then iteratively adding complexity to resolve issues and improve performance. + +## General workflow + +- **For large datasets, start by linking a small non-random sample**. Building a model is often a highly iterative process and you don't want long processing times slowing down your iteration cycle. Most of the modelling can be conducted on a small sample, and only once that's working, re-run everything on the full dataset. You need a **non-random** sample of about 10,000 records. By non-random, I mean a sample that retains lots of matches - for instance, all people aged over 70, or all people with a first name starting with the characters 'pa'. You should aim to be able to run your full training and prediction script in less than a minute. Remember to set a lower value (say `1e6`) of the `target_rows` when calling `estimate_u_using_random_sampling()` during this iteration process, but then increase in the final full-dataset run to a much higher value, maybe `1e8`. + +- **Start simple, and iterate**. It's often tempting to start by a complex model, with many granular comparison levels, in an attempt to reflect the real world closely. Instead, I recommend starting with a simple, rough and ready model where most comparisons have 2-3 levels (exact match, possibly a fuzzy level, and everything else). The purpose is to get to the point of looking at prediction results as quickly as possible using e.g. the comparison viewer. You can then start to look for where your simple model is getting it wrong, and use that as the basis for improving your model, and iterating until you're seeing good results. + +## Blocking + +- **Many strict `blocking_rules_for_prediction` are generally better than few loose rules.** Each individual blocking rule is likely to exclude many true matches. But between them, it should be implausible that a truly matching record 'falls through' all the blockinges. Many of our models have between about 10-15 `blocking_rules_for_prediction` + +- **Analyse the number of comparisons before running predict**. Use the tools in `splink.blocking_analysis` to validate that your rules aren't going to create a vast number of comparisons before asking Splink to create those comparisons. + +## EM trainining + +- *Predictions usually aren't very sensitive to `m` probabilities being a bit wrong*. The hardest model parameters to estimate are the `m` probabilities. It's fairly common for Expectation Maximisation to yield 'bad' (implausble) values. Luckily, the accuracy of your model is usually not particularly sensitive to the `m` probabilities - the `u` probabilities drive the match weights. In many cases, you'll get good results to simply set the `m` probabilities by by 'expert judgement' (i.e. guess). + +- *Convergece problems are often indicative of the need for further data cleaning*. Whilst predictions often aren't terribly sensitive to `m` probabilities, question why the estimation procedue is producing bad parameter estimates. To do this, it's often enough to look at a variety of predictions to see if you can spot edge cases where the model is not doing what's expected. For instance, we may find matches where the first name is `Mr.`. By fixing this and reestimating, the parameter estimates make more sense. + +- **Blocking rules for EM training do not need high recall**. The purpose of blocking rules for EM training is to find a subset of records which include a reasonably balanced mix of matches and non matches. There is no requirement that these records neet to contain all even most of the matches. To double check that parameter estimates are a result of a biased sample of matches, you can use `linker.visualisations.parameter_estimate_comparisons_chart`. + +## Working with large datasets + +To optimise memory usage and performance: + +- **Avoid pandas for input/output** Whilst Splink supports inputs as pandas dataframes, and you can convert results to pandas using `.as_pandas_dataframe()`, we recommend against this for large datasets. For large datasets, use the concept of a dataframe that's native to your database backend. For example, if you're using Spark, it's best to read your files using Spark and pass Spark dataframes into Splink, and save any outputs using `splink_dataframe.as_spark_dataframe`. With duckdb use the inbuilt duckdb csv/parquet reader, and output via `splinkdataframe.as_duckdbpyrelation`. + +- **Avoid pandas for data cleaning**. You will generally get substantially better performance by performing data cleaning in SQL using your chosen backend rather than using pandas. + From 392143a2f84c4789a66b36acfb4b6e57580c4c96 Mon Sep 17 00:00:00 2001 From: Robin Linacre Date: Sat, 7 Dec 2024 11:53:30 +0000 Subject: [PATCH 2/4] improve --- .../topic_guides/splink_fundamentals/modelling_tips.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/topic_guides/splink_fundamentals/modelling_tips.md b/docs/topic_guides/splink_fundamentals/modelling_tips.md index 5da26b39ce..91124abb86 100644 --- a/docs/topic_guides/splink_fundamentals/modelling_tips.md +++ b/docs/topic_guides/splink_fundamentals/modelling_tips.md @@ -11,11 +11,11 @@ At a high level, we recommend beginning with a small sample and a basic model, t ## General workflow -- **For large datasets, start by linking a small non-random sample**. Building a model is often a highly iterative process and you don't want long processing times slowing down your iteration cycle. Most of the modelling can be conducted on a small sample, and only once that's working, re-run everything on the full dataset. You need a **non-random** sample of about 10,000 records. By non-random, I mean a sample that retains lots of matches - for instance, all people aged over 70, or all people with a first name starting with the characters 'pa'. You should aim to be able to run your full training and prediction script in less than a minute. Remember to set a lower value (say `1e6`) of the `target_rows` when calling `estimate_u_using_random_sampling()` during this iteration process, but then increase in the final full-dataset run to a much higher value, maybe `1e8`. +- **For large datasets, start by linking a small non-random sample**. Building a model is an iterative process and you don't want long processing times slowing down your iteration cycle. Most of the modelling can be conducted on a small sample, and only once that's working, re-run everything on the full dataset. You need a **non-random** sample of about 10,000 records. By non-random, I mean a sample that retains lots of matches - for instance, all people aged over 70, or all people with a first name starting with the characters 'pa'. You should aim to be able to run your full training and prediction script in less than a minute. Remember to set a lower value (say `1e6`) of the `target_rows` when calling `estimate_u_using_random_sampling()` during this iteration process, but then increase in the final full-dataset run to a much higher value, maybe `1e8`. -- **Start simple, and iterate**. It's often tempting to start by a complex model, with many granular comparison levels, in an attempt to reflect the real world closely. Instead, I recommend starting with a simple, rough and ready model where most comparisons have 2-3 levels (exact match, possibly a fuzzy level, and everything else). The purpose is to get to the point of looking at prediction results as quickly as possible using e.g. the comparison viewer. You can then start to look for where your simple model is getting it wrong, and use that as the basis for improving your model, and iterating until you're seeing good results. +- **Start simple, and iterate**. It's often tempting to start by a complex model, with many granular comparison levels, in an attempt to reflect the real world closely. Instead, start with with a simple, rough and ready model where most comparisons have 2-3 levels (exact match, possibly a fuzzy level, and everything else). The purpose is to get to the point of looking at prediction results as quickly as possible using e.g. the comparison viewer. You can then start to look for where your simple model is getting it wrong, and use that as the basis for improving your model, and iterating until you're seeing good results. -## Blocking +## Blocking rules for prediction - **Many strict `blocking_rules_for_prediction` are generally better than few loose rules.** Each individual blocking rule is likely to exclude many true matches. But between them, it should be implausible that a truly matching record 'falls through' all the blockinges. Many of our models have between about 10-15 `blocking_rules_for_prediction` @@ -23,11 +23,11 @@ At a high level, we recommend beginning with a small sample and a basic model, t ## EM trainining -- *Predictions usually aren't very sensitive to `m` probabilities being a bit wrong*. The hardest model parameters to estimate are the `m` probabilities. It's fairly common for Expectation Maximisation to yield 'bad' (implausble) values. Luckily, the accuracy of your model is usually not particularly sensitive to the `m` probabilities - the `u` probabilities drive the match weights. In many cases, you'll get good results to simply set the `m` probabilities by by 'expert judgement' (i.e. guess). +- *Predictions usually aren't very sensitive to `m` probabilities being a bit wrong*. The hardest model parameters to estimate are the `m` probabilities. It's fairly common for Expectation Maximisation to yield 'bad' (implausble) values. Luckily, the accuracy of your model is usually not particularly sensitive to the `m` probabilities - the `u` probabilities drive the match weights. If you're having problems, consider fixing some `m` probabilities by expert judgement - see [here](https://github.com/moj-analytical-services/splink/pull/2379) for how. - *Convergece problems are often indicative of the need for further data cleaning*. Whilst predictions often aren't terribly sensitive to `m` probabilities, question why the estimation procedue is producing bad parameter estimates. To do this, it's often enough to look at a variety of predictions to see if you can spot edge cases where the model is not doing what's expected. For instance, we may find matches where the first name is `Mr.`. By fixing this and reestimating, the parameter estimates make more sense. -- **Blocking rules for EM training do not need high recall**. The purpose of blocking rules for EM training is to find a subset of records which include a reasonably balanced mix of matches and non matches. There is no requirement that these records neet to contain all even most of the matches. To double check that parameter estimates are a result of a biased sample of matches, you can use `linker.visualisations.parameter_estimate_comparisons_chart`. +- **Blocking rules for EM training do not need high recall**. The purpose of blocking rules for EM training is to find a subset of records which include a reasonably balanced mix of matches and non matches. There is no requirement that these records neet to contain all even most of the matches. For more see [here](https://moj-analytical-services.github.io/splink/topic_guides/blocking/model_training.html) To double check that parameter estimates are a result of a biased sample of matches, you can use `linker.visualisations.parameter_estimate_comparisons_chart`. ## Working with large datasets From 6e30a899682bbd9f1a2333628cec582b4f41c91e Mon Sep 17 00:00:00 2001 From: Robin Linacre Date: Sat, 7 Dec 2024 12:04:09 +0000 Subject: [PATCH 3/4] moe to building your own model --- .../tutorials/08_building_your_own_model.md} | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) rename docs/{topic_guides/splink_fundamentals/modelling_tips.md => demos/tutorials/08_building_your_own_model.md} (96%) diff --git a/docs/topic_guides/splink_fundamentals/modelling_tips.md b/docs/demos/tutorials/08_building_your_own_model.md similarity index 96% rename from docs/topic_guides/splink_fundamentals/modelling_tips.md rename to docs/demos/tutorials/08_building_your_own_model.md index 91124abb86..145812f280 100644 --- a/docs/topic_guides/splink_fundamentals/modelling_tips.md +++ b/docs/demos/tutorials/08_building_your_own_model.md @@ -1,11 +1,7 @@ ---- -tags: - - Modelling ---- # Next steps:Top Tips for Building your own model -This page summarises some recommendations for how to approach building a new Splink model, to get an accurate model as quickly as possible. +Now that you've completed the tutorial, this page summarises some recommendations for how to approach building a new Splink model, to get an accurate model as quickly as possible. At a high level, we recommend beginning with a small sample and a basic model, then iteratively adding complexity to resolve issues and improve performance. From 4b4a1f5e1c74bf80959d3eb1cd40039bec756417 Mon Sep 17 00:00:00 2001 From: Robin Linacre Date: Tue, 10 Dec 2024 19:30:35 +0000 Subject: [PATCH 4/4] modelling guide --- .../tutorials/00_Tutorial_Introduction.ipynb | 7 ++-- .../tutorials/08_building_your_own_model.md | 34 +++++++++++++------ mkdocs.yml | 1 + 3 files changed, 28 insertions(+), 14 deletions(-) diff --git a/docs/demos/tutorials/00_Tutorial_Introduction.ipynb b/docs/demos/tutorials/00_Tutorial_Introduction.ipynb index 2395141ce7..49baa76155 100644 --- a/docs/demos/tutorials/00_Tutorial_Introduction.ipynb +++ b/docs/demos/tutorials/00_Tutorial_Introduction.ipynb @@ -14,10 +14,7 @@ "\n", "The seven parts are:\n", "\n", - "- [1. Data prep pre-requisites](./01_Prerequisites.ipynb) \n", - " \"Open\n", - "\n", - "\n", + "- [1. Data prep pre-requisites](./01_Prerequisites.ipynb)\n", "\n", "- [2. Exploratory analysis](./02_Exploratory_analysis.ipynb) \n", " \"Open\n", @@ -43,6 +40,8 @@ " \"Open\n", "\n", "\n", + "- [8. Building your own model](./08_building_your_own_model.md) \n", + "\n", "\n", "Throughout the tutorial, we use the duckdb backend, which is the recommended option for smaller datasets of up to around 1 million records on a normal laptop.\n", "\n", diff --git a/docs/demos/tutorials/08_building_your_own_model.md b/docs/demos/tutorials/08_building_your_own_model.md index 145812f280..93a8b556ed 100644 --- a/docs/demos/tutorials/08_building_your_own_model.md +++ b/docs/demos/tutorials/08_building_your_own_model.md @@ -1,29 +1,40 @@ -# Next steps:Top Tips for Building your own model +# Next steps: Tips for Building your own model -Now that you've completed the tutorial, this page summarises some recommendations for how to approach building a new Splink model, to get an accurate model as quickly as possible. +Now that you've completed the tutorial, this page summarises some recommendations for how to approach building your own Splink models. -At a high level, we recommend beginning with a small sample and a basic model, then iteratively adding complexity to resolve issues and improve performance. +These recommendations should help you create an accurate model as quickly as possible. They're particularly applicable if you're working with large datasets, where you can get slowed down by long processing times. + +In a nutshell, we recommend beginning with a small sample and a basic model, then iteratively adding complexity to resolve issues and improve performance. ## General workflow -- **For large datasets, start by linking a small non-random sample**. Building a model is an iterative process and you don't want long processing times slowing down your iteration cycle. Most of the modelling can be conducted on a small sample, and only once that's working, re-run everything on the full dataset. You need a **non-random** sample of about 10,000 records. By non-random, I mean a sample that retains lots of matches - for instance, all people aged over 70, or all people with a first name starting with the characters 'pa'. You should aim to be able to run your full training and prediction script in less than a minute. Remember to set a lower value (say `1e6`) of the `target_rows` when calling `estimate_u_using_random_sampling()` during this iteration process, but then increase in the final full-dataset run to a much higher value, maybe `1e8`. +- **For large datasets, start by linking a small non-random sample of records**. Building a model is an iterative process of writing data cleaning code, training models, finding issues, and circling back to fix them. You don't want long processing times slowing down this iteration cycle. -- **Start simple, and iterate**. It's often tempting to start by a complex model, with many granular comparison levels, in an attempt to reflect the real world closely. Instead, start with with a simple, rough and ready model where most comparisons have 2-3 levels (exact match, possibly a fuzzy level, and everything else). The purpose is to get to the point of looking at prediction results as quickly as possible using e.g. the comparison viewer. You can then start to look for where your simple model is getting it wrong, and use that as the basis for improving your model, and iterating until you're seeing good results. + Most of your code can be developed against a small sample of records, and only once that's working, re-run everything on the full dataset. -## Blocking rules for prediction + You need a **non-random** sample of perhaps about 10,000 records. The same must be non-random because it must retain lots of matches - for instance, retain all people aged over 70, or all people with a first name starting with the characters `pa`. You should aim to be able to run your full training and prediction script in less than a minute. + + Remember to set a lower value (say `1e6`) of the `target_rows` when calling `estimate_u_using_random_sampling()` during this iteration process, but then increase in the final full-dataset run to a much higher value, maybe `1e8`, since large value of `target_rows` can cause long processing times even on relatively small datasets. -- **Many strict `blocking_rules_for_prediction` are generally better than few loose rules.** Each individual blocking rule is likely to exclude many true matches. But between them, it should be implausible that a truly matching record 'falls through' all the blockinges. Many of our models have between about 10-15 `blocking_rules_for_prediction` +- **Start with a simple model**. It's often tempting to start by designing a complex model, with many granular comparison levels in an attempt to reflect the real world closely. + + Instead, we recommend starting with with a simple, rough and ready model where most comparisons have 2-3 levels (exact match, possibly a fuzzy level, and everything else). The idea is to get to the point of looking at prediction results as quickly as possible using e.g. the comparison viewer. You can then start to look for where your simple model is getting it wrong, and use that as the basis for improving your model, and iterating until you're seeing good results. + +## Blocking rules for prediction - **Analyse the number of comparisons before running predict**. Use the tools in `splink.blocking_analysis` to validate that your rules aren't going to create a vast number of comparisons before asking Splink to create those comparisons. +- **Many strict `blocking_rules_for_prediction` are generally better than few loose rules.** Whilst individually, strict blocking rules are likely to exclude many true matches, between them it should be implausible that a truly matching record 'falls through' all the rules. Many strict rules often result in far fewer overall comparisons and a small number of loose rules. In practice, many of our real-life models have between about 10-15 `blocking_rules_for_prediction`. + + ## EM trainining -- *Predictions usually aren't very sensitive to `m` probabilities being a bit wrong*. The hardest model parameters to estimate are the `m` probabilities. It's fairly common for Expectation Maximisation to yield 'bad' (implausble) values. Luckily, the accuracy of your model is usually not particularly sensitive to the `m` probabilities - the `u` probabilities drive the match weights. If you're having problems, consider fixing some `m` probabilities by expert judgement - see [here](https://github.com/moj-analytical-services/splink/pull/2379) for how. +- **Predictions usually aren't very sensitive to `m` probabilities being a bit wrong**. The hardest model parameters to estimate are the `m` probabilities. It's fairly common for Expectation Maximisation to yield 'bad' (implausble) values. Luckily, the accuracy of your model is usually not particularly sensitive to the `m` probabilities - it usually the `u` probabilities driving the biggest match weights. If you're having problems, consider fixing some `m` probabilities by expert judgement - see [here](https://github.com/moj-analytical-services/splink/pull/2379) for how. -- *Convergece problems are often indicative of the need for further data cleaning*. Whilst predictions often aren't terribly sensitive to `m` probabilities, question why the estimation procedue is producing bad parameter estimates. To do this, it's often enough to look at a variety of predictions to see if you can spot edge cases where the model is not doing what's expected. For instance, we may find matches where the first name is `Mr.`. By fixing this and reestimating, the parameter estimates make more sense. +- **Convergence problems are often indicative of the need for further data cleaning**. Whilst predictions often aren't terribly sensitive to `m` probabilities, question why the estimation procedue is producing bad parameter estimates. To do this, it's often enough to look at a variety of predictions to see if you can spot edge cases where the model is not doing what's expected. For instance, we may find matches where the first name is `Mr`. By fixing this and reestimating, the parameter estimates often make more sense. -- **Blocking rules for EM training do not need high recall**. The purpose of blocking rules for EM training is to find a subset of records which include a reasonably balanced mix of matches and non matches. There is no requirement that these records neet to contain all even most of the matches. For more see [here](https://moj-analytical-services.github.io/splink/topic_guides/blocking/model_training.html) To double check that parameter estimates are a result of a biased sample of matches, you can use `linker.visualisations.parameter_estimate_comparisons_chart`. +- **Blocking rules for EM training do not need high recall**. The purpose of blocking rules for EM training is to find a subset of records which include a reasonably balanced mix of matches and non matches. There is no requirement that these records contain all, or even most of the matches. For more see [here](https://moj-analytical-services.github.io/splink/topic_guides/blocking/model_training.html) To double check that parameter estimates are a result of a biased sample of matches, you can use `linker.visualisations.parameter_estimate_comparisons_chart`. ## Working with large datasets @@ -33,3 +44,6 @@ To optimise memory usage and performance: - **Avoid pandas for data cleaning**. You will generally get substantially better performance by performing data cleaning in SQL using your chosen backend rather than using pandas. +- **Turn off intermediate columns when calling `predict()`**. Whilst during the model development phase, it is useful to set `retain_intermediate_calculation_columns=True` and + `retain_intermediate_calculation_columns_for_prediction=True` in your settings, you should generally turn these off when calling `predict()`. This will result in a much smaller table as your result set. If you want waterfall charts for individual pairs, you can use [`linker.inference.compare_two_records`](../../api_docs/inference.md) + diff --git a/mkdocs.yml b/mkdocs.yml index ac45bb8d8c..7c58414073 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -110,6 +110,7 @@ nav: - 5. Predicting results: "demos/tutorials/05_Predicting_results.ipynb" - 6. Visualising predictions: "demos/tutorials/06_Visualising_predictions.ipynb" - 7. Evaluation: "demos/tutorials/07_Evaluation.ipynb" + - 8. Tips for building your own model: "demos/tutorials/08_building_your_own_model.md" - Examples: - Introduction: "demos/examples/examples_index.md" - DuckDB: