modelling guide

moj-analytical-services · Dec 10, 2024 · 4b4a1f5 · 4b4a1f5
1 parent 6e30a89
commit 4b4a1f5
Show file tree

Hide file tree

Showing 3 changed files with 28 additions and 14 deletions.
diff --git a/docs/demos/tutorials/00_Tutorial_Introduction.ipynb b/docs/demos/tutorials/00_Tutorial_Introduction.ipynb
@@ -14,10 +14,7 @@
         "\n",
         "The seven parts are:\n",
         "\n",
-        "- [1. Data prep pre-requisites](./01_Prerequisites.ipynb) <a target=\"_blank\" href=\"https://colab.research.google.com/github/moj-analytical-services/splink/blob/master/docs/demos/tutorials/00_Tutorial_introduction.ipynb\">\n",
-        "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
-        "</a>\n",
-        "\n",
+        "- [1. Data prep pre-requisites](./01_Prerequisites.ipynb)\n",
         "\n",
         "- [2. Exploratory analysis](./02_Exploratory_analysis.ipynb) <a target=\"_blank\" href=\"https://colab.research.google.com/github/moj-analytical-services/splink/blob/master/docs/demos/tutorials/02_Exploratory_analysis.ipynb\">\n",
         "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
@@ -43,6 +40,8 @@
         "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
         "</a>\n",
         "\n",
+        "- [8. Building your own model](./08_building_your_own_model.md) \n",
+        "\n",
         "\n",
         "Throughout the tutorial, we use the duckdb backend, which is the recommended option for smaller datasets of up to around 1 million records on a normal laptop.\n",
         "\n",

diff --git a/docs/demos/tutorials/08_building_your_own_model.md b/docs/demos/tutorials/08_building_your_own_model.md
@@ -1,29 +1,40 @@
 
-# Next steps:Top Tips for Building your own model
+# Next steps: Tips for Building your own model
 
-Now that you've completed the tutorial, this page summarises some recommendations for how to approach building a new Splink model, to get an accurate model as quickly as possible.
+Now that you've completed the tutorial, this page summarises some recommendations for how to approach building your own Splink models.
 
-At a high level, we recommend beginning with a small sample and a basic model, then iteratively adding complexity to resolve issues and improve performance.
+These recommendations should help you create an accurate model as quickly as possible.  They're particularly applicable if you're working with large datasets, where you can get slowed down by long processing times.
+
+In a nutshell, we recommend beginning with a small sample and a basic model, then iteratively adding complexity to resolve issues and improve performance.
 
 ## General workflow
 
-- **For large datasets, start by linking a small non-random sample**. Building a model is an iterative process and you don't want long processing times slowing down your iteration cycle. Most of the modelling can be conducted on a small sample, and only once that's working, re-run everything on the full dataset.  You need a **non-random** sample of about 10,000 records. By non-random, I mean a sample that retains lots of matches - for instance, all people aged over 70, or all people with a first name starting with the characters 'pa'.  You should aim to be able to run your full training and prediction script in less than a minute. Remember to set a lower value (say `1e6`) of the `target_rows` when calling `estimate_u_using_random_sampling()` during this iteration process, but then increase in the final full-dataset run to a much higher value, maybe `1e8`.
+- **For large datasets, start by linking a small non-random sample of records**. Building a model is an iterative process of writing data cleaning code, training models, finding issues, and circling back to fix them. You don't want long processing times slowing down this iteration cycle.
 
-- **Start simple, and iterate**.  It's often tempting to start by a complex model, with many granular comparison levels, in an attempt to reflect the real world closely.  Instead, start with with a simple, rough and ready model where most comparisons have 2-3 levels (exact match, possibly a fuzzy level, and everything else).  The purpose is to get to the point of looking at prediction results as quickly as possible using e.g. the comparison viewer.  You can then start to look for where your simple model is getting it wrong, and use that as the basis for improving your model, and iterating until you're seeing good results.
+  Most of your code can be developed against a small sample of records, and only once that's working, re-run everything on the full dataset.
 
-## Blocking rules for prediction
+  You need a **non-random** sample of perhaps about 10,000 records. The same must be  non-random because it must retain lots of matches - for instance, retain all people aged over 70, or all people with a first name starting with the characters `pa`.  You should aim to be able to run your full training and prediction script in less than a minute.
+
+  Remember to set a lower value (say `1e6`) of the `target_rows` when calling `estimate_u_using_random_sampling()` during this iteration process, but then increase in the final full-dataset run to a much higher value, maybe `1e8`, since large value of `target_rows` can cause long processing times even on relatively small datasets.
 
-- **Many strict `blocking_rules_for_prediction` are generally better than few loose rules.**  Each individual blocking rule is likely to exclude many true matches.  But between them, it should be implausible that a truly matching record 'falls through' all the blockinges.  Many of our models have between about 10-15 `blocking_rules_for_prediction`
+- **Start with a simple model**.  It's often tempting to start by designing a complex model, with many granular comparison levels in an attempt to reflect the real world closely.
+
+  Instead, we recommend starting with with a simple, rough and ready model where most comparisons have 2-3 levels (exact match, possibly a fuzzy level, and everything else).  The idea is to get to the point of looking at prediction results as quickly as possible using e.g. the comparison viewer.  You can then start to look for where your simple model is getting it wrong, and use that as the basis for improving your model, and iterating until you're seeing good results.
+
+## Blocking rules for prediction
 
 - **Analyse the number of comparisons before running predict**.  Use the tools in `splink.blocking_analysis` to validate that your rules aren't going to create a vast number of comparisons before asking Splink to create those comparisons.
 
+- **Many strict `blocking_rules_for_prediction` are generally better than few loose rules.**  Whilst individually, strict blocking rules are likely to exclude many true matches, between them it should be implausible that a truly matching record 'falls through' all the rules.  Many strict rules often result in far fewer overall comparisons and a small number of loose rules.  In practice, many of our real-life models have between about 10-15 `blocking_rules_for_prediction`.
+
+
 ## EM trainining
 
-- *Predictions usually aren't very sensitive to `m` probabilities being a bit wrong*.  The hardest model parameters to estimate are the `m` probabilities.  It's fairly common for Expectation Maximisation to yield 'bad' (implausble) values.  Luckily, the accuracy of your model is usually not particularly sensitive to the `m` probabilities - the `u` probabilities drive the match weights.  If you're having problems, consider fixing some `m` probabilities by expert judgement - see [here](https://github.com/moj-analytical-services/splink/pull/2379) for how.
+- **Predictions usually aren't very sensitive to `m` probabilities being a bit wrong**.  The hardest model parameters to estimate are the `m` probabilities.  It's fairly common for Expectation Maximisation to yield 'bad' (implausble) values.  Luckily, the accuracy of your model is usually not particularly sensitive to the `m` probabilities - it usually the `u` probabilities driving the biggest match weights.  If you're having problems, consider fixing some `m` probabilities by expert judgement - see [here](https://github.com/moj-analytical-services/splink/pull/2379) for how.
 
-- *Convergece problems are often indicative of the need for further data cleaning*.  Whilst predictions often aren't terribly sensitive to `m` probabilities, question why the estimation procedue is producing bad parameter estimates.  To do this, it's often enough to look at a variety of predictions to see if you can spot edge cases where the model is not doing what's expected.  For instance, we may find matches where the first name is `Mr.`.  By fixing this and reestimating, the parameter estimates make more sense.
+- **Convergence problems are often indicative of the need for further data cleaning**.  Whilst predictions often aren't terribly sensitive to `m` probabilities, question why the estimation procedue is producing bad parameter estimates.  To do this, it's often enough to look at a variety of predictions to see if you can spot edge cases where the model is not doing what's expected.  For instance, we may find matches where the first name is `Mr`.  By fixing this and reestimating, the parameter estimates often make more sense.
 
-- **Blocking rules for EM training do not need high recall**.  The purpose of blocking rules for EM training is to find a subset of records which include a reasonably balanced mix of matches and non matches.  There is no requirement that these records neet to contain all even most of the matches.  For more see [here](https://moj-analytical-services.github.io/splink/topic_guides/blocking/model_training.html)  To double check that parameter estimates are a result of a biased sample of matches, you can use `linker.visualisations.parameter_estimate_comparisons_chart`.
+- **Blocking rules for EM training do not need high recall**.  The purpose of blocking rules for EM training is to find a subset of records which include a reasonably balanced mix of matches and non matches.  There is no requirement that these records contain all, or even most of the matches.  For more see [here](https://moj-analytical-services.github.io/splink/topic_guides/blocking/model_training.html)  To double check that parameter estimates are a result of a biased sample of matches, you can use `linker.visualisations.parameter_estimate_comparisons_chart`.
 
 ## Working with large datasets
 
@@ -33,3 +44,6 @@ To optimise memory usage and performance:
 
 - **Avoid pandas for data cleaning**.  You will generally get substantially better performance by performing data cleaning in SQL using your chosen backend rather than using pandas.
 
+- **Turn off intermediate columns when calling `predict()`**.  Whilst during the model development phase, it is useful to set `retain_intermediate_calculation_columns=True` and
+    `retain_intermediate_calculation_columns_for_prediction=True` in your settings, you should generally turn these off when calling `predict()`.  This will result in a much smaller table as your result set.  If you want waterfall charts for individual pairs, you can use [`linker.inference.compare_two_records`](../../api_docs/inference.md)
+
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -110,6 +110,7 @@ nav:
       - 5. Predicting results: "demos/tutorials/05_Predicting_results.ipynb"
       - 6. Visualising predictions: "demos/tutorials/06_Visualising_predictions.ipynb"
       - 7. Evaluation: "demos/tutorials/07_Evaluation.ipynb"
+      - 8. Tips for building your own model: "demos/tutorials/08_building_your_own_model.md"
   - Examples:
       - Introduction: "demos/examples/examples_index.md"
       - DuckDB: