Add postcode data ingestion job #5966

ClaudiaGC1339 · 2025-02-18T16:13:07Z

Description of change

Dataflow currently exports postcode and region data from DataWorkspace to the Datahub S3 bucket with less frequency however this PR schedules tasks to ingest postcode data nightly.

Checklist

Has this branch been rebased on top of the current main branch?

Explanation

The branch should not be stale or have conflicts at the time reviews are requested.
Is the CircleCI build passing?

General points

Other things to check

Make sure fixtures/test_data.yaml is maintained when updating models
Consider the admin site when making changes to models
Use select-/prefetch-related field lists in views and search apps, and update them when fields are added
Make sure the README is updated e.g. when adding new environment variables

See docs/CONTRIBUTING.md for more guidelines.

codecov · 2025-02-18T16:30:07Z

Codecov Report

Attention: Patch coverage is 88.09524% with 10 lines in your changes missing coverage. Please review.

Project coverage is 96.62%. Comparing base (bebfa68) to head (7511281).

Files with missing lines	Patch %	Lines
datahub/metadata/tasks.py	78.57%	6 Missing and 3 partials ⚠️
datahub/metadata/test/factories.py	93.75%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5966      +/-   ##
==========================================
- Coverage   96.65%   96.62%   -0.03%     
==========================================
  Files        1081     1084       +3     
  Lines       25394    25474      +80     
  Branches     1676     1681       +5     
==========================================
+ Hits        24544    24614      +70     
- Misses        694      700       +6     
- Partials      156      160       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

oliverjwroberts

@ClaudiaGC1339 this is great work so far! Ingesting postcode data (or any other reference data for that matter) in this manner is a very sensible approach. That is, compared to migrations that only ingest a snapshot.

cron-scheduler.py

datahub/metadata/constants.py

oliverjwroberts · 2025-02-19T18:50:21Z

datahub/metadata/migrations/0090_postcodedata.py

+            fields=[
+                ('disabled_on', models.DateTimeField(blank=True, null=True)),
+                ('id', models.UUIDField(default=uuid.uuid4, primary_key=True, serialize=False)),
+                ('name', models.TextField(blank=True)),


It feels like having both the name and postcode field is somewhat unnecessary, and we should choose one or the other.

However, I'd be keen to get others' opinions on this?

What is a name in the context of a postcode? Do we have an example of what this field would contain?

datahub/metadata/models.py

oliverjwroberts · 2025-02-19T19:02:44Z

datahub/metadata/metadata.py

@@ -104,3 +105,8 @@
 )
 registry.register(metadata_id='fdi-value', model=models.FDIValue)
 registry.register(metadata_id='export-barrier', model=models.ExportBarrierType)
+registry.register(


It might be worth adding a test to check the serializer returns a Postcode instance in the intended format? For example, if you were to send a GET request to the metadata endpoint, that it returns a list of postcode instances.

oliverjwroberts · 2025-02-19T19:15:45Z

datahub/metadata/tasks.py

+        if serializer.is_valid():
+            primary_key = serializer.validated_data.pop('id')
+            queryset = PostcodeData.objects.filter(pk=primary_key)
+            instance, created = queryset.update_or_create(


Because we aren't updating postcode records in the first instance, this update_or_create call is probably redundant. Instead you could do something like:

def _process_record(self, record: dict) -> None): serializer = self.serializer_class(data=record) if serializer.is_valid(): serializer.validated_data.pop('id') # because setting an id from the incoming data may raise an error when we've told Django to auto generate a UUID instance = Postcode.objects.create(**serializer.validated_data) self.created_ids.append(str(instance.id)) else: self.errors.append({ 'record': record, 'errors': serializer.errors, })

This may also fix some of the test coverage.

Why aren't we updating postcode records? Given the point of this exercise is to improve accuracy it seems like we probably should unless there's a reason not? I don't think the data size is prohibitive?

oliverjwroberts · 2025-02-19T19:16:10Z

datahub/metadata/tasks.py

+            self.existing_ids = set(PostcodeData.objects.values_list(
+                'id', flat=True))
+
+        postcode_data_id = record.get('id')


Will the incoming records have an ID? Maybe we want to ignore these if we are setting our own UUID?

oliverjwroberts · 2025-02-19T20:01:15Z

datahub/metadata/test/factories.py

+
+    postcode = factory.Faker('postcode')
+    modified_on = '2025-10-08T08:06:53+00:00'
+    postcode_region = factory.Faker('postcode_region')


The value passed into the factory.Faker() call is the name of the faker provider. I don't think there is one called postcode_region. It might be more appropriate to select a random UK region here instead?

oliverjwroberts · 2025-02-19T20:04:32Z

datahub/metadata/test/test_ingest_postcode_data.py

+        ingestion_task._process_record(record)
+
+        assert len(ingestion_task.created_ids) == 1
+        assert len(ingestion_task.updated_ids) == 0


The updated ids may not be applicable here, as per the above comment, if we aren't updating postcode instance.

oliverjwroberts · 2025-02-19T20:05:48Z

datahub/metadata/test/test_ingest_postcode_data.py

+
+        assert ingestion_task._should_process_record(record) is False
+
+    def test_process_record_creates_postcode_data_instance(self, ingestion_task):


To help with code coverage, it may also be worth adding a similar test calling the ingestion function directly and asserting the log messages appear, and an instance is created. See test_ingestion_task_success in test_ingest_eyb_marketing for an example.

ClaudiaGC1339 force-pushed the CPS-618-add-postcode-ingestion-job branch from 79c607f to 9b4beb6 Compare February 19, 2025 15:03

oliverjwroberts reviewed Feb 19, 2025

View reviewed changes

ClaudiaGC1339 added 13 commits February 20, 2025 14:28

WIP: Add postcode data ingestion job

e2b0053

Create postcode data model

fa338dd

Add processing of records

e427286

Refactor postcode model

19ad4a8

Update postcode data ingestion task

ea24185

Add postcode data serializer

bb6be51

WIP: Add postcode data factory and ingestion tests

bbc3f8a

Move postcode ingestion tests to separate file

33bfecc

Add test for the processing of a postcode data instance

341a596

Update code

96619a4

Add postcode data model migration

3801798

Removed redundant code

a93145a

Add tests and update code

2e2f933

baarkerlounger force-pushed the CPS-618-add-postcode-ingestion-job branch from 9b4beb6 to c4d9f03 Compare February 20, 2025 14:31

baarkerlounger added 2 commits February 20, 2025 14:44

Sync postcodes weekly

3b95edf

Fix typo

7511281

baarkerlounger force-pushed the CPS-618-add-postcode-ingestion-job branch from 3705618 to 7511281 Compare February 20, 2025 14:44

baarkerlounger added 2 commits February 25, 2025 11:46

Single constant definition

4f4438e

Rename postcode region

5f4b00c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add postcode data ingestion job #5966

Add postcode data ingestion job #5966

ClaudiaGC1339 commented Feb 18, 2025 •

edited

Loading

codecov bot commented Feb 18, 2025 •

edited

Loading

oliverjwroberts left a comment

oliverjwroberts Feb 19, 2025 •

edited

Loading

baarkerlounger Feb 25, 2025

oliverjwroberts Feb 19, 2025

oliverjwroberts Feb 19, 2025

baarkerlounger Feb 25, 2025

oliverjwroberts Feb 19, 2025

oliverjwroberts Feb 19, 2025

oliverjwroberts Feb 19, 2025

oliverjwroberts Feb 19, 2025


		assert ingestion_task._should_process_record(record) is False

		def test_process_record_creates_postcode_data_instance(self, ingestion_task):

Add postcode data ingestion job #5966

Are you sure you want to change the base?

Add postcode data ingestion job #5966

Conversation

ClaudiaGC1339 commented Feb 18, 2025 • edited Loading

Description of change

Checklist

General points

codecov bot commented Feb 18, 2025 • edited Loading

Codecov Report

oliverjwroberts left a comment

Choose a reason for hiding this comment

oliverjwroberts Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ClaudiaGC1339 commented Feb 18, 2025 •

edited

Loading

codecov bot commented Feb 18, 2025 •

edited

Loading

oliverjwroberts Feb 19, 2025 •

edited

Loading