-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add postcode data ingestion job #5966
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5966 +/- ##
==========================================
- Coverage 96.65% 96.62% -0.03%
==========================================
Files 1081 1084 +3
Lines 25394 25474 +80
Branches 1676 1681 +5
==========================================
+ Hits 24544 24614 +70
- Misses 694 700 +6
- Partials 156 160 +4 ☔ View full report in Codecov by Sentry. |
79c607f
to
9b4beb6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ClaudiaGC1339 this is great work so far! Ingesting postcode data (or any other reference data for that matter) in this manner is a very sensible approach. That is, compared to migrations that only ingest a snapshot.
fields=[ | ||
('disabled_on', models.DateTimeField(blank=True, null=True)), | ||
('id', models.UUIDField(default=uuid.uuid4, primary_key=True, serialize=False)), | ||
('name', models.TextField(blank=True)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like having both the name
and postcode
field is somewhat unnecessary, and we should choose one or the other.
However, I'd be keen to get others' opinions on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a name in the context of a postcode? Do we have an example of what this field would contain?
@@ -104,3 +105,8 @@ | |||
) | |||
registry.register(metadata_id='fdi-value', model=models.FDIValue) | |||
registry.register(metadata_id='export-barrier', model=models.ExportBarrierType) | |||
registry.register( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth adding a test to check the serializer returns a Postcode instance in the intended format? For example, if you were to send a GET request to the metadata endpoint, that it returns a list of postcode instances.
if serializer.is_valid(): | ||
primary_key = serializer.validated_data.pop('id') | ||
queryset = PostcodeData.objects.filter(pk=primary_key) | ||
instance, created = queryset.update_or_create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we aren't updating postcode records in the first instance, this update_or_create
call is probably redundant. Instead you could do something like:
def _process_record(self, record: dict) -> None):
serializer = self.serializer_class(data=record)
if serializer.is_valid():
serializer.validated_data.pop('id') # because setting an id from the incoming data may raise an error when we've told Django to auto generate a UUID
instance = Postcode.objects.create(**serializer.validated_data)
self.created_ids.append(str(instance.id))
else:
self.errors.append({
'record': record,
'errors': serializer.errors,
})
This may also fix some of the test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't we updating postcode records? Given the point of this exercise is to improve accuracy it seems like we probably should unless there's a reason not? I don't think the data size is prohibitive?
self.existing_ids = set(PostcodeData.objects.values_list( | ||
'id', flat=True)) | ||
|
||
postcode_data_id = record.get('id') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the incoming records have an ID? Maybe we want to ignore these if we are setting our own UUID?
|
||
postcode = factory.Faker('postcode') | ||
modified_on = '2025-10-08T08:06:53+00:00' | ||
postcode_region = factory.Faker('postcode_region') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value passed into the factory.Faker()
call is the name of the faker provider. I don't think there is one called postcode_region
. It might be more appropriate to select a random UK region here instead?
ingestion_task._process_record(record) | ||
|
||
assert len(ingestion_task.created_ids) == 1 | ||
assert len(ingestion_task.updated_ids) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated ids may not be applicable here, as per the above comment, if we aren't updating postcode instance.
|
||
assert ingestion_task._should_process_record(record) is False | ||
|
||
def test_process_record_creates_postcode_data_instance(self, ingestion_task): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To help with code coverage, it may also be worth adding a similar test calling the ingestion function directly and asserting the log messages appear, and an instance is created. See test_ingestion_task_success
in test_ingest_eyb_marketing
for an example.
9b4beb6
to
c4d9f03
Compare
3705618
to
7511281
Compare
Description of change
Dataflow currently exports postcode and region data from DataWorkspace to the Datahub S3 bucket with less frequency however this PR schedules tasks to ingest postcode data nightly.
Checklist
Has this branch been rebased on top of the current
main
branch?Explanation
The branch should not be stale or have conflicts at the time reviews are requested.
Is the CircleCI build passing?
General points
Other things to check
fixtures/test_data.yaml
is maintained when updating modelsSee docs/CONTRIBUTING.md for more guidelines.