Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add postcode data ingestion job #5966

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

ClaudiaGC1339
Copy link
Contributor

@ClaudiaGC1339 ClaudiaGC1339 commented Feb 18, 2025

Description of change

Dataflow currently exports postcode and region data from DataWorkspace to the Datahub S3 bucket with less frequency however this PR schedules tasks to ingest postcode data nightly.

Checklist

  • Has this branch been rebased on top of the current main branch?

    Explanation

    The branch should not be stale or have conflicts at the time reviews are requested.

  • Is the CircleCI build passing?

General points

Other things to check

  • Make sure fixtures/test_data.yaml is maintained when updating models
  • Consider the admin site when making changes to models
  • Use select-/prefetch-related field lists in views and search apps, and update them when fields are added
  • Make sure the README is updated e.g. when adding new environment variables

See docs/CONTRIBUTING.md for more guidelines.

Copy link

codecov bot commented Feb 18, 2025

Codecov Report

Attention: Patch coverage is 88.09524% with 10 lines in your changes missing coverage. Please review.

Project coverage is 96.62%. Comparing base (bebfa68) to head (7511281).

Files with missing lines Patch % Lines
datahub/metadata/tasks.py 78.57% 6 Missing and 3 partials ⚠️
datahub/metadata/test/factories.py 93.75% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5966      +/-   ##
==========================================
- Coverage   96.65%   96.62%   -0.03%     
==========================================
  Files        1081     1084       +3     
  Lines       25394    25474      +80     
  Branches     1676     1681       +5     
==========================================
+ Hits        24544    24614      +70     
- Misses        694      700       +6     
- Partials      156      160       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ClaudiaGC1339 ClaudiaGC1339 force-pushed the CPS-618-add-postcode-ingestion-job branch from 79c607f to 9b4beb6 Compare February 19, 2025 15:03
Copy link
Contributor

@oliverjwroberts oliverjwroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ClaudiaGC1339 this is great work so far! Ingesting postcode data (or any other reference data for that matter) in this manner is a very sensible approach. That is, compared to migrations that only ingest a snapshot.

fields=[
('disabled_on', models.DateTimeField(blank=True, null=True)),
('id', models.UUIDField(default=uuid.uuid4, primary_key=True, serialize=False)),
('name', models.TextField(blank=True)),
Copy link
Contributor

@oliverjwroberts oliverjwroberts Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like having both the name and postcode field is somewhat unnecessary, and we should choose one or the other.

However, I'd be keen to get others' opinions on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a name in the context of a postcode? Do we have an example of what this field would contain?

@@ -104,3 +105,8 @@
)
registry.register(metadata_id='fdi-value', model=models.FDIValue)
registry.register(metadata_id='export-barrier', model=models.ExportBarrierType)
registry.register(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a test to check the serializer returns a Postcode instance in the intended format? For example, if you were to send a GET request to the metadata endpoint, that it returns a list of postcode instances.

if serializer.is_valid():
primary_key = serializer.validated_data.pop('id')
queryset = PostcodeData.objects.filter(pk=primary_key)
instance, created = queryset.update_or_create(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we aren't updating postcode records in the first instance, this update_or_create call is probably redundant. Instead you could do something like:

def _process_record(self, record: dict) -> None):
    serializer = self.serializer_class(data=record)
    if serializer.is_valid():
        serializer.validated_data.pop('id')  # because setting an id from the incoming data may raise an error when we've told Django to auto generate a UUID
        instance = Postcode.objects.create(**serializer.validated_data)
        self.created_ids.append(str(instance.id))
    else:
        self.errors.append({
            'record': record,
            'errors': serializer.errors,
        })

This may also fix some of the test coverage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we updating postcode records? Given the point of this exercise is to improve accuracy it seems like we probably should unless there's a reason not? I don't think the data size is prohibitive?

self.existing_ids = set(PostcodeData.objects.values_list(
'id', flat=True))

postcode_data_id = record.get('id')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the incoming records have an ID? Maybe we want to ignore these if we are setting our own UUID?


postcode = factory.Faker('postcode')
modified_on = '2025-10-08T08:06:53+00:00'
postcode_region = factory.Faker('postcode_region')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value passed into the factory.Faker() call is the name of the faker provider. I don't think there is one called postcode_region. It might be more appropriate to select a random UK region here instead?

ingestion_task._process_record(record)

assert len(ingestion_task.created_ids) == 1
assert len(ingestion_task.updated_ids) == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated ids may not be applicable here, as per the above comment, if we aren't updating postcode instance.


assert ingestion_task._should_process_record(record) is False

def test_process_record_creates_postcode_data_instance(self, ingestion_task):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help with code coverage, it may also be worth adding a similar test calling the ingestion function directly and asserting the log messages appear, and an instance is created. See test_ingestion_task_success in test_ingest_eyb_marketing for an example.

@baarkerlounger baarkerlounger force-pushed the CPS-618-add-postcode-ingestion-job branch from 9b4beb6 to c4d9f03 Compare February 20, 2025 14:31
@baarkerlounger baarkerlounger force-pushed the CPS-618-add-postcode-ingestion-job branch from 3705618 to 7511281 Compare February 20, 2025 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants