TIMX-432 Rework dataset partitions to only year, month, day #15

jonavellecuerdo · 2024-12-09T20:46:37Z

Purpose and background context

The main updates in this PR are updating the TIMDEXDataset partitioning scheme to use the [year, month, day] of the "run date" and removing the partition_values option from the TIMDEXDataset.write method and DatasetRecord.to_dict method (serialization method).

Please read the Working Development Note | Rework Dataset Partitions to use only Year/Month/Day in the engineering plan (shoutout to @ghukill for writing an awesome document to provide helpful context for a rather significant change re: TIMDEXDataset partitions!

How can a reviewer manually see the effects of these changes?

Review update unit tests and verify all are passing.
Run the following commands in an IPython terminal.

from tests.utils import generate_sample_records
from timdex_dataset_api import *

new_dataset = TIMDEXDataset(location="/tmp/my_dataset")
sample_records_alma = generate_sample_records(100) # source="alma", run_date="2024-12-01"
sample_records_libguides = generate_sample_records(150, timdex_record_id_prefix="libguides", source="libguides") # source="libguides", run_date="2024-12-01"

written_files_alma = new_dataset.write(sample_records_alma)
written_files_libguides = new_dataset.write(sample_records_libguides)

# check output
new_dataset.reload()
print(written_files_alma[0].path) # output like: /tmp/my_dataset/year=2024/month=12/day=01/c9461fb3-3f52-4fcb-a7bf-b1024b6a1853-0.parquet'
print(written_files_libguides[0].path) # output like: /tmp/my_dataset/year=2024/month=12/day=01/42dfee49-1682-4e67-84c5-60f3a90e0970-0.parquet
print(new_dataset.row_count) # output must equal 250

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES - These changes will require any callers of the write method (i.e., Transmogrifier) to explicitly set these required fields when creating the iterator of DatasetRecord instances.

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-432

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: * These changes simplify the partitioning schema for the TIMDEXDataset, allowing the app to take advantage of PyArrow's memory-efficient processes for reading and writing Parquet datasets. Furthermore, the new partitioning schema will result in a more efficient, coherent folder structure when writing datasets. For more details, see: https://mitlibraries.atlassian.net/wiki/spaces/IN/pages/4094296066/Engineering+Plan+Parquet+Datasets+for+TIMDEX+ETL#Rework-Dataset-Partitions-to-use-only-Year-%2F-Month-%2F-Day. How this addresses that need: * Update TIMDEX_DATASET_SCHEMA to include [year, month, day] * Update DatasetRecord attrs to include [year, month, day] and set [source, run_date, run_type, run_id, action] as primary columns * Add post_init method to DatasetRecord to derive partition values from 'run-date * Remove 'partition' values from DatasetRecord.to_dict * Remove 'partition_values' mixin from TIMDEXDataset.write to reduce complexity and have write method utilize DatasetRecord partition columns instead. * Update unit tests to use new partitions and remove deprecated tests Side effects of this change: * The new partitioning schema introduces a 3-level folder structure within TIMDEXDataset.location (i.e. the base path of the dataset) for [year, month, day], where the leaf node will contain parquet files for every source run. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-432

timdex_dataset_api/record.py

jonavellecuerdo · 2024-12-09T20:55:49Z

timdex_dataset_api/dataset.py

+        pa.field("year", pa.string()),
+        pa.field("month", pa.string()),
+        pa.field("day", pa.string()),


The partition columns are set as pa.string() objects to support zero-padded months and dates.

ghukill

This partitioning rework is getting at the core of not only this library, but also the architecture of the dataset! Thanks for taking on this work.

Submitting a "Request changes" now specifically related to the existing_data_behavior setting during writing, which I think needs updating given these changes.

But also left some comments around the DatasetRecord dataclass, specifically how we handle these somewhat special year, month, and day fields.

Despite these requests and comments, I think it's looking real good so far.

timdex_dataset_api/dataset.py

tests/utils.py

timdex_dataset_api/record.py

ghukill · 2024-12-09T21:38:33Z

timdex_dataset_api/record.py

+    month: str | None = None
+    day: str | None = None
+
+    def __post_init__(self) -> None:


See comment above about potentially making year, month, and day dynamic properties, which would have bearing on this method.

All of our shared learnings from our many discussions led to the updates in this commit: 5e532d3. Please take a look at the commit message, which pulls from our discussion and provides more context for the changes introduced!

ghukill · 2024-12-10T14:03:00Z

@jonavellecuerdo - one last request. Can you bump the version number here? At some point we may want to explore setting the version number via the Github release version when installing, but until then, bumping the version number here helps local installs by other applications (e.g. Transmog) to update.

I'd propose maybe a v0.3.0? Normally this would definitely be a major bump, but until a) we hit v1.0 and b) we're using this in other applications, I don't think it matters much.

coveralls · 2024-12-11T18:39:33Z

Pull Request Test Coverage Report for Build 12301839367

Details

24 of 24 (100.0%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 98.592%

Totals
Change from base Build 12237422707:	0.0%
Covered Lines:	140
Relevant Lines:	142

💛 - Coveralls

jonavellecuerdo · 2024-12-11T20:17:31Z

timdex_dataset_api/record.py

+def strict_date_parse(date_string: str) -> datetime:
+    return datetime.strptime(date_string, "%Y-%m-%d").astimezone(UTC)


Initially thought we could use dateutil.parser as a converter, but it seems that the parser will make some assumptions for almost correctly formatted date strings. For example:

dateutil.parse("-02-01") -> datetime.datetime(2024, 2, 1)

It will interpret the provided value as a relative date and fill in the missing gaps with default values (i.e., the current year).

Nice catch. That could have been quietly problematic if not noticed!

jonavellecuerdo · 2024-12-11T20:23:48Z

timdex_dataset_api/record.py

+    source_record: bytes = field()
+    transformed_record: bytes = field()
+    source: str = field()
+    run_date: datetime = field(converter=strict_date_parse)


It's worth noting that the __init__ method assumes that users are passing (a) strings (b) strictly formatted as "YYYY-MM-DD" (Python date format code "%Y-%m-%d"). Initially, we were allowing users to provide either a string or a datetime.datetime object but that introduces some complexity...Since we are primarily using the run_date field to derive year, month, day values, I think what we have now is sufficient.

At first I was a bit dissappointed that we can no longer instantiate (or set) run_date with a datetime object, given that's effectively our target format. But, if development of this library has demonstrated anything, it's that it may yet require change as we progress. Keeping it simple and strict now, feels like a great path; easier to extend functionality later than reign in expansive options/features that we don't even have concrete use cases for.

ghukill

I think it looks great.

Given the greenfield nature of this library, and how others will rely on it, I think a lot of these lines removed are a good thing. I'm feeling good about this direction of keeping this library simple and strict, and extending when we need more.

It's not really documented anywhere, but this library effectively is the opinionation of the TIMDEX parquet dataset, so it makes sense that architectual considerations would ripple through this.

timdex_dataset_api/record.py

ehanson8

Looks great!

tests/test_dataset_write.py

…arquet files Why these changes are being introduced: * Since the TIMDEXDataset partitions are now the [year, month, day] of the 'run_date', parquet files from different source runs will be written to the same partition. The previous configuration of existing_data_behavior="delete_matching" would result in the deletion of any existing parquet files from the partition directory with every source run, which is not the desired outcome. To support the new partitions, this updates the configuration existing_data_behavior="overwrite_or_ignore" which will ignore any existing data and will only overwrite files with the same filename. How this addresses that need: * Set existing_data_behavior="overwrite_or_ignore" in ds.write_dataset method call * Add unit tests to demonstrate updated existing_data_behavior Side effects of this change: * In the event the multiple runs are performed for the same 'source' and 'run-date', which is unlikely to occur, parquet files from both runs will exist in the partitioned directory. DatasetRecords are can still be uniquely identified via the 'run_id' column. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-432

Why these changes are being introduced: * Reworking the dataset partitions to use the [year, month, day] of the 'run_date' means that parquet files for different 'source' runs on the same 'run_date' get written to the same partition directory. Therefore, it is crucial that the timdex_dataset_api.write method retrieves the correct partition columns from the (batches) of DatasetRecord objects. The DatasetRecord class has been refactored to adhere to the following criteria: 1. When writing to the dataset, and therefore serializing DatasetRecord objects, year, month, day should be derived from the run_date and should not be modifiable 2. If possible, avoid parsing a datetime string 3 times for each partition column How this addresses that need: * Refactor DatasetRecord to use attrs * Define custom strict_date_parse converter method for 'run_date' field * Simplify serialization method to rely on converter for 'run_date' error handling * Remove DatasetRecord.validate * Include attrs as a dependency Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-432

jonavellecuerdo commented Dec 9, 2024

View reviewed changes

timdex_dataset_api/record.py Outdated Show resolved Hide resolved

jonavellecuerdo commented Dec 9, 2024

View reviewed changes

jonavellecuerdo self-assigned this Dec 9, 2024

jonavellecuerdo requested review from ghukill and ehanson8 December 9, 2024 21:01

jonavellecuerdo marked this pull request as ready for review December 9, 2024 21:01

ghukill requested changes Dec 9, 2024

View reviewed changes

jonavellecuerdo commented Dec 11, 2024

View reviewed changes

jonavellecuerdo requested a review from ghukill December 11, 2024 20:36

ghukill approved these changes Dec 11, 2024

View reviewed changes

timdex_dataset_api/record.py Show resolved Hide resolved

ghukill mentioned this pull request Dec 11, 2024

TIMX 405 - support output to TIMDEX parquet dataset MITLibraries/transmogrifier#219

Merged

9 tasks

ehanson8 approved these changes Dec 12, 2024

View reviewed changes

tests/test_dataset_write.py Outdated Show resolved Hide resolved

jonavellecuerdo added 3 commits December 12, 2024 12:41

Update package version number

8a30ca3

jonavellecuerdo force-pushed the TIMX-432-rework-dataset-partitions branch from 8bf085a to 8a30ca3 Compare December 12, 2024 17:41

jonavellecuerdo merged commit e1c0c6a into main Dec 12, 2024
2 checks passed

jonavellecuerdo deleted the TIMX-432-rework-dataset-partitions branch December 12, 2024 17:45

jonavellecuerdo changed the title ~~Rework dataset partitions to only year, month, day~~ TIMX-432 Rework dataset partitions to only year, month, day Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TIMX-432 Rework dataset partitions to only year, month, day #15

TIMX-432 Rework dataset partitions to only year, month, day #15

jonavellecuerdo commented Dec 9, 2024 •

edited

Loading

jonavellecuerdo Dec 9, 2024

ghukill left a comment

ghukill Dec 9, 2024

jonavellecuerdo Dec 11, 2024

ghukill commented Dec 10, 2024

coveralls commented Dec 11, 2024 •

edited

Loading

jonavellecuerdo Dec 11, 2024

ghukill Dec 11, 2024

jonavellecuerdo Dec 11, 2024

ghukill Dec 11, 2024 •

edited

Loading

ghukill left a comment

ehanson8 left a comment

		def strict_date_parse(date_string: str) -> datetime:
		return datetime.strptime(date_string, "%Y-%m-%d").astimezone(UTC)

TIMX-432 Rework dataset partitions to only year, month, day #15

TIMX-432 Rework dataset partitions to only year, month, day #15

Conversation

jonavellecuerdo commented Dec 9, 2024 • edited Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

jonavellecuerdo Dec 9, 2024

Choose a reason for hiding this comment

ghukill left a comment

Choose a reason for hiding this comment

ghukill Dec 9, 2024

Choose a reason for hiding this comment

jonavellecuerdo Dec 11, 2024

Choose a reason for hiding this comment

ghukill commented Dec 10, 2024

coveralls commented Dec 11, 2024 • edited Loading

Pull Request Test Coverage Report for Build 12301839367

Details

💛 - Coveralls

jonavellecuerdo Dec 11, 2024

Choose a reason for hiding this comment

ghukill Dec 11, 2024

Choose a reason for hiding this comment

jonavellecuerdo Dec 11, 2024

Choose a reason for hiding this comment

ghukill Dec 11, 2024 • edited Loading

Choose a reason for hiding this comment

ghukill left a comment

Choose a reason for hiding this comment

ehanson8 left a comment

Choose a reason for hiding this comment

jonavellecuerdo commented Dec 9, 2024 •

edited

Loading

coveralls commented Dec 11, 2024 •

edited

Loading

ghukill Dec 11, 2024 •

edited

Loading