Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIMX-432 Rework dataset partitions to only year, month, day #15

Merged
merged 4 commits into from
Dec 12, 2024

Conversation

jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Dec 9, 2024

Purpose and background context

The main updates in this PR are updating the TIMDEXDataset partitioning scheme to use the [year, month, day] of the "run date" and removing the partition_values option from the TIMDEXDataset.write method and DatasetRecord.to_dict method (serialization method).

Please read the Working Development Note | Rework Dataset Partitions to use only Year/Month/Day in the engineering plan (shoutout to @ghukill for writing an awesome document to provide helpful context for a rather significant change re: TIMDEXDataset partitions!

How can a reviewer manually see the effects of these changes?

  1. Review update unit tests and verify all are passing.
  2. Run the following commands in an IPython terminal.
from tests.utils import generate_sample_records
from timdex_dataset_api import *

new_dataset = TIMDEXDataset(location="/tmp/my_dataset")
sample_records_alma = generate_sample_records(100) # source="alma", run_date="2024-12-01"
sample_records_libguides = generate_sample_records(150, timdex_record_id_prefix="libguides", source="libguides") # source="libguides", run_date="2024-12-01"

written_files_alma = new_dataset.write(sample_records_alma)
written_files_libguides = new_dataset.write(sample_records_libguides)

# check output
new_dataset.reload()
print(written_files_alma[0].path) # output like: /tmp/my_dataset/year=2024/month=12/day=01/c9461fb3-3f52-4fcb-a7bf-b1024b6a1853-0.parquet'
print(written_files_libguides[0].path) # output like: /tmp/my_dataset/year=2024/month=12/day=01/42dfee49-1682-4e67-84c5-60f3a90e0970-0.parquet
print(new_dataset.row_count) # output must equal 250

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES - These changes will require any callers of the write method (i.e., Transmogrifier) to explicitly set these required fields when creating the iterator of DatasetRecord instances.

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
* These changes simplify the partitioning schema for the TIMDEXDataset,
allowing the app to take advantage of PyArrow's memory-efficient
processes for reading and writing Parquet datasets. Furthermore, the
new partitioning schema will result in a more efficient, coherent
folder structure when writing datasets. For more details, see:
https://mitlibraries.atlassian.net/wiki/spaces/IN/pages/4094296066/Engineering+Plan+Parquet+Datasets+for+TIMDEX+ETL#Rework-Dataset-Partitions-to-use-only-Year-%2F-Month-%2F-Day.

How this addresses that need:
* Update TIMDEX_DATASET_SCHEMA to include [year, month, day]
* Update DatasetRecord attrs to include [year, month, day] and
  set [source, run_date, run_type, run_id, action] as primary columns
* Add post_init method to DatasetRecord to derive partition values
  from 'run-date
* Remove 'partition' values from DatasetRecord.to_dict
* Remove 'partition_values' mixin from TIMDEXDataset.write to reduce
  complexity and have write method utilize DatasetRecord partition
  columns instead.
* Update unit tests to use new partitions and remove deprecated tests

Side effects of this change:
* The new partitioning schema introduces a 3-level folder structure
within TIMDEXDataset.location (i.e. the base path of the dataset)
for [year, month, day], where the leaf node will contain parquet files
for every source run.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-432
Comment on lines +32 to +34
pa.field("year", pa.string()),
pa.field("month", pa.string()),
pa.field("day", pa.string()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The partition columns are set as pa.string() objects to support zero-padded months and dates.

@jonavellecuerdo jonavellecuerdo self-assigned this Dec 9, 2024
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review December 9, 2024 21:01
Copy link
Collaborator

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This partitioning rework is getting at the core of not only this library, but also the architecture of the dataset! Thanks for taking on this work.

Submitting a "Request changes" now specifically related to the existing_data_behavior setting during writing, which I think needs updating given these changes.

But also left some comments around the DatasetRecord dataclass, specifically how we handle these somewhat special year, month, and day fields.

Despite these requests and comments, I think it's looking real good so far.

timdex_dataset_api/dataset.py Show resolved Hide resolved
tests/utils.py Outdated Show resolved Hide resolved
timdex_dataset_api/record.py Outdated Show resolved Hide resolved
month: str | None = None
day: str | None = None

def __post_init__(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above about potentially making year, month, and day dynamic properties, which would have bearing on this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of our shared learnings from our many discussions led to the updates in this commit: 5e532d3. Please take a look at the commit message, which pulls from our discussion and provides more context for the changes introduced!

@ghukill
Copy link
Collaborator

ghukill commented Dec 10, 2024

@jonavellecuerdo - one last request. Can you bump the version number here? At some point we may want to explore setting the version number via the Github release version when installing, but until then, bumping the version number here helps local installs by other applications (e.g. Transmog) to update.

I'd propose maybe a v0.3.0? Normally this would definitely be a major bump, but until a) we hit v1.0 and b) we're using this in other applications, I don't think it matters much.

@coveralls
Copy link

coveralls commented Dec 11, 2024

Pull Request Test Coverage Report for Build 12301839367

Details

  • 24 of 24 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 98.592%

Totals Coverage Status
Change from base Build 12237422707: 0.0%
Covered Lines: 140
Relevant Lines: 142

💛 - Coveralls

Comment on lines +8 to +9
def strict_date_parse(date_string: str) -> datetime:
return datetime.strptime(date_string, "%Y-%m-%d").astimezone(UTC)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially thought we could use dateutil.parser as a converter, but it seems that the parser will make some assumptions for almost correctly formatted date strings. For example:

dateutil.parse("-02-01") -> datetime.datetime(2024, 2, 1)

It will interpret the provided value as a relative date and fill in the missing gaps with default values (i.e., the current year).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. That could have been quietly problematic if not noticed!

source_record: bytes = field()
transformed_record: bytes = field()
source: str = field()
run_date: datetime = field(converter=strict_date_parse)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that the __init__ method assumes that users are passing (a) strings (b) strictly formatted as "YYYY-MM-DD" (Python date format code "%Y-%m-%d"). Initially, we were allowing users to provide either a string or a datetime.datetime object but that introduces some complexity...Since we are primarily using the run_date field to derive year, month, day values, I think what we have now is sufficient.

Copy link
Collaborator

@ghukill ghukill Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I was a bit dissappointed that we can no longer instantiate (or set) run_date with a datetime object, given that's effectively our target format. But, if development of this library has demonstrated anything, it's that it may yet require change as we progress. Keeping it simple and strict now, feels like a great path; easier to extend functionality later than reign in expansive options/features that we don't even have concrete use cases for.

Copy link
Collaborator

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks great.

Given the greenfield nature of this library, and how others will rely on it, I think a lot of these lines removed are a good thing. I'm feeling good about this direction of keeping this library simple and strict, and extending when we need more.

It's not really documented anywhere, but this library effectively is the opinionation of the TIMDEX parquet dataset, so it makes sense that architectual considerations would ripple through this.

timdex_dataset_api/record.py Show resolved Hide resolved
Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

tests/test_dataset_write.py Outdated Show resolved Hide resolved
…arquet files

Why these changes are being introduced:
* Since the TIMDEXDataset partitions are now the [year, month, day]
of the 'run_date', parquet files from different source runs
will be written to the same partition. The previous configuration
of existing_data_behavior="delete_matching" would result in
the deletion of any existing parquet files from the partition directory
with every source run, which is not the desired outcome.
To support the new partitions, this updates the configuration
existing_data_behavior="overwrite_or_ignore" which will
ignore any existing data and will only overwrite files with the
same filename.

How this addresses that need:
* Set existing_data_behavior="overwrite_or_ignore" in ds.write_dataset method call
* Add unit tests to demonstrate updated existing_data_behavior

Side effects of this change:
* In the event the multiple runs are performed for the same 'source' and 'run-date',
which is unlikely to occur, parquet files from both runs will exist in the
partitioned directory. DatasetRecords are can still be uniquely identified via the
'run_id' column.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-432
Why these changes are being introduced:
* Reworking the dataset partitions to use the [year, month, day]
of the 'run_date' means that parquet files for different 'source' runs
on the same 'run_date' get written to the same partition directory.
Therefore, it is crucial that the timdex_dataset_api.write method
retrieves the correct partition columns from the (batches) of DatasetRecord
objects. The DatasetRecord class has been refactored to adhere
to the following criteria:

1. When writing to the dataset, and therefore serializing DatasetRecord objects,
   year, month, day should be derived from the run_date and should not be modifiable
2. If possible, avoid parsing a datetime string 3 times for each partition column

How this addresses that need:
* Refactor DatasetRecord to use attrs
* Define custom strict_date_parse converter method for 'run_date' field
* Simplify serialization method to rely on converter for 'run_date'
  error handling
* Remove DatasetRecord.validate
* Include attrs as a dependency

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-432
@jonavellecuerdo jonavellecuerdo force-pushed the TIMX-432-rework-dataset-partitions branch from 8bf085a to 8a30ca3 Compare December 12, 2024 17:41
@jonavellecuerdo jonavellecuerdo merged commit e1c0c6a into main Dec 12, 2024
2 checks passed
@jonavellecuerdo jonavellecuerdo deleted the TIMX-432-rework-dataset-partitions branch December 12, 2024 17:45
@jonavellecuerdo jonavellecuerdo changed the title Rework dataset partitions to only year, month, day TIMX-432 Rework dataset partitions to only year, month, day Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants