-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TIMX-432 Rework dataset partitions to only year, month, day #15
Conversation
Why these changes are being introduced: * These changes simplify the partitioning schema for the TIMDEXDataset, allowing the app to take advantage of PyArrow's memory-efficient processes for reading and writing Parquet datasets. Furthermore, the new partitioning schema will result in a more efficient, coherent folder structure when writing datasets. For more details, see: https://mitlibraries.atlassian.net/wiki/spaces/IN/pages/4094296066/Engineering+Plan+Parquet+Datasets+for+TIMDEX+ETL#Rework-Dataset-Partitions-to-use-only-Year-%2F-Month-%2F-Day. How this addresses that need: * Update TIMDEX_DATASET_SCHEMA to include [year, month, day] * Update DatasetRecord attrs to include [year, month, day] and set [source, run_date, run_type, run_id, action] as primary columns * Add post_init method to DatasetRecord to derive partition values from 'run-date * Remove 'partition' values from DatasetRecord.to_dict * Remove 'partition_values' mixin from TIMDEXDataset.write to reduce complexity and have write method utilize DatasetRecord partition columns instead. * Update unit tests to use new partitions and remove deprecated tests Side effects of this change: * The new partitioning schema introduces a 3-level folder structure within TIMDEXDataset.location (i.e. the base path of the dataset) for [year, month, day], where the leaf node will contain parquet files for every source run. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-432
pa.field("year", pa.string()), | ||
pa.field("month", pa.string()), | ||
pa.field("day", pa.string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The partition columns are set as pa.string()
objects to support zero-padded months and dates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This partitioning rework is getting at the core of not only this library, but also the architecture of the dataset! Thanks for taking on this work.
Submitting a "Request changes" now specifically related to the existing_data_behavior
setting during writing, which I think needs updating given these changes.
But also left some comments around the DatasetRecord
dataclass, specifically how we handle these somewhat special year
, month
, and day
fields.
Despite these requests and comments, I think it's looking real good so far.
timdex_dataset_api/record.py
Outdated
month: str | None = None | ||
day: str | None = None | ||
|
||
def __post_init__(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above about potentially making year
, month
, and day
dynamic properties, which would have bearing on this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of our shared learnings from our many discussions led to the updates in this commit: 5e532d3. Please take a look at the commit message, which pulls from our discussion and provides more context for the changes introduced!
@jonavellecuerdo - one last request. Can you bump the version number here? At some point we may want to explore setting the version number via the Github release version when installing, but until then, bumping the version number here helps local installs by other applications (e.g. Transmog) to update. I'd propose maybe a |
Pull Request Test Coverage Report for Build 12301839367Details
💛 - Coveralls |
def strict_date_parse(date_string: str) -> datetime: | ||
return datetime.strptime(date_string, "%Y-%m-%d").astimezone(UTC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially thought we could use dateutil.parser
as a converter, but it seems that the parser will make some assumptions for almost correctly formatted date strings. For example:
dateutil.parse("-02-01") -> datetime.datetime(2024, 2, 1)
It will interpret the provided value as a relative date and fill in the missing gaps with default values (i.e., the current year).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. That could have been quietly problematic if not noticed!
source_record: bytes = field() | ||
transformed_record: bytes = field() | ||
source: str = field() | ||
run_date: datetime = field(converter=strict_date_parse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that the __init__
method assumes that users are passing (a) strings (b) strictly formatted as "YYYY-MM-DD" (Python date format code "%Y-%m-%d"
). Initially, we were allowing users to provide either a string
or a datetime.datetime
object but that introduces some complexity...Since we are primarily using the run_date
field to derive year, month, day
values, I think what we have now is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I was a bit dissappointed that we can no longer instantiate (or set) run_date
with a datetime object, given that's effectively our target format. But, if development of this library has demonstrated anything, it's that it may yet require change as we progress. Keeping it simple and strict now, feels like a great path; easier to extend functionality later than reign in expansive options/features that we don't even have concrete use cases for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks great.
Given the greenfield nature of this library, and how others will rely on it, I think a lot of these lines removed are a good thing. I'm feeling good about this direction of keeping this library simple and strict, and extending when we need more.
It's not really documented anywhere, but this library effectively is the opinionation of the TIMDEX parquet dataset, so it makes sense that architectual considerations would ripple through this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
…arquet files Why these changes are being introduced: * Since the TIMDEXDataset partitions are now the [year, month, day] of the 'run_date', parquet files from different source runs will be written to the same partition. The previous configuration of existing_data_behavior="delete_matching" would result in the deletion of any existing parquet files from the partition directory with every source run, which is not the desired outcome. To support the new partitions, this updates the configuration existing_data_behavior="overwrite_or_ignore" which will ignore any existing data and will only overwrite files with the same filename. How this addresses that need: * Set existing_data_behavior="overwrite_or_ignore" in ds.write_dataset method call * Add unit tests to demonstrate updated existing_data_behavior Side effects of this change: * In the event the multiple runs are performed for the same 'source' and 'run-date', which is unlikely to occur, parquet files from both runs will exist in the partitioned directory. DatasetRecords are can still be uniquely identified via the 'run_id' column. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-432
Why these changes are being introduced: * Reworking the dataset partitions to use the [year, month, day] of the 'run_date' means that parquet files for different 'source' runs on the same 'run_date' get written to the same partition directory. Therefore, it is crucial that the timdex_dataset_api.write method retrieves the correct partition columns from the (batches) of DatasetRecord objects. The DatasetRecord class has been refactored to adhere to the following criteria: 1. When writing to the dataset, and therefore serializing DatasetRecord objects, year, month, day should be derived from the run_date and should not be modifiable 2. If possible, avoid parsing a datetime string 3 times for each partition column How this addresses that need: * Refactor DatasetRecord to use attrs * Define custom strict_date_parse converter method for 'run_date' field * Simplify serialization method to rely on converter for 'run_date' error handling * Remove DatasetRecord.validate * Include attrs as a dependency Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-432
8bf085a
to
8a30ca3
Compare
Purpose and background context
The main updates in this PR are updating the
TIMDEXDataset
partitioning scheme to use the[year, month, day]
of the "run date" and removing thepartition_values
option from theTIMDEXDataset.write
method andDatasetRecord.to_dict
method (serialization method).Please read the Working Development Note | Rework Dataset Partitions to use only Year/Month/Day in the engineering plan (shoutout to @ghukill for writing an awesome document to provide helpful context for a rather significant change re:
TIMDEXDataset
partitions!How can a reviewer manually see the effects of these changes?
Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES - These changes will require any callers of the write method (i.e., Transmogrifier) to explicitly set these required fields when creating the iterator of
DatasetRecord
instances.What are the relevant tickets?
Developer
Code Reviewer(s)