Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Why these changes are being introduced: * Reworking the dataset partitions to use the [year, month, day] of the 'run_date' means that parquet files for different 'source' runs on the same 'run_date' get written to the same partition directory. Therefore, it is crucial that the timdex_dataset_api.write method retrieves the correct partition columns from the (batches) of DatasetRecord objects. The DatasetRecord class has been refactored to adhere to the following criteria: 1. When writing to the dataset, and therefore serializing DatasetRecord objects, year, month, day should be derived from the run_date and should not be modifiable 2. If possible, avoid parsing a datetime string 3 times for each partition column How this addresses that need: * Refactor DatasetRecord to use attrs * Define custom strict_date_parse converter method for 'run_date' field * Simplify serialization method to rely on converter for 'run_date' error handling * Remove DatasetRecord.validate * Include attrs as a dependency Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-432
- Loading branch information