-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch import metadata #35
Batch import metadata #35
Conversation
dr_factory.py: Filters backup_metadata to ensure the scheduler does not resume backup_metadata with the status running during batch imports of metadata. If the scheduler resumes, it will overwrite the metadata while the import is running.
Appreciate your code contribution, @mathiasflorin ! I plan to work on this PR sometimes next week. Thank you! |
Meanwhile, please make sure all tests are passing and dont worry about Publish to TestPyPi failure. Thank you! |
test_dr_factory_2_4.py: Add expected_export_filters to the tests. test_base_table.py: Adjust tests to assert that copy_expert is called with StringIO("".join(batch)). Add test_restore_batch_size to ensure that commit is called twice when the file contains more rows than the batch size.
I have modified the tests to accommodate the batch import approach. Additionally, I have added a new test to verify that the commit is called twice when the data to be loaded exceeds the batch size. |
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
Hi @crupakheti, I've already added an additional test to test batches of sample data. I have added tests to achieve 100% code coverage. |
Thank you, @mathiasflorin ! Yes, please. Let's try to maintain the coverage at 100%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great change! These backup files could get very large, so batching the file-to-database import is a great improvement. I have two overall comments:
1 - I'm wondering if you were able to somewhat load-test the new import logic with a particularly large backup file, as this change is targeted at improving the scalability of the import step. For example, one thing that comes to mind is that frequent COPY-commit commands in quick succession could be expensive or high-latency due to row locking and transaction overhead. Some of these can be mitigated by chunking the commits themselves. And maybe a question to @crupakheti, is there benchmarking for some of the more resource-heavy operations in the MWAA DR tool?
2 - I want to call out that this doesn't close #31; as that issue relates to the export (backup) dag, while this change is focused on chunking the import (restore) dag.
I have tested this code with CSV files containing several million records. I have implemented your suggestion to use
Sorry for the confusion. You are correct that this change does not close #31. I will update the pull request description accordingly. This change addresses the memory pressure issue during import rather than export. |
…store_more_than_batch_size and test_restore_cursor_exception
… itertools.islice for yielding lines instead of looping over the lines of the file. Thank you for the recommendation!
I will likely not have time to get to it this week, but I am +1 for biasing to merge this, and creating a separate Issue to track this improvement proposal. @crupakheti What do you think? |
Thank you for your contribution! Your code has been merged into |
What type of PR is this? (check all applicable)
Description
Use batch import instead of loading one large file.
Related Tickets & Documents
Linting
Have you done linting by issuing
./build.sh lint
command?Testing
Have you run testing by issuing
./build.sh unit
command?Documentation
Have you updated the README appropriate for this PR?
Changelog
Have you updated the CHANGELOG with the changes you are making in this PR?
Please use Added, Removed, and/or Changed subsections under the Unreleased section.
Terms of Contribution
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.