Batch import metadata #35

mathiasflorin · 2024-11-20T16:20:47Z

What type of PR is this? (check all applicable)

Description

Use batch import instead of loading one large file.

Related Tickets & Documents

Related Issue #
Closes #

Linting

Have you done linting by issuing ./build.sh lint command?

Yes
No, I need help with linting

Testing

Have you run testing by issuing ./build.sh unit command?

Yes
No, but I tested the export and import

Documentation

Have you updated the README appropriate for this PR?

Yes
No, README does not need any changes for this PR

Changelog

Have you updated the CHANGELOG with the changes you are making in this PR?
Please use Added, Removed, and/or Changed subsections under the Unreleased section.

Yes
No, I need help

Terms of Contribution

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

dr_factory.py: Filters backup_metadata to ensure the scheduler does not resume backup_metadata with the status running during batch imports of metadata. If the scheduler resumes, it will overwrite the metadata while the import is running.

assets/dags/mwaa_dr/v_2_4/dr_factory.py

crupakheti · 2024-11-20T16:27:53Z

Appreciate your code contribution, @mathiasflorin ! I plan to work on this PR sometimes next week. Thank you!

crupakheti · 2024-11-20T16:28:52Z

Meanwhile, please make sure all tests are passing and dont worry about Publish to TestPyPi failure. Thank you!

test_dr_factory_2_4.py: Add expected_export_filters to the tests. test_base_table.py: Adjust tests to assert that copy_expert is called with StringIO("".join(batch)). Add test_restore_batch_size to ensure that commit is called twice when the file contains more rows than the batch size.

mathiasflorin · 2024-11-21T12:05:23Z

Meanwhile, please make sure all tests are passing and dont worry about Publish to TestPyPi failure. Thank you!

I have modified the tests to accommodate the batch import approach. Additionally, I have added a new test to verify that the commit is called twice when the data to be loaded exceeds the batch size.

github-actions · 2024-11-21T15:11:20Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
.
config.py
assets/dags/mwaa_dr/framework/factory
base_dr_factory.py
default_dag_factory.py
assets/dags/mwaa_dr/framework/model
active_dag_table.py
base_table.py
connection_table.py
variable_table.py
lib/dr_constructs
airflow_cli.py
lib/functions
airflow_cli_client.py
airflow_cli_function.py
lib/stacks
mwaa_base_stack.py
mwaa_primary_stack.py
mwaa_secondary_stack.py
Project Total

_{This report was generated by python-coverage-comment-action}

mathiasflorin · 2024-11-21T15:17:58Z

Hi @crupakheti, I've already added an additional test to test batches of sample data. I have added tests to achieve 100% code coverage.

crupakheti · 2024-11-21T15:28:53Z

Thank you, @mathiasflorin ! Yes, please. Let's try to maintain the coverage at 100%.

mayushko26

Great change! These backup files could get very large, so batching the file-to-database import is a great improvement. I have two overall comments:

1 - I'm wondering if you were able to somewhat load-test the new import logic with a particularly large backup file, as this change is targeted at improving the scalability of the import step. For example, one thing that comes to mind is that frequent COPY-commit commands in quick succession could be expensive or high-latency due to row locking and transaction overhead. Some of these can be mitigated by chunking the commits themselves. And maybe a question to @crupakheti, is there benchmarking for some of the more resource-heavy operations in the MWAA DR tool?

2 - I want to call out that this doesn't close #31; as that issue relates to the export (backup) dag, while this change is focused on chunking the import (restore) dag.

assets/dags/mwaa_dr/framework/model/base_table.py

mathiasflorin · 2024-11-22T09:47:03Z

Great change! These backup files could get very large, so batching the file-to-database import is a great improvement. I have two overall comments:

1 - I'm wondering if you were able to somewhat load-test the new import logic with a particularly large backup file, as this change is targeted at improving the scalability of the import step. For example, one thing that comes to mind is that frequent COPY-commit commands in quick succession could be expensive or high-latency due to row locking and transaction overhead. Some of these can be mitigated by chunking the commits themselves. And maybe a question to @crupakheti, is there benchmarking for some of the more resource-heavy operations in the MWAA DR tool?

I have tested this code with CSV files containing several million records. I have implemented your suggestion to use itertools.islice, but it still takes approximately 20 minutes on my test system to restore the metadata.

2 - I want to call out that this doesn't close #31; as that issue relates to the export (backup) dag, while this change is focused on chunking the import (restore) dag.

Sorry for the confusion. You are correct that this change does not close #31. I will update the pull request description accordingly. This change addresses the memory pressure issue during import rather than export.

…store_more_than_batch_size and test_restore_cursor_exception

… itertools.islice for yielding lines instead of looping over the lines of the file. Thank you for the recommendation!

mayushko26 · 2024-11-25T16:59:01Z

I will likely not have time to get to it this week, but I am +1 for biasing to merge this, and creating a separate Issue to track this improvement proposal. @crupakheti What do you think?

crupakheti · 2025-02-14T01:11:37Z

Thank you for your contribution! Your code has been merged into main through PR #38. 🙏

mathiasflorin added 4 commits November 20, 2024 16:42

Add missing comma in export_filter for dag_run

b2cc60e

Add missing comma in export_filters

48b7b81

Add batch import to changelog

162e9a3

mathiasflorin commented Nov 20, 2024

View reviewed changes

assets/dags/mwaa_dr/v_2_4/dr_factory.py Show resolved Hide resolved

crupakheti self-requested a review November 20, 2024 16:25

crupakheti self-assigned this Nov 20, 2024

mathiasflorin had a problem deploying to testpypi November 20, 2024 16:28 — with GitHub Actions Failure

crupakheti assigned mathiasflorin Nov 20, 2024

crupakheti added the enhancement New feature or request label Nov 20, 2024

mathiasflorin had a problem deploying to testpypi November 21, 2024 15:09 — with GitHub Actions Failure

mayushko26 reviewed Nov 22, 2024

View reviewed changes

assets/dags/mwaa_dr/framework/model/base_table.py Outdated Show resolved Hide resolved

mathiasflorin added 3 commits November 22, 2024 11:11

Code coverage back to 100% by adding test_restore_batch_size, test_re…

79f83e6

…store_more_than_batch_size and test_restore_cursor_exception

I have implemented the change you suggested in the code review to use…

1876007

… itertools.islice for yielding lines instead of looping over the lines of the file. Thank you for the recommendation!

Spell mistake

8cb7f9c

mathiasflorin had a problem deploying to testpypi December 3, 2024 16:43 — with GitHub Actions Failure

crupakheti mentioned this pull request Feb 14, 2025

Support for v2.9.2, v2.10.1, and v2.10.3 and paging support for large metadata #38

Merged

15 tasks

crupakheti closed this Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch import metadata #35

Batch import metadata #35

mathiasflorin commented Nov 20, 2024 •

edited

Loading

crupakheti commented Nov 20, 2024

crupakheti commented Nov 20, 2024

mathiasflorin commented Nov 21, 2024

github-actions bot commented Nov 21, 2024 •

edited

Loading

mathiasflorin commented Nov 21, 2024 •

edited

Loading

crupakheti commented Nov 21, 2024

mayushko26 left a comment

mathiasflorin commented Nov 22, 2024 •

edited

Loading

mayushko26 commented Nov 25, 2024

crupakheti commented Feb 14, 2025

Batch import metadata #35

Batch import metadata #35

Conversation

mathiasflorin commented Nov 20, 2024 • edited Loading

What type of PR is this? (check all applicable)

Description

Related Tickets & Documents

Linting

Testing

Documentation

Changelog

Terms of Contribution

crupakheti commented Nov 20, 2024

crupakheti commented Nov 20, 2024

mathiasflorin commented Nov 21, 2024

github-actions bot commented Nov 21, 2024 • edited Loading

Coverage report

mathiasflorin commented Nov 21, 2024 • edited Loading

crupakheti commented Nov 21, 2024

mayushko26 left a comment

Choose a reason for hiding this comment

mathiasflorin commented Nov 22, 2024 • edited Loading

mayushko26 commented Nov 25, 2024

crupakheti commented Feb 14, 2025

mathiasflorin commented Nov 20, 2024 •

edited

Loading

github-actions bot commented Nov 21, 2024 •

edited

Loading

mathiasflorin commented Nov 21, 2024 •

edited

Loading

mathiasflorin commented Nov 22, 2024 •

edited

Loading