Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch import metadata #35

Closed

Conversation

mathiasflorin
Copy link
Contributor

@mathiasflorin mathiasflorin commented Nov 20, 2024

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update
  • Dependency Update
  • Release

Description

Use batch import instead of loading one large file.

image

Related Tickets & Documents

  • Related Issue #
  • Closes #

Linting

Have you done linting by issuing ./build.sh lint command?

  • Yes
  • No, I need help with linting

Testing

Have you run testing by issuing ./build.sh unit command?

  • Yes
  • No, but I tested the export and import

Documentation

Have you updated the README appropriate for this PR?

  • Yes
  • No, README does not need any changes for this PR

Changelog

Have you updated the CHANGELOG with the changes you are making in this PR?
Please use Added, Removed, and/or Changed subsections under the Unreleased section.

  • Yes
  • No, I need help

Terms of Contribution

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

dr_factory.py: Filters backup_metadata to ensure the scheduler does not resume backup_metadata with the status running during batch imports of metadata. If the scheduler resumes, it will overwrite the metadata while the import is running.
@crupakheti crupakheti self-requested a review November 20, 2024 16:25
@crupakheti crupakheti self-assigned this Nov 20, 2024
@crupakheti
Copy link
Contributor

Appreciate your code contribution, @mathiasflorin ! I plan to work on this PR sometimes next week. Thank you!

@crupakheti
Copy link
Contributor

Meanwhile, please make sure all tests are passing and dont worry about Publish to TestPyPi failure. Thank you!

@crupakheti crupakheti added the enhancement New feature or request label Nov 20, 2024
test_dr_factory_2_4.py: Add expected_export_filters to the tests.

test_base_table.py:

Adjust tests to assert that copy_expert is called with StringIO("".join(batch)).

Add test_restore_batch_size to ensure that commit is called twice when the file contains more rows than the batch size.
@mathiasflorin
Copy link
Contributor Author

Meanwhile, please make sure all tests are passing and dont worry about Publish to TestPyPi failure. Thank you!

I have modified the tests to accommodate the batch import approach. Additionally, I have added a new test to verify that the commit is called twice when the data to be loaded exceeds the batch size.

Copy link

github-actions bot commented Nov 21, 2024

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  .
  config.py
  assets/dags/mwaa_dr/framework/factory
  base_dr_factory.py
  default_dag_factory.py
  assets/dags/mwaa_dr/framework/model
  active_dag_table.py
  base_table.py
  connection_table.py
  variable_table.py
  lib/dr_constructs
  airflow_cli.py
  lib/functions
  airflow_cli_client.py
  airflow_cli_function.py
  lib/stacks
  mwaa_base_stack.py
  mwaa_primary_stack.py
  mwaa_secondary_stack.py
Project Total  

This report was generated by python-coverage-comment-action

@mathiasflorin
Copy link
Contributor Author

mathiasflorin commented Nov 21, 2024

Hi @crupakheti, I've already added an additional test to test batches of sample data. I have added tests to achieve 100% code coverage.

@crupakheti
Copy link
Contributor

Thank you, @mathiasflorin ! Yes, please. Let's try to maintain the coverage at 100%.

Copy link

@mayushko26 mayushko26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change! These backup files could get very large, so batching the file-to-database import is a great improvement. I have two overall comments:

1 - I'm wondering if you were able to somewhat load-test the new import logic with a particularly large backup file, as this change is targeted at improving the scalability of the import step. For example, one thing that comes to mind is that frequent COPY-commit commands in quick succession could be expensive or high-latency due to row locking and transaction overhead. Some of these can be mitigated by chunking the commits themselves. And maybe a question to @crupakheti, is there benchmarking for some of the more resource-heavy operations in the MWAA DR tool?

2 - I want to call out that this doesn't close #31; as that issue relates to the export (backup) dag, while this change is focused on chunking the import (restore) dag.

assets/dags/mwaa_dr/framework/model/base_table.py Outdated Show resolved Hide resolved
@mathiasflorin
Copy link
Contributor Author

mathiasflorin commented Nov 22, 2024

Great change! These backup files could get very large, so batching the file-to-database import is a great improvement. I have two overall comments:

1 - I'm wondering if you were able to somewhat load-test the new import logic with a particularly large backup file, as this change is targeted at improving the scalability of the import step. For example, one thing that comes to mind is that frequent COPY-commit commands in quick succession could be expensive or high-latency due to row locking and transaction overhead. Some of these can be mitigated by chunking the commits themselves. And maybe a question to @crupakheti, is there benchmarking for some of the more resource-heavy operations in the MWAA DR tool?

I have tested this code with CSV files containing several million records. I have implemented your suggestion to use itertools.islice, but it still takes approximately 20 minutes on my test system to restore the metadata.

2 - I want to call out that this doesn't close #31; as that issue relates to the export (backup) dag, while this change is focused on chunking the import (restore) dag.

Sorry for the confusion. You are correct that this change does not close #31. I will update the pull request description accordingly. This change addresses the memory pressure issue during import rather than export.

…store_more_than_batch_size and test_restore_cursor_exception
… itertools.islice for yielding lines instead of looping over the lines of the file. Thank you for the recommendation!
@mayushko26
Copy link

I will likely not have time to get to it this week, but I am +1 for biasing to merge this, and creating a separate Issue to track this improvement proposal. @crupakheti What do you think?

@crupakheti
Copy link
Contributor

Thank you for your contribution! Your code has been merged into main through PR #38. 🙏

@crupakheti crupakheti closed this Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants