Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log replay deduplication can ignore deletion vectors #701

Open
scovich opened this issue Feb 19, 2025 · 0 comments
Open

Log replay deduplication can ignore deletion vectors #701

scovich opened this issue Feb 19, 2025 · 0 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@scovich
Copy link
Collaborator

scovich commented Feb 19, 2025

Please describe why this is necessary.

Currently log replay deduplication is based on a hash set of (path, dvId) pairs, but this is a waste of space.

In the Delta spec under action reconciliation we read that:

A given snapshot of a Delta table consists of ... a collection of add actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.

Describe the functionality you are proposing.

Our log replay seen hash set should be tracking just path not (path, dvid) pairs. Doing so would make the visitor less complex (no longer need to pull the dvid at all), and also leaves us tracking O(1) entries per path instead of O(k), where k is the number of DV changes on each file.

This should be a localized change to scan/log_replay.rs. Just delete struct FileActionKey, update the hash set to use String instead, and clean up the resulting mess in scan::AddRemoveDedupVisitor.

Additional context

No response

@scovich scovich added enhancement New feature or request good first issue Good for newcomers labels Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant