You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A given snapshot of a Delta table consists of ... a collection of add actions with unique path keys, corresponding to the newest (path, deletionVector.uniqueId) pair encountered for each path.
Describe the functionality you are proposing.
Our log replay seen hash set should be tracking just path not (path, dvid) pairs. Doing so would make the visitor less complex (no longer need to pull the dvid at all), and also leaves us tracking O(1) entries per path instead of O(k), where k is the number of DV changes on each file.
This should be a localized change to scan/log_replay.rs. Just delete struct FileActionKey, update the hash set to use String instead, and clean up the resulting mess in scan::AddRemoveDedupVisitor.
Additional context
No response
The text was updated successfully, but these errors were encountered:
Please describe why this is necessary.
Currently log replay deduplication is based on a hash set of
(path, dvId)
pairs, but this is a waste of space.In the Delta spec under action reconciliation we read that:
Describe the functionality you are proposing.
Our log replay
seen
hash set should be tracking justpath
not(path, dvid)
pairs. Doing so would make the visitor less complex (no longer need to pull the dvid at all), and also leaves us tracking O(1) entries per path instead of O(k), where k is the number of DV changes on each file.This should be a localized change to scan/log_replay.rs. Just delete
struct FileActionKey
, update the hash set to useString
instead, and clean up the resulting mess inscan::AddRemoveDedupVisitor
.Additional context
No response
The text was updated successfully, but these errors were encountered: