Engines now return FileMeta with correct millisecond timestamps #565

OussamaSaoudi-db · 2024-12-06T03:13:23Z

What changes are proposed in this pull request?

This PR fixes the list_from method used by both the DefaultEngine and the SyncEngine to return the correct FileMeta struct. Previously, both the engines would return the timestamp in number of seconds from Unix Epoch. However, FileMeta specifies that we should expect the modification time to be in milliseconds since Unix Epoch:

pub struct FileMeta {
    /// The fully qualified path to the object
    pub location: Url,
    /// The last modified time as milliseconds since unix epoch
    pub last_modified: i64,
    /// The size in bytes of the object
    pub size: usize,
}

This PR affects the following public APIs

This changes the behaviour of the following:

SyncFilesystemClient::list_from
ObjectStoreFileSystemClient::list_from
any transitive users of these methods

How was this change tested?

I compare the FileMeta returned by each of the engines to expected FileMeta within 10 seconds of creation time.

OussamaSaoudi-db · 2024-12-06T03:17:47Z

Since I'm using the filesystem to get a ground truth for the tests, should I:

try to remove the os dependence in the tests
Only run these tests in unix systems
Remove the tests

codecov · 2024-12-06T03:18:14Z

Codecov Report

Attention: Patch coverage is 75.00000% with 16 lines in your changes missing coverage. Please review.

Project coverage is 82.26%. Comparing base (3b456e4) to head (64b174f).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/lib.rs	50.00%	4 Missing and 6 partials ⚠️
kernel/src/engine/sync/fs_client.rs	66.66%	0 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #565      +/-   ##
==========================================
- Coverage   82.32%   82.26%   -0.06%     
==========================================
  Files          71       71              
  Lines       15734    15777      +43     
  Branches    15734    15777      +43     
==========================================
+ Hits        12953    12979      +26     
- Misses       2164     2167       +3     
- Partials      617      631      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

scovich · 2024-12-06T15:10:24Z

Since I'm using the filesystem to get a ground truth for the tests, should I:
1. try to remove the os dependence in the tests

2. Only run these tests in unix systems

3. Remove the tests

It seems like we should either make the tests OS-independent, or have OS-specific code for each test so we verify it works? But then again this is only for local filesystem which is of little interest to prod code. The really useful test would validate that object store returns millisecond resolution timestamps, but that would require tests that access cloud storage.

OussamaSaoudi-db · 2024-12-06T17:10:24Z

this is only for local filesystem which is of little interest to prod code. The really useful test would validate that object store returns millisecond resolution timestamps, but that would require tests that access cloud storage.

Does that mean I should remove the tests for now?

scovich · 2024-12-06T21:12:20Z

kernel/src/engine/sync/fs_client.rs

+                            .ok()
+                            .and_then(|modified| {
+                                modified.duration_since(SystemTime::UNIX_EPOCH).ok()
+                            })
+                            .and_then(|modified| modified.as_millis().try_into().ok())
                            .unwrap_or(0);


Are we converting Result to Option three times here? Why is that needed?

Suggested change

.ok()

.and_then(|modified| {

modified.duration_since(SystemTime::UNIX_EPOCH).ok()

})

.and_then(|modified| modified.as_millis().try_into().ok())

.unwrap_or(0);

.and_then(|modified| modified.duration_since(SystemTime::UNIX_EPOCH))

.and_then(|modified| modified.as_millis().try_into())

.unwrap_or(0);

Or do types not match in some annoying way?

Honestly, all this monadic chaining feels like a poor substitute for a missing helper method...

impl TryFrom<DirEntry> for FileMeta { type Error = Error; fn try_from(ent: DirEntry) -> DeltaResult<FileMeta> { let metadata = ent.metadata()?; let last_modified = metadata.modified()?.duration_since(SystemTime::UNIX_EPOCH)?; Ok(FileMeta { location: Url::from_file_path(ent.path())?, last_modified: last_modified.as_millis().try_into()?, size: metadata.len() as usize, }) } }

and then

let it = all_ents .into_iter() .sorted_by_key(|ent| ent.path()) .map(TryFrom::try_from);

Woah that's clean 👌 I'm doing a lot of map_err, and I'm wondering if we should do From conversions for common things like Url::from_* and TryInto 🤔

yea seems reasonable? make an issue for us to think about?

We already have the From conversion for url::ParseError, so ? should Just Work? The problem with TryInto is the error type is parametrized, so there's no way to know in advance what might be needed. We'd have to go case by case (when in doubt, try ? and hopefully there's already an impl From for it to use).

scovich · 2024-12-06T21:28:24Z

kernel/src/engine/sync/fs_client.rs

+        let expected_timestamp = metadata
+            .modified()?
+            .duration_since(UNIX_EPOCH)?
+            .as_millis()
+            .try_into()?;


This is the exact same code the iterator uses... so it's not actually testing for correctness but rather testing that the code is equivalent. Not sure what the better test would be? Maybe we write a file and manually set its mtime, then verify the FileMeta has the same timestamp?

(same criticism applies to the default client test)

maybe create the file then just assert that the timestamp we get back is within 10s or 60s or something?

Implemented a test that checks that the filemeta is from the last minute.

seems good for now :)

scovich

LGTM

scovich · 2024-12-09T15:58:07Z

kernel/src/engine/default/filesystem.rs

+        // The [`FileMeta`]s must be greater than 1 minute ago
+        let allowed_time = begin_time - Duration::from_secs(60);


qq: Why do we need such a big safety margin, out of curiosity? All files are PUT after this timestamp, so they should not have smaller timestamps?

I think I did come across a timing issue where it failed even though we put it before. This was when I'd set the bound to 0. I could make the bounds tighter, but I didn't want to make the test remotely flaky

Could we also just assert the file_meta time is < current time (and could include a little safety margin as well)? Right now we only have a bound on one end (I don't think this would catch the case of us accidentally returning nanos instead of millis for example)

zachschuermann

LGTM pending one fix!

zachschuermann · 2024-12-09T17:24:52Z

kernel/src/engine/default/filesystem.rs

+        // The [`FileMeta`]s must be greater than 1 minute ago
+        let allowed_time = begin_time - Duration::from_secs(60);


Could we also just assert the file_meta time is < current time (and could include a little safety margin as well)? Right now we only have a bound on one end (I don't think this would catch the case of us accidentally returning nanos instead of millis for example)

kernel/src/engine/sync/fs_client.rs

Fix timestamp millisecond and add tests to check

c7d41a8

github-actions bot assigned OussamaSaoudi-db Dec 6, 2024

OussamaSaoudi-db changed the title ~~Fix timestamp millisecond and add tests to check~~ Engines now return FileMeta with correct timestamps Dec 6, 2024

OussamaSaoudi-db requested review from zachschuermann, nicklan and scovich December 6, 2024 03:16

OussamaSaoudi-db changed the title ~~Engines now return FileMeta with correct timestamps~~ Engines now return FileMeta with correct millisecond timestamps Dec 6, 2024

scovich reviewed Dec 6, 2024

View reviewed changes

OussamaSaoudi-db added 4 commits December 7, 2024 16:35

Fix file meta tests

dcc2d49

make tests use 60 seconds instead

b3bde76

Move FileMeta construction to TryFrom

97e61a6

remove compilation error

7103d27

OussamaSaoudi-db requested a review from scovich December 8, 2024 00:51

OussamaSaoudi-db mentioned this pull request Dec 9, 2024

Helper methods for CDF Physical to Logical Transformation #579

Merged

scovich approved these changes Dec 9, 2024

View reviewed changes

zachschuermann approved these changes Dec 9, 2024

View reviewed changes

OussamaSaoudi-db added 3 commits December 9, 2024 09:55

Fixup tests

dd71d00

change test to use absolute, bring down to 10s

c765eae

Merge branch 'main' into file_modification_fixup

64b174f

OussamaSaoudi-db merged commit bea3326 into delta-io:main Dec 9, 2024
18 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engines now return FileMeta with correct millisecond timestamps #565

Engines now return FileMeta with correct millisecond timestamps #565

OussamaSaoudi-db commented Dec 6, 2024 •

edited

Loading

OussamaSaoudi-db commented Dec 6, 2024

codecov bot commented Dec 6, 2024 •

edited

Loading

scovich commented Dec 6, 2024

OussamaSaoudi-db commented Dec 6, 2024 •

edited

Loading

scovich Dec 6, 2024

scovich Dec 6, 2024

OussamaSaoudi-db Dec 8, 2024

zachschuermann Dec 9, 2024

scovich Dec 10, 2024

scovich Dec 6, 2024

zachschuermann Dec 6, 2024

OussamaSaoudi-db Dec 8, 2024

zachschuermann Dec 9, 2024

scovich left a comment

scovich Dec 9, 2024

OussamaSaoudi-db Dec 9, 2024

zachschuermann Dec 9, 2024

zachschuermann left a comment

zachschuermann Dec 9, 2024

		// The [`FileMeta`]s must be greater than 1 minute ago
		let allowed_time = begin_time - Duration::from_secs(60);

Engines now return FileMeta with correct millisecond timestamps #565

Engines now return FileMeta with correct millisecond timestamps #565

Conversation

OussamaSaoudi-db commented Dec 6, 2024 • edited Loading

What changes are proposed in this pull request?

This PR affects the following public APIs

How was this change tested?

OussamaSaoudi-db commented Dec 6, 2024

codecov bot commented Dec 6, 2024 • edited Loading

Codecov Report

scovich commented Dec 6, 2024

OussamaSaoudi-db commented Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachschuermann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OussamaSaoudi-db commented Dec 6, 2024 •

edited

Loading

codecov bot commented Dec 6, 2024 •

edited

Loading

OussamaSaoudi-db commented Dec 6, 2024 •

edited

Loading