chunk cache #31

assafvayner · 2024-10-01T00:25:37Z

Chunk Cache implementation.

caching xorb ranges on file system

assafvayner · 2024-10-01T00:25:57Z

This PR needs more iterations

rajatarya · 2024-10-02T03:53:17Z

Are you ready for this PR to be reviewed now?

assafvayner · 2024-10-02T15:48:54Z

yes

rajatarya

I'd like some high level explanation of the design - either in a readme.md file in the crate or in the docstrings for DiskCache or ChunkCache - something that highlights that this cache stores chunk ranges from xorbs (are they chunk-aligned?) on disk using a layout of cache_dir/xorb merklehash/base64_encode(start,end,checksum).

Tests look good, but want to better understand concurrency around HashMap - imagine python code is spinning up multiple Python threads (and down the road actual system threads) - how would they interact with this cache?

(use case is hf_hub.download_snapshot() which downloads full repo).

rajatarya · 2024-10-02T17:34:47Z

chunk_cache/src/lib.rs

+        &mut self,
+        key: &Key,
+        range: &Range,
+        chunk_byte_indicies: &[u32],


typo: indicies -> indices

(in chunk_byte_indicies).

rajatarya · 2024-10-02T17:35:24Z

chunk_cache/src/lib.rs

+
+pub use disk_cache::DiskCache;
+
+pub trait ChunkCache {


Is the practice to put comments on the trait or on the implementation?

I'd like to know why this is a trait, what its responsibilities are, and what its intended uses are.

rajatarya · 2024-10-02T17:36:27Z

chunk_cache/src/error.rs

+    Parse(String),
+    #[error("bad range")]
+    BadRange,
+    #[error("cache is empty when it is presumed no empty")]


Maybe rephrase: "cache is unexpectedly empty"

rajatarya · 2024-10-02T17:38:33Z

chunk_cache/src/disk_cache.rs

+const BASE64_ENGINE: GeneralPurpose = BASE64_URL_SAFE;
+
+#[derive(Debug, Clone)]
+pub struct DiskCache {


Please add comments on the pub struct.

What is the purpose of this struct? What should someone looking at this code 3mo from now without any other context understand about it?

rajatarya · 2024-10-02T17:39:02Z

chunk_cache/src/disk_cache.rs

+}
+
+impl DiskCache {
+    pub fn initialize<T: Into<PathBuf>>(


Please add comments on pub fn - especially things like initialize().

Important to explain the template/generic usage here.

rajatarya · 2024-10-02T18:10:45Z

chunk_cache/src/disk_cache/cache_file_header.rs

+}
+
+impl CacheFileHeader {
+    pub fn new<T: Into<Vec<u32>>>(chunk_byte_indicies: T) -> Self {


rajatarya · 2024-10-02T18:10:51Z

chunk_cache/src/disk_cache/cache_file_header.rs

+        }
+    }
+
+    pub fn deserialize<R: Read + Seek>(reader: &mut R) -> Result<Self, ChunkCacheError> {


rajatarya · 2024-10-02T18:11:00Z

chunk_cache/src/disk_cache/cache_file_header.rs

+        })
+    }
+
+    pub fn serialize<W: Write>(&self, writer: &mut W) -> Result<usize, std::io::Error> {


rajatarya · 2024-10-02T18:11:54Z

chunk_cache/src/disk_cache/cache_file_header.rs

+    }
+}
+
+pub fn read_u32<R: Read>(reader: &mut R) -> Result<u32, std::io::Error> {


Why are read_u32 and write_u32 outside of the impl? They seem like useful helper functions, but they should be in the impl, right?

rajatarya · 2024-10-02T18:13:16Z

chunk_cache/src/disk_cache/file_name.rs

+
+const BASE64_ENGINE: GeneralPurpose = BASE64_URL_SAFE;
+/// A file name is represented as the start index and end index of chunks for the given xorb
+/// and a timestamp of last successful access or put


Is timestamp missing from the struct? Or should this part be removed from comment?

rajatarya · 2024-10-02T20:23:18Z

Does the file IO handle partially written files? Do you want to leverage atomic file moves in *nix OSes? (write cache file to tmp dir and then move it to final location).

assafvayner added 2 commits September 27, 2024 18:22

dissapointing save

235cea4

updates

b26f89f

assafvayner requested review from rajatarya, seanses and jgodlew October 1, 2024 00:25

assafvayner added 7 commits October 1, 2024 10:30

tests and lint

8ffb3ba

size_of

c14e28f

modify for build

ff0df1a

more size_of imports

f156c5a

lint

5219929

fix fail case

3be9a27

test updates

5ebb2ad

rajatarya reviewed Oct 2, 2024

View reviewed changes

assafvayner closed this Oct 8, 2024

assafvayner deleted the assaf/chunk_cache branch November 18, 2024 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chunk cache #31

chunk cache #31

assafvayner commented Oct 1, 2024

assafvayner commented Oct 1, 2024

rajatarya commented Oct 2, 2024

assafvayner commented Oct 2, 2024

rajatarya left a comment

rajatarya Oct 2, 2024

rajatarya Oct 2, 2024

rajatarya Oct 2, 2024

rajatarya Oct 2, 2024

rajatarya Oct 2, 2024

rajatarya Oct 2, 2024

rajatarya Oct 2, 2024

rajatarya Oct 2, 2024

rajatarya Oct 2, 2024

rajatarya Oct 2, 2024

rajatarya Oct 2, 2024

rajatarya Oct 2, 2024

rajatarya commented Oct 2, 2024

chunk cache #31

chunk cache #31

Conversation

assafvayner commented Oct 1, 2024

assafvayner commented Oct 1, 2024

rajatarya commented Oct 2, 2024

assafvayner commented Oct 2, 2024

rajatarya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajatarya commented Oct 2, 2024