Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunk cache #31

Closed
wants to merge 9 commits into from
Closed

chunk cache #31

wants to merge 9 commits into from

Conversation

assafvayner
Copy link
Contributor

Chunk Cache implementation.

  • caching xorb ranges on file system

@assafvayner
Copy link
Contributor Author

This PR needs more iterations

@rajatarya
Copy link
Collaborator

Are you ready for this PR to be reviewed now?

@assafvayner
Copy link
Contributor Author

yes

Copy link
Collaborator

@rajatarya rajatarya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like some high level explanation of the design - either in a readme.md file in the crate or in the docstrings for DiskCache or ChunkCache - something that highlights that this cache stores chunk ranges from xorbs (are they chunk-aligned?) on disk using a layout of cache_dir/xorb merklehash/base64_encode(start,end,checksum).

Tests look good, but want to better understand concurrency around HashMap - imagine python code is spinning up multiple Python threads (and down the road actual system threads) - how would they interact with this cache?

(use case is hf_hub.download_snapshot() which downloads full repo).

&mut self,
key: &Key,
range: &Range,
chunk_byte_indicies: &[u32],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: indicies -> indices

(in chunk_byte_indicies).


pub use disk_cache::DiskCache;

pub trait ChunkCache {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the practice to put comments on the trait or on the implementation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know why this is a trait, what its responsibilities are, and what its intended uses are.

Parse(String),
#[error("bad range")]
BadRange,
#[error("cache is empty when it is presumed no empty")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rephrase: "cache is unexpectedly empty"

const BASE64_ENGINE: GeneralPurpose = BASE64_URL_SAFE;

#[derive(Debug, Clone)]
pub struct DiskCache {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments on the pub struct.

What is the purpose of this struct? What should someone looking at this code 3mo from now without any other context understand about it?

}

impl DiskCache {
pub fn initialize<T: Into<PathBuf>>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments on pub fn - especially things like initialize().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important to explain the template/generic usage here.

}

impl CacheFileHeader {
pub fn new<T: Into<Vec<u32>>>(chunk_byte_indicies: T) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments

}
}

pub fn deserialize<R: Read + Seek>(reader: &mut R) -> Result<Self, ChunkCacheError> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments.

})
}

pub fn serialize<W: Write>(&self, writer: &mut W) -> Result<usize, std::io::Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments

}
}

pub fn read_u32<R: Read>(reader: &mut R) -> Result<u32, std::io::Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are read_u32 and write_u32 outside of the impl? They seem like useful helper functions, but they should be in the impl, right?


const BASE64_ENGINE: GeneralPurpose = BASE64_URL_SAFE;
/// A file name is represented as the start index and end index of chunks for the given xorb
/// and a timestamp of last successful access or put
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is timestamp missing from the struct? Or should this part be removed from comment?

@rajatarya
Copy link
Collaborator

Does the file IO handle partially written files? Do you want to leverage atomic file moves in *nix OSes? (write cache file to tmp dir and then move it to final location).

@assafvayner assafvayner closed this Oct 8, 2024
@assafvayner assafvayner deleted the assaf/chunk_cache branch November 18, 2024 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants