-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chunk cache #31
chunk cache #31
Conversation
This PR needs more iterations |
Are you ready for this PR to be reviewed now? |
yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like some high level explanation of the design - either in a readme.md file in the crate or in the docstrings for DiskCache or ChunkCache - something that highlights that this cache stores chunk ranges from xorbs (are they chunk-aligned?) on disk using a layout of cache_dir/xorb merklehash/base64_encode(start,end,checksum).
Tests look good, but want to better understand concurrency around HashMap - imagine python code is spinning up multiple Python threads (and down the road actual system threads) - how would they interact with this cache?
(use case is hf_hub.download_snapshot() which downloads full repo).
&mut self, | ||
key: &Key, | ||
range: &Range, | ||
chunk_byte_indicies: &[u32], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: indicies -> indices
(in chunk_byte_indicies).
|
||
pub use disk_cache::DiskCache; | ||
|
||
pub trait ChunkCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the practice to put comments on the trait or on the implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to know why this is a trait, what its responsibilities are, and what its intended uses are.
Parse(String), | ||
#[error("bad range")] | ||
BadRange, | ||
#[error("cache is empty when it is presumed no empty")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rephrase: "cache is unexpectedly empty"
const BASE64_ENGINE: GeneralPurpose = BASE64_URL_SAFE; | ||
|
||
#[derive(Debug, Clone)] | ||
pub struct DiskCache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments on the pub struct.
What is the purpose of this struct? What should someone looking at this code 3mo from now without any other context understand about it?
} | ||
|
||
impl DiskCache { | ||
pub fn initialize<T: Into<PathBuf>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comments on pub fn - especially things like initialize().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important to explain the template/generic usage here.
} | ||
|
||
impl CacheFileHeader { | ||
pub fn new<T: Into<Vec<u32>>>(chunk_byte_indicies: T) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments
} | ||
} | ||
|
||
pub fn deserialize<R: Read + Seek>(reader: &mut R) -> Result<Self, ChunkCacheError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments.
}) | ||
} | ||
|
||
pub fn serialize<W: Write>(&self, writer: &mut W) -> Result<usize, std::io::Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments
} | ||
} | ||
|
||
pub fn read_u32<R: Read>(reader: &mut R) -> Result<u32, std::io::Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are read_u32
and write_u32
outside of the impl? They seem like useful helper functions, but they should be in the impl, right?
|
||
const BASE64_ENGINE: GeneralPurpose = BASE64_URL_SAFE; | ||
/// A file name is represented as the start index and end index of chunks for the given xorb | ||
/// and a timestamp of last successful access or put |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is timestamp missing from the struct? Or should this part be removed from comment?
Does the file IO handle partially written files? Do you want to leverage atomic file moves in *nix OSes? (write cache file to tmp dir and then move it to final location). |
Chunk Cache implementation.