Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] prefetch blocks for fulltext index writer #3370

Merged
merged 6 commits into from
Jan 23, 2025

Conversation

codetheweb
Copy link
Contributor

@codetheweb codetheweb commented Dec 30, 2024

Description of changes

Prefetchs blocks in parallel when writing the full text index rather than fetching the blocks sequentially on-demand. Fetching in parallel saves us quite a bit of time and having prefetch as a distinct operator also allows us to mark it as I/O.

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Tested by disabling the write-through cache and observing that the prefetch step pulled in the correct blocks.

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

n/a

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

rust/index/src/fulltext/types.rs Outdated Show resolved Hide resolved
rust/blockstore/src/arrow/ordered_blockfile_writer.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @sanketkedia should weigh in on this before merge, as we have operators for prefetch and established patterns for it

@codetheweb
Copy link
Contributor Author

I think @sanketkedia should weigh in on this before merge, as we have operators for prefetch and established patterns for it

I will ask
I originally sent more context to Robert in a DM, should have included it here:

I didn't implement prefetching in an operator in this first pass because it doesn't fit well into the upstream SegmentWriter abstraction & associated operators

@codetheweb codetheweb force-pushed the perf-prefetch-fulltext-index-writer-blocks branch from 8b144d3 to 154c080 Compare January 10, 2025 23:36
@codetheweb codetheweb force-pushed the perf-prefetch-fulltext-index-writer-blocks branch 2 times, most recently from 8ed8d3b to 931a5f2 Compare January 10, 2025 23:54
@codetheweb codetheweb force-pushed the perf-prefetch-fulltext-index-writer-blocks branch from 931a5f2 to 323b15e Compare January 13, 2025 22:34
@codetheweb codetheweb changed the base branch from main to fix-arrow-cached-check-bug January 13, 2025 22:35
Copy link
Contributor Author

codetheweb commented Jan 13, 2025

@codetheweb codetheweb force-pushed the fix-arrow-cached-check-bug branch from ce5521e to 2c3884c Compare January 14, 2025 18:30
@codetheweb codetheweb force-pushed the perf-prefetch-fulltext-index-writer-blocks branch from 323b15e to 31ee004 Compare January 14, 2025 18:31
@codetheweb codetheweb marked this pull request as ready for review January 14, 2025 18:35
@codetheweb codetheweb changed the base branch from fix-arrow-cached-check-bug to graphite-base/3370 January 15, 2025 02:09
@codetheweb codetheweb force-pushed the perf-prefetch-fulltext-index-writer-blocks branch from 31ee004 to f7ecfd9 Compare January 15, 2025 02:09
@codetheweb codetheweb changed the base branch from graphite-base/3370 to main January 15, 2025 02:10
@codetheweb codetheweb force-pushed the perf-prefetch-fulltext-index-writer-blocks branch from f7ecfd9 to a13d405 Compare January 15, 2025 02:10
@codetheweb codetheweb force-pushed the perf-prefetch-fulltext-index-writer-blocks branch from a13d405 to 8c7288e Compare January 16, 2025 00:53
@codetheweb codetheweb force-pushed the perf-prefetch-fulltext-index-writer-blocks branch from e0f7533 to 577da0e Compare January 16, 2025 18:35
@codetheweb codetheweb requested a review from HammadB January 17, 2025 00:51
let num_block_cache_hits = self
.num_block_cache_hits
.load(std::sync::atomic::Ordering::Relaxed);
let num_block_cache_misses = self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Foyer already exposes the underlying hit rate and miss rate - i think this is redundant?

cache_hit: opentelemetry::metrics::Counter<u64>,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if we are doing it - i i think it should be a metric not a trace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yep missed that
I think it's actually kinda useful having it on the trace; if all you have is a metric point with a low cache hit rate it could be pretty hard to debug
will leave out for now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah makes sense

@codetheweb codetheweb force-pushed the perf-prefetch-fulltext-index-writer-blocks branch 3 times, most recently from 77c6a76 to f7c73c4 Compare January 22, 2025 19:21
@@ -271,6 +277,21 @@ impl CompactOrchestrator {
let input = PartitionInput::new(records, self.max_partition_size);
let task = wrap(operator, input, ctx.receiver());
self.send(task, ctx).await;

let segments = self.get_all_segments().await.unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not putting this in the initial_tasks() method from the Orchestrator trait as that would require making initial_tasks() async and my understanding is that it's intended to be a static list of tasks. A cleaner solution would perhaps be adding a GetSegment operator and dispatching that from initial_tasks().)

@codetheweb codetheweb force-pushed the perf-prefetch-fulltext-index-writer-blocks branch 3 times, most recently from c547c2f to a264c8f Compare January 22, 2025 20:51
@@ -100,6 +100,15 @@ impl BlockfileProvider {
};
Ok(())
}

pub async fn prefetch(&self, id: &uuid::Uuid) -> Result<usize, Box<dyn ChromaError>> {
Copy link
Contributor Author

@codetheweb codetheweb Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a new top-level API, not attached to a reader or writer
I think this makes more sense semantically since we're not fetching the blocks for any particular reader or writer instance and it lets us avoid K & V generics

@codetheweb codetheweb force-pushed the perf-prefetch-fulltext-index-writer-blocks branch from a264c8f to 8241a65 Compare January 22, 2025 21:01
@codetheweb codetheweb requested a review from HammadB January 22, 2025 21:02
@@ -62,6 +80,29 @@ impl ArrowBlockfileProvider {
}
}

pub async fn prefetch(&self, id: &Uuid) -> Result<usize, ArrowBlockfileProviderPrefetchError> {
let block_ids = self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just fetch the root and then do this? The abstractions feel a bit off to me here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetching the root requires K to be known, which would break a general-purpose prefetch operator that has no knowledge of the concrete K/V types for each file path.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that sucks, might be nice to comment that

@HammadB
Copy link
Collaborator

HammadB commented Jan 23, 2025

Barring any fallout from my question above this makes sense to me. Thanks for the attention to quality.

@codetheweb codetheweb force-pushed the perf-prefetch-fulltext-index-writer-blocks branch 2 times, most recently from 3440b66 to 8241a65 Compare January 23, 2025 21:46
@codetheweb codetheweb enabled auto-merge (squash) January 23, 2025 21:49
@codetheweb codetheweb merged commit b848326 into main Jan 23, 2025
79 checks passed
@codetheweb codetheweb deleted the perf-prefetch-fulltext-index-writer-blocks branch January 23, 2025 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants