-
Notifications
You must be signed in to change notification settings - Fork 1.3k
guardrails: attribution slow on dotcom #54950
Comments
We are doing searches for large strings and the performance is surprisingly slow. I think it is zoekt that is just slow in this case. We are setting max repo match limits. My current theory is that
I would suspect once the iterator has been constructed, it would likely then be fast for large queries since I would suspect large queries are more likely to contains two rare trigrams. Just finding those trigrams is the issue. I'm trying to think up some SearchResults.Stats we can add or other observability to try confirm this. @stefanhengl wdyt? |
I think your suspicion might be correct. The way we get the frequencies is just inefficient. We could store the frequencies in the btree, which would save us 1 read of the posting list's simple section. In addition we could sort the ngrams before searching for the frequencies. This should reduce the disk access because the ngrams in the buckets are sorted too. EDIT:
Right now we construct the btree on read. For the buckets, which remain on disk, we just reinterpret the existing ngram section. If we want to attach metadata to the items in the buckets we have to change the index format, because I assume we would like to store the frequency right next to the ngram instead of getting it from the simple section of the ngram's posting list. Sorting the ngrams before we get the frequencies might already give us some benefits without changing the format. |
Assuming we are not saturating the disk, we could add some concurrency to frequency lookup in |
I spent a bunch of time adding general observability stuff to the search client, then I realised I can just look into our zoekt instrumentation since Zoekt seems to be the root cause. So did a query which took 5s. This is faster than I have been seeing but still too slow. It is likely faster thanks to turning on case sensitivity last night. Got this from honeycomb for that query: {
"Timestamp": "2023-06-22T07:47:31.590105624Z",
"actor": "0",
"category": "SearchAll",
"duration_ms": 5189,
"events": 4,
"filematches": 2,
"opts.estimate_doc_count": false,
"opts.flush_wall_time_ms": 0,
"opts.max_doc_display_count": 5,
"opts.max_wall_time_ms": 59999,
"opts.shard_max_match_count": 10000,
"opts.shard_repo_max_match_count": 1,
"opts.total_max_match_count": 100000,
"opts.use_document_ranks": false,
"opts.whole": false,
"pod_name": "sourcegraph-frontend-8c6b48cb4-j8szf",
"query": "(and case_content_substr:\"const data: Event = { ...JSON.parse(message.data), type: message.event }\" branch=\"HEAD\" rawConfig:RcOnlyPublic|RcNoForks|RcNoArchived)",
"stats.content_bytes_loaded": 126962274,
"stats.crashes": 0,
"stats.file_count": 2,
"stats.files_considered": 4423,
"stats.files_loaded": 4307,
"stats.files_skipped": 0,
"stats.flush_reason": "none",
"stats.index_bytes_loaded": 5659533,
"stats.match_count": 2,
"stats.ngram_matches": 7255,
"stats.regexps_considered": 0,
"stats.shard_files_considered": 0,
"stats.shards_scanned": 8921,
"stats.shards_skipped": 0,
"stats.shards_skipped_filter": 537798,
"stats.wait_ms": 0,
"stream.total_send_time_ms": 0
} Alright so when it comes to our options, they seem sub-optimal given we are not doing ranking. We don't need 100k results if we are only returning 5. I don't think this is the root cause of the slowness of this query but for more common patterns it will be an issue.
These numbers are super low, which means we only attempted to search a very low number of files. So our slowness is related to index related options.
This means we ruled out the majority of shards. This does interact with the index to lookup which trigrams exist.
126MB of data is not a lot to do substring searches over. This will be somewhat random though which could account for the 5s time. 5MB of index is not much at all. One surprising thing is that those stats give us an avg file size of 29kb which is large for code. For example the ts and tsx files in our monorepo hav an average size of 3.5kb. But I don't think that is the root cause. I'm feeling much more confident in the idea this is related to how we lookup trigrams needing improvement. This is actually a place were bloom filters would be useful again. I am going to add in a stastistic in zoekt to track how much work we do in matchtree construction to try and narrow this down. |
It would at least half the amount of IO we do which is great. But alone I don't think this is good enough.
Indeed I am thinking along this approach. Given in most cases we find a trigram which rules out the need to do the search, by implement a binary search with a sorted candidate set we will be more friendly to the page cache.
I am unsure how much this would help, I always assume we are saturating the disk already. But this is a good idea to check. |
You could also just disable the btree locally and compare. This would be a smoking gun indeed. |
I added some attribution specific observability and had in my plans for today to:
I don't think I'll get to the latter two today, so will attempt it on Monday. Below is full notes
|
This issue is still in progress. Since my last update on Friday I have mostly been focussed on helping out with the release of 5.1. However, I did send out a draft PR to discuss how to instrument zoekt here. It is a little trickier than I at first hoped. sourcegraph/zoekt#606 |
tl;dr abandoning btree instrumentation and will just instrument ngram lookup counts. btree lookup should only cause at most 1 page fault. btree and page alignmentthe non-leaf nodes exist in memory. The lead nodes is where we do the binary search and that should represent a page. I don't think we actually do anything to make them page aligned, @stefanhengl mentioned he couldn't affect the number of page faults in his testing so didn't do this logic (if I remember what he said correctly). Given we are doing the binary search on one page of data it isn't worth capturing how many accesses we do here. And to find the page we only access in memory data. Given both of those its likely only useful to track how many lookups we do in total, so do not need btree specific instrumentation. I decided to go down the rabbit hole to work out how we could do page alignment if we wanted it. For mmap in zoekt we do what is called a NULL mmap of the file. We make the length aligned with page size (although I don't think this is necessary for NULL mmap). A NULL mmap means the OS picks where to start the memory, which means it will be page aligned. To confirm we do a NULL mmap I followed our mmap calls all the way down to the syscall, and confirmed the first argument (which is Out of interest, we can get hold of page size with reproducing locallyDiscussed doing with just one big repo. But we need lots of shards to reproduce. When looking at the stats you can see |
I can somewhat reproduce locally and have a way to measure improvements here, but it isn't ideal. Jump to the conclusion below for next steps. This also appears in the search-scratch journal. methodologyI would run the the below command twice to fill up the page cache first. The interesting thing to look at in profiles is how much time is taken by My changes for testing are at zoekt: implement mode which has same behaviour as attribution search zoekt#613
cpu profilesInterestingly the btree is faster for me locally. It seems without btree the "combinedNgramOffset" stuff did much more work in sort.Search. btree has it ngramFrequency at 33% of the time vs 42% of total run time.
Interestingly splitNGrams is taking 16% of the time. If we introduce an intermediate language that we compile once per search rather than per shard this will go away. Additionally when looking at real life profiles I often see compiling regexes being high which would also go away with that feature. However, in practice I imagine splitNGrams does not take up that much time since on my local machine the page cache pretty much doesn't miss. full profiles (fgprof)This will hopefully be a better measure since it inclues off cpu. But I found it quite hard to compare these profiles. The times iterateNgrams spent in ngramFrequency was:
end to end timeI just checked how many searches I could do in 10s. This was pretty close between the them, but my machine won't thrash the page cache as much as production so I don't think this is a valid way to measure.
conclusionLocally it was hard to reproduce large discrepencies in overall time since the page cache would of been well utilized. However, when looking at proportional time spent in A nice win for btree is it is faster (and more memory efficient) than the combinedngramoffset implementation we added to replace the in memory map. Proposal:
|
@stefanhengl and I did some pairing on next steps. Copy-paste below of the journal entry at https://github.com/sourcegraph/search-scratch/blob/master/2023/journal.org#keeganstefan-btree-pairing-2023-07-14-fri-1418 List is ordered highest prio to lowest.
|
Since the last update we have done the following:
We deployed sorted ngrams lookups too soon after the performance monitoring rollout and only had about 1h30m of data before the change. The sorted ngrams lookup appears to drastically reduce variance (which implies less IO), but there are still some spikes. We have reverted the deployment to allow more collection of performance before it. IE we want to compare a similiar 24 hour period before declaring success. We don't have enough scale on these queries to look at p99/etc. The p75 is around 900ms which is likely acceptable for LLM filtering but possibly not completions. I will reach out to get an idea of what success looks like here. Note: I am looking at zoekt in honeycomb for this, which excludes slowness like gitserver/postgres/etc. Additionally prometheus is misleading. Next up:
Expand below for some graphs of the perf: All graphs include about 2h of the unoptimized code running Rollout of sorted ngramsFirst 24 hours
|
Comparing the 24-hour periods of sorted ngrams it is very clear it drastically reduced the variance. See below for the heatmaps of the two queries (sorted ngrams is the left/day before). We just deployed a further improvement which should cut the IO done by the btree in half in the case of attribution search. We will monitor how that improves performance. Our investigation into spikes and traces concluded that these where general slow downs that happen across all zoekt searches. The causes generally seem to be a large number of searches with Next up:
|
After gathering more data we are happy with the performance so gonna close this out :) There is still some general zoekt slowdowns that happen on dot com, but we will follow up when we fail the pain more accutely. |
Getting 5s+ attribution search times for search queries which should be very very fast. This is proving a bit tricky to track down due to unreliability in our traces at the moment.
The text was updated successfully, but these errors were encountered: