Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

guardrails: attribution slow on dotcom #54950

Closed
keegancsmith opened this issue Jun 20, 2023 · 14 comments
Closed

guardrails: attribution slow on dotcom #54950

keegancsmith opened this issue Jun 20, 2023 · 14 comments
Assignees
Labels
dot-com Applies to sourcegraph.com only perf

Comments

@keegancsmith
Copy link
Member

Getting 5s+ attribution search times for search queries which should be very very fast. This is proving a bit tricky to track down due to unreliability in our traces at the moment.

@keegancsmith keegancsmith self-assigned this Jun 20, 2023
@keegancsmith keegancsmith added perf dot-com Applies to sourcegraph.com only labels Jun 20, 2023
@keegancsmith
Copy link
Member Author

We are doing searches for large strings and the performance is surprisingly slow. I think it is zoekt that is just slow in this case. We are setting max repo match limits.

My current theory is that iterateNgrams is slow in this use case. This theory needs to be tested. But I believe we likely spend a long time here vs normal searches.

iterateNgrams constructs the document iterator based on the ngram index. To do this is looks up every ngram in the query (and possibly all casing) to find the rarest ngrams (or an example of an ngram which doesn't exist). Because the lookup for this is via a btree, this likely does a lot of random accesses across all shards.

I would suspect once the iterator has been constructed, it would likely then be fast for large queries since I would suspect large queries are more likely to contains two rare trigrams. Just finding those trigrams is the issue.

I'm trying to think up some SearchResults.Stats we can add or other observability to try confirm this.

@stefanhengl wdyt?

@stefanhengl
Copy link
Member

stefanhengl commented Jun 21, 2023

I think your suspicion might be correct. The way we get the frequencies is just inefficient. We could store the frequencies in the btree, which would save us 1 read of the posting list's simple section. In addition we could sort the ngrams before searching for the frequencies. This should reduce the disk access because the ngrams in the buckets are sorted too.

EDIT:

We could store the frequencies in the btree

Right now we construct the btree on read. For the buckets, which remain on disk, we just reinterpret the existing ngram section. If we want to attach metadata to the items in the buckets we have to change the index format, because I assume we would like to store the frequency right next to the ngram instead of getting it from the simple section of the ngram's posting list.

Sorting the ngrams before we get the frequencies might already give us some benefits without changing the format.

@stefanhengl
Copy link
Member

Assuming we are not saturating the disk, we could add some concurrency to frequency lookup in iterateNgrams.

@keegancsmith
Copy link
Member Author

I spent a bunch of time adding general observability stuff to the search client, then I realised I can just look into our zoekt instrumentation since Zoekt seems to be the root cause. So did a query which took 5s. This is faster than I have been seeing but still too slow. It is likely faster thanks to turning on case sensitivity last night. Got this from honeycomb for that query:

{
  "Timestamp": "2023-06-22T07:47:31.590105624Z",
  "actor": "0",
  "category": "SearchAll",
  "duration_ms": 5189,
  "events": 4,
  "filematches": 2,
  "opts.estimate_doc_count": false,
  "opts.flush_wall_time_ms": 0,
  "opts.max_doc_display_count": 5,
  "opts.max_wall_time_ms": 59999,
  "opts.shard_max_match_count": 10000,
  "opts.shard_repo_max_match_count": 1,
  "opts.total_max_match_count": 100000,
  "opts.use_document_ranks": false,
  "opts.whole": false,
  "pod_name": "sourcegraph-frontend-8c6b48cb4-j8szf",
  "query": "(and case_content_substr:\"const data: Event = { ...JSON.parse(message.data), type: message.event }\" branch=\"HEAD\" rawConfig:RcOnlyPublic|RcNoForks|RcNoArchived)",
  "stats.content_bytes_loaded": 126962274,
  "stats.crashes": 0,
  "stats.file_count": 2,
  "stats.files_considered": 4423,
  "stats.files_loaded": 4307,
  "stats.files_skipped": 0,
  "stats.flush_reason": "none",
  "stats.index_bytes_loaded": 5659533,
  "stats.match_count": 2,
  "stats.ngram_matches": 7255,
  "stats.regexps_considered": 0,
  "stats.shard_files_considered": 0,
  "stats.shards_scanned": 8921,
  "stats.shards_skipped": 0,
  "stats.shards_skipped_filter": 537798,
  "stats.wait_ms": 0,
  "stream.total_send_time_ms": 0
}

Alright so when it comes to our options, they seem sub-optimal given we are not doing ranking. We don't need 100k results if we are only returning 5. I don't think this is the root cause of the slowness of this query but for more common patterns it will be an issue.

"stats.files_considered": 4423,
"stats.files_loaded": 4307,

These numbers are super low, which means we only attempted to search a very low number of files. So our slowness is related to index related options.

"stats.shards_scanned": 8921,
"stats.shards_skipped": 0,
"stats.shards_skipped_filter": 537798,

This means we ruled out the majority of shards. This does interact with the index to lookup which trigrams exist.

"stats.content_bytes_loaded": 126 962 274,
"stats.index_bytes_loaded":     5 659 533,

126MB of data is not a lot to do substring searches over. This will be somewhat random though which could account for the 5s time. 5MB of index is not much at all.

One surprising thing is that those stats give us an avg file size of 29kb which is large for code. For example the ts and tsx files in our monorepo hav an average size of 3.5kb. But I don't think that is the root cause.

I'm feeling much more confident in the idea this is related to how we lookup trigrams needing improvement. This is actually a place were bloom filters would be useful again.

I am going to add in a stastistic in zoekt to track how much work we do in matchtree construction to try and narrow this down.

@keegancsmith
Copy link
Member Author

We could store the frequencies in the btree

Right now we construct the btree on read. For the buckets, which remain on disk, we just reinterpret the existing ngram section. If we want to attach metadata to the items in the buckets we have to change the index format, because I assume we would like to store the frequency right next to the ngram instead of getting it from the simple section of the ngram's posting list.

It would at least half the amount of IO we do which is great. But alone I don't think this is good enough.

Sorting the ngrams before we get the frequencies might already give us some benefits without changing the format.

Indeed I am thinking along this approach. Given in most cases we find a trigram which rules out the need to do the search, by implement a binary search with a sorted candidate set we will be more friendly to the page cache.

Assuming we are not saturating the disk, we could add some concurrency to frequency lookup in iterateNgrams.

I am unsure how much this would help, I always assume we are saturating the disk already. But this is a good idea to check.

@stefanhengl
Copy link
Member

You could also just disable the btree locally and compare. This would be a smoking gun indeed.

@keegancsmith
Copy link
Member Author

I added some attribution specific observability and had in my plans for today to:

  • add a stat to zoekt related to ngram lookups
  • reproduce locally

I don't think I'll get to the latter two today, so will attempt it on Monday. Below is full notes

*** TODO stat around ngram lookups
    SCHEDULED: <2023-06-26 Mon>
    - Rescheduled from "[2023-06-23 Fri]" on [2023-06-23 Fri 13:42]
This is just to confirm that this value is much higher. It doesn't really tell
us how the IO is behaving.
*** TODO try reproduce locally
    SCHEDULED: <2023-06-26 Mon>
    - Rescheduled from "[2023-06-23 Fri]" on [2023-06-23 Fri 13:42]
- [ ] get a large corpus running locally. Maybe snapshot prod which only has
  public code into a GCS bucket.
- [ ] reproduce an attribution search with same opts
- [ ] reproduce with btree turned off
*** TODO implement smart frequency lookup
This is something to try out once we can reproduce locally.

@keegancsmith
Copy link
Member Author

This issue is still in progress. Since my last update on Friday I have mostly been focussed on helping out with the release of 5.1. However, I did send out a draft PR to discuss how to instrument zoekt here. It is a little trickier than I at first hoped. sourcegraph/zoekt#606

@keegancsmith
Copy link
Member Author

tl;dr abandoning btree instrumentation and will just instrument ngram lookup counts. btree lookup should only cause at most 1 page fault.

btree and page alignment

the non-leaf nodes exist in memory. The lead nodes is where we do the binary search and that should represent a page. I don't think we actually do anything to make them page aligned, @stefanhengl mentioned he couldn't affect the number of page faults in his testing so didn't do this logic (if I remember what he said correctly).

Given we are doing the binary search on one page of data it isn't worth capturing how many accesses we do here. And to find the page we only access in memory data. Given both of those its likely only useful to track how many lookups we do in total, so do not need btree specific instrumentation.

I decided to go down the rabbit hole to work out how we could do page alignment if we wanted it. For mmap in zoekt we do what is called a NULL mmap of the file. We make the length aligned with page size (although I don't think this is necessary for NULL mmap). A NULL mmap means the OS picks where to start the memory, which means it will be page aligned. To confirm we do a NULL mmap I followed our mmap calls all the way down to the syscall, and confirmed the first argument (which is void *addr) was 0.

Out of interest, we can get hold of page size with os.Getpagesize() (and use it in our implementation of bufferSize). Given our mmap files are page aligned, we could probably construct the btree along page boundaries.

reproducing locally

Discussed doing with just one big repo. But we need lots of shards to reproduce. When looking at the stats you can see "stats.shards_skipped_filter": 537798 + "stats.shards_scanned": 8921 is the total number of shards in our cluster. IE it looks like our time is dominated by ruling out shards. We confirmed those two numbers equal the number of shards by checking the metric zoekt_shards_total.

@lrhacker lrhacker transferred this issue from sourcegraph/sourcegraph-public-snapshot Jul 13, 2023
@keegancsmith keegancsmith transferred this issue from sourcegraph/cody Jul 14, 2023
@keegancsmith
Copy link
Member Author

I can somewhat reproduce locally and have a way to measure improvements here, but it isn't ideal. Jump to the conclusion below for next steps. This also appears in the search-scratch journal.

methodology

I would run the the below command twice to fill up the page cache first. The interesting thing to look at in profiles is how much time is taken by ngramFrequency in iterateNgrams since that represents the work done to lookup in the btree. I also added back the old in memory map based code for experimentation.

My changes for testing are at zoekt: implement mode which has same behaviour as attribution search zoekt#613

$ echo -n "const data: Event = { ...JSON.parse(message.data), type: message.event }" | go run ./cmd/zoekt -v -exact_stdin -index_dir ~/zoekt-dotcom-subset-2023-06-29 -cpu_profile profile.cpu
2023/07/13 09:54:26 loading 854 shard(s): github.com%2F214140846%2FTOGO_School_Miniprograme_v16.00000.zoekt, github.com%2F2fd%2Fgraphdoc_v16.00000.zoekt, github.com%2F5t3ph%2Fa11y-color-tokens_v16.00000.zoekt, github.com%2F97-things%2F97-things-every-programmer-should-know_v16.00000.zoekt, github.com%2FADmad%2Fcakephp-jwt-auth_v16.00000.zoekt... 849 more
2023/07/13 09:54:26 query: case_content_substr:"const data: Event = { ...JSON.parse(message.data), type: message.event }"
2023/07/13 09:54:26 Displaying matches...
client/cody-shared/src/sourcegraph-api/completions/browserClient.ts:48:                const data: Event = { ...JSON.parse(message.data), type: message.event }
2023/07/13 09:54:27 stats: zoekt.Stats{ContentBytesLoaded:81980, IndexBytesLoaded:441, Crashes:0, Duration:1481452, FileCount:1, ShardFilesConsidered:0, FilesConsidered:5, FilesLoaded:5, FilesSkipped:0, ShardsScanned:12, ShardsSkipped:0, ShardsSkippedFilter:841, MatchCount:1, NgramMatches:5, NgramLookups:11954, Wait:2860, RegexpsConsidered:0, FlushReason:0x0}

cpu profiles

Interestingly the btree is faster for me locally. It seems without btree the "combinedNgramOffset" stuff did much more work in sort.Search. btree has it ngramFrequency at 33% of the time vs 42% of total run time.

  • inmem: 11%
  • btree: 33%
  • ascii: 42%

Interestingly splitNGrams is taking 16% of the time. If we introduce an intermediate language that we compile once per search rather than per shard this will go away. Additionally when looking at real life profiles I often see compiling regexes being high which would also go away with that feature. However, in practice I imagine splitNGrams does not take up that much time since on my local machine the page cache pretty much doesn't miss.

full profiles (fgprof)

This will hopefully be a better measure since it inclues off cpu. But I found it quite hard to compare these profiles. The times iterateNgrams spent in ngramFrequency was:

  • inmem: 0.23 / 0.79 = 0.29
  • btree: 1.13 / 1.65 = 0.68
  • ascii: 1.73 / 2.15 = 0.80

end to end time

I just checked how many searches I could do in 10s. This was pretty close between the them, but my machine won't thrash the page cache as much as production so I don't think this is a valid way to measure.

  • inmem: 685.523250 searches/s
  • btree: 647.085918 searches/s
  • ascii: 633.632068 searches/s

conclusion

Locally it was hard to reproduce large discrepencies in overall time since the page cache would of been well utilized. However, when looking at proportional time spent in iterateNgrams it is clear we can interact with the disk more effectively.

A nice win for btree is it is faster (and more memory efficient) than the combinedngramoffset implementation we added to replace the in memory map.

Proposal:

  • implement batch frequency lookups in the btree
  • maybe: IR between matchtree and query to avoid repeating expensive/heap work like regex compilation and splitNgrams.
  • remove combinedNgramOffset support

@keegancsmith
Copy link
Member Author

@stefanhengl and I did some pairing on next steps. Copy-paste below of the journal entry at https://github.com/sourcegraph/search-scratch/blob/master/2023/journal.org#keeganstefan-btree-pairing-2023-07-14-fri-1418

List is ordered highest prio to lowest.

  • experiment with sorting before lookup
    • improves chances we do ngram lookups in the same btree bucket.
    • got a PR for this, will see effect in production next week.
    • still need to deploy and then measure in prod.
    • keegan on monday.
  • experiment with looking up ngrams without doing the posting list lookup.
    • If we find any missing ngrams we can abort early and avoid all IO to do with posting lists.
    • alternative to striping
    • requires an extra |ngram| allocation
    • stefan on monday, then keegan joins in when done measuring above.
  • tiny bloom filter?
    • per btree
    • maybe per block
    • if anyMissing([]ngram) { return early }.
    • could experiment with this at load time, so probably worth it?
    • checkout xorfilter and related newness.
  • IR to make stuff free.
    • unsure if it improves attribution slowness. But feels like a good general improvement.
  • stripe ngram with simplesection so that when we do a lookup, we only do one disk access instead of two.
    • this requires reindexing so is operationally a little more difficult.
  • Batch btree lookups.
    • Get(ngram) Get([]ngram)
    • I don't think this will be worth it. Complicated and there are only 2 inner levels in the btree. So the sorting above likely has the same perf.

@keegancsmith
Copy link
Member Author

Since the last update we have done the following:

  • Implemented performance monitoring via search-blitz.
  • Implemented and rolled out sorted ngrams lookups.
  • Implemented avoiding posting list lookups.

We deployed sorted ngrams lookups too soon after the performance monitoring rollout and only had about 1h30m of data before the change. The sorted ngrams lookup appears to drastically reduce variance (which implies less IO), but there are still some spikes.

We have reverted the deployment to allow more collection of performance before it. IE we want to compare a similiar 24 hour period before declaring success.

We don't have enough scale on these queries to look at p99/etc. The p75 is around 900ms which is likely acceptable for LLM filtering but possibly not completions. I will reach out to get an idea of what success looks like here. Note: I am looking at zoekt in honeycomb for this, which excludes slowness like gitserver/postgres/etc. Additionally prometheus is misleading.

Next up:

  • Analyse data after 24h of old zoekt code running.
  • Deploy posting list lookups code and then analyse after 24 hours.
  • Dive into some traces collected to look for patterns.

Expand below for some graphs of the perf:

All graphs include about 2h of the unoptimized code running

Rollout of sorted ngrams

Screenshot 2023-07-18 at 17 24 31

First 24 hours

image

one_line query heatmap

https://ui.honeycomb.io/sourcegraph/datasets/search-zoekt/result/fQKXB4hh6KU
image

many_lines query heatmap

https://ui.honeycomb.io/sourcegraph/datasets/search-zoekt/result/3WjamSobCm3
image

@keegancsmith
Copy link
Member Author

Comparing the 24-hour periods of sorted ngrams it is very clear it drastically reduced the variance. See below for the heatmaps of the two queries (sorted ngrams is the left/day before).

image

image

We just deployed a further improvement which should cut the IO done by the btree in half in the case of attribution search. We will monitor how that improves performance.

Our investigation into spikes and traces concluded that these where general slow downs that happen across all zoekt searches. The causes generally seem to be a large number of searches with count:all. The source of those seem to be queries generated by our app while browsing. We need to investigate further. But in general some sort of priority service/etc would be useful before GA to guarentee SLAs for more important queries.

Next up:

  • We are shifting our focus to discuss/plan next job fair work a bit.
  • After that there are more stats we might add to help diagnose zoekt performance.
  • Additionally we need to clean up tracing and zoekt. It is missing important information aswell as being suboptimal when viewed via google traces.
  • Analyse data once we collect enough data points on the latest version of zoekt.

@keegancsmith
Copy link
Member Author

After gathering more data we are happy with the performance so gonna close this out :)

There is still some general zoekt slowdowns that happen on dot com, but we will follow up when we fail the pain more accutely.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
dot-com Applies to sourcegraph.com only perf
Projects
None yet
Development

No branches or pull requests

2 participants