Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support per index cache strategy and time based caching condition #5650

Open
ftong2020 opened this issue Jan 25, 2025 · 3 comments
Open

Support per index cache strategy and time based caching condition #5650

ftong2020 opened this issue Jan 25, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@ftong2020
Copy link

ftong2020 commented Jan 25, 2025

Is your feature request related to a problem? Please describe.
Thanks for providing such an amazing piece of work, quickwit provides everything(almost) we need for our platform.
Our workload pattern is very similar to what is described in #5445 , at a much smaller scale. Currently we have 11 indexes, split sized-wise, 2 indexes range from 100T to 150T, 1 index is at about 20TB, others are well below 1TB.

The LRU cache strategy is very brittle against "big scans" that runs every now and then(less than 10 times every day). Some work( #5469 ) have been done to support LFU strategy which might work, but it still lacks flexibility.

In our case ,caching is not for performance, quickwit with no disk cache is blazing fast, which is where quickwit's engineering truly shines. Long range queries with term conditions (trace_id = xxx) can not be effectively cached anyways, downloading all splits to local disk won't help.

The actual value of cache for us is that s3 requests are greatly reduced for repeated data queries, which saves money and makes some pattern economically viable(100+TPS read on last x days data).

Describe the solution you'd like
To mitigate the cache churn issue, I would like quickwit to support following features

  1. support customized cache strategy for each index, instead of the whole cluster(also mentioned in Tunable Cache Eviction Policies #5445 ). For most write-heavy, read-never workloads(yep, logs), user is not very sensitive to latency, we can simply disable cache, which saves tons of disk space.
  2. support time range condition for cache fetching. New configuration "cache_within" can be specified for each index, only split with in the time range will be downloaded. For example, if someone query an index with cache_within set to 7d(7 days), only split relative to now is less than 7 days old will be downloaded and cached, other split simply stays in object storage.

Describe alternatives you've considered
For feature request 1, we evaluated a potential solution that use 2 searcher cluster to handle 2 groups of index(with cache/ w.o cache). It is not hard to play with some logic in the http proxy to quickwit since we also have to do authentication anyways.

For feature request 2, we believe there is no alternative.

Additional context
Out of curiosity, what's the plan for this project? Are you willing to take contributions? If it is ok, we would like to try working on this issue.

@ftong2020 ftong2020 added the enhancement New feature or request label Jan 25, 2025
@esatterwhite
Copy link
Collaborator

esatterwhite commented Jan 25, 2025

To add to this, if there are per index settings I would want a way to define this in the index template. And more specifically define an index pattern that defines which index's to apply those settings to.

We create many indexes daily in a dynamic fashion simply by inserting data and we don't want to hand hold that process.

For us it would be a situation where just 1 of our users has a sudden increase in ingested data that necessitates an increase in cache capacity for their specific set of indexes, but not the entire cluster. The problem becomes how to do that. Do we have to reapply index templates, or maybe an api to adjust a certain settings?

The way it stands currently, it would seem that we would have to have multiple templates for customer specific cache settings which could be in the hundreds to thousands of templates

@ftong2020
Copy link
Author

To add to this, if there are per index settings I would want a way to define this in the index template. And more specifically define an index pattern that defines which index's to apply those settings to.

We create many indexes daily in a dynamic fashion simply by inserting data and we don't want to hand hold that process.

This falls into "apply index template on dynamic created index" feature, which is out of the scope of this issue. TBH, I am not aware of this feature(Where is it documented?). If cache is configurable per index, then specify configuration in the template will do just fine.

Oh, maybe we should consider cache per tag as well, since splits are created independently for each tag, as long as tag cardinality does not exceed threshold.

For us it would be a situation where just 1 of our users has a sudden increase in ingested data that necessitates an increase in cache capacity for their specific set of indexes, but not the entire cluster. The problem becomes how to do that. Do we have to reapply index templates, or maybe an api to adjust a certain settings?

quickwit's coming release 0.9 will have index update feature included. Can't wait for that release! Updating cache eviction configuration of an index is nothing but updating its definition. Since splits cache are immutable, I believe it is pretty straightforward to implement as well.

In term of cascading update(index template updates triggering index updates), I have no idea. What if you want to apply changes to part of indexes? What if some index has been manually updated, should we override? I believe manually call index update API is much simpler and flexible.

The way it stands currently, it would seem that we would have to have multiple templates for customer specific cache settings which could be in the hundreds to thousands of templates

Looks like your index configuration comes with very complicated logic :). If that is the case, externalize index creation might be a better solution.

@esatterwhite
Copy link
Collaborator

apply index template on dynamic created index" feature, which is out of the scope of this issue. TBH, I am not aware of this feature(Where is it documented?

I am not sure if it is documented, but yes, you can specify index id patterns on a template like elastic search, and when inserting documents for indexes that do not exists, they will be created with the template that first matches its id.

This works today, at least though the elastic search _bulk endpoint, we leverage this functionality quite heavily.

Your idea sounds useful, but consideration should be taken that it isn't a maintenance nightmare at larger scales.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants