Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add request limit cache #1275

Merged
merged 17 commits into from
Feb 6, 2025
Merged

Add request limit cache #1275

merged 17 commits into from
Feb 6, 2025

Conversation

JesseAbram
Copy link
Member

@JesseAbram JesseAbram commented Jan 31, 2025

Originally started off as writing a threaded test for multiple txs, found out in the increment_or_wipe_request_limit function there was a reserve_key kvdb call that did not handle the overloading tx well and caused failures. This was solved by adding a cache and moving the request limit there

The cache probably should have been used before and will open up an issue about moving things that were stored in the kvdb that shouldn't to it. #1280

As well will open an issue to create a deletion strategy for request limit as it is O(n) (this problem existed with the kvdb too) #1281

Also maybe we should merge other app_state hashmaps into cache #1282

@JesseAbram JesseAbram changed the title Overload signing test Add Cache Feb 4, 2025
@JesseAbram JesseAbram marked this pull request as ready for review February 5, 2025 00:21
Copy link
Contributor

@ameba23 ameba23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The signing stress test is great ⭐

And the cache is definitely an improvement, since this stuff does not need to be stored on disk.

But i would have done it a bit differently - rather than a generalised cache i would have separate data structures for the different things we need to store.

So for the request limit i would have the property of AppState be:

request_limit: Arc<RwLock<HashMap<VerifyingKey, RequestLimitStorage>>>

and have similar properties for the other things we need to store (we could put them in a sub-struct if AppState is starting to get a mess).

The advantages:

  • Less time spent waiting for lock to become free, as with everything behind a single RwLock the chances of it being locked are much higher. I think putting them all behind the same Arc is fine.
  • Faster because you don't need to serialize and deserialize values.
  • The type system guarantees that entries are well formed - impossible to have a failed lookup or corrupt value.

impl AppState {
/// Setup AppState with given secret keys
pub fn new(
configuration: Configuration,
kv_store: KvManager,
pair: sr25519::Pair,
x25519_secret: StaticSecret,
cache: Cache,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It looks like we always call this constructor with the cache being empty - so it would be slightly less code to create a new cache here rather than have it be passed in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya but if we want to test some pre state it would be easier......however I did create the unsafe calls for that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also initialize it to Default internally instead of passing it in. There are methods to write into the cache which can be used for testing

crates/threshold-signature-server/src/lib.rs Outdated Show resolved Hide resolved
crates/threshold-signature-server/src/lib.rs Outdated Show resolved Hide resolved
self.clear_poisioned_chache();
let cache =
self.cache.read().map_err(|_| anyhow!("Error getting read read_from_cache lock"))?;
Ok(cache[key].clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely use .get here and return an option, as this will panic if the key does not exist.

Also, it gives us a way to check if the key exists and get it if it does in one go. Otherwise we end up locking the RwLock once for exists, unlocking again then immediately locking again to get the value.

@@ -344,6 +352,46 @@ impl AppState {
))
}

pub fn write_to_cache(&self, key: String, value: Vec<u8>) -> anyhow::Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Im not super keen on using anyhow here as strictly speaking this is library code. Probably it doesn't really matter but in best case we make have a CacheError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im ok with that, I saw your method I think I was thinking of a method with multiple errors then never ended up writing one

@JesseAbram
Copy link
Member Author

JesseAbram commented Feb 5, 2025

The signing stress test is great ⭐

And the cache is definitely an improvement, since this stuff does not need to be stored on disk.

But i would have done it a bit differently - rather than a generalised cache i would have separate data structures for the different things we need to store.

So for the request limit i would have the property of AppState be:

request_limit: Arc<RwLock<HashMap<VerifyingKey, RequestLimitStorage>>>

and have similar properties for the other things we need to store (we could put them in a sub-struct if AppState is starting to get a mess).

The advantages:

  • Less time spent waiting for lock to become free, as with everything behind a single RwLock the chances of it being locked are much higher. I think putting them all behind the same Arc is fine.
  • Faster because you don't need to serialize and deserialize values.
  • The type system guarantees that entries are well formed - impossible to have a failed lookup or corrupt value.

hmmmmm I see the benefit, the issue being it will be writing a lot more code instead of a generalized get call or write call it becomes multiple functions for each storage unit, im on the fence, Actually don't respond here let's talk about it on #1282

Copy link
Collaborator

@HCastano HCastano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with Peg here, we should be leveraging the type system here more + also defining the cache a bit better (but I guess that's being discussed in #1282).

impl AppState {
/// Setup AppState with given secret keys
pub fn new(
configuration: Configuration,
kv_store: KvManager,
pair: sr25519::Pair,
x25519_secret: StaticSecret,
cache: Cache,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also initialize it to Default internally instead of passing it in. There are methods to write into the cache which can be used for testing

@@ -344,6 +352,46 @@ impl AppState {
))
}

pub fn write_to_cache(&self, key: String, value: Vec<u8>) -> anyhow::Result<()> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these public methods should be documented

self.cache.clear_poison()
}
}
// TODO delete from cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover TODO

@@ -344,6 +352,46 @@ impl AppState {
))
}

pub fn write_to_cache(&self, key: String, value: Vec<u8>) -> anyhow::Result<()> {
self.clear_poisioned_cache();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we should just be bypassing this check every time we try and write.

For example, let's say there's a thread that's updating the request limit and it keeps panicking when trying to update the limit. By bypassing this we could end up with inconsistent data here, potentially leading to the limit just being ignored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for one block, which would be better than a poisioned cache forever making it unable to sign

Comment on lines 261 to 262
/// A global cache type for the TSS
pub type Cache = Arc<RwLock<HashMap<String, Vec<u8>>>>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda strange to have this between the struct definition and implementation block

Comment on lines 261 to 262
/// A global cache type for the TSS
pub type Cache = Arc<RwLock<HashMap<String, Vec<u8>>>>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using a Vec<u8> value we lose type guarantees here for no real benefit. We know exactly what type(s) we want to store, so imo we should just use them directly

@@ -602,7 +601,7 @@ pub async fn check_for_key(account: &str, kv: &KvManager) -> Result<bool, UserEr
/// Checks the request limit
pub async fn request_limit_check(
rpc: &LegacyRpcMethods<EntropyConfig>,
kv_store: &KvManager,
app_state: &AppState,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd try and limit the scope of what we pass into here. We don't need the whole of AppState, just the cache

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently all hellper functions are in app state #1282 should handle that by creating a sub impl of cache

@@ -429,6 +440,84 @@ async fn signature_request_with_derived_account_works() {
clean_tests();
}

#[tokio::test]
#[serial]
async fn signature_request_overload() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the actual assertions needed for this test? I don't see anything (either success or failure)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HCastano
Copy link
Collaborator

HCastano commented Feb 5, 2025

@JesseAbram

@dvdplm suggested that we can probably get away with a non-poisoning lock here instead of explicitly ignoring the poisoned lock. parking_lot::RwLock fits the description here, so maybe consider using that?

@JesseAbram
Copy link
Member Author

@JesseAbram

@dvdplm suggested that we can probably get away with a non-poisoning lock here instead of explicitly ignoring the poisoned lock. parking_lot::RwLock fits the description here, so maybe consider using that?

From the docs, "No poisoning, the lock is released normally on panic." which doesn't help the initial concern of the data not being updated.......which again is mostly being off by one which isn't a big deal, however locking the sign flow is, anyways can look in more on monday, make an issue

@HCastano
Copy link
Collaborator

HCastano commented Feb 5, 2025

From the docs, "No poisoning, the lock is released normally on panic." which doesn't help the initial concern of the data not being updated.......which again is mostly being off by one which isn't a big deal, however locking the sign flow is, anyways can look in more on monday, make an issue

One of the benefits here is that we lean into the fact that we're not offering any guarantees around a non-poisoned lock state - as opposed to hiding that from the caller by clearing the poisoned state internally

@JesseAbram
Copy link
Member Author

From the docs, "No poisoning, the lock is released normally on panic." which doesn't help the initial concern of the data not being updated.......which again is mostly being off by one which isn't a big deal, however locking the sign flow is, anyways can look in more on monday, make an issue

One of the benefits here is that we lean into the fact that we're not offering any guarantees around a non-poisoned lock state - as opposed to hiding that from the caller by clearing the poisoned state internally

right but that means we either do this for all the rwlocks here or just this one? Im writting this one, I know that a poision lock should not lock the request limit Im not sure about the other rwlocks in app_state. would you rather be explicit about this or have two different rwlock libraries

@JesseAbram JesseAbram changed the title Add Cache Add Request limit cache Feb 5, 2025
@JesseAbram JesseAbram changed the title Add Request limit cache Add request limit cache Feb 5, 2025
@HCastano
Copy link
Collaborator

HCastano commented Feb 5, 2025

right but that means we either do this for all the rwlocks here or just this one? Im writting this one, I know that a poision lock should not lock the request limit Im not sure about the other rwlocks in app_state. would you rather be explicit about this or have two different rwlock libraries

Not sure - depends on the other data. Let's leave it as is for now, and make a concrete decision in #1282?

@JesseAbram
Copy link
Member Author

right but that means we either do this for all the rwlocks here or just this one? Im writting this one, I know that a poision lock should not lock the request limit Im not sure about the other rwlocks in app_state. would you rather be explicit about this or have two different rwlock libraries

Not sure - depends on the other data. Let's leave it as is for now, and make a concrete decision in #1282?

lol pretty much what I was thinking, also to be honest talking about this mutex problem is fun id like to bring it up in our next call.......just cuz idk it is fun

@JesseAbram JesseAbram merged commit 0fed38b into master Feb 6, 2025
8 checks passed
@JesseAbram JesseAbram deleted the overloaded-signing-test branch February 6, 2025 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants