-
-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watch not triggering reconciliation #1316
Comments
This appears to be a lock-up from somewhere in the streaming code. It seems to trigger consistently on this branch with a simple example run in both --release or debug, with openssl and rustls. Instrumented the example by forking the example and adding console instrumentation by adding - tracing_subscriber::fmt::init();
+ use tracing_subscriber::{prelude::*, Registry};
+ let logger = tracing_subscriber::fmt::layer().compact();
+ let console = console_subscriber::ConsoleLayer::builder()
+ .retention(std::time::Duration::from_secs(60))
+ .spawn();
+ let collector = Registry::default().with(console).with(logger);
+ tracing::subscriber::set_global_default(collector).unwrap(); running (while having RUSTFLAGS="--cfg tokio_unstable" RUST_LOG=trace,runtime=trace,tokio=trace cargo run --example crd_watcher --release which shows these tasks stuck: which i guess come from these lines:
that doesn't super help me yet. the task screenshot has a lot of similarities with this old hyper issue though: hyperium/hyper#2312 |
I tried one of the older workarounds: hyperium/hyper#2312 (comment) with diff --git a/kube-client/src/client/builder.rs b/kube-client/src/client/builder.rs
index bfaa945..2073c99 100644
--- a/kube-client/src/client/builder.rs
+++ b/kube-client/src/client/builder.rs
@@ -100,7 +100,10 @@ impl TryFrom<Config> for ClientBuilder<BoxService<Request<hyper::Body>, Response
connector.set_read_timeout(config.read_timeout);
connector.set_write_timeout(config.write_timeout);
- hyper::Client::builder().build(connector)
+ hyper::Client::builder()
+ .pool_idle_timeout(std::time::Duration::from_millis(0))
+ .pool_max_idle_per_host(0)
+ .build(connector)
};
let stack = ServiceBuilder::new().layer(config.base_uri_layer()).into_inner(); which does seem to cause some of these tasks to mostly disappear from the console list, but the example does not really resume in this case so it practically has no real effect, and the example is still locked up. |
ok, using an aggressive combination of timeouts
the example actually does seem to slowly move along:
i.e. and the example sloooowly moves along at a snails pace (about 1 iteration a minute as it hits the lockup very frequently) i also tried setting the 3 timeouts to (5s, 5s, 5s) just to check if that helps much, and yes, a bit, but not a whole lot; we are then at the mercy of the watcher backoff. running the example without console took 9m to complete all iterations in release mode against a local cluster, and that's pretty bad considering how much of that is idle:
|
Ok, I've been playing around with this a bit more and I am starting to question the validity of the repro here tbh. Simpler ExamplesThe above Have tried several other examples.
This one avoids the teardown + recreate (single controller), just focusing on the messages, but it still will do a single startup and teardown, so we re-run it a bunch of times. hyperfine -w 0 -r 100 'RUST_LOG=trace cargo run --example=crd_watcher_single --release'
Benchmark 1: RUST_LOG=trace cargo run --example=crd_watcher_single --release
Time (mean ± σ): 17.034 s ± 1.486 s [User: 0.365 s, System: 0.113 s]
Range (min … max): 14.714 s … 20.305 s 100 runs
This one only starts the controller, runs a single iteration with only two patches + reconciles and shuts down. It takes 1s to run. Explicitly just trying to start the controller from scratch to see if it could happen on startup alone. hyperfine -w 0 -r 1000 'RUST_LOG=trace cargo run --example=crd_watcher_mono --release'
Benchmark 1: RUST_LOG=trace cargo run --example=crd_watcher_mono --release
Time (mean ± σ): 1.132 s ± 0.004 s [User: 0.074 s, System: 0.039 s]
Range (min … max): 1.123 s … 1.153 s 1000 runs so it's something particular about this particular trashing of controllers and associated objects. (if that's all that's broken, then i can live with it at the moment - i.e. it's not great, but it's not a normal use case either). SuggestionsSome things that could be worth looking into;
that could explain why idle timeouts and max idle connections helps recover (albeit very slowly). if so, that could indicate a bug in hyper, and it could be worth building with a more instrumented version of hyper to figure this out.
we are using as a side-note, you probably are not even draining the stream. expecting one event per patch is not necessarily true, you could have bookmark events. anyway.. it's possible this does represent a real problem, and it's possible that this should work, but it's also a very strange way of doing things. i'm going to step away from this for now, until more information surfaces. if anyone else has input i'll leave it open. |
Hey @zlepper, I may be way off here, but... do you still see this issue on kube 0.87.1? I ask because we run a decent sized kube project, and we hit a similar problem after upgrading to kube 0.86. Our issue was a little tricky to reproduce, but we did so via a simulation script that every 100ms takes one of the following actions:
On kube 0.86, this simulation script will eventually restart the controller, attempt a deletion and then hang forever. On kube 0.87.1, this no longer arises. In fact, I've narrowed the bug fix down to this commit (e78d488), which is present in kube 0.87.1 and not in kube 0.86. This was a surprise to us as #1324 and #1322 seem to indicate that the bug fix in e78d488 relates only to controllers running with bounded concurrency, but we're actually using the default (unbounded) and still saw the bug. Anyhow, perhaps try kube 0.87.1 and see if that resolves your issue? |
Having tested it for a couple of days it does indeed seem to have solved the issue, so great :D And thank you so much for the heads up |
@clux Is there perhaps a case to be made for yanking v0.86? The bug caused a prod incident for us, and if @zlepper's codebase has hit it too I wonder if other folks might inadvertently upgrade to this version and run into problems too. It seems that v0.87.1 doesn't exhibit the issue any more, so anybody on v0.86 should have a fairly straightforward upgrade path? |
Yeah, I think that's probably a good idea. Particularly since |
[email protected] (and it's individual workspace crates) are now yanked as a result of this; https://crates.io/crates/kube/versions |
Current and expected behavior
Run this example: https://github.com/zlepper/kube/blob/main/examples/crd_watcher.rs
At some point it will stall without any progress.
The examples does setup and teardown a controller many times to simulate a process being restarted (It was the only easy way i could reproduce without having to manually start a process over and over).
Possible solution
No response
Additional context
I originally wrote on Discord, so this thread might contain some additional context from Clux. https://discord.com/channels/500028886025895936/1164148573571661926
Environment
Local Docker Desktop kubernetes.
Azure kubernetes (AKS).
Configuration and features
No response
Affected crates
No response
Would you like to work on fixing this bug?
no
The text was updated successfully, but these errors were encountered: