-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use mlocked KES internally #1374
base: main
Are you sure you want to change the base?
Conversation
d7d3c8f
to
3c994f0
Compare
3c994f0
to
cfddb02
Compare
ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Block/Forging.hs
Show resolved
Hide resolved
ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Block/Abstract.hs
Outdated
Show resolved
Hide resolved
ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/HardFork/Combinator/Basics.hs
Outdated
Show resolved
Hide resolved
...ros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/HardFork/Combinator/Embed/Unary.hs
Outdated
Show resolved
Hide resolved
ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/HardFork/Combinator/Forging.hs
Outdated
Show resolved
Hide resolved
ouroboros-consensus-cardano/src/ouroboros-consensus-cardano/Ouroboros/Consensus/Cardano/Node.hs
Outdated
Show resolved
Hide resolved
ouroboros-consensus-cardano/src/ouroboros-consensus-cardano/Ouroboros/Consensus/Cardano/Node.hs
Outdated
Show resolved
Hide resolved
ouroboros-consensus-cardano/changelog.d/20250130_093803_tdammers_mlocked_kes_rebase.md
Outdated
Show resolved
Hide resolved
ouroboros-consensus-cardano/changelog.d/20250130_093803_tdammers_mlocked_kes_rebase.md
Show resolved
Hide resolved
cabal.project
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure we want to merge this as is. If we did so Consensus would be unreleasable until those other dependencies are released. I think we should wait until there are no more source-repository-package
stanzas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. This PR depends on another PR anyway, so it's not in a merge-ready state. I just want to have it as ready as I can.
I do wonder if this impacts the forging loop. I have been following the call hierarchy a bit, and it turns out that:
Will these invoke the KES agent? if so, they might impact the forging loop and therefore block production and diffusion. |
No. The way the KES agent connectivity will work is that The only situation where this can block the block forging is when the block forging thread first starts up; at this point, it will not have a KES key until a connection to the KES agent is established and a KES key is received. This is different from the current situation, where the KES key is loaded from disk before the block forging thread starts up. |
Can you quantify “for a short time” please. SPOs are quite sensitive to timing issues during block forging. if it’s significant, there might be operational procedures that could mitigate the issue (eg if it’s related to the timing of KES key evolution or KES agent restart)? |
Depends on the scenario. Evolving keys will temporarily block the block forging threads for the time it takes to evolve the key; however, in the current code, the block forging thread itself will evolve keys as needed before attempting to forge, so that delay is going to be the same, modulo thread switching overhead. Pushing entirely new keys works differently. In the current situation, installing a new key requires shutting down the node kernel, reloading the configuration from disk, and restarting the node kernel. In the new situation, a KES agent can just push a new key, and it will be replaced in the live block forging threads. The block forging will be blocked for the time it takes to swap out the key, but since this is effectively just an in-memory pointer swap, it's going to be extremely fast - much faster than restarting the entire node kernel. The only situation where an extra delay is introduced that wasn't there is when the node kernel first starts up. Right now, we load the KES key from disk before starting up the node kernel, and when the block forging thread starts, it already has the KES key. In the new situation, we will instead pass it the address of a KES agent, so the block forging thread will not have the KES key yet when it starts up, and it will block until it has established a KES agent connection and received a valid KES key. This process shouldn't take significantly longer than loading a KES key from disk (it may, in fact, even be faster, since the KES agent serves the key from RAM), but the delay occurs at a different point, after starting the block forging thread rather than before. It also means that if the KES agent connection fails, the block forging thread may remain blocked indefinitely. In other words, failing to obtain a valid KES key will no longer prevent the block forging threads from starting, it will instead block them indefinitely. TL;DR: I expect the overall delays to be shorter, but the order of operations during startup is different, causing potentially different failure modes. |
Hi @tdammers Thanks! I'm trying to make sure we can answer any questions that SPOs might have, so that we encourage adoption
Part of the point is that SPOs can currently control exactly when the KES evolution takes place, so can avoid doing this when they're scheduled to make a block. That is, they can avoid any timing impact if they are careful with their procedures. In the new system, I think they would need to kill/suspend/block the KES agent to achieve the same effort (something that is presumably undesirable).
This situation is interesting since this could impact unattended operation/scheduled restarts. It might be possible to mitigate this by having OS process monitors (i.e. make sure the KES agent is running before starting a block producer). Would the node proceed "silently" in this case (i.e. would it look as if it was working, but just fail to make blocks?). (I wouldn't be too worried about small increases in startup time, of course - that can be accounted for). Also, I'm assuming there's absolutely no need to have the KES agent running if you are e.g. operating relays/a full node wallet or some other service (I can't see why you would need to do this since KES keys are only needed for block producers!) |
OK, so I've been assuming that each KES evolution is valid for the exact duration of its corresponding 36-hour period, which implies that there is no point doing the evolution any earlier or later than the exact moment the period flips over. The code as it currently is is in line with this - the block forging threads will always check whether the KES key needs evolving before attempting to forge a block (this has been the case previously, and still is), and AFAICT there is no way for SPOs to control exactly when that happens. Please correct me if I'm mistaken about this. Also note that the KES agent does not interact with the node in order to evolve KES keys - this happens independently on both sides. The only reason the KES agent evolves keys at all is to ensure forward security on that end (we need to erase old evolutions of the key everywhere, not just inside the node process), but it will not push out a new key just because it has evolved. Killing, suspending, or blocking the KES agent will change absolutely nothing about how keys are evolved inside the node process. The only thing SPOs can control is when an entirely new KES key is installed, but this will still be the case with the KES agent system, because a KES agent will only push a key in two situations:
Absolutely, yes. Although you can say the same about invalid configurations that prevent a node process from starting up (or, worse, that would cause a node process to start up, but not become a block forging node).
It doesn't need to run before starting the block producer, it just needs to run at some point before the block producer is supposed to start producing blocks. If you start both together, having the agent start a couple milliseconds after the node process wouldn't be an issue. The KES agent is definitely not supposed to be down for extended amounts of time though, so the plan is to run it as a systemd unit and hook into the OS' logging and process management infrastructure. But of course it can also run under some other process management system.
There will be trace logging (currently working on that), and we could possibly add some functionality for querying the block forging state of a node, something like:
One thing that might complicate matters a bit is that it's possible for a node to be actively forging blocks despite having temporarily lost the connection to its KES agent - as long as it still holds a valid KES key (or one that can be evolved into a valid one), it will happily continue, make regular reconnection attempts, and pick up any new KES keys once the connection comes back.
No, I don't know how that works. I haven't looked into node / CLI source code at all so far. My guess would be that the best way to monitor this would be via trace events, or by directly querying the node kernel about its block forging threads, if any.
Correct. If you don't need to forge any blocks, you can ignore the KES agent entirely. Just like you wouldn't configure an initial KES key with the current setup, you would simply not configure a KES agent address, and the node would run as a non-block-forging instance. -- edit -- I've opened an issue (#1405) regarding the block forging query thing, FYI. |
This changes Consensus such that mlocked KES keys are used internally.
This is important groundwork for supporting KES agents in the future. In this form, the code will still load KES keys from disk, which is unsound, but the internal machinery is ready to also accept KES keys from other sources, and once loaded, KES keys will be handled appropriately (kept in mlocked RAM at all times, securely erased when expired).
This also involves a restructuring of the
HotKey
data structure, which now manages not only a KES SignKey, but also the corresponding OpCert. This is necessary for two reasons:Supersedes #1284.