Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mlocked KES internally #1374

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

tdammers
Copy link

This changes Consensus such that mlocked KES keys are used internally.

This is important groundwork for supporting KES agents in the future. In this form, the code will still load KES keys from disk, which is unsound, but the internal machinery is ready to also accept KES keys from other sources, and once loaded, KES keys will be handled appropriately (kept in mlocked RAM at all times, securely erased when expired).

This also involves a restructuring of the HotKey data structure, which now manages not only a KES SignKey, but also the corresponding OpCert. This is necessary for two reasons:

  • With a KES agent, the OpCert will be provided along with the SignKey; this is the easiest way to make sure that the OpCert always matches the SignKey it is used with
  • With the new structure, KES keys can be replaced in a live node kernel, without having to restart it and reload the entire configuration. Because of this, the HotKey, which manages the dynamic part of the node kernel's signing mechanism, needs to be able to replace not just the SignKey (which it already did, in order to handle key evolution), but also the OpCert (which will not change when a SignKey evolves, but it will change when a new SignKey is provided).

Supersedes #1284.

@tdammers tdammers force-pushed the tdammers/mlocked-kes-rebase branch from d7d3c8f to 3c994f0 Compare January 30, 2025 09:44
@tdammers tdammers force-pushed the tdammers/mlocked-kes-rebase branch from 3c994f0 to cfddb02 Compare January 30, 2025 10:15
@tdammers tdammers marked this pull request as ready for review February 12, 2025 15:49
cabal.project Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure we want to merge this as is. If we did so Consensus would be unreleasable until those other dependencies are released. I think we should wait until there are no more source-repository-package stanzas.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. This PR depends on another PR anyway, so it's not in a merge-ready state. I just want to have it as ready as I can.

@jasagredo
Copy link
Contributor

I do wonder if this impacts the forging loop. I have been following the call hierarchy a bit, and it turns out that:

  • checkShouldForgeupdateForgeStateHotKey.evolve
  • forgeBlockforgeShelleyBlockmkHeaderforgePraosFieldsHotKey.sign

Will these invoke the KES agent? if so, they might impact the forging loop and therefore block production and diffusion.

@tdammers
Copy link
Author

I do wonder if this impacts the forging loop. I have been following the call hierarchy a bit, and it turns out that:

* `checkShouldForge` → `updateForgeState` → `HotKey.evolve`

* `forgeBlock` → `forgeShelleyBlock` → `mkHeader` → `forgePraosFields` → `HotKey.sign`

Will these invoke the KES agent? if so, they might impact the forging loop and therefore block production and diffusion.

No.

The way the KES agent connectivity will work is that mkHotKey will spawn a separate thread that connects to a KES agent and replaces the credentials stored inside the HotKey whenever the KES agent sends any. evolve and sign will block for a short amount of time while the ke is being replaced, but other than that, the block forging will be completely independent from the KES agent connection.

The only situation where this can block the block forging is when the block forging thread first starts up; at this point, it will not have a KES key until a connection to the KES agent is established and a KES key is received. This is different from the current situation, where the KES key is loaded from disk before the block forging thread starts up.

@kevinhammond
Copy link

Can you quantify “for a short time” please. SPOs are quite sensitive to timing issues during block forging. if it’s significant, there might be operational procedures that could mitigate the issue (eg if it’s related to the timing of KES key evolution or KES agent restart)?

@tdammers
Copy link
Author

Can you quantify “for a short time” please. SPOs are quite sensitive to timing issues during block forging. if it’s significant, there might be operational procedures that could mitigate the issue (eg if it’s related to the timing of KES key evolution or KES agent restart)?

Depends on the scenario.

Evolving keys will temporarily block the block forging threads for the time it takes to evolve the key; however, in the current code, the block forging thread itself will evolve keys as needed before attempting to forge, so that delay is going to be the same, modulo thread switching overhead.

Pushing entirely new keys works differently. In the current situation, installing a new key requires shutting down the node kernel, reloading the configuration from disk, and restarting the node kernel. In the new situation, a KES agent can just push a new key, and it will be replaced in the live block forging threads. The block forging will be blocked for the time it takes to swap out the key, but since this is effectively just an in-memory pointer swap, it's going to be extremely fast - much faster than restarting the entire node kernel.

The only situation where an extra delay is introduced that wasn't there is when the node kernel first starts up. Right now, we load the KES key from disk before starting up the node kernel, and when the block forging thread starts, it already has the KES key. In the new situation, we will instead pass it the address of a KES agent, so the block forging thread will not have the KES key yet when it starts up, and it will block until it has established a KES agent connection and received a valid KES key. This process shouldn't take significantly longer than loading a KES key from disk (it may, in fact, even be faster, since the KES agent serves the key from RAM), but the delay occurs at a different point, after starting the block forging thread rather than before. It also means that if the KES agent connection fails, the block forging thread may remain blocked indefinitely.

In other words, failing to obtain a valid KES key will no longer prevent the block forging threads from starting, it will instead block them indefinitely.

TL;DR: I expect the overall delays to be shorter, but the order of operations during startup is different, causing potentially different failure modes.

@kevinhammond
Copy link

Hi @tdammers

Thanks! I'm trying to make sure we can answer any questions that SPOs might have, so that we encourage adoption

Evolving keys will temporarily block the block forging threads for the time it takes to evolve the key; however, in the current code, the block forging thread itself will evolve keys as needed before attempting to forge, so that delay is going to be the same, modulo thread switching overhead.

Pushing entirely new keys works differently. In the current situation, installing a new key requires shutting down the node kernel, reloading the configuration from disk, and restarting the node kernel. In the new situation, a KES agent can just push a new key, and it will be replaced in the live block forging threads. The block forging will be blocked for the time it takes to swap out the key, but since this is effectively just an in-memory pointer swap, it's going to be extremely fast - much faster than restarting the entire node kernel.

Part of the point is that SPOs can currently control exactly when the KES evolution takes place, so can avoid doing this when they're scheduled to make a block. That is, they can avoid any timing impact if they are careful with their procedures. In the new system, I think they would need to kill/suspend/block the KES agent to achieve the same effort (something that is presumably undesirable).

In other words, failing to obtain a valid KES key will no longer prevent the block forging threads from starting, it will instead block them indefinitely

This situation is interesting since this could impact unattended operation/scheduled restarts. It might be possible to mitigate this by having OS process monitors (i.e. make sure the KES agent is running before starting a block producer). Would the node proceed "silently" in this case (i.e. would it look as if it was working, but just fail to make blocks?).
Do we know how it is currently detected (a process monitor can presumably check whether the block forging thread has started). Basically, this looks like a change that needs to be considered by those writing e.g. CNCLI

(I wouldn't be too worried about small increases in startup time, of course - that can be accounted for).

Also, I'm assuming there's absolutely no need to have the KES agent running if you are e.g. operating relays/a full node wallet or some other service (I can't see why you would need to do this since KES keys are only needed for block producers!)

@tdammers
Copy link
Author

tdammers commented Feb 27, 2025

@kevinhammond

Part of the point is that SPOs can currently control exactly when the KES evolution takes place, so can avoid doing this when they're scheduled to make a block. That is, they can avoid any timing impact if they are careful with their procedures. In the new system, I think they would need to kill/suspend/block the KES agent to achieve the same effort (something that is presumably undesirable).

OK, so I've been assuming that each KES evolution is valid for the exact duration of its corresponding 36-hour period, which implies that there is no point doing the evolution any earlier or later than the exact moment the period flips over. The code as it currently is is in line with this - the block forging threads will always check whether the KES key needs evolving before attempting to forge a block (this has been the case previously, and still is), and AFAICT there is no way for SPOs to control exactly when that happens. Please correct me if I'm mistaken about this.

Also note that the KES agent does not interact with the node in order to evolve KES keys - this happens independently on both sides. The only reason the KES agent evolves keys at all is to ensure forward security on that end (we need to erase old evolutions of the key everywhere, not just inside the node process), but it will not push out a new key just because it has evolved. Killing, suspending, or blocking the KES agent will change absolutely nothing about how keys are evolved inside the node process.

The only thing SPOs can control is when an entirely new KES key is installed, but this will still be the case with the KES agent system, because a KES agent will only push a key in two situations:

  • When a node starts up and first connects to the agent (delaying this makes no sense, because without that first key, you can't forge anything anyway).
  • When the user installs a new key into the agent. However, this requires a manual intervention, just like the process it replaces, so SPOs are still completely in control as to when this happens.

This situation is interesting since this could impact unattended operation/scheduled restarts.

Absolutely, yes. Although you can say the same about invalid configurations that prevent a node process from starting up (or, worse, that would cause a node process to start up, but not become a block forging node).

It might be possible to mitigate this by having OS process monitors (i.e. make sure the KES agent is running before starting a block producer).

It doesn't need to run before starting the block producer, it just needs to run at some point before the block producer is supposed to start producing blocks. If you start both together, having the agent start a couple milliseconds after the node process wouldn't be an issue.

The KES agent is definitely not supposed to be down for extended amounts of time though, so the plan is to run it as a systemd unit and hook into the OS' logging and process management infrastructure. But of course it can also run under some other process management system.

Would the node proceed "silently" in this case (i.e. would it look as if it was working, but just fail to make blocks?).

There will be trace logging (currently working on that), and we could possibly add some functionality for querying the block forging state of a node, something like:

  • "Not a block forging node": this node has not been configured as a block forging node
  • "No agent": this node has been configured to load keys from a KES agent, but hasn't been able to make a connection on the configured address
  • "No key": this node has successfully connected to a KES agent, but hasn't received any KES key yet
  • "Forging": this node is currently forging blocks

One thing that might complicate matters a bit is that it's possible for a node to be actively forging blocks despite having temporarily lost the connection to its KES agent - as long as it still holds a valid KES key (or one that can be evolved into a valid one), it will happily continue, make regular reconnection attempts, and pick up any new KES keys once the connection comes back.

Do we know how it is currently detected (a process monitor can presumably check whether the block forging thread has started).

No, I don't know how that works. I haven't looked into node / CLI source code at all so far. My guess would be that the best way to monitor this would be via trace events, or by directly querying the node kernel about its block forging threads, if any.

Also, I'm assuming there's absolutely no need to have the KES agent running if you are e.g. operating relays/a full node wallet or some other service (I can't see why you would need to do this since KES keys are only needed for block producers!)

Correct. If you don't need to forge any blocks, you can ignore the KES agent entirely. Just like you wouldn't configure an initial KES key with the current setup, you would simply not configure a KES agent address, and the node would run as a non-block-forging instance.

-- edit --

I've opened an issue (#1405) regarding the block forging query thing, FYI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

4 participants