-
Notifications
You must be signed in to change notification settings - Fork 302
Post Incident Review Deposit Processing Performance
On March 15 2022 from approximately 17:02 UTC (slot 3378308) a large number of Teku nodes began exhibiting increased garbage collection activity resulting in high CPU usage. Depending on the node this persisted for 30-60 minutes and may have caused the node to fall behind.
A smaller recurrance of this occurred from approximately slot 3380288 and a third smaller still at approximately slot 3386466.
For 20 minutes prior to the initial increased GC activity, monitoring showed Teku consistently lagging one slot behind, indicating slow import of blocks. During this period CPU usage was significantly lower than normal.
A similar pattern was observed for the second and third events.
During the major incidents mentioned above, most Teku nodes could not keep up with the current beacon chain head. This resulted in varying combinations of increased rates of incorrect head votes, increased attestation inclusion delays or entirely missed attestations. Some nodes may have crashed with OutOfMemoryError
.
Outside of the major incidents, any block containing deposits is likely to have been imported late resulting in Teku producing attestations with the incorrect head block.
Validator keys were not at risk.
Teku 22.3.2 has been released with a fix for the performance issue in deposit processing that was the root cause of this problem. Users are recommended to upgrade to 22.3.2.
Further work is on-going to improve the resiliency and performance of Teku and avoid the escalation of problems seen in this incident. See the detailed technical analysis for further information.
Due to the way eth1 voting works, deposits are included in the beacon chain in batches, typically about 7 hours apart. Each block can contain up to 16 deposits so the batch of deposits to be included when the eth1 voting process agrees on a new eth1 chain head are spread across as many blocks as required to fit them in.
Prior to the first incident an unusually large number of deposits, around 4000, were included in the beacon chain. These were included in blocks 3378203 to 3378453 (250 slots).
Prior to the second incident of increased GC activity, deposits were included from blocks 3380265 to 3389388 (123 slots).
Prior to the third incident of increased GC activity, deposits were included from blocks 3386401 to 3386500 (99 slots).
There is a correlation between the number of deposits being included and the severity of the performance problems Teku nodes experience. Notably the processing of deposits starts some time prior to GC and CPU activity increasing. Since then deposit levels have returned to more normal, lower levels and Teku nodes and the impact of performance issues was limited to a period of incorrect head votes while processing blocks containing deposits, without triggering the major impacts with significantly increased GC and CPU.
As part of processing deposits, beacon nodes must determine if the deposit is for a new validator or for an existing validator by checking if the public key in the deposit matches the public key of an existing validator. Teku's implementation of this check was not using the validator public key caches but inefficiently iterating through all validators in the state to check their public key. As the number of validators has increased, that check has become more expensive.
This inefficient loop explains the period of time where Teku nodes were consistently one block behind and the increase in incorrect head votes during periods where deposits are included. However with blocks limited to just 16 deposits, Teku is able to import each block within the 12 second slot time as seen by the period of time where Teku is consistently one slot behind. Only after a consistent period of deposits being included is there a change from being one slot behind to dramatically falling behind the chain head.
This change is shown in the graph below of "trailing distance": current_slot - head_block_slot.
Note that due to the way Prometheus samples data and the potential for empty slots on the beacon chain it is normal for the trailing distance to toggle between 0 and 1. From just after 16:40 the trailing distance begins toggling between 1 and 2. Then from just after 17:00 the trailing distance grows more rapidly until at 17:10 the node is making essentially no progress on importing blocks. Eventually the node switches back to long sync mode and recovers.
Understanding this change in behaviour requires looking at the GC activity.
This shows GC time for young (blue) and old generations (yellow) overlaid on the previous graph of trailing distance (green). On the far left of the graph we see normal GC behaviour with no old generation GC being required and young generation mostly under 2% with periodic spikes (typically aligning with epoch transitions). As Teku begins trailing one slot behind we see more regular and larger spikes of young GC. The two inflection points in trailing distance correlate to two points where old generation GC occurs.
The increase in young GC activity suggests there was an increase in the rate of object allocation. To verify this was caused by the same inefficient deposit processing loop we monitoring all memory allocations in a profiler while syncing the portion of the beacon chain containing deposits from the first incident. With the inefficient loop a total of 117Gb of memory was allocated and the fixed version only allocated 342Mb. Note that this measures allocations, not total memory requirements at a single point in time.
Modern JVMs are extremely efficient at allocating and collecting short-lived objects, particularly when only used as a local variable within a method. However, given the extremely high rate of allocation caused by this inefficient loop, the allocations "overflow" the young generation and are promoted to the old generation which is significantly more expensive to collect. The details of this process and the various trade-offs the garbage collector makes varies with different JVMs and memory levels but given a high enough rate of memory allocation for a long enough period, eventually a full garbage collection will be triggered. This is the spike in the yellow old gen GC time in the graphs.
Old gen GC by itself uses some CPU but we can see that even after the old gen GC completes, the slow block processing continues. The final contributing factor here is the way Teku leverages a cache of beacon states to enable fast access to the data required for processing blocks and validating incoming gossip. These states are held in a cache using a Java feature called soft references.
A soft reference, tells the garbage collector that the referenced object (the BeaconState
in this case) is useful but not essential. Normally the object is held in memory but if memory is tight, it can be garbage collected to free up space. If the state is collected but is later required again, Teku can regenerate it by starting from an earlier state that is available and applying blocks to regenerate the required state.
Teku is intelligent about handling these regeneration tasks, detecting an optimal state to start from and queuing requests that would benefit from starting from the result of another request.
Using soft references in this way can significantly increase resilience and avoid crashing with OutOfMemoryError
during periods of higher memory usage such as long periods of non-finalization.
By looking at the number of cached states, we can see that when the old gen GC ran, it collected nearly all the states in the cache.
Given the number of cached states fairly quickly increased again, it's clear that many of the collected states turned out to be required again immediately. Regenerating them however required processing the same blocks containing deposits that initially caused the problem. Thus, the regeneration was slower than usual and created a significant amount of new garbage to collect, causing additional garbage collection events which remove states from the cache again.
Since states are used for both importing blocks and validating other gossip like attestations, there is significant demand for state regeneration and it is this process that results in the "death spiral" which prevents Teku from making any real progress to importing blocks.
Generally nodes recovered from this issue by falling far enough behind the beacon chain head that the long sync process began, during which Teku unsubscribes from gossip since it knows it's behind the chain head and cannot validate or use the information from gossip. Since the majority of a node's workload is generated by gossip, unsubscribing allows it to much more rapidly make progress. That in turn rebuilds the in-memory caches and is aided by the finalized checkpoint progressing to beyond the blocks with deposits, removing the original source of the issue.
Nodes with a sufficiently large max heap size either didn't exhibit the same kind of old gen GC that caused issues or were able to make sufficient progress between collections that they recovered without trigging the long sync process.
Teku 22.3.2 has now been released which replaces the inefficient loop over all validators during deposit processing with a fast lookup via the validator public key register Teku maintains. This resolves the root cause of the issue by ensuring that deposits are efficiently processed in terms of both CPU and memory allocation.
Work has begun to improve the way Teku utilises it's cached states to avoid causing cascading failures like this one when the cache is suddenly collected. In particular, a significant amount of gossip can be validated using a small number of states, reducing the number of states that need to be regenerated.
We are also reviewing the sizings and usage of the different types of cache that Teku uses to ensure that they are balanced correctly and memory usage is prioritised appropriately.
These improvements will ship in future releases as they are ready.