20.08.2021 Orphaned Blocks in Ethereum Incident Postmortem
Friday night last week (Aug 20), a misconfiguration within Chorus One’s Lido Ethereum beacon chain setup led to a network-wide drop of participation of about 1% (measured by the so-called attestation rates) on the beacon chain. The issue was fixed within hours of the discovery by the involved teams and minor financial impact did occur.
The following post-mortem is primarily written by the Chorus One team and highlights the steps they took and will be taking to avoid further such issues going forward.
We want to thank everyone that contributed to identifying the issue and finding a solution.
Detailed Analysis
A helpful summary of the issue and follow-on discussions can also be found in these Twitter threads by Ben Edgington and Danny Ryan. A data-driven analysis has also been made by Shyam Sridhar.
Validators on Ethereum consist of two pieces: a beacon chain node and the validator client (a helpful overview of this can be found here). A beacon chain node can serve multiple validator clients, which is why we are operating 3 beacon chain nodes at Chorus One to service the 4,000 Lido validators we are currently running. The incident on Friday occured because the Lighthouse validator client, which we currently use exclusively, queries beacon nodes in the order in which they are specified in the configuration, rather than a round-robin, or randomised fashion; this is by design, in order to avoid situations where multiple beacon chain nodes queried by a single validator client are not in sync, causing errors and inefficiencies with attestation production. By extension, this helps keep the validator client simple, and less prone to bugs.
However, despite us running multiple nodes, having all validator clients configured identically meant that all queries were hitting a single node, which whilst it remained alive, became less responsive, causing a number of attestations and blocks to experience delays in their inclusion on the chain, ultimately leading to some blocks becoming orphaned. The existence of orphaned blocks led to an impact in network-wide attestation rates because of how the beacon chain works. As explained by Ben, the orphaned blocks led to overflowing attestations in other blocks due to reduced blockspace, and thus some attestations not being included at all. As a result, other validators on the beacon chain were also slightly penalized.
Within hours of the issue being spotted, we were able to adjust our Lighthouse configuration and have been performing flawlessly again since. The detection of this issue was suboptimal and can be traced back to imperfect monitoring around attestations and orphaned blocks. None of the Lighthouse metrics we are monitoring and alerting on showed any issues. We are in the process of adjusting our systems to take these events into account and want to again thank the diligent teams involved in uncovering the issue for their help.
Conclusion and Next Steps
We have updated our validator client configuration to balance queries more evenly between our beacon nodes. In addition, we are working on improving our monitoring and observability setup, especially around alerting based upon on-chain events, and investigating what metrics emitted by clients would have alerted us to Friday’s issue sooner. In the coming weeks, we will be diversifying our node setup to include another client (likely Prysm) to be more resilient to such a type of correlated failure.
From a Lido perspective, we currently don’t have just-in-time monitoring for all our clients teams, just a somewhat lagging analytics service. That’s partially by choice (we think that operators shouldn’t rely on a centralized setup for monitoring and must implement it themselves), and partially due to the lack of good technical solutions for just in time analytics. We have greenlit a project to develop robust reporting and analytics on validator performance (see the relevant LEGO RFP), which will aid in the possible identification of similar issues in the future. Additionally, as a part of our journey to Trustless Staking, we also will start gathering data on various validator metrics that are currently not easily drawn from on-chain data (e.g. client diversity), to enable more holistic risk management and impact analyses for these types of events.
One other way to mitigate risks like around correlated failure is getting more node operators in Lido, so that one incident doesn’t impact aggregate Lido operations too much. Such a process is underway now.
Finally, it is pertinent for us to investigate the question of how a relatively small number of orphaned blocks from a small percentage of the network validators was sufficient to cause the impact we have seen. As a part of this effort, we hope to understand to what extent the network is able to tolerate such misbehaviour, unintentional or otherwise, and whether we are able to propose any changes that could be made at the protocol level to ensure the ongoing stability of the Ethereum beacon chain.