Post Mortem: Lido on Ethereum RockLogic GmbH Slashing Incident
Incident Summary and Root Cause
At 13:02 UTC on April 13, Lido DAO contributors alerted the RockLogic GmbH (“RockLogic”) Node Operator participating in the Lido on Ethereum protocol of a slashing event taking place affecting 11 of the validators that they operate. A full list of the validators impacted is provided in APPENDIX B below.
Update: The RockLogic slashing-burn omnibus will commence on the week beginning the 20th of June.
Over the course of the next two hours, the affected cluster was brought offline to mitigate potential further risk, and the RockLogic team successfully identified the root cause. The cause of the slashing boiled down to the duplication of validator keys in two different active clusters; this caused a double vote, which led to attester slashings of 11 validators. A full post mortem from their perspective is available in APPENDIX A below. A full timeline of the incident can be found in section “4. Timeline” below.
On April 11th, a cluster (A) of 500 validator keys experienced an outage following an Execution Layer (EL) client database corruption and the keys were subsequently failed over to a new cluster (B). This was done by removing the keys from the initial cluster (A) and re-importing into an existing cluster (B) of another 500 keys. While RockLogic did not fully shut down or completely wipe cluster A, which would have made double-signing impossible, they relied on strong evidence that the deletion actions had worked as intended (confirmation of deletion of keys and re-querying the key manager later). The EL client on cluster A was restored on April 12th, and at the time no slashing occurred, which proved that the keys had been successfully removed from the cluster. However, following an update to the BN+VC clients (Prysm) of cluster A (on April 13th), a restart of the clients was performed which caused an unexpected re-import of the deleted validator keys and led to the 11 validator slashings beginning in epoch 194182 and ending in epoch 194183. On April 14th, Preston van Loon from Prysmatic Labs was instrumental in conducting a speedy and thorough investigation of the root cause together with the RockLogic team. The cause of the misleading confirmation of key deletion and subsequent unexpected re-import has been confirmed by Prysmatic Labs to be a bug (as evidenced in issue 12281 of their code repository). (EDIT Apr 21: this bug has been addressed and fixed as of Prysm v4.0.3)
The incident began at 12:50 UTC and was resolved (by bringing the remaining non-slashed validator keys back online) at 15:30 UTC. As of 10:56 UTC on April 14, 2023, current total penalties amount to 11.1945 ETH (including offline penalties for the entire cluster deactivated during investigation). As the 11 slashed validators continue to incur penalties before their scheduled withdrawal on May 20th, total penalties and missed rewards when the slashed validators become withdrawn, and including downtime penalties of the cluster, are projected to be ~13.77 ETH.
Impact
The impact on stakers (stETH holders) from a penalties and missed rewards perspective is analysed below:
Compared to the average daily protocol rewards which accrue to stETH holders, the total projected impact of 13.77 ETH is ~2.4% of daily rewards, or 0.0023% of total protocol TVL as at the time of writing.
Resolution
Following the incident, RockLogic shut down the cluster of 1000 validators to ensure no further slashing could take place. Upon further analysis, RockLogic deleted the Consensus Layer client (Prysm) to remove any potential stored key data, queued/buffered messages, and node data. Over the following hours, RockLogic reactivated the remaining 989 validators successfully without any further slashing event taking place.
Regarding possible compensation, RockLogic requested that the Lido DAO utilise its cover fund to compensate stakers for damages and lost rewards. This decision was finalised and enacted on June 30th 2023 through an on-chain vote.
- Forum post: research.lido.fi/t/slashing-incident-involving-rocklogic-gmbh-validators-april-13-2023/4399/13?u=izzy
- Snapshot vote: snapshot.org/#/lido-snapshot.eth/proposal/0x78bbc81011457ffcc0d2183de2a813869708d4f9996f4af3df8b669510950cf3
- Onchain vote to utilize funds from cover fund and burn them (thereby compensate stakers): vote.lido.fi/vote/160
Timeline
Action Items
- RockLogic will continue to work closely with Prysmatic Labs to assess the proposed fix to the issue identified and roll out the fix on relevant setups, and coordinate with other Node Operators.
- Node Operators participating in the Lido on Ethereum protocol will review setup to ensure doppelganger protection is utilised where possible, key migration implementations / activities are thoroughly checked, and additional precautions are taken when key migrations are performed (e.g. total wipe of initial cluster).
APPENDIX A
RockLogic GmbH Incident Report
1. Timeline
2. Root Cause
The root cause was double votes of validators imported on 2 different nodes. This duplication was due to an image version update followed by a reboot of Consensus and Validator Client (Prysm) to apply the update (4.0.1 -> 4.0.2). It seems that this process caused some kind of re-import of the previously deleted keys. However, nuking the node beforehand would have prevented this issue in the first place.
3. Action Points
- Further investigations to verify the root cause of the key re-import.
- Expand internal monitoring
- Security checks of client configurations (eg. doppelgänger is enabled)
- Documented and clear instructions for the migration of keys.