Post Mortem: Lido on Ethereum RockLogic GmbH Slashing Incident

in Post Mortem by Izzy

 

Incident Summary and Root Cause

At 13:02 UTC on April 13, Lido DAO contributors alerted the RockLogic GmbH (“RockLogic”) Node Operator participating in the Lido on Ethereum protocol of a slashing event taking place affecting 11 of the validators that they operate. A full list of the validators impacted is provided in APPENDIX B below.

 

Update: The RockLogic slashing-burn omnibus will commence on the week beginning the 20th of June.

 

Over the course of the next two hours, the affected cluster was brought offline to mitigate potential further risk, and the RockLogic team successfully identified the root cause. The cause of the slashing boiled down to the duplication of validator keys in two different active clusters; this caused a double vote, which led to attester slashings of 11 validators. A full post mortem from their perspective is available in APPENDIX A below. A full timeline of the incident can be found in section “4. Timeline” below.

 

On April 11th, a cluster (A) of 500 validator keys experienced an outage following an Execution Layer (EL) client database corruption and the keys were subsequently failed over to a new cluster (B). This was done by removing the keys from the initial cluster (A) and re-importing into an existing cluster (B) of another 500 keys. While RockLogic did not fully shut down or completely wipe cluster A, which would have made double-signing impossible, they relied on strong evidence that the deletion actions had worked as intended (confirmation of deletion of keys and re-querying the key manager later). The EL client on cluster A was restored on April 12th, and at the time no slashing occurred, which proved that the keys had been successfully removed from the cluster. However, following an update to the BN+VC clients (Prysm) of cluster A (on April 13th), a restart of the clients was performed which caused an unexpected re-import of the deleted validator keys and led to the 11 validator slashings beginning in epoch 194182 and ending in epoch 194183. On April 14th, Preston van Loon from Prysmatic Labs was instrumental in conducting a speedy and thorough investigation of the root cause together with the RockLogic team. The cause of the misleading confirmation of key deletion and subsequent unexpected re-import has been confirmed by Prysmatic Labs to be a bug (as evidenced in issue 12281 of their code repository). (EDIT Apr 21: this bug has been addressed and fixed as of Prysm v4.0.3)

 

The incident began at 12:50 UTC and was resolved (by bringing the remaining non-slashed validator keys back online) at 15:30 UTC. As of 10:56 UTC on April 14, 2023, current total penalties amount to 11.1945 ETH (including offline penalties for the entire cluster deactivated during investigation). As the 11 slashed validators continue to incur penalties before their scheduled withdrawal on May 20th, total penalties and missed rewards when the slashed validators become withdrawn, and including downtime penalties of the cluster, are projected to be ~13.77 ETH.

 

Impact

The impact on stakers (stETH holders) from a penalties and missed rewards perspective is analysed below:

Description

Amounts

Initial slashing penalties

Penalties: 11 ETH

(Actual) (1 ETH penalty per validator slashed)

Additional slashing penalties (i.e. due to correlated slashing multiplier)

0*

* Projected. No additional slashing penalties expected, as thousands of additional validators would need to be slashed within the correlated 36-day period to trigger a penalty of 1 ETH per validator.

Slashing-subsequent validator duty inactivity penalties and missed rewards (attestations)

Attestation penalties*: 0.8276 ETH

Missed Rewards:
Attestation penalties*: 0.8276 ETH
Missed head vote rewards: 0.2896



(*Projected. Assuming Projected base reward 474. Attestation penalty ~9183.75 gwei, Attestation reward ~12798 gwei. Slashing Vector 8192 epochs, incurred until the validators become withdrawable on the Beacon Chain)

Slashing-subsequent validator duty inactivity penalties (missed proposals and/or sync committees)

Missed proposal rewards*: 0.362 ETH
Sync committee penalties / rewards: N/A**

*Projected. Unlikely estimate that at each validator makes within the 36d (avg proposal reward for 2w = 32,927,752 Gwei)
** See “interesting edge case” note in “Other Penalties”)

Slashing-subsequent inactivity leak

0*


* Projected. There is no inactivity leak expected as the network is not having issues finalising.

Penalties and missed rewards of the associated cluster de-activated during the Slashing investigation

Penalties: 0.1742 ETH

Missed rewards: 0.2974 ETH


 

Compared to the average daily protocol rewards which accrue to stETH holders, the total projected impact of 13.77 ETH is ~2.4% of  daily rewards, or 0.0023% of total protocol TVL as at the time of writing.

 

Resolution

Following the incident, RockLogic shut down the cluster of 1000 validators to ensure no further slashing could take place. Upon further analysis, RockLogic deleted the Consensus Layer client (Prysm) to remove any potential stored key data, queued/buffered messages, and node data. Over the following hours, RockLogic reactivated the remaining 989 validators successfully without any further slashing event taking place.

 

Regarding possible compensation, RockLogic requested that the Lido DAO utilise its cover fund to compensate stakers for damages and lost rewards. This decision was finalised and enacted on June 30th 2023 through an on-chain vote.

 

Timeline

April 13, 13:02 UTC

Lido contributor observes validator slashings taking place on the Ethereum network and begins investigation.

April 13, 13:03 UTC

Confirmed by Lido contributor that the slashing involves validators operated by the Node Operator RockLogic GmbH as a part of the Lido on Ethereum protocol. 

April 13, 13:04 UTC

Lido contributor notifies RockLogic of a slashing event taking place for the validators they operate.

April 13, 13:05 UTC

RockLogic acknowledges the issue.

April 13, 13:06 UTC

Confirmation that 11 validators in total have been slashed.

April 13, 13:09 UTC

A call is held to diagnose and explore remediation of the issue. 

April 13, 13:15 UTC

RockLogic team describes the series of events from their perspective. The cluster of 1000 validators (of which the 11 slashed were included) is shut down out of caution while RockLogic investigates the root cause.

The cluster of 1000 validators includes 500 (Cluster A) which had been imported into Cluster B on April 11 as a failover due to a corrupted Nethermind database. 

April 13, 13:15-14:00 UTC

Additional analysis of the situation by the RockLogic team to determine potential next steps.

April 13, 13:28 UTC

Communication is shared via the @Lidofinance twitter account notifying the community of a slashing event.

April 13, 14:00 UTC

RockLogic deletes the Consensus Layer clients (BN+VC) (Prysm) to remove any stored key data, queued/buffered messages, and node data. 

April 13, 14:05 UTC

RockLogic reinstalls the Consensus Layer and Beacon Node and re-syncs utilising checkpoint sync from another RockLogic-operated node.

April 13, 14:07 UTC

RockLogic imports the validator keys for 50 validators corresponding to the affected cluster (Cluster A). 

April 13, 14:20 UTC

Three consecutive epochs of successful attestations are observed for the 50 validators with no additional slashings observed. An additional 50 validators are imported from Cluster A.

April 13, 14:26 UTC

Successful attestations are observed for the 2nd batch of 50 validators across multiple epochs.

April 13, 14:34 UTC

An additional 200 validators are re-activated from Cluster A and successfully attest.

April 13, 14:54 UTC

Final 200 validators from Cluster A are re-activated and successfully attest.

April 13, 15:10 UTC

First 100 validators of Cluster B (500 total) are reactivated and successfully attest.

April 13, 15:30 UTC

Remaining 400 validators of Cluster B are reactivated and successfully attest. Incident is considered remediated as the 11 validators are confirmed to be slashed and non-recoverable, and the remaining validators in the associated clusters have been brought online successfully and correctly. 

April 14, 16:40 UTC

Following retrieval of logs from the nuked system by the RockLogic team and extensive testing, the issue was reproduced in a test environment and Prysmatic Labs joined to help further debug.

 

Action Items

  • RockLogic will continue to work closely with Prysmatic Labs to assess the proposed fix to the issue identified and roll out the fix on relevant setups, and coordinate with other Node Operators.
  • Node Operators participating in the Lido on Ethereum protocol will review setup to ensure doppelganger protection is utilised where possible, key migration implementations / activities are thoroughly checked, and additional precautions are taken when key migrations are performed (e.g. total wipe of initial cluster).

 

APPENDIX A

 

RockLogic GmbH Incident Report

1. Timeline

11. 04. - 7:30 UTC 

500 Offline Vals 

The Corruption of a Nethermind Database caused 500 Validators to be offline, so the NO migrated the keys onto another machine. (remove keys from cluster A and import keys into cluster B)

11. 04. - 9:30 UTC 

Resync Nethermind 

The resync of nethermind was initiated to get a backup-node up.

12. 04. - late morning 

est. time of Nethermind
finish syncing 

At this point Nethermind should definitely finish syncing. If the keys were not removed, it would have started staking right away. 

13. 04. 12:27 UTC 

Update Prysm 

Prysm Docker image was updated from 4.0.1 to 4.0.2 for Consensus and Validator Client. The containers were restarted afterwards to apply the update.

13. 04. 12:50 UTC 

First slashings 

Slot 6213852 with the first 2 slashings 

13. 04. 1:13 UTC 

Shutdown VC 

Shutdown of the Prysm validator of the node that was causing the slashing.

13. 04. couple minutes later 

Nuke node 

Complete vanish of the node that caused the slashing.

 

2. Root Cause

The root cause was double votes of validators imported on 2 different nodes. This duplication was due to an image version update followed by a reboot of Consensus and Validator Client (Prysm) to apply the update (4.0.1 -> 4.0.2). It seems that this process caused some kind of re-import of the previously deleted keys. However, nuking the node beforehand would have prevented this issue in the first place.

 

3. Action Points

  • Further investigations to verify the root cause of the key re-import.
  • Expand internal monitoring
  • Security checks of client configurations (eg. doppelgänger is enabled)
  • Documented and clear instructions for the migration of keys.

 

APPENDIX B

Slashed validators

Validator

Time

Slot

Epoch

https://beaconcha.in/validator/459890

April 13, 12:51:47 UTC

6213857

194183

https://beaconcha.in/validator/459098

April 13, 12:51:35 UTC

6213856

194183

https://beaconcha.in/validator/459140

April 13, 12:51:35 UTC

6213856

194183

https://beaconcha.in/validator/459225

April 13, 12:51:23 UTC

6213855

194182

https://beaconcha.in/validator/458237

April 13, 12:51:23 UTC

6213855

194182

https://beaconcha.in/validator/459093

April 13, 12:51:11 UTC

6213854

194182

https://beaconcha.in/validator/458562

April 13, 12:51:11 UTC

6213854

194182

https://beaconcha.in/validator/458038

April 13, 12:50:59 UTC

6213853

194182

https://beaconcha.in/validator/459803

April 13, 12:50:59 UTC

6213853

194182

https://beaconcha.in/validator/459103

April 13, 12:50:47 UTC

6213852

194182

https://beaconcha.in/validator/458566

April 13, 12:50:47 UTC

6213852

194182