Post Mortem: Lido on Ethereum Launchnodes Slashing Incident

in Post Mortem by Lido

An update as of November 28th, 2023: as of Nov 16, 2023, the 20 validators in question are now withdrawable and have thus stopped accumulating penalties. View full update here: https://research.lido.fi/t/slashing-incident-involving-launchnodes-validators-oct-11-2023/5631/4.

 

Incident Summary and Root Cause

At 15:55 UTC on October 11 2023, Lido DAO contributors alerted the Launchnodes Node Operator of a slashing event taking place which ultimately affected 20 of the validators that they operate as users of the Lido protocol. A full list of the validators impacted is provided in APPENDIX B below.

 

Within 10 minutes, the affected clusters were brought offline to mitigate potential further risk, and the Launchnodes team began to investigate the root cause. The root cause of the slashing boiled down to executing non-optimal fallback procedures during datacenter connectivity issues. In an attempt to restore validator connectivity, multiple validator client instances (an initial instance and a manually activated fallback instance) were pointed to a single Web3signer instance without slashing protection enabled at the Web3signer level and without blocking the initial instance from the signer (e.g. via firewall rules); this caused double votes to occur for the loaded validators, which led to attester slashings of 20 validators.

 

The fallback validator client was brought on and connected to Web3signer after an attempt had been made to deactivate the nodes attached to the original validator client instance by moving the associated EL node’s data container.

 

A full post mortem from Launchnodes’ perspective is available in APPENDIX A below. A full timeline of the incident can be found in section “4. Timeline” below.

 

Impact

The impact on stakers (stETH holders) from a penalties and missed rewards perspective is analysed below:

 

Description

Amounts

Initial slashing penalties

Penalties: 20 ETH

(Actual) (1 ETH penalty per validator slashed)

Additional slashing penalties (i.e. due to correlated slashing multiplier)

0*

* Projected. No additional slashing penalties expected, as thousands of additional validators would need to be slashed within the correlated 36-day period to trigger a penalty of 1 ETH per validator.

Slashing-subsequent validator duty inactivity penalties and missed rewards (attestations)

Attestation penalties*: 1.197 ETH

Missed Attestation Rewards*: 1.668 ETH

–Missed Target + Source Rewards: 1.235 ETH

–Missed Head Rewards: 0.432 ETH 


(*Projected. Assuming Projected base reward 377. Attestation penalty ~7304.75 gwei, Attestation reward ~ 10179 gwei. Slashing Vector 8192 epochs, incurred until the validators become withdrawable on the Beacon Chain)

Slashing-subsequent validator duty inactivity penalties (missed proposals and/or sync committees)

Missed proposal rewards*: 0.198 ETH
Sync committee penalties / rewards: N/A**

*Projected, based on expected value of 6.14 proposals for total 20 validators (avg proposal reward for 2w = 32,258,570 Gwei)
** See “interesting edge case” note in “Other Penalties”)

Slashing-subsequent inactivity leak

0*


* Projected. There is no inactivity leak expected as the network is not having issues finalising.

Penalties and missed rewards of the associated cluster (excluding 20 slashed validators)  de-activated during the Slashing investigation

Penalties: 2.188 ETH

Missed rewards: 3.426 ETH


Sum total of projected penalties and missed rewards of all impacted validators

28.677 ETH 

 

Resolution

Following the incident, Launchnodes shut down multiple clusters totalling 2582 validators (including the 20 slashed) to ensure no further slashing could take place. In order to prevent the slashing from spreading, Launchnodes nuked the original node clients & data (EL+CL nodes and validator clients) and the original Web3signer instance. Over the following hours, Launchnodes reactivated the remaining 2562 validators successfully without any further slashing event taking place, with slashing protection enabled on the new Web3signer instance.

 

Regarding staker compensation, Launchnodes has already disbursed 25.663 ETH to cover the initial slashing penalties and missed rewards due to infrastructure downtime, meaning that stakers suffered no reduced rewards on the day of the slashing, and has pledged to also compensate for additional penalties that the slashed validators will receive until they are withdrawn from the network.

 

Timeline

The order and timing of events was outlined below:

 

Oct 11, 15:41 UTC

Lido DAO contributor monitoring alerts fire, noting that a majority of  Launchnodes operated validators are offline. 

Oct 11, 15:41 UTC

Lido DAO contributors notify Node Operator Launchnodes of offline validators. Launchnodes acknowledges and confirms internal alerting worked and that the issue is being investigated.


Cause of the outage is local data center issues and troubleshooting is in progress. 

Oct 11, 15:47 UTC

Offline validators are gradually coming back online after the Node Operator has switched to a set of fallback nodes.

Oct 11, 15:53 UTC

ethereum-head-watcher (slashing monitoring system) alert went off indicating 2 validators were slashed. Lido DAO contributor observes validator slashings taking place on the Ethereum network and begins investigation. Validator slashings continue for the following 10 slots.

Oct 11, 15:55 UTC

Lido DAO contributors ping Node Operator of slashings which are corroborated by internal monitoring. 

Oct 11, 16:02 UTC

Launchnodes confirms validator infrastructure has been shut off

Oct 11, 16:05 UTC

20 validators confirmed slashed in total

Oct 11, 16:27 UTC

Lido twitter account provides public update notifying of the slashing incident: https://twitter.com/LidoFinance/status/1712142945783013393 

Oct 11, 16:43 UTC

Data center continues to have issues causing root cause analysis to not be able to take place at desired pace.

Oct 11, 18:52 UTC

Launchnodes executes a transaction to remove one undeposited key from the registry, resetting their “vetted keys” (i.e. staking limit) to the currently used number of keys so that no more stake would be allocated to the operator.

Tx: https://etherscan.io/tx/0x55bf362106c6f1f1a8a8632b60fb05c0b3ab5fc8e1cbd7459797ff8c10f35a0b

Oct 11, 20:00 UTC

Launchnodes prepares a plan to restore connectivity to remaining validators by nuking original nodes and Web3signer instance and using encrypted backups of key material to spin up new temporary instance until data center connectivity can be fully restored. As the baremetal server is reachable but the kubernetes clusters are not, this is a viable solution.

Oct 11, 22:28 UTC

Node Operator has nuked the server hosted in Data Centre 1 and has set up new the Web3signer instance. Web3signer slashing db has also been enabled.

Oct 11, 22:52 UTC

After taking above mentioned mitigation steps, Node Operator prepares to bring the first validators back online by gradually loading keys into the Web3signer instance and then monitoring performance. 

Oct 11, 23:23 UTC

Node Operator brings first 10 validators back online, successful attestations are observed.

Oct 12, 00:02 UTC

Additional 90 validators are brought back online following no observed issues.

Oct 12, 00:50 UTC

Next 400 validators are brought back online to observe performance. 500 total validators now actively attesting. 

Oct 12, 01:34 UTC

Additional 500 validators are brought back online. Performance monitoring continues with no issues observed since re-onlining began. 

Oct 12, 02:52  UTC

Additional 500 validators are brought back online. Performance monitoring continues with no issues observed since re-onlining began. 

Oct 12, 04:23  UTC

Additional 500 validators are brought online after removing 20 slashed keys from those being uploaded. Performance monitoring continues with no issues observed since re-onlining began. 2000 total actively attesting. 

Oct 12, 05:13  UTC

Additional 500 validators are brought back online. Performance monitoring continues with no issues observed since re-onlining began. 

Oct 12 05:31 UTC

Final 100 validators brought back online. Performance monitoring continues with no issues observed.

Oct 12 06:00

Launchnodes and Lido DAO contributors work together on lost rewards calculations for the day and estimated total impact of slashing and downtime until the slashed validators are exited.

Oct 12 09:48 UTC

Launchnodes submit compensation transaction for day’s rewards reduction

Oct 12, 11:02 UTC

Tweets posted with status update (all offline validators back up, estimated slashing penalties calculated, Launchnodes has compensated stakers for daily rewards reduction) https://twitter.com/LidoFinance/status/1712423359340818926

Oct 13 07:30 UTC

Root cause analysis concluded by Launchnodes.

 

Action Items

  • Enable Web3signer slashing database (already confirmed as done).
  • Launchnodes to work on plan for setting up infra anew on baremetal using updated risk mitigation processes.
  • Launchnodes to communicate plan and updated risk mitigation and anti-slashing processes to Lido DAO community.
  • Launchnodes to proceed with shutdown of interim infra and bringing up validators on baremetal infra.

 

Appendix A

Launchnodes Incident Report

 

Timeline & Root Cause

Timeline

October 11th


14:34 UTC


System Outages at DC1



Launchnodes’ internal monitoring systems raised alerts that core components of Launchnodes’ infrastructure in their DC1 ‘bare metal’ Data Centre environment were sporadically down.


Launchnodes had already noticed intermittent connectivity issues through its monitoring dashboard and was investigating.  Initially this was believed to be due to activation of multiple new nodes in DC1.


Investigation of Connectivity Outage

Further investigation showed that Launchnodes’ node clusters were inaccessible, due to a failure of DC1’s Virtual Private Connection.


Access to Launchnodes’ servers was possible, however access to nodes clusters was not.  Nodes connectivity was intermittent, with missed attestations noted on some nodes.


Escalation to Data Centre Provider

Tickets were raised immediately with DC1 support to restore connectivity, including evidence of the problem from logs, and ping tests to different servers.


October 11th


15:35 UTC

EL-CL Services Down

Further alerts were generated, notifying Launchnodes that key Execution Layer-Consensus Layer services were down.

October 11th


15:41 UTC

Lido DAO Notifications

Lido DAO members confirmed Launchnodes’ monitoring of Validators being offline.  Launchnodes explained the ongoing DC1 connectivity issues.

October 11th


15:45 UTC

Decision to Failover to Backup Data Centre

After investigating the outage and with no imminent resolution expected at DC1, Launchnodes’ team decided to fail over to a 2nd ‘cold standby’ data centre, DC2.

Detaching Besu storage at DC1



Launchnodes have bare metal servers that constitute an independent Kubes cluster in DC1.  On that cluster the Besu service is running.  The Besu storage is local on that server, and is used by the Besu service.


As the Kubes clusters were inaccessible at DC1, but access to the server remained possible, Launchnodes elected to move the Besu storage, to detach it from the Besu service.


This was carried out to prevent validators from attesting, even if the connectivity to the nodes was restored, as the EL-CL pair would not function without synchronisation with the latest head.


Preparing for Failover to Backup Data Centre, DC2

Launchnodes began preparing to enliven its ‘cold standby’ backup environment, in the expectation that the nodes at the primary site were rendered permanently offline.



Begin Provisioning Failover Nodes with Existing Web3 Signer

Launchnodes runs web3 signers remotely from its node infrastructure.  This is an architectural choice, as this enables the Web3 signer to act as a ‘kill switch’ in the event of needing to stop Validator nodes from attesting when connectivity is erratic or nodes are inaccessible.


Nodes at the failover Data Centre DC2 were configured to utilize the existing Web3 signer, already loaded with keys.  


Launchnodes started the services for Pre-synced Beacon and Geth nodes, and began to bring node clusters online in the failover Data Centre.


October 11th


15:55 UTC

Notification of Slashing 

Launchnodes’ monitoring systems detected a slashing event on 2 Validators.


This was immediately confirmed by the Lido team through alert messages from Lido DAO contributors.


Slashing took place on 18 further validators.


Lido DAO contributors request that all nodes be deactivated to avoid further issues.


October 11th 


16:02 UTC

Disabling of Nodes



Launchnodes completed deactivating all of its node infrastructure, by manually stopping all Validator services at DC1, and advised by Lido DAO contributors.





October 11th


16:04 UTC

Root Cause Analysis

Launchnodes began investigating root cause of the slashing incident.  Node infrastructure at DC1 remained inaccessible.

October 11th


16:16

Lido Communications

Launchnodes reviewed and agreed the accuracy of Lido’s proposed tweets about the incident.

October 11th


18:52 UTC

Staking Limit Reset

Launchnodes resets its “vetted keys”, to prevent further stake being allocated.

October 11th


17:37 UTC

Launchnodes Pledge to Lido Stakers

Launchnodes tweets, “Launchnodes will reimburse all losses incurred to Lido.”

October 11th


20:00 UTC

Plan to Restore Service

Launchnodes prepared a step-by-step plan to safely and securely restore service to the ‘cold standby’ Validators in DC2.


This involved fully decommissioning the original nodes at DC1, destroying the servers and the web3 signer instance.


October 11th


22:28 UTC

Failover Nodes and Web3 Signer Instance Ready

Launchnodes completed setup and syncing of the Execution and Consensus layer node infrastructure.


A fresh Web3 signer instance was configured, with keys loaded from secure backup.  Web3signer slashing db was also enabled.

October 11th


22:52 UTC

Validators Online

Validator nodes were brought back online, with a measured, cautious approach proposed by Launchnodes and agreed by Lido DAO contributors.


Keys were steadily loaded on the web3signer, with care to exclude keys for validators that had already been slashed.


10, 90, 400, 500, 500, 500, 500, 100 validators were brought back online in batches, with careful monitoring of performance at each stage.


October 11th


Ongoing

Monitoring

Launchnodes continued to monitor node performance throughout the night.

October 12th


06:00 UTC

Impact Assessment

Lido DAO contributors and Launchnodes review the impact of the slashing.



October 12th


06:40 UTC

Making stETH Stakers Whole

Launchnodes commits to ensuring that there is no negative financial impact to any Lido staker as a result of this incident.


Offers to disburse the calculated rewards impact for the first day to the Lido protocol Execution Layer Rewards `Vault before the rebasing scheduled for 12:00 UTC.

October 12th


09:48 UTC

Compensation Submitted

Launchnodes transfers a compensation transaction of 25.663 ETH to the Lido EL Rewards Vault, with an agreement that any further losses resulting from this incident would also be compensated.

October 12th


Ongoing

Infrastructure Review and Optimisation

Launchnodes reviews its infrastructure and processes, in order to implement guaranteed safeguards against future slashing incidents.

 

Root Cause

The root cause was Launchnodes failure to transition across to its ‘cold standby’ Data Centre, DC2 in an optimal way.

 

This resulted in nodes being active across 2 different Data Centres simultaneously - a scenario that should not have occurred.

 

Several actions could have preventing nodes from being slashed, including:

  • Destroying the DC1 node cluster before failing over to DC2.
  • Destroying the web3 signer before failing over to DC2.

 

Appendix B

Slashed validators

 

Slashed Validators

Slashed by

Reason

Slot

Epoch

964922

890138

Attestation Violation

7517976

234936

964396

890138

Attestation Violation

7517976

234936

964371

574681

Attestation Violation

7517975

234936

964360

574681

Attestation Violation

7517975

234936

964104

189894

Attestation Violation

7517974

234936

963910

189894

Attestation Violation

7517974

234936

963894

742440

Attestation Violation

7517973

234936

963841

742440

Attestation Violation

7517973

234936

963820

175591

Attestation Violation

7517972

234936

963578

175591

Attestation Violation

7517972

234936

963574

284709

Attestation Violation

7517971

234936

963403

284709

Attestation Violation

7517971

234936

965141

535608

Attestation Violation

7517970

234936

963975

535608

Attestation Violation

7517970

234936

963781

418448

Attestation Violation

7517969

234936

963358

418448

Attestation Violation

7517969

234936

963194

281420

Attestation Violation

7517968

234936

963275

281420

Attestation Violation

7517968

234936

962852

940614

Attestation Violation

7517967

234936

962807

940614

Attestation Violation

7517967

234936