Post Mortem: Lido on Ethereum Launchnodes Slashing Incident

October 13, 2023 in Post Mortem by Lido

An update as of November 28th, 2023: as of Nov 16, 2023, the 20 validators in question are now withdrawable and have thus stopped accumulating penalties. View full update here: https://research.lido.fi/t/slashing-incident-involving-launchnodes-validators-oct-11-2023/5631/4.

Incident Summary and Root Cause

At 15:55 UTC on October 11 2023, Lido DAO contributors alerted the Launchnodes Node Operator of a slashing event taking place which ultimately affected 20 of the validators that they operate as users of the Lido protocol. A full list of the validators impacted is provided in APPENDIX B below.

Within 10 minutes, the affected clusters were brought offline to mitigate potential further risk, and the Launchnodes team began to investigate the root cause. The root cause of the slashing boiled down to executing non-optimal fallback procedures during datacenter connectivity issues. In an attempt to restore validator connectivity, multiple validator client instances (an initial instance and a manually activated fallback instance) were pointed to a single Web3signer instance without slashing protection enabled at the Web3signer level and without blocking the initial instance from the signer (e.g. via firewall rules); this caused double votes to occur for the loaded validators, which led to attester slashings of 20 validators.

The fallback validator client was brought on and connected to Web3signer after an attempt had been made to deactivate the nodes attached to the original validator client instance by moving the associated EL node’s data container.

A full post mortem from Launchnodes’ perspective is available in APPENDIX A below. A full timeline of the incident can be found in section “4. Timeline” below.

Impact

The impact on stakers (stETH holders) from a penalties and missed rewards perspective is analysed below:

Description	Amounts
Initial slashing penalties	Penalties: 20 ETH (Actual) (1 ETH penalty per validator slashed)
Additional slashing penalties (i.e. due to correlated slashing multiplier)	0* * Projected. No additional slashing penalties expected, as thousands of additional validators would need to be slashed within the correlated 36-day period to trigger a penalty of 1 ETH per validator.
Slashing-subsequent validator duty inactivity penalties and missed rewards (attestations)	Attestation penalties: 1.197 ETH Missed Attestation Rewards: 1.668 ETH –Missed Target + Source Rewards: 1.235 ETH –Missed Head Rewards: 0.432 ETH (*Projected. Assuming Projected base reward 377. Attestation penalty ~7304.75 gwei, Attestation reward ~ 10179 gwei. Slashing Vector 8192 epochs, incurred until the validators become withdrawable on the Beacon Chain)
Slashing-subsequent validator duty inactivity penalties (missed proposals and/or sync committees)	Missed proposal rewards: 0.198 ETH Sync committee penalties / rewards: N/A* Projected, based on expected value of 6.14 proposals for total 20 validators (avg proposal reward for 2w = 32,258,570 Gwei) * See “interesting edge case” note in “Other Penalties”)
Slashing-subsequent inactivity leak	0* * Projected. There is no inactivity leak expected as the network is not having issues finalising.
Penalties and missed rewards of the associated cluster (excluding 20 slashed validators) de-activated during the Slashing investigation	Penalties: 2.188 ETH Missed rewards: 3.426 ETH
Sum total of projected penalties and missed rewards of all impacted validators	28.677 ETH

Resolution

Following the incident, Launchnodes shut down multiple clusters totalling 2582 validators (including the 20 slashed) to ensure no further slashing could take place. In order to prevent the slashing from spreading, Launchnodes nuked the original node clients & data (EL+CL nodes and validator clients) and the original Web3signer instance. Over the following hours, Launchnodes reactivated the remaining 2562 validators successfully without any further slashing event taking place, with slashing protection enabled on the new Web3signer instance.

Regarding staker compensation, Launchnodes has already disbursed 25.663 ETH to cover the initial slashing penalties and missed rewards due to infrastructure downtime, meaning that stakers suffered no reduced rewards on the day of the slashing, and has pledged to also compensate for additional penalties that the slashed validators will receive until they are withdrawn from the network.

Timeline

The order and timing of events was outlined below:

Oct 11, 15:41 UTC	Lido DAO contributor monitoring alerts fire, noting that a majority of Launchnodes operated validators are offline.
Oct 11, 15:41 UTC	Lido DAO contributors notify Node Operator Launchnodes of offline validators. Launchnodes acknowledges and confirms internal alerting worked and that the issue is being investigated. Cause of the outage is local data center issues and troubleshooting is in progress.
Oct 11, 15:47 UTC	Offline validators are gradually coming back online after the Node Operator has switched to a set of fallback nodes.
Oct 11, 15:53 UTC	ethereum-head-watcher (slashing monitoring system) alert went off indicating 2 validators were slashed. Lido DAO contributor observes validator slashings taking place on the Ethereum network and begins investigation. Validator slashings continue for the following 10 slots.
Oct 11, 15:55 UTC	Lido DAO contributors ping Node Operator of slashings which are corroborated by internal monitoring.
Oct 11, 16:02 UTC	Launchnodes confirms validator infrastructure has been shut off
Oct 11, 16:05 UTC	20 validators confirmed slashed in total
Oct 11, 16:27 UTC	Lido twitter account provides public update notifying of the slashing incident: https://twitter.com/LidoFinance/status/1712142945783013393
Oct 11, 16:43 UTC	Data center continues to have issues causing root cause analysis to not be able to take place at desired pace.
Oct 11, 18:52 UTC	Launchnodes executes a transaction to remove one undeposited key from the registry, resetting their “vetted keys” (i.e. staking limit) to the currently used number of keys so that no more stake would be allocated to the operator. Tx: https://etherscan.io/tx/0x55bf362106c6f1f1a8a8632b60fb05c0b3ab5fc8e1cbd7459797ff8c10f35a0b
Oct 11, 20:00 UTC	Launchnodes prepares a plan to restore connectivity to remaining validators by nuking original nodes and Web3signer instance and using encrypted backups of key material to spin up new temporary instance until data center connectivity can be fully restored. As the baremetal server is reachable but the kubernetes clusters are not, this is a viable solution.
Oct 11, 22:28 UTC	Node Operator has nuked the server hosted in Data Centre 1 and has set up new the Web3signer instance. Web3signer slashing db has also been enabled.
Oct 11, 22:52 UTC	After taking above mentioned mitigation steps, Node Operator prepares to bring the first validators back online by gradually loading keys into the Web3signer instance and then monitoring performance.
Oct 11, 23:23 UTC	Node Operator brings first 10 validators back online, successful attestations are observed.
Oct 12, 00:02 UTC	Additional 90 validators are brought back online following no observed issues.
Oct 12, 00:50 UTC	Next 400 validators are brought back online to observe performance. 500 total validators now actively attesting.
Oct 12, 01:34 UTC	Additional 500 validators are brought back online. Performance monitoring continues with no issues observed since re-onlining began.
Oct 12, 02:52 UTC	Additional 500 validators are brought back online. Performance monitoring continues with no issues observed since re-onlining began.
Oct 12, 04:23 UTC	Additional 500 validators are brought online after removing 20 slashed keys from those being uploaded. Performance monitoring continues with no issues observed since re-onlining began. 2000 total actively attesting.
Oct 12, 05:13 UTC	Additional 500 validators are brought back online. Performance monitoring continues with no issues observed since re-onlining began.
Oct 12 05:31 UTC	Final 100 validators brought back online. Performance monitoring continues with no issues observed.
Oct 12 06:00	Launchnodes and Lido DAO contributors work together on lost rewards calculations for the day and estimated total impact of slashing and downtime until the slashed validators are exited.
Oct 12 09:48 UTC	Launchnodes submit compensation transaction for day’s rewards reduction
Oct 12, 11:02 UTC	Tweets posted with status update (all offline validators back up, estimated slashing penalties calculated, Launchnodes has compensated stakers for daily rewards reduction) https://twitter.com/LidoFinance/status/1712423359340818926
Oct 13 07:30 UTC	Root cause analysis concluded by Launchnodes.

Action Items

Enable Web3signer slashing database (already confirmed as done).
Launchnodes to work on plan for setting up infra anew on baremetal using updated risk mitigation processes.
Launchnodes to communicate plan and updated risk mitigation and anti-slashing processes to Lido DAO community.
Launchnodes to proceed with shutdown of interim infra and bringing up validators on baremetal infra.

Appendix A

Launchnodes Incident Report

Timeline & Root Cause

Timeline

October 11th 14:34 UTC	System Outages at DC1	Launchnodes’ internal monitoring systems raised alerts that core components of Launchnodes’ infrastructure in their DC1 ‘bare metal’ Data Centre environment were sporadically down. Launchnodes had already noticed intermittent connectivity issues through its monitoring dashboard and was investigating. Initially this was believed to be due to activation of multiple new nodes in DC1.
	Investigation of Connectivity Outage	Further investigation showed that Launchnodes’ node clusters were inaccessible, due to a failure of DC1’s Virtual Private Connection. Access to Launchnodes’ servers was possible, however access to nodes clusters was not. Nodes connectivity was intermittent, with missed attestations noted on some nodes.
	Escalation to Data Centre Provider	Tickets were raised immediately with DC1 support to restore connectivity, including evidence of the problem from logs, and ping tests to different servers.
October 11th 15:35 UTC	EL-CL Services Down	Further alerts were generated, notifying Launchnodes that key Execution Layer-Consensus Layer services were down.
October 11th 15:41 UTC	Lido DAO Notifications	Lido DAO members confirmed Launchnodes’ monitoring of Validators being offline. Launchnodes explained the ongoing DC1 connectivity issues.
October 11th 15:45 UTC	Decision to Failover to Backup Data Centre	After investigating the outage and with no imminent resolution expected at DC1, Launchnodes’ team decided to fail over to a 2nd ‘cold standby’ data centre, DC2.
	Detaching Besu storage at DC1	Launchnodes have bare metal servers that constitute an independent Kubes cluster in DC1. On that cluster the Besu service is running. The Besu storage is local on that server, and is used by the Besu service. As the Kubes clusters were inaccessible at DC1, but access to the server remained possible, Launchnodes elected to move the Besu storage, to detach it from the Besu service. This was carried out to prevent validators from attesting, even if the connectivity to the nodes was restored, as the EL-CL pair would not function without synchronisation with the latest head.
	Preparing for Failover to Backup Data Centre, DC2	Launchnodes began preparing to enliven its ‘cold standby’ backup environment, in the expectation that the nodes at the primary site were rendered permanently offline.
	Begin Provisioning Failover Nodes with Existing Web3 Signer	Launchnodes runs web3 signers remotely from its node infrastructure. This is an architectural choice, as this enables the Web3 signer to act as a ‘kill switch’ in the event of needing to stop Validator nodes from attesting when connectivity is erratic or nodes are inaccessible. Nodes at the failover Data Centre DC2 were configured to utilize the existing Web3 signer, already loaded with keys. Launchnodes started the services for Pre-synced Beacon and Geth nodes, and began to bring node clusters online in the failover Data Centre.
October 11th 15:55 UTC	Notification of Slashing	Launchnodes’ monitoring systems detected a slashing event on 2 Validators. This was immediately confirmed by the Lido team through alert messages from Lido DAO contributors. Slashing took place on 18 further validators. Lido DAO contributors request that all nodes be deactivated to avoid further issues.
October 11th 16:02 UTC	Disabling of Nodes	Launchnodes completed deactivating all of its node infrastructure, by manually stopping all Validator services at DC1, and advised by Lido DAO contributors.
October 11th 16:04 UTC	Root Cause Analysis	Launchnodes began investigating root cause of the slashing incident. Node infrastructure at DC1 remained inaccessible.
October 11th 16:16	Lido Communications	Launchnodes reviewed and agreed the accuracy of Lido’s proposed tweets about the incident.
October 11th 18:52 UTC	Staking Limit Reset	Launchnodes resets its “vetted keys”, to prevent further stake being allocated.
October 11th 17:37 UTC	Launchnodes Pledge to Lido Stakers	Launchnodes tweets, “Launchnodes will reimburse all losses incurred to Lido.”
October 11th 20:00 UTC	Plan to Restore Service	Launchnodes prepared a step-by-step plan to safely and securely restore service to the ‘cold standby’ Validators in DC2. This involved fully decommissioning the original nodes at DC1, destroying the servers and the web3 signer instance.
October 11th 22:28 UTC	Failover Nodes and Web3 Signer Instance Ready	Launchnodes completed setup and syncing of the Execution and Consensus layer node infrastructure. A fresh Web3 signer instance was configured, with keys loaded from secure backup. Web3signer slashing db was also enabled.
October 11th 22:52 UTC	Validators Online	Validator nodes were brought back online, with a measured, cautious approach proposed by Launchnodes and agreed by Lido DAO contributors. Keys were steadily loaded on the web3signer, with care to exclude keys for validators that had already been slashed. 10, 90, 400, 500, 500, 500, 500, 100 validators were brought back online in batches, with careful monitoring of performance at each stage.
October 11th Ongoing	Monitoring	Launchnodes continued to monitor node performance throughout the night.
October 12th 06:00 UTC	Impact Assessment	Lido DAO contributors and Launchnodes review the impact of the slashing.
October 12th 06:40 UTC	Making stETH Stakers Whole	Launchnodes commits to ensuring that there is no negative financial impact to any Lido staker as a result of this incident. Offers to disburse the calculated rewards impact for the first day to the Lido protocol Execution Layer Rewards `Vault before the rebasing scheduled for 12:00 UTC.
October 12th 09:48 UTC	Compensation Submitted	Launchnodes transfers a compensation transaction of 25.663 ETH to the Lido EL Rewards Vault, with an agreement that any further losses resulting from this incident would also be compensated.
October 12th Ongoing	Infrastructure Review and Optimisation	Launchnodes reviews its infrastructure and processes, in order to implement guaranteed safeguards against future slashing incidents.

Root Cause

The root cause was Launchnodes failure to transition across to its ‘cold standby’ Data Centre, DC2 in an optimal way.

This resulted in nodes being active across 2 different Data Centres simultaneously - a scenario that should not have occurred.

Several actions could have preventing nodes from being slashed, including:

Destroying the DC1 node cluster before failing over to DC2.
Destroying the web3 signer before failing over to DC2.

Appendix B

Slashed validators

Slashed Validators	Slashed by	Reason	Slot	Epoch
964922	890138	Attestation Violation	7517976	234936
964396	890138	Attestation Violation	7517976	234936
964371	574681	Attestation Violation	7517975	234936
964360	574681	Attestation Violation	7517975	234936
964104	189894	Attestation Violation	7517974	234936
963910	189894	Attestation Violation	7517974	234936
963894	742440	Attestation Violation	7517973	234936
963841	742440	Attestation Violation	7517973	234936
963820	175591	Attestation Violation	7517972	234936
963578	175591	Attestation Violation	7517972	234936
963574	284709	Attestation Violation	7517971	234936
963403	284709	Attestation Violation	7517971	234936
965141	535608	Attestation Violation	7517970	234936
963975	535608	Attestation Violation	7517970	234936
963781	418448	Attestation Violation	7517969	234936
963358	418448	Attestation Violation	7517969	234936
963194	281420	Attestation Violation	7517968	234936
963275	281420	Attestation Violation	7517968	234936
962852	940614	Attestation Violation	7517967	234936
962807	940614	Attestation Violation	7517967	234936