Post Mortem: Lido on Ethereum Launchnodes Slashing Incident
An update as of November 28th, 2023: as of Nov 16, 2023, the 20 validators in question are now withdrawable and have thus stopped accumulating penalties. View full update here: https://research.lido.fi/t/slashing-incident-involving-launchnodes-validators-oct-11-2023/5631/4.
Incident Summary and Root Cause
At 15:55 UTC on October 11 2023, Lido DAO contributors alerted the Launchnodes Node Operator of a slashing event taking place which ultimately affected 20 of the validators that they operate as users of the Lido protocol. A full list of the validators impacted is provided in APPENDIX B below.
Within 10 minutes, the affected clusters were brought offline to mitigate potential further risk, and the Launchnodes team began to investigate the root cause. The root cause of the slashing boiled down to executing non-optimal fallback procedures during datacenter connectivity issues. In an attempt to restore validator connectivity, multiple validator client instances (an initial instance and a manually activated fallback instance) were pointed to a single Web3signer instance without slashing protection enabled at the Web3signer level and without blocking the initial instance from the signer (e.g. via firewall rules); this caused double votes to occur for the loaded validators, which led to attester slashings of 20 validators.
The fallback validator client was brought on and connected to Web3signer after an attempt had been made to deactivate the nodes attached to the original validator client instance by moving the associated EL node’s data container.
A full post mortem from Launchnodes’ perspective is available in APPENDIX A below. A full timeline of the incident can be found in section “4. Timeline” below.
Impact
The impact on stakers (stETH holders) from a penalties and missed rewards perspective is analysed below:
Description | Amounts |
Initial slashing penalties | Penalties: 20 ETH |
Additional slashing penalties (i.e. due to correlated slashing multiplier) | 0* * Projected. No additional slashing penalties expected, as thousands of additional validators would need to be slashed within the correlated 36-day period to trigger a penalty of 1 ETH per validator. |
Slashing-subsequent validator duty inactivity penalties and missed rewards (attestations) | Attestation penalties*: 1.197 ETH –Missed Target + Source Rewards: 1.235 ETH –Missed Head Rewards: 0.432 ETH
|
Slashing-subsequent validator duty inactivity penalties (missed proposals and/or sync committees) | Missed proposal rewards*: 0.198 ETH |
Slashing-subsequent inactivity leak | 0* |
Penalties and missed rewards of the associated cluster (excluding 20 slashed validators) de-activated during the Slashing investigation | Penalties: 2.188 ETH Missed rewards: 3.426 ETH |
Sum total of projected penalties and missed rewards of all impacted validators | 28.677 ETH |
Resolution
Following the incident, Launchnodes shut down multiple clusters totalling 2582 validators (including the 20 slashed) to ensure no further slashing could take place. In order to prevent the slashing from spreading, Launchnodes nuked the original node clients & data (EL+CL nodes and validator clients) and the original Web3signer instance. Over the following hours, Launchnodes reactivated the remaining 2562 validators successfully without any further slashing event taking place, with slashing protection enabled on the new Web3signer instance.
Regarding staker compensation, Launchnodes has already disbursed 25.663 ETH to cover the initial slashing penalties and missed rewards due to infrastructure downtime, meaning that stakers suffered no reduced rewards on the day of the slashing, and has pledged to also compensate for additional penalties that the slashed validators will receive until they are withdrawn from the network.
Timeline
The order and timing of events was outlined below:
Oct 11, 15:41 UTC | Lido DAO contributor monitoring alerts fire, noting that a majority of Launchnodes operated validators are offline. |
Oct 11, 15:41 UTC | Lido DAO contributors notify Node Operator Launchnodes of offline validators. Launchnodes acknowledges and confirms internal alerting worked and that the issue is being investigated. Cause of the outage is local data center issues and troubleshooting is in progress. |
Oct 11, 15:47 UTC | Offline validators are gradually coming back online after the Node Operator has switched to a set of fallback nodes. |
Oct 11, 15:53 UTC | ethereum-head-watcher (slashing monitoring system) alert went off indicating 2 validators were slashed. Lido DAO contributor observes validator slashings taking place on the Ethereum network and begins investigation. Validator slashings continue for the following 10 slots. |
Oct 11, 15:55 UTC | Lido DAO contributors ping Node Operator of slashings which are corroborated by internal monitoring. |
Oct 11, 16:02 UTC | Launchnodes confirms validator infrastructure has been shut off |
Oct 11, 16:05 UTC | 20 validators confirmed slashed in total |
Oct 11, 16:27 UTC | Lido twitter account provides public update notifying of the slashing incident: https://twitter.com/LidoFinance/status/1712142945783013393 |
Oct 11, 16:43 UTC | Data center continues to have issues causing root cause analysis to not be able to take place at desired pace. |
Oct 11, 18:52 UTC | Launchnodes executes a transaction to remove one undeposited key from the registry, resetting their “vetted keys” (i.e. staking limit) to the currently used number of keys so that no more stake would be allocated to the operator. Tx: https://etherscan.io/tx/0x55bf362106c6f1f1a8a8632b60fb05c0b3ab5fc8e1cbd7459797ff8c10f35a0b |
Oct 11, 20:00 UTC | Launchnodes prepares a plan to restore connectivity to remaining validators by nuking original nodes and Web3signer instance and using encrypted backups of key material to spin up new temporary instance until data center connectivity can be fully restored. As the baremetal server is reachable but the kubernetes clusters are not, this is a viable solution. |
Oct 11, 22:28 UTC | Node Operator has nuked the server hosted in Data Centre 1 and has set up new the Web3signer instance. Web3signer slashing db has also been enabled. |
Oct 11, 22:52 UTC | After taking above mentioned mitigation steps, Node Operator prepares to bring the first validators back online by gradually loading keys into the Web3signer instance and then monitoring performance. |
Oct 11, 23:23 UTC | Node Operator brings first 10 validators back online, successful attestations are observed. |
Oct 12, 00:02 UTC | Additional 90 validators are brought back online following no observed issues. |
Oct 12, 00:50 UTC | Next 400 validators are brought back online to observe performance. 500 total validators now actively attesting. |
Oct 12, 01:34 UTC | Additional 500 validators are brought back online. Performance monitoring continues with no issues observed since re-onlining began. |
Oct 12, 02:52 UTC | Additional 500 validators are brought back online. Performance monitoring continues with no issues observed since re-onlining began. |
Oct 12, 04:23 UTC | Additional 500 validators are brought online after removing 20 slashed keys from those being uploaded. Performance monitoring continues with no issues observed since re-onlining began. 2000 total actively attesting. |
Oct 12, 05:13 UTC | Additional 500 validators are brought back online. Performance monitoring continues with no issues observed since re-onlining began. |
Oct 12 05:31 UTC | Final 100 validators brought back online. Performance monitoring continues with no issues observed. |
Oct 12 06:00 | Launchnodes and Lido DAO contributors work together on lost rewards calculations for the day and estimated total impact of slashing and downtime until the slashed validators are exited. |
Oct 12 09:48 UTC | Launchnodes submit compensation transaction for day’s rewards reduction |
Oct 12, 11:02 UTC | Tweets posted with status update (all offline validators back up, estimated slashing penalties calculated, Launchnodes has compensated stakers for daily rewards reduction) https://twitter.com/LidoFinance/status/1712423359340818926 |
Oct 13 07:30 UTC | Root cause analysis concluded by Launchnodes. |
Action Items
- Enable Web3signer slashing database (already confirmed as done).
- Launchnodes to work on plan for setting up infra anew on baremetal using updated risk mitigation processes.
- Launchnodes to communicate plan and updated risk mitigation and anti-slashing processes to Lido DAO community.
- Launchnodes to proceed with shutdown of interim infra and bringing up validators on baremetal infra.
Appendix A
Launchnodes Incident Report
Timeline & Root Cause
Timeline
October 11th 14:34 UTC | System Outages at DC1 | Launchnodes’ internal monitoring systems raised alerts that core components of Launchnodes’ infrastructure in their DC1 ‘bare metal’ Data Centre environment were sporadically down. Launchnodes had already noticed intermittent connectivity issues through its monitoring dashboard and was investigating. Initially this was believed to be due to activation of multiple new nodes in DC1. |
Investigation of Connectivity Outage | Further investigation showed that Launchnodes’ node clusters were inaccessible, due to a failure of DC1’s Virtual Private Connection. Access to Launchnodes’ servers was possible, however access to nodes clusters was not. Nodes connectivity was intermittent, with missed attestations noted on some nodes. | |
Escalation to Data Centre Provider | Tickets were raised immediately with DC1 support to restore connectivity, including evidence of the problem from logs, and ping tests to different servers. | |
October 11th 15:35 UTC | EL-CL Services Down | Further alerts were generated, notifying Launchnodes that key Execution Layer-Consensus Layer services were down. |
October 11th 15:41 UTC | Lido DAO Notifications | Lido DAO members confirmed Launchnodes’ monitoring of Validators being offline. Launchnodes explained the ongoing DC1 connectivity issues. |
October 11th 15:45 UTC | Decision to Failover to Backup Data Centre | After investigating the outage and with no imminent resolution expected at DC1, Launchnodes’ team decided to fail over to a 2nd ‘cold standby’ data centre, DC2. |
Detaching Besu storage at DC1 | Launchnodes have bare metal servers that constitute an independent Kubes cluster in DC1. On that cluster the Besu service is running. The Besu storage is local on that server, and is used by the Besu service. As the Kubes clusters were inaccessible at DC1, but access to the server remained possible, Launchnodes elected to move the Besu storage, to detach it from the Besu service. This was carried out to prevent validators from attesting, even if the connectivity to the nodes was restored, as the EL-CL pair would not function without synchronisation with the latest head. | |
Preparing for Failover to Backup Data Centre, DC2 | Launchnodes began preparing to enliven its ‘cold standby’ backup environment, in the expectation that the nodes at the primary site were rendered permanently offline. | |
Begin Provisioning Failover Nodes with Existing Web3 Signer | Launchnodes runs web3 signers remotely from its node infrastructure. This is an architectural choice, as this enables the Web3 signer to act as a ‘kill switch’ in the event of needing to stop Validator nodes from attesting when connectivity is erratic or nodes are inaccessible. Nodes at the failover Data Centre DC2 were configured to utilize the existing Web3 signer, already loaded with keys. Launchnodes started the services for Pre-synced Beacon and Geth nodes, and began to bring node clusters online in the failover Data Centre. | |
October 11th 15:55 UTC | Notification of Slashing | Launchnodes’ monitoring systems detected a slashing event on 2 Validators. This was immediately confirmed by the Lido team through alert messages from Lido DAO contributors. Slashing took place on 18 further validators. Lido DAO contributors request that all nodes be deactivated to avoid further issues. |
October 11th 16:02 UTC | Disabling of Nodes | Launchnodes completed deactivating all of its node infrastructure, by manually stopping all Validator services at DC1, and advised by Lido DAO contributors. |
October 11th 16:04 UTC | Root Cause Analysis | Launchnodes began investigating root cause of the slashing incident. Node infrastructure at DC1 remained inaccessible. |
October 11th 16:16 | Lido Communications | Launchnodes reviewed and agreed the accuracy of Lido’s proposed tweets about the incident. |
October 11th 18:52 UTC | Staking Limit Reset | Launchnodes resets its “vetted keys”, to prevent further stake being allocated. |
October 11th 17:37 UTC | Launchnodes Pledge to Lido Stakers | Launchnodes tweets, “Launchnodes will reimburse all losses incurred to Lido.” |
October 11th 20:00 UTC | Plan to Restore Service | Launchnodes prepared a step-by-step plan to safely and securely restore service to the ‘cold standby’ Validators in DC2. This involved fully decommissioning the original nodes at DC1, destroying the servers and the web3 signer instance. |
October 11th 22:28 UTC | Failover Nodes and Web3 Signer Instance Ready | Launchnodes completed setup and syncing of the Execution and Consensus layer node infrastructure. A fresh Web3 signer instance was configured, with keys loaded from secure backup. Web3signer slashing db was also enabled. |
October 11th 22:52 UTC | Validators Online | Validator nodes were brought back online, with a measured, cautious approach proposed by Launchnodes and agreed by Lido DAO contributors. Keys were steadily loaded on the web3signer, with care to exclude keys for validators that had already been slashed. 10, 90, 400, 500, 500, 500, 500, 100 validators were brought back online in batches, with careful monitoring of performance at each stage. |
October 11th Ongoing | Monitoring | Launchnodes continued to monitor node performance throughout the night. |
October 12th 06:00 UTC | Impact Assessment | Lido DAO contributors and Launchnodes review the impact of the slashing. |
October 12th 06:40 UTC | Making stETH Stakers Whole | Launchnodes commits to ensuring that there is no negative financial impact to any Lido staker as a result of this incident. Offers to disburse the calculated rewards impact for the first day to the Lido protocol Execution Layer Rewards `Vault before the rebasing scheduled for 12:00 UTC. |
October 12th 09:48 UTC | Compensation Submitted | Launchnodes transfers a compensation transaction of 25.663 ETH to the Lido EL Rewards Vault, with an agreement that any further losses resulting from this incident would also be compensated. |
October 12th Ongoing | Infrastructure Review and Optimisation | Launchnodes reviews its infrastructure and processes, in order to implement guaranteed safeguards against future slashing incidents. |
Root Cause
The root cause was Launchnodes failure to transition across to its ‘cold standby’ Data Centre, DC2 in an optimal way.
This resulted in nodes being active across 2 different Data Centres simultaneously - a scenario that should not have occurred.
Several actions could have preventing nodes from being slashed, including:
- Destroying the DC1 node cluster before failing over to DC2.
- Destroying the web3 signer before failing over to DC2.
Appendix B
Slashed validators
Slashed Validators | Slashed by | Reason | Slot | Epoch |
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation | ||||
Attestation Violation |