
Unraveling the December 5th Hyper-V Cluster Blackout: A 10-Node Outage
Customer reported a complete outage of their virtual machine environment. This Root Cause Analysis (RCA) explores the events leading to the failure, identifies contributing factors, and proposes mitigation strategies, despite challenges posed by incomplete diagnostic tracing.
Root Cause Analysis Report
Executive Summary
This RCA report analyzes the events that occurred on December 5th, leading to an outage of a 10-node Hyper-V cluster which recovered itself within the day. The outage, starting at approximately 1:58 AM UTC, involved simultaneous removal of all nodes from the cluster, virtual machine failures, and storage subsystem issues.
Introduction
Environment
Cluster Nodes
Name | Node Number |
---|---|
Machine1 | 16 |
Machine2 | 17 |
Machine3 | 18 |
Machine4 | 19 |
Machine5 | 20 |
Machine6 | 21 |
Machine7 | 22 |
Machine8 | 23 |
Machine9 | 24 |
Machine10 | 11 |
Cluster Networks
Name | Prefix List | Address | Metric |
---|---|---|---|
Network-A | PrefixList1 | Address1 | 3984 |
Network-B | PrefixList2 | Address2 | 7984 |
Network-C | PrefixList3 | Address3 | 7984 |
Network-D | PrefixList4 | Address4 | 7984 |
Network-E | PrefixList5 | Address5 | 3984 |
Network-F | PrefixList6 | Address6 | 2960 |
Network-G | PrefixList7 | Address7 | 2040 |
Network-H | PrefixList8 | Address8 | 2040 |
Cluster Shared Volumes
Volume | Disk Number | Disk Name |
---|---|---|
Volume1 | 7 | DiskName1 |
Volume2 | 3 | DiskName2 |
Volume3 | 12 | DiskName3 |
Volume4 | 13 | DiskName4 |
Volume5 | 14 | DiskName5 |
Volume6 | 16 | DiskName6 |
Volume7 | 17 | DiskName7 |
Volume8 | 18 | DiskName8 |
Volume9 | 19 | DiskName9 |
Volume10 | 20 | DiskName10 |
Volume11 | 21 | DiskName11 |
Volume12 | 8 | DiskName12 |
Volume13 | 22 | DiskName13 |
Volume14 | 23 | DiskName14 |
Volume15 | 1 | DiskName15 |
Volume16 | 24 | DiskName16 |
Volume17 | 26 | DiskName17 |
Volume18 | 27 | DiskName18 |
Volume19 | 25 | DiskName19 |
Volume20 | 9 | DiskName20 |
Volume21 | 10 | DiskName21 |
Volume22 | 5 | DiskName22 |
Volume23 | 4 | DiskName23 |
Volume24 | 11 | DiskName24 |
Volume25 | 2 | DiskName25 |
Purpose
This document provides a detailed Root Cause Analysis (RCA) of the cluster outage that occurred on December 5th. The primary goal is to identify the root cause(s) of the outage, understand the sequence of events that led to the failure, and recommend corrective actions to prevent similar incidents in the future.
Incident Description
Multiple disconnections to the cluster and the clustered VMs were detected during early afternoon on the 5th of December AEDT.
Methodology
The investigation involved analyzing system logs and application logs from all the servers within the cluster.
Findings
Chronology and summary
On December 5th, at approximately 1:58 AM UTC, the 10-node Hyper-V cluster experienced a catastrophic failure. The sequence of events was as follows:
- Precursor Events (December 4th):
- Sporadic node evictions were observed, which were attributed to planned activity.
- Disk I/O retries and paging errors were logged across multiple nodes, indicating potential storage issues.
- Authentication errors (Event ID 40970) were reported, suggesting problems with domain controllers or trust relationships.
- Filter Manager failures (Event ID 3) occurred on some nodes, pointing to file system driver or service issues.
- Major Outage (December 5th, 1:58 AM - 2:00 AM):
- Simultaneous Node Removal: All cluster nodes were simultaneously removed from the active failover cluster membership (Event ID 1135).
- VMs Unmonitored: Virtual machines on all nodes entered an unmonitored state (Event ID 1681).
- CSV Failures: Cluster Shared Volumes (CSVs) became paused due to STATUS_USER_SESSION_DELETED or STATUS_UNEXPECTED_NETWORK_ERROR (Event ID 5157).
- Quorum Loss: The cluster service shut down due to a loss of quorum (Event ID 1177).
- Cluster Service Termination: The Cluster Service terminated unexpectedly (Event ID 7024/7031).
- Disk and Resource Failures: Physical disk resources encountered errors during termination (Event ID 1795), and NTFS reported non-retryable errors (Event ID 137, 140). Disk I/O retries were logged on multiple nodes.
- Post-Outage (December 5th onwards):
- Nodes kept being removed from the cluster.
- Hyper-V VmSwitch errors (Event ID 106) appeared.
- Disk surprise removals (Event ID 157) and paging errors (Event ID 51) persisted, particularly with Disk 28.
- NTFS corruption (Event ID 55) was detected.
- Disk signature issues (Event ID 158, 58) were reported.
- Authentication errors (Event ID 40970) continued.
- Filter Manager failures (Event ID 3) persisted.
Root Causes
The root cause of the outage appears to be a combination of a major network failure and storage subsystem issues, with potential contributing factors from node-specific problems.
- Primary Cause: Network Connectivity Failure:
- The simultaneous removal of all nodes from the cluster at 1:58 AM on December 5th is the strongest indicator of a widespread network outage.
- Possible causes include:
- Switch/Router Failure: A failure in a core network switch or router.
- Network Configuration Error: An erroneous network configuration change affecting all nodes (e.g., VLAN, routing, firewall).
- Cable/Port Issues: While less likely to be simultaneous across all nodes, a widespread problem affecting multiple network cables or ports cannot be entirely ruled out.
- Secondary Cause: Storage Subsystem Failure:
- The numerous disk-related errors (I/O retries, paging errors, surprise removals, NTFS corruption) point to a significant problem within the storage subsystem.
- Possible causes include:
- SAN/Storage Array Issue: A problem with the SAN or storage array itself (e.g., controller failure, disk failure, configuration error).
- Storage Network Issues: If storage is accessed over a separate network (e.g., iSCSI or Fibre Channel), a problem with this network could be a contributing factor.
- Disk Failures: Multiple disk failures, especially concerning Disk 28, are evident.
- Contributing Factors:
- Node Machine1: This node exhibited recurring problems, suggesting a potential hardware or software issue that might have exacerbated the cluster instability.
- Filter Manager Issues: The Filter Manager errors on several nodes could have contributed to volume access problems.
- Authentication Errors: The authentication errors on December 4th might indicate underlying issues with domain controllers or trust relationships, potentially impacting cluster communication.
Conclusion
The catastrophic failure of the 10-node Hyper-V cluster on December 5th was primarily triggered by a major network outage that simultaneously isolated all nodes, causing a loss of quorum and cluster service termination. This was compounded by a significant storage subsystem failure, evidenced by widespread disk errors and corruption. While the network failure was the likely instigator, the underlying storage issues severely impacted recovery and contributed to the severity of the outage. The sporadic problems observed on December 4th, including authentication and Filter Manager errors, might have further destabilized the environment, although their direct role in the main event is less clear.
Recommendation
Immediate Actions:
- Network Infrastructure Investigation:
- Thoroughly investigate the network infrastructure, focusing on the time around 1:58 AM on December 5th.
- Examine network device logs (switches, routers, firewalls) for errors, failures, or configuration changes.
- Analyze network monitoring data for any anomalies or outages.
- Storage Subsystem Assessment:
- Conduct a comprehensive health check of the storage subsystem (SAN/storage array, storage network, individual disks).
- Examine storage controller logs and performance metrics.
- Prioritize Disk 28: Investigate and replace or thoroughly test Disk 28, which exhibited numerous errors.
- Address NTFS corruption on affected volumes (Machine1, Machine4).
- Node Machine1 Diagnostics:
- Perform in-depth diagnostics on Machine1 (hardware checks, driver updates, OS file integrity checks).
- Consider temporarily removing this node from the cluster until the issue is resolved.
Long-Term Actions:
- Network Redundancy:
- Implement redundant network infrastructure (as there have already been multiple network paths, I presume they are on the same infrastructure).
- Review network topology and consider implementing a more robust design to prevent single points of failure.
- Storage Redundancy and Monitoring:
- Ensure storage redundancy (e.g., RAID, redundant controllers) is properly configured and functioning.
- Implement comprehensive storage monitoring to detect and alert on potential issues proactively.
- Filter Driver Management:
- Address the Filter Manager errors (Event ID 3) by updating or reinstalling file system filter drivers.
- Authentication and Domain Health:
- Investigate and resolve the authentication errors (Event ID 40970) observed on December 4th.
- Verify the health and stability of domain controllers and trust relationships.
- Regular Maintenance and Testing:
- Establish a regular maintenance schedule for the cluster, including hardware checks, software updates, and failover testing.
- Conduct periodic disaster recovery drills to ensure the cluster can recover from various failure scenarios.
Appendices
Note: This table is a combined view, meaning events with the same ID and similar descriptions are grouped together, even if they occurred on different nodes. The "Affected Nodes" column indicates which nodes reported that specific event in the provided logs.
Time (UTC) | Event IDs | Provider | Description | Affected Nodes | Implications |
---|---|---|---|---|---|
Dec 4th, various times | 1011 | Microsoft-Windows-FailoverClustering | Cluster nodes evicted from the failover cluster. | Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 | Indicates underlying cluster instability, potentially network or storage related. |
Dec 4th, various times | 153 | Disk | IO operation at logical block address was retried. | Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 | Suggests storage subsystem issues, potentially disk or controller related. |
Dec 4th, various times | 51 | Disk | An error was detected on the device during a paging operation. | Machine1, Machine2, Machine4, Machine5, Machine6 | Indicates problems with specific disks or storage connectivity. |
Dec 4th, various times | 40970 | LsaSrv | The Security System has detected a downgrade attempt during authentication. | Machine2, Machine3, Machine6 | Potential issue with domain controllers or trust relationships. |
Dec 4th, various times | 3 | Microsoft-Windows-FilterManager | Filter Manager failed to attach to volume | Machine2, Machine3, Machine4 | Problems with file system drivers or services. |
1:58:36 AM | 1135 | Microsoft-Windows-FailoverClustering | Cluster node removed from the active failover cluster membership. | Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 | Major event: Indicates a widespread network or quorum failure, leading to the outage. |
1:58:36 AM | 1681 | Microsoft-Windows-FailoverClustering | Virtual machines on nodes entered an unmonitored state. | Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 | Consequence of node removal; VMs are no longer managed by the cluster. |
1:58:37 AM | 5157 | Microsoft-Windows-FailoverClustering | Cluster Shared Volume entered a paused state due to 'STATUS_USER_SESSION_DELETED(c0000203)' or 'STATUS_UNEXPECTED_NETWORK_ERROR(c00000c4)'. | Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 | Indicates storage connectivity issues, possibly related to the network outage. |
1:58:37 AM | 1673 | Microsoft-Windows-FailoverClustering | Cluster node has entered the isolated state. | Machine1, Machine2 | The node is unable to communicate with other nodes in the cluster. |
1:58:37 AM | 1177 | Microsoft-Windows-FailoverClustering | The Cluster service is shutting down because quorum was lost. | Machine1, Machine2, Machine3, Machine4 | Critical event: The cluster has lost the necessary number of nodes to function. |
1:58:37 AM | 7024,7031 | Service Control Manager | The Cluster Service service terminated unexpectedly. | Machine1, Machine2, Machine3 | Cluster service has crashed. |
1:58:37 AM | 1795 | Microsoft-Windows-FailoverClustering | Cluster physical disk resource encountered an error while attempting to terminate. | Machine1, Machine2, Machine3 | Problems with releasing storage resources during the outage. |
1:58:40 AM - 2:00:39 AM | 153 | Disk | The IO operation at the logical block address was retried. | Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 | Indicates ongoing storage I/O issues. |
1:58:40 AM | 137, 140 | Ntfs | The default transaction resource manager encountered a non-retryable error and could not start. | Machine1, Machine2, Machine3 | NTFS file system errors, possibly due to abrupt storage disconnection. |
2:01:02 AM - 2:02:33 AM | 21502 | Microsoft-Windows-Hyper-V-High-Availability | Live migration of VM failed / VM failed to stop during resource initialization. | Machine1, Machine2, Machine3 | Problems with VM configurations or storage. |
2:01:07 AM - 2:07:23 AM | 1069, 21502 | Microsoft-Windows-FailoverClustering, Microsoft-Windows-Hyper-V-High-Availability | Cluster resource of type 'Virtual Machine' failed. Live migration of VM failed due to processor-specific features not supported on the destination. | Machine1, Machine2, Machine3, Machine2 | VMs are failing to start or are becoming unavailable. VM configuration issue; processor compatibility mode needs to be enabled for live migration. |
2:04:52 AM - 2:25:25 AM | 1135 | Microsoft-Windows-FailoverClustering | Cluster node removed from the active failover cluster membership. | Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 | Nodes are being removed due to communication issues or potentially hardware/software problems. |
2:04:52 AM - 2:25:25 AM | 1681 | Microsoft-Windows-FailoverClustering | Virtual machines on nodes entered an unmonitored state. | Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 | VMs are no longer being monitored due to node isolation/removal. |
2:04:52 AM - 2:25:25 AM | 1673 | Microsoft-Windows-FailoverClustering | Cluster node has entered the isolated state. | Machine1, Machine2 | Nodes are unable to communicate with each other. |
Dec 5th, various times after outage | 106 | Microsoft-Windows-Hyper-V-VmSwitch | VmSwitch errors | Machine1, Machine2, Machine3 | Virtual networking issues. |
Dec 5th and later | 157 | Disk | Disk has been surprisingly removed. | Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 | Indicates a serious problem with the storage subsystem, potentially disks or controllers. |
Dec 5th and later | 51 | Disk | An error was detected on device \Device\Harddisk during a paging operation. | Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 | Ongoing disk errors, particularly with Harddisk28, indicating potential disk failures or severe storage connectivity issues. |
Dec 5th, 10:27:59 | 55 | Ntfs | A corruption was discovered in the file system structure on volume | Machine1, Machine4 | File system corruption on specific volumes, requiring further investigation and potential repair. |
Dec 4th, 6th, 7th, various times | 158, 58 | Disk, partmgr | Disk has the same disk identifiers as one or more disks connected to the system / The disk signature of disk is equal to the disk signature of disk. | Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 | Potential disk configuration issue, duplicate disk identifiers causing conflicts. |
Dec 5th and later | 16, 18, 23 | mpio | MPIO related events, likely related to disk removal and path failures. | Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 | Multi-Path I/O errors, indicating problems with storage connectivity and redundancy. |
Dec 5th and later | 3 | Microsoft-Windows-FilterManager | Filter Manager failed to attach to volume | Machine3, Machine4, Machine5 | Problems with file system filter drivers, potentially impacting volume access and functionality. |
Dec 5th, 3:15:46 AM | 1069 | Microsoft-Windows-FailoverClustering | Cluster resource 'GenericVM' of type 'Virtual Machine' in clustered role 'GenericVM Resources' failed. | Machine1 | Indicates a specific issue with the virtual machine 'GenericVM'. This could be a problem with the virtual machine itself, its configuration, or the underlying host. |
Dec 6th, 3:17:09 AM | 1085 | Microsoft-Windows-GroupPolicy | Windows failed to apply the Group Policy Local Users and Groups settings. | Machine4 | Suggests an issue with Group Policy processing. It might be related to network connectivity, domain controller communication, or local security policy issues. |
Dec 4th, 12:43:48 AM & Dec 7th, 3:34 PM | 7031 | Service Control Manager | The GenericAgentService service terminated unexpectedly. | Machine1, Machine7 | Indicates a problem with the GenericAgentService, which could affect performance and stability. |
References for recorded error messages:
Disclaimer: this RCA was fabricated based on a potential event, all of these information was generated on my own lab.