Unraveling the December 5th Hyper-V Cluster Blackout: A 10-Node Outage

Customer reported a complete outage of their virtual machine environment. This Root Cause Analysis (RCA) explores the events leading to the failure, identifies contributing factors, and proposes mitigation strategies, despite challenges posed by incomplete diagnostic tracing.

 

Root Cause Analysis Report

Executive Summary

This RCA report analyzes the events that occurred on December 5th, leading to an outage of a 10-node Hyper-V cluster which recovered itself within the day. The outage, starting at approximately 1:58 AM UTC, involved simultaneous removal of all nodes from the cluster, virtual machine failures, and storage subsystem issues.

Introduction

Environment

Cluster Nodes

Name Node Number
Machine1 16
Machine2 17
Machine3 18
Machine4 19
Machine5 20
Machine6 21
Machine7 22
Machine8 23
Machine9 24
Machine10 11

Cluster Networks

Name Prefix List Address Metric
Network-A PrefixList1 Address1 3984
Network-B PrefixList2 Address2 7984
Network-C PrefixList3 Address3 7984
Network-D PrefixList4 Address4 7984
Network-E PrefixList5 Address5 3984
Network-F PrefixList6 Address6 2960
Network-G PrefixList7 Address7 2040
Network-H PrefixList8 Address8 2040

Cluster Shared Volumes

Volume Disk Number Disk Name
Volume1 7 DiskName1
Volume2 3 DiskName2
Volume3 12 DiskName3
Volume4 13 DiskName4
Volume5 14 DiskName5
Volume6 16 DiskName6
Volume7 17 DiskName7
Volume8 18 DiskName8
Volume9 19 DiskName9
Volume10 20 DiskName10
Volume11 21 DiskName11
Volume12 8 DiskName12
Volume13 22 DiskName13
Volume14 23 DiskName14
Volume15 1 DiskName15
Volume16 24 DiskName16
Volume17 26 DiskName17
Volume18 27 DiskName18
Volume19 25 DiskName19
Volume20 9 DiskName20
Volume21 10 DiskName21
Volume22 5 DiskName22
Volume23 4 DiskName23
Volume24 11 DiskName24
Volume25 2 DiskName25

Purpose

This document provides a detailed Root Cause Analysis (RCA) of the cluster outage that occurred on December 5th. The primary goal is to identify the root cause(s) of the outage, understand the sequence of events that led to the failure, and recommend corrective actions to prevent similar incidents in the future.

Incident Description

Multiple disconnections to the cluster and the clustered VMs were detected during early afternoon on the 5th of December AEDT.

Methodology

The investigation involved analyzing system logs and application logs from all the servers within the cluster.

Findings

Chronology and summary

On December 5th, at approximately 1:58 AM UTC, the 10-node Hyper-V cluster experienced a catastrophic failure. The sequence of events was as follows:

  • Precursor Events (December 4th):
    • Sporadic node evictions were observed, which were attributed to planned activity.
    • Disk I/O retries and paging errors were logged across multiple nodes, indicating potential storage issues.
    • Authentication errors (Event ID 40970) were reported, suggesting problems with domain controllers or trust relationships.
    • Filter Manager failures (Event ID 3) occurred on some nodes, pointing to file system driver or service issues.
  • Major Outage (December 5th, 1:58 AM - 2:00 AM):
    • Simultaneous Node Removal: All cluster nodes were simultaneously removed from the active failover cluster membership (Event ID 1135).
    • VMs Unmonitored: Virtual machines on all nodes entered an unmonitored state (Event ID 1681).
    • CSV Failures: Cluster Shared Volumes (CSVs) became paused due to STATUS_USER_SESSION_DELETED or STATUS_UNEXPECTED_NETWORK_ERROR (Event ID 5157).
    • Quorum Loss: The cluster service shut down due to a loss of quorum (Event ID 1177).
    • Cluster Service Termination: The Cluster Service terminated unexpectedly (Event ID 7024/7031).
    • Disk and Resource Failures: Physical disk resources encountered errors during termination (Event ID 1795), and NTFS reported non-retryable errors (Event ID 137, 140). Disk I/O retries were logged on multiple nodes.
  • Post-Outage (December 5th onwards):
    • Nodes kept being removed from the cluster.
    • Hyper-V VmSwitch errors (Event ID 106) appeared.
    • Disk surprise removals (Event ID 157) and paging errors (Event ID 51) persisted, particularly with Disk 28.
    • NTFS corruption (Event ID 55) was detected.
    • Disk signature issues (Event ID 158, 58) were reported.
    • Authentication errors (Event ID 40970) continued.
    • Filter Manager failures (Event ID 3) persisted.

Root Causes

The root cause of the outage appears to be a combination of a major network failure and storage subsystem issues, with potential contributing factors from node-specific problems.

  • Primary Cause: Network Connectivity Failure:
    • The simultaneous removal of all nodes from the cluster at 1:58 AM on December 5th is the strongest indicator of a widespread network outage.
    • Possible causes include:
      • Switch/Router Failure: A failure in a core network switch or router.
      • Network Configuration Error: An erroneous network configuration change affecting all nodes (e.g., VLAN, routing, firewall).
      • Cable/Port Issues: While less likely to be simultaneous across all nodes, a widespread problem affecting multiple network cables or ports cannot be entirely ruled out.
  • Secondary Cause: Storage Subsystem Failure:
    • The numerous disk-related errors (I/O retries, paging errors, surprise removals, NTFS corruption) point to a significant problem within the storage subsystem.
    • Possible causes include:
      • SAN/Storage Array Issue: A problem with the SAN or storage array itself (e.g., controller failure, disk failure, configuration error).
      • Storage Network Issues: If storage is accessed over a separate network (e.g., iSCSI or Fibre Channel), a problem with this network could be a contributing factor.
      • Disk Failures: Multiple disk failures, especially concerning Disk 28, are evident.
  • Contributing Factors:
    • Node Machine1: This node exhibited recurring problems, suggesting a potential hardware or software issue that might have exacerbated the cluster instability.
    • Filter Manager Issues: The Filter Manager errors on several nodes could have contributed to volume access problems.
    • Authentication Errors: The authentication errors on December 4th might indicate underlying issues with domain controllers or trust relationships, potentially impacting cluster communication.

Conclusion

The catastrophic failure of the 10-node Hyper-V cluster on December 5th was primarily triggered by a major network outage that simultaneously isolated all nodes, causing a loss of quorum and cluster service termination. This was compounded by a significant storage subsystem failure, evidenced by widespread disk errors and corruption. While the network failure was the likely instigator, the underlying storage issues severely impacted recovery and contributed to the severity of the outage. The sporadic problems observed on December 4th, including authentication and Filter Manager errors, might have further destabilized the environment, although their direct role in the main event is less clear.

Recommendation

Immediate Actions:

  1. Network Infrastructure Investigation:
    • Thoroughly investigate the network infrastructure, focusing on the time around 1:58 AM on December 5th.
    • Examine network device logs (switches, routers, firewalls) for errors, failures, or configuration changes.
    • Analyze network monitoring data for any anomalies or outages.
  2. Storage Subsystem Assessment:
    • Conduct a comprehensive health check of the storage subsystem (SAN/storage array, storage network, individual disks).
    • Examine storage controller logs and performance metrics.
    • Prioritize Disk 28: Investigate and replace or thoroughly test Disk 28, which exhibited numerous errors.
    • Address NTFS corruption on affected volumes (Machine1, Machine4).
  3. Node Machine1 Diagnostics:
    • Perform in-depth diagnostics on Machine1 (hardware checks, driver updates, OS file integrity checks).
    • Consider temporarily removing this node from the cluster until the issue is resolved.

Long-Term Actions:

  1. Network Redundancy:
    • Implement redundant network infrastructure (as there have already been multiple network paths, I presume they are on the same infrastructure).
    • Review network topology and consider implementing a more robust design to prevent single points of failure.
  2. Storage Redundancy and Monitoring:
    • Ensure storage redundancy (e.g., RAID, redundant controllers) is properly configured and functioning.
    • Implement comprehensive storage monitoring to detect and alert on potential issues proactively.
  3. Filter Driver Management:
    • Address the Filter Manager errors (Event ID 3) by updating or reinstalling file system filter drivers.
  4. Authentication and Domain Health:
    • Investigate and resolve the authentication errors (Event ID 40970) observed on December 4th.
    • Verify the health and stability of domain controllers and trust relationships.
  5. Regular Maintenance and Testing:
    • Establish a regular maintenance schedule for the cluster, including hardware checks, software updates, and failover testing.
    • Conduct periodic disaster recovery drills to ensure the cluster can recover from various failure scenarios.

Appendices

Note: This table is a combined view, meaning events with the same ID and similar descriptions are grouped together, even if they occurred on different nodes. The "Affected Nodes" column indicates which nodes reported that specific event in the provided logs.

Time (UTC) Event IDs Provider Description Affected Nodes Implications
Dec 4th, various times 1011 Microsoft-Windows-FailoverClustering Cluster nodes evicted from the failover cluster. Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 Indicates underlying cluster instability, potentially network or storage related.
Dec 4th, various times 153 Disk IO operation at logical block address was retried. Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 Suggests storage subsystem issues, potentially disk or controller related.
Dec 4th, various times 51 Disk An error was detected on the device during a paging operation. Machine1, Machine2, Machine4, Machine5, Machine6 Indicates problems with specific disks or storage connectivity.
Dec 4th, various times 40970 LsaSrv The Security System has detected a downgrade attempt during authentication. Machine2, Machine3, Machine6 Potential issue with domain controllers or trust relationships.
Dec 4th, various times 3 Microsoft-Windows-FilterManager Filter Manager failed to attach to volume Machine2, Machine3, Machine4 Problems with file system drivers or services.
1:58:36 AM 1135 Microsoft-Windows-FailoverClustering Cluster node removed from the active failover cluster membership. Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 Major event: Indicates a widespread network or quorum failure, leading to the outage.
1:58:36 AM 1681 Microsoft-Windows-FailoverClustering Virtual machines on nodes entered an unmonitored state. Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 Consequence of node removal; VMs are no longer managed by the cluster.
1:58:37 AM 5157 Microsoft-Windows-FailoverClustering Cluster Shared Volume entered a paused state due to 'STATUS_USER_SESSION_DELETED(c0000203)' or 'STATUS_UNEXPECTED_NETWORK_ERROR(c00000c4)'. Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 Indicates storage connectivity issues, possibly related to the network outage.
1:58:37 AM 1673 Microsoft-Windows-FailoverClustering Cluster node has entered the isolated state. Machine1, Machine2 The node is unable to communicate with other nodes in the cluster.
1:58:37 AM 1177 Microsoft-Windows-FailoverClustering The Cluster service is shutting down because quorum was lost. Machine1, Machine2, Machine3, Machine4 Critical event: The cluster has lost the necessary number of nodes to function.
1:58:37 AM 7024,7031 Service Control Manager The Cluster Service service terminated unexpectedly. Machine1, Machine2, Machine3 Cluster service has crashed.
1:58:37 AM 1795 Microsoft-Windows-FailoverClustering Cluster physical disk resource encountered an error while attempting to terminate. Machine1, Machine2, Machine3 Problems with releasing storage resources during the outage.
1:58:40 AM - 2:00:39 AM 153 Disk The IO operation at the logical block address was retried. Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 Indicates ongoing storage I/O issues.
1:58:40 AM 137, 140 Ntfs The default transaction resource manager encountered a non-retryable error and could not start. Machine1, Machine2, Machine3 NTFS file system errors, possibly due to abrupt storage disconnection.
2:01:02 AM - 2:02:33 AM 21502 Microsoft-Windows-Hyper-V-High-Availability Live migration of VM failed / VM failed to stop during resource initialization. Machine1, Machine2, Machine3 Problems with VM configurations or storage.
2:01:07 AM - 2:07:23 AM 1069, 21502 Microsoft-Windows-FailoverClustering, Microsoft-Windows-Hyper-V-High-Availability Cluster resource of type 'Virtual Machine' failed. Live migration of VM failed due to processor-specific features not supported on the destination. Machine1, Machine2, Machine3, Machine2 VMs are failing to start or are becoming unavailable. VM configuration issue; processor compatibility mode needs to be enabled for live migration.
2:04:52 AM - 2:25:25 AM 1135 Microsoft-Windows-FailoverClustering Cluster node removed from the active failover cluster membership. Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 Nodes are being removed due to communication issues or potentially hardware/software problems.
2:04:52 AM - 2:25:25 AM 1681 Microsoft-Windows-FailoverClustering Virtual machines on nodes entered an unmonitored state. Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 VMs are no longer being monitored due to node isolation/removal.
2:04:52 AM - 2:25:25 AM 1673 Microsoft-Windows-FailoverClustering Cluster node has entered the isolated state. Machine1, Machine2 Nodes are unable to communicate with each other.
Dec 5th, various times after outage 106 Microsoft-Windows-Hyper-V-VmSwitch VmSwitch errors Machine1, Machine2, Machine3 Virtual networking issues.
Dec 5th and later 157 Disk Disk has been surprisingly removed. Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 Indicates a serious problem with the storage subsystem, potentially disks or controllers.
Dec 5th and later 51 Disk An error was detected on device \Device\Harddisk during a paging operation. Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 Ongoing disk errors, particularly with Harddisk28, indicating potential disk failures or severe storage connectivity issues.
Dec 5th, 10:27:59 55 Ntfs A corruption was discovered in the file system structure on volume Machine1, Machine4 File system corruption on specific volumes, requiring further investigation and potential repair.
Dec 4th, 6th, 7th, various times 158, 58 Disk, partmgr Disk has the same disk identifiers as one or more disks connected to the system / The disk signature of disk is equal to the disk signature of disk. Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 Potential disk configuration issue, duplicate disk identifiers causing conflicts.
Dec 5th and later 16, 18, 23 mpio MPIO related events, likely related to disk removal and path failures. Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10 Multi-Path I/O errors, indicating problems with storage connectivity and redundancy.
Dec 5th and later 3 Microsoft-Windows-FilterManager Filter Manager failed to attach to volume Machine3, Machine4, Machine5 Problems with file system filter drivers, potentially impacting volume access and functionality.
Dec 5th, 3:15:46 AM 1069 Microsoft-Windows-FailoverClustering Cluster resource 'GenericVM' of type 'Virtual Machine' in clustered role 'GenericVM Resources' failed. Machine1 Indicates a specific issue with the virtual machine 'GenericVM'. This could be a problem with the virtual machine itself, its configuration, or the underlying host.
Dec 6th, 3:17:09 AM 1085 Microsoft-Windows-GroupPolicy Windows failed to apply the Group Policy Local Users and Groups settings. Machine4 Suggests an issue with Group Policy processing. It might be related to network connectivity, domain controller communication, or local security policy issues.
Dec 4th, 12:43:48 AM & Dec 7th, 3:34 PM 7031 Service Control Manager The GenericAgentService service terminated unexpectedly. Machine1, Machine7 Indicates a problem with the GenericAgentService, which could affect performance and stability.

References for recorded error messages:

EventID Official Article
153 Guidance for troubleshooting data corruption and disk errors - Windows Server | Microsoft Learn
140 Guidance for troubleshooting data corruption and disk errors - Windows Server | Microsoft Learn
51 Information about Event ID 51 - Windows Server | Microsoft Learn
1135, 1069 Troubleshoot cluster issue with Event ID 1135 - Windows Server | Microsoft Learn
1177 Event ID 1177 — Quorum and Connectivity Needed for Quorum | Microsoft Learn
5157 (and 5120) Event ID 5120 Cluster Shared Volume troubleshooting guidance - Windows Server | Microsoft Learn

 

 

Disclaimer: this RCA was fabricated based on a potential event, all of these information was generated on my own lab.