Root Cause Analysis Report

Executive Summary

This RCA report analyzes the events that occurred on December 5th, leading to an outage of a 10-node Hyper-V cluster which recovered itself within the day. The outage, starting at approximately 1:58 AM UTC, involved simultaneous removal of all nodes from the cluster, virtual machine failures, and storage subsystem issues.

Introduction

Environment

Cluster Nodes

Name	Node Number
Machine1	16
Machine2	17
Machine3	18
Machine4	19
Machine5	20
Machine6	21
Machine7	22
Machine8	23
Machine9	24
Machine10	11

Cluster Networks

Name	Prefix List	Address	Metric
Network-A	PrefixList1	Address1	3984
Network-B	PrefixList2	Address2	7984
Network-C	PrefixList3	Address3	7984
Network-D	PrefixList4	Address4	7984
Network-E	PrefixList5	Address5	3984
Network-F	PrefixList6	Address6	2960
Network-G	PrefixList7	Address7	2040
Network-H	PrefixList8	Address8	2040

Cluster Shared Volumes

Volume	Disk Number	Disk Name
Volume1	7	DiskName1
Volume2	3	DiskName2
Volume3	12	DiskName3
Volume4	13	DiskName4
Volume5	14	DiskName5
Volume6	16	DiskName6
Volume7	17	DiskName7
Volume8	18	DiskName8
Volume9	19	DiskName9
Volume10	20	DiskName10
Volume11	21	DiskName11
Volume12	8	DiskName12
Volume13	22	DiskName13
Volume14	23	DiskName14
Volume15	1	DiskName15
Volume16	24	DiskName16
Volume17	26	DiskName17
Volume18	27	DiskName18
Volume19	25	DiskName19
Volume20	9	DiskName20
Volume21	10	DiskName21
Volume22	5	DiskName22
Volume23	4	DiskName23
Volume24	11	DiskName24
Volume25	2	DiskName25

Purpose

This document provides a detailed Root Cause Analysis (RCA) of the cluster outage that occurred on December 5th. The primary goal is to identify the root cause(s) of the outage, understand the sequence of events that led to the failure, and recommend corrective actions to prevent similar incidents in the future.

Incident Description

Multiple disconnections to the cluster and the clustered VMs were detected during early afternoon on the 5th of December AEDT.

Methodology

The investigation involved analyzing system logs and application logs from all the servers within the cluster.

Findings

Chronology and summary

On December 5th, at approximately 1:58 AM UTC, the 10-node Hyper-V cluster experienced a catastrophic failure. The sequence of events was as follows:

Precursor Events (December 4th):
- Sporadic node evictions were observed, which were attributed to planned activity.
- Disk I/O retries and paging errors were logged across multiple nodes, indicating potential storage issues.
- Authentication errors (Event ID 40970) were reported, suggesting problems with domain controllers or trust relationships.
- Filter Manager failures (Event ID 3) occurred on some nodes, pointing to file system driver or service issues.
Major Outage (December 5th, 1:58 AM - 2:00 AM):
- Simultaneous Node Removal: All cluster nodes were simultaneously removed from the active failover cluster membership (Event ID 1135).
- VMs Unmonitored: Virtual machines on all nodes entered an unmonitored state (Event ID 1681).
- CSV Failures: Cluster Shared Volumes (CSVs) became paused due to STATUS_USER_SESSION_DELETED or STATUS_UNEXPECTED_NETWORK_ERROR (Event ID 5157).
- Quorum Loss: The cluster service shut down due to a loss of quorum (Event ID 1177).
- Cluster Service Termination: The Cluster Service terminated unexpectedly (Event ID 7024/7031).
- Disk and Resource Failures: Physical disk resources encountered errors during termination (Event ID 1795), and NTFS reported non-retryable errors (Event ID 137, 140). Disk I/O retries were logged on multiple nodes.
Post-Outage (December 5th onwards):
- Nodes kept being removed from the cluster.
- Hyper-V VmSwitch errors (Event ID 106) appeared.
- Disk surprise removals (Event ID 157) and paging errors (Event ID 51) persisted, particularly with Disk 28.
- NTFS corruption (Event ID 55) was detected.
- Disk signature issues (Event ID 158, 58) were reported.
- Authentication errors (Event ID 40970) continued.
- Filter Manager failures (Event ID 3) persisted.

Root Causes

The root cause of the outage appears to be a combination of a major network failure and storage subsystem issues, with potential contributing factors from node-specific problems.

Primary Cause: Network Connectivity Failure:
- The simultaneous removal of all nodes from the cluster at 1:58 AM on December 5th is the strongest indicator of a widespread network outage.
- Possible causes include:
  - Switch/Router Failure: A failure in a core network switch or router.
  - Network Configuration Error: An erroneous network configuration change affecting all nodes (e.g., VLAN, routing, firewall).
  - Cable/Port Issues: While less likely to be simultaneous across all nodes, a widespread problem affecting multiple network cables or ports cannot be entirely ruled out.
Secondary Cause: Storage Subsystem Failure:
- The numerous disk-related errors (I/O retries, paging errors, surprise removals, NTFS corruption) point to a significant problem within the storage subsystem.
- Possible causes include:
  - SAN/Storage Array Issue: A problem with the SAN or storage array itself (e.g., controller failure, disk failure, configuration error).
  - Storage Network Issues: If storage is accessed over a separate network (e.g., iSCSI or Fibre Channel), a problem with this network could be a contributing factor.
  - Disk Failures: Multiple disk failures, especially concerning Disk 28, are evident.
Contributing Factors:
- Node Machine1: This node exhibited recurring problems, suggesting a potential hardware or software issue that might have exacerbated the cluster instability.
- Filter Manager Issues: The Filter Manager errors on several nodes could have contributed to volume access problems.
- Authentication Errors: The authentication errors on December 4th might indicate underlying issues with domain controllers or trust relationships, potentially impacting cluster communication.

Conclusion

The catastrophic failure of the 10-node Hyper-V cluster on December 5th was primarily triggered by a major network outage that simultaneously isolated all nodes, causing a loss of quorum and cluster service termination. This was compounded by a significant storage subsystem failure, evidenced by widespread disk errors and corruption. While the network failure was the likely instigator, the underlying storage issues severely impacted recovery and contributed to the severity of the outage. The sporadic problems observed on December 4th, including authentication and Filter Manager errors, might have further destabilized the environment, although their direct role in the main event is less clear.

Recommendation

Immediate Actions:

Network Infrastructure Investigation:
- Thoroughly investigate the network infrastructure, focusing on the time around 1:58 AM on December 5th.
- Examine network device logs (switches, routers, firewalls) for errors, failures, or configuration changes.
- Analyze network monitoring data for any anomalies or outages.
Storage Subsystem Assessment:
- Conduct a comprehensive health check of the storage subsystem (SAN/storage array, storage network, individual disks).
- Examine storage controller logs and performance metrics.
- Prioritize Disk 28: Investigate and replace or thoroughly test Disk 28, which exhibited numerous errors.
- Address NTFS corruption on affected volumes (Machine1, Machine4).
Node Machine1 Diagnostics:
- Perform in-depth diagnostics on Machine1 (hardware checks, driver updates, OS file integrity checks).
- Consider temporarily removing this node from the cluster until the issue is resolved.

Long-Term Actions:

Network Redundancy:
- Implement redundant network infrastructure (as there have already been multiple network paths, I presume they are on the same infrastructure).
- Review network topology and consider implementing a more robust design to prevent single points of failure.
Storage Redundancy and Monitoring:
- Ensure storage redundancy (e.g., RAID, redundant controllers) is properly configured and functioning.
- Implement comprehensive storage monitoring to detect and alert on potential issues proactively.
Filter Driver Management:
- Address the Filter Manager errors (Event ID 3) by updating or reinstalling file system filter drivers.
Authentication and Domain Health:
- Investigate and resolve the authentication errors (Event ID 40970) observed on December 4th.
- Verify the health and stability of domain controllers and trust relationships.
Regular Maintenance and Testing:
- Establish a regular maintenance schedule for the cluster, including hardware checks, software updates, and failover testing.
- Conduct periodic disaster recovery drills to ensure the cluster can recover from various failure scenarios.

Appendices

Note: This table is a combined view, meaning events with the same ID and similar descriptions are grouped together, even if they occurred on different nodes. The "Affected Nodes" column indicates which nodes reported that specific event in the provided logs.

Time (UTC)	Event IDs	Provider	Description	Affected Nodes	Implications
Dec 4th, various times	1011	Microsoft-Windows-FailoverClustering	Cluster nodes evicted from the failover cluster.	Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10	Indicates underlying cluster instability, potentially network or storage related.
Dec 4th, various times	153	Disk	IO operation at logical block address was retried.	Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10	Suggests storage subsystem issues, potentially disk or controller related.
Dec 4th, various times	51	Disk	An error was detected on the device during a paging operation.	Machine1, Machine2, Machine4, Machine5, Machine6	Indicates problems with specific disks or storage connectivity.
Dec 4th, various times	40970	LsaSrv	The Security System has detected a downgrade attempt during authentication.	Machine2, Machine3, Machine6	Potential issue with domain controllers or trust relationships.
Dec 4th, various times	3	Microsoft-Windows-FilterManager	Filter Manager failed to attach to volume	Machine2, Machine3, Machine4	Problems with file system drivers or services.
1:58:36 AM	1135	Microsoft-Windows-FailoverClustering	Cluster node removed from the active failover cluster membership.	Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10	Major event: Indicates a widespread network or quorum failure, leading to the outage.
1:58:36 AM	1681	Microsoft-Windows-FailoverClustering	Virtual machines on nodes entered an unmonitored state.	Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10	Consequence of node removal; VMs are no longer managed by the cluster.
1:58:37 AM	5157	Microsoft-Windows-FailoverClustering	Cluster Shared Volume entered a paused state due to 'STATUS_USER_SESSION_DELETED(c0000203)' or 'STATUS_UNEXPECTED_NETWORK_ERROR(c00000c4)'.	Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10	Indicates storage connectivity issues, possibly related to the network outage.
1:58:37 AM	1673	Microsoft-Windows-FailoverClustering	Cluster node has entered the isolated state.	Machine1, Machine2	The node is unable to communicate with other nodes in the cluster.
1:58:37 AM	1177	Microsoft-Windows-FailoverClustering	The Cluster service is shutting down because quorum was lost.	Machine1, Machine2, Machine3, Machine4	Critical event: The cluster has lost the necessary number of nodes to function.
1:58:37 AM	7024,7031	Service Control Manager	The Cluster Service service terminated unexpectedly.	Machine1, Machine2, Machine3	Cluster service has crashed.
1:58:37 AM	1795	Microsoft-Windows-FailoverClustering	Cluster physical disk resource encountered an error while attempting to terminate.	Machine1, Machine2, Machine3	Problems with releasing storage resources during the outage.
1:58:40 AM - 2:00:39 AM	153	Disk	The IO operation at the logical block address was retried.	Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10	Indicates ongoing storage I/O issues.
1:58:40 AM	137, 140	Ntfs	The default transaction resource manager encountered a non-retryable error and could not start.	Machine1, Machine2, Machine3	NTFS file system errors, possibly due to abrupt storage disconnection.
2:01:02 AM - 2:02:33 AM	21502	Microsoft-Windows-Hyper-V-High-Availability	Live migration of VM failed / VM failed to stop during resource initialization.	Machine1, Machine2, Machine3	Problems with VM configurations or storage.
2:01:07 AM - 2:07:23 AM	1069, 21502	Microsoft-Windows-FailoverClustering, Microsoft-Windows-Hyper-V-High-Availability	Cluster resource of type 'Virtual Machine' failed. Live migration of VM failed due to processor-specific features not supported on the destination.	Machine1, Machine2, Machine3, Machine2	VMs are failing to start or are becoming unavailable. VM configuration issue; processor compatibility mode needs to be enabled for live migration.
2:04:52 AM - 2:25:25 AM	1135	Microsoft-Windows-FailoverClustering	Cluster node removed from the active failover cluster membership.	Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10	Nodes are being removed due to communication issues or potentially hardware/software problems.
2:04:52 AM - 2:25:25 AM	1681	Microsoft-Windows-FailoverClustering	Virtual machines on nodes entered an unmonitored state.	Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10	VMs are no longer being monitored due to node isolation/removal.
2:04:52 AM - 2:25:25 AM	1673	Microsoft-Windows-FailoverClustering	Cluster node has entered the isolated state.	Machine1, Machine2	Nodes are unable to communicate with each other.
Dec 5th, various times after outage	106	Microsoft-Windows-Hyper-V-VmSwitch	VmSwitch errors	Machine1, Machine2, Machine3	Virtual networking issues.
Dec 5th and later	157	Disk	Disk has been surprisingly removed.	Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10	Indicates a serious problem with the storage subsystem, potentially disks or controllers.
Dec 5th and later	51	Disk	An error was detected on device \Device\Harddisk during a paging operation.	Machine1, Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10	Ongoing disk errors, particularly with Harddisk28, indicating potential disk failures or severe storage connectivity issues.
Dec 5th, 10:27:59	55	Ntfs	A corruption was discovered in the file system structure on volume	Machine1, Machine4	File system corruption on specific volumes, requiring further investigation and potential repair.
Dec 4th, 6th, 7th, various times	158, 58	Disk, partmgr	Disk has the same disk identifiers as one or more disks connected to the system / The disk signature of disk is equal to the disk signature of disk.	Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10	Potential disk configuration issue, duplicate disk identifiers causing conflicts.
Dec 5th and later	16, 18, 23	mpio	MPIO related events, likely related to disk removal and path failures.	Machine2, Machine3, Machine4, Machine5, Machine6, Machine7, Machine8, Machine9, Machine10	Multi-Path I/O errors, indicating problems with storage connectivity and redundancy.
Dec 5th and later	3	Microsoft-Windows-FilterManager	Filter Manager failed to attach to volume	Machine3, Machine4, Machine5	Problems with file system filter drivers, potentially impacting volume access and functionality.
Dec 5th, 3:15:46 AM	1069	Microsoft-Windows-FailoverClustering	Cluster resource 'GenericVM' of type 'Virtual Machine' in clustered role 'GenericVM Resources' failed.	Machine1	Indicates a specific issue with the virtual machine 'GenericVM'. This could be a problem with the virtual machine itself, its configuration, or the underlying host.
Dec 6th, 3:17:09 AM	1085	Microsoft-Windows-GroupPolicy	Windows failed to apply the Group Policy Local Users and Groups settings.	Machine4	Suggests an issue with Group Policy processing. It might be related to network connectivity, domain controller communication, or local security policy issues.
Dec 4th, 12:43:48 AM & Dec 7th, 3:34 PM	7031	Service Control Manager	The GenericAgentService service terminated unexpectedly.	Machine1, Machine7	Indicates a problem with the GenericAgentService, which could affect performance and stability.

References for recorded error messages:

EventID	Official Article
153	Guidance for troubleshooting data corruption and disk errors - Windows Server \| Microsoft Learn
140	Guidance for troubleshooting data corruption and disk errors - Windows Server \| Microsoft Learn
51	Information about Event ID 51 - Windows Server \| Microsoft Learn
1135, 1069	Troubleshoot cluster issue with Event ID 1135 - Windows Server \| Microsoft Learn
1177	Event ID 1177 — Quorum and Connectivity Needed for Quorum \| Microsoft Learn
5157 (and 5120)	Event ID 5120 Cluster Shared Volume troubleshooting guidance - Windows Server \| Microsoft Learn

Disclaimer: this RCA was fabricated based on a potential event, all of these information was generated on my own lab.

Unraveling the December 5th Hyper-V Cluster Blackout: A 10-Node Outage