Root Cause Analysis Report

Analysis of a server crash (Bugcheck 0x9E) on a Hyper-V host, traced to storage latency and an outdated HBA driver. This report details the findings from the memory dump analysis and provides recommendations for remediation.

Executive Summary

On December 11th, 2024, server Machine1 experienced a system crash (Bugcheck 0x9E: USER_MODE_HEALTH_MONITOR) due to a prolonged I/O operation to a clustered disk. Analysis of the memory dump revealed that a thread within the Resource Hosting Subsystem (rhs.exe) became unresponsive while waiting for a disk capacity query (IOCTL_DISK_GET_LENGTH_INFO) to complete. This triggered the Failover Clustering watchdog mechanism, leading to the system crash. The investigation also uncovered a significantly outdated Host Bus Adapter (HBA) driver and widespread high I/O latency across multiple disks, pointing to issues within the storage stack.

Incident Description

A critical server, Machine1, part of a Hyper-V cluster, crashed unexpectedly on December 11th, 2024. The system generated a memory dump file, which was analyzed to determine the root cause.

Technical Analysis and Findings

Our analysis of the memory dump indicates that the crash was triggered by a failure in the Resource Hosting Subsystem (RHS), a critical component of Windows Failover Clustering. This failure was caused by an unresponsive storage resource, ultimately leading to the system becoming unstable and initiating a bugcheck in an attempt to recover.

1. Unresponsive I/O Request:

The root cause was traced to a stuck I/O Request Packet (IRP) (address: ffffcf89c8a39010). This IRP was stalled for approximately 20 minutes while attempting to retrieve disk capacity information (IOCTL_DISK_GET_LENGTH_INFO).

The IRP's major function code was IRP_MJ_DEVICE_CONTROL(e), indicating a device control operation.
The !irp command output confirmed the operation had reached the \Driver\Disk driver, signifying that the OS had completed its processing, and the operation was awaiting handling and completion by the disk driver.
The last driver to complete processing of the IRP was nt!RawCompletionRoutine.
The last device to complete process the IRP is device address ffffc386416ba050, DR9, \Driver\Disk.

2. Stuck Thread in rhs.exe:

The delayed IRP caused a thread (ID 0x7c0) within the rhs.exe process (ffffcf89c6fa27c0) to become blocked while waiting for the operation to complete. This thread was responsible for managing the health of a clustered disk resource. The following call stack snippet illustrates the issue:


# Call Site
0 nt!KiSwapContext+0x76
1 nt!KiSwapThread+0x17d
2 nt!KiCommitThreadWait+0x14f
3 nt!KeWaitForSingleObject+0x377
4 CLASSPNP!ClassReadDriveCapacity+0x125  <-- Waiting to read drive capacity
5 disk!DiskIoctlGetLengthInfo+0x46      <-- Requesting disk length information

3. RHS Timeout and Bugcheck:

RHS monitors resource health. If a resource doesn't respond within a defined timeout (default: 5 minutes), RHS attempts termination. Due to the stuck IOCTL_DISK_GET_LENGTH_INFO, the resource became unresponsive. RHS's termination attempt failed, leading to bugcheck 0x9E after a 20-minute wait (4 times the default timeout). Bugcheck parameters:

Arg1: ffffcf89c6fa27c0 (The rhs.exe process)
Arg2: 00000000000004b0 (Timeout of 1200 seconds, or 20 minutes)
Arg3: 0000000000000065 (Indicates WatchdogSourceRhsResourceDeadlockPhysicalDisk)
Arg4: 0000000000000000

4. Multiple Pending I/Os:

The IRP ffffcf89c8a39010 was not the only one delayed. We found at least 94 pending I/O requests across 12 disks, indicating a broader storage issue. Examples:

\Device\Harddisk9\DR9 had 84 errors and 22 pending IRPs.
\Device\Harddisk3\DR3 had 76 errors and 37 pending IRPs.

The analysis also revealed high I/O latency on multiple disks connected to the Hitachi Storage. For instance, a thread in the System process was stuck for 42 seconds waiting on an I/O operation related to \Device\Harddisk4\DR4.


Process                   Thread           CID       UserTime KernelTime ContextSwitches Wait Reason    Time State
System (ffffc3863cc386c0) ffffc38642b40800 4.ccc           0s     6s.922          263985 Executive   42s.750 Waiting

# Call Site
0 nt!KiSwapContext+0x76
1 nt!KiSwapThread+0x17d
2 nt!KiCommitThreadWait+0x14f
3 nt!KeWaitForSingleObject+0x377
4 NTFS!NtfsWaitOnIo+0x1e
5 NTFS!NtfsNonCachedIo+0x425
6 NTFS!NtfsCommonWrite+0x36e8
7 NTFS!NtfsFsdWrite+0x1d8
8 FLTMGR!FltpLegacyProcessingAfterPreCallbacksCompleted+0x1a6
9 FLTMGR!FltpDispatch+0xb6
a nt!IoSynchronousPageWriteEx+0x138
b nt!MiSynchronousPageWrite+0x29
c nt!MiIssueSynchronousFlush+0x72
d nt!MiFlushSectionInternal+0xae9
e nt!MmFlushSection+0x1a8
f nt!CcFlushCachePriv+0x60e
10 nt!CcFlushCache+0x22
11 nt!CcWriteBehindInternal+0x15a
12 nt!CcWriteBehind+0x76
13 nt!CcWorkerThread+0x212
14 nt!ExpWorkerThread+0xe9
15 nt!PspSystemThreadStartup+0x41
16 nt!KxStartSystemThread+0x16

5. Outdated HBA Driver:

The Host Bus Adapter (HBA) driver (ql2300i.sys) is significantly outdated (timestamp: 2015-08-17), raising concerns about compatibility or bugs.


Base             End                Module name   Time stamp            Path
========================================================================================
fffff80448380000 fffff80448512000 | ql2300i     | 2015-08-17 04:20:31 | \SystemRoot\System32\drivers\ql2300i.sys

No hardware issue was reported by the HBA driver.

Conclusion

The bugcheck 0x9E was a direct result of a stalled I/O operation that prevented a critical Failover Cluster resource from responding to health checks. The high number of pending I/O requests and the outdated HBA driver strongly suggest a problem within the storage stack. While no direct hardware errors were reported, a hardware-level malfunction cannot be entirely ruled out.

Recommendations

Prioritize Driver Update: The outdated HBA driver (ql2300i.sys) is the most likely culprit. We strongly recommend updating to the latest version from the hardware vendor (HPE).
Investigate Storage Vendor: Contact your storage vendor (Hitachi) for further investigation. They should diagnose potential issues within the storage system (controller, cables, etc.).
Further Monitoring: After updating the driver, closely monitor the system for any further I/O delays or instability.

Debugging Details

Item	Details
Computer Name	Machine1
Kernel Version	Machine Version
Product	PII
Edition build lab	PII
Dump Time	Wed Dec 11 12:49:36.341 2024 (UTC + 1:00)
System Uptime	54 days 20:51:31.931
System Manufacturer	HPE
System Product Name	PII
Processor	Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
Bugcheck Info	Watchdog Bugcheck 9e
Dump Type	Kernel Summary (Kernel address space is available - user address space may not be available)
Build Revision	PII

References

Understanding how Failover Clustering Recovers from Unresponsive Resources | Microsoft Community Hub - BugCheck9e underlying mechanism.
Bug Check 0x9E USER_MODE_HEALTH_MONITOR - Windows drivers | Microsoft Learn - BugCheck9e arguments explanation.

Disclaimer: this RCA was fabricated based on a potential event, all of these information was generated on my own lab.

Deciphering the December 11th Server Crash: An In-Depth RCA of Bugcheck 0x9E