Deciphering the December 11th Server Crash: An In-Depth RCA of Bugcheck 0x9E
A critical Hyper-V host server experienced a crash (Bugcheck 0x9E) due to stalled I/O operations within the storage stack. This Root Cause Analysis (RCA) utilizes memory dump analysis to pinpoint the cause, identify contributing factors, and recommend critical remediation steps.
Root Cause Analysis Report
Analysis of a server crash (Bugcheck 0x9E) on a Hyper-V host, traced to storage latency and an outdated HBA driver. This report details the findings from the memory dump analysis and provides recommendations for remediation.
Executive Summary
On December 11th, 2024, server Machine1 experienced a system crash (Bugcheck 0x9E: USER_MODE_HEALTH_MONITOR) due to a prolonged I/O operation to a clustered disk. Analysis of the memory dump revealed that a thread within the Resource Hosting Subsystem (rhs.exe) became unresponsive while waiting for a disk capacity query (IOCTL_DISK_GET_LENGTH_INFO) to complete. This triggered the Failover Clustering watchdog mechanism, leading to the system crash. The investigation also uncovered a significantly outdated Host Bus Adapter (HBA) driver and widespread high I/O latency across multiple disks, pointing to issues within the storage stack.
Incident Description
A critical server, Machine1, part of a Hyper-V cluster, crashed unexpectedly on December 11th, 2024. The system generated a memory dump file, which was analyzed to determine the root cause.
Technical Analysis and Findings
Our analysis of the memory dump indicates that the crash was triggered by a failure in the Resource Hosting Subsystem (RHS), a critical component of Windows Failover Clustering. This failure was caused by an unresponsive storage resource, ultimately leading to the system becoming unstable and initiating a bugcheck in an attempt to recover.
1. Unresponsive I/O Request:
The root cause was traced to a stuck I/O Request Packet (IRP) (address: ffffcf89c8a39010
). This IRP was stalled for approximately 20 minutes while attempting to retrieve disk capacity information (IOCTL_DISK_GET_LENGTH_INFO
).
- The IRP's major function code was
IRP_MJ_DEVICE_CONTROL(e)
, indicating a device control operation. - The
!irp
command output confirmed the operation had reached the\Driver\Disk
driver, signifying that the OS had completed its processing, and the operation was awaiting handling and completion by the disk driver. - The last driver to complete processing of the IRP was
nt!RawCompletionRoutine
. - The last device to complete process the IRP is device address ffffc386416ba050, DR9,
\Driver\Disk
.
2. Stuck Thread in rhs.exe:
The delayed IRP caused a thread (ID 0x7c0
) within the rhs.exe
process (ffffcf89c6fa27c0
) to become blocked while waiting for the operation to complete. This thread was responsible for managing the health of a clustered disk resource. The following call stack snippet illustrates the issue:
# Call Site
0 nt!KiSwapContext+0x76
1 nt!KiSwapThread+0x17d
2 nt!KiCommitThreadWait+0x14f
3 nt!KeWaitForSingleObject+0x377
4 CLASSPNP!ClassReadDriveCapacity+0x125 <-- Waiting to read drive capacity
5 disk!DiskIoctlGetLengthInfo+0x46 <-- Requesting disk length information
3. RHS Timeout and Bugcheck:
RHS monitors resource health. If a resource doesn't respond within a defined timeout (default: 5 minutes), RHS attempts termination. Due to the stuck IOCTL_DISK_GET_LENGTH_INFO
, the resource became unresponsive. RHS's termination attempt failed, leading to bugcheck 0x9E
after a 20-minute wait (4 times the default timeout). Bugcheck parameters:
Arg1: ffffcf89c6fa27c0
(Therhs.exe
process)Arg2: 00000000000004b0
(Timeout of 1200 seconds, or 20 minutes)Arg3: 0000000000000065
(IndicatesWatchdogSourceRhsResourceDeadlockPhysicalDisk
)Arg4: 0000000000000000
4. Multiple Pending I/Os:
The IRP ffffcf89c8a39010
was not the only one delayed. We found at least 94 pending I/O requests across 12 disks, indicating a broader storage issue. Examples:
\Device\Harddisk9\DR9
had 84 errors and 22 pending IRPs.\Device\Harddisk3\DR3
had 76 errors and 37 pending IRPs.
The analysis also revealed high I/O latency on multiple disks connected to the Hitachi Storage. For instance, a thread in the System process was stuck for 42 seconds waiting on an I/O operation related to \Device\Harddisk4\DR4.
Process Thread CID UserTime KernelTime ContextSwitches Wait Reason Time State
System (ffffc3863cc386c0) ffffc38642b40800 4.ccc 0s 6s.922 263985 Executive 42s.750 Waiting
# Call Site
0 nt!KiSwapContext+0x76
1 nt!KiSwapThread+0x17d
2 nt!KiCommitThreadWait+0x14f
3 nt!KeWaitForSingleObject+0x377
4 NTFS!NtfsWaitOnIo+0x1e
5 NTFS!NtfsNonCachedIo+0x425
6 NTFS!NtfsCommonWrite+0x36e8
7 NTFS!NtfsFsdWrite+0x1d8
8 FLTMGR!FltpLegacyProcessingAfterPreCallbacksCompleted+0x1a6
9 FLTMGR!FltpDispatch+0xb6
a nt!IoSynchronousPageWriteEx+0x138
b nt!MiSynchronousPageWrite+0x29
c nt!MiIssueSynchronousFlush+0x72
d nt!MiFlushSectionInternal+0xae9
e nt!MmFlushSection+0x1a8
f nt!CcFlushCachePriv+0x60e
10 nt!CcFlushCache+0x22
11 nt!CcWriteBehindInternal+0x15a
12 nt!CcWriteBehind+0x76
13 nt!CcWorkerThread+0x212
14 nt!ExpWorkerThread+0xe9
15 nt!PspSystemThreadStartup+0x41
16 nt!KxStartSystemThread+0x16
5. Outdated HBA Driver:
The Host Bus Adapter (HBA) driver (ql2300i.sys
) is significantly outdated (timestamp: 2015-08-17), raising concerns about compatibility or bugs.
Base End Module name Time stamp Path
========================================================================================
fffff80448380000 fffff80448512000 | ql2300i | 2015-08-17 04:20:31 | \SystemRoot\System32\drivers\ql2300i.sys
No hardware issue was reported by the HBA driver.
Conclusion
The bugcheck 0x9E was a direct result of a stalled I/O operation that prevented a critical Failover Cluster resource from responding to health checks. The high number of pending I/O requests and the outdated HBA driver strongly suggest a problem within the storage stack. While no direct hardware errors were reported, a hardware-level malfunction cannot be entirely ruled out.
Recommendations
- Prioritize Driver Update: The outdated HBA driver (
ql2300i.sys
) is the most likely culprit. We strongly recommend updating to the latest version from the hardware vendor (HPE). - Investigate Storage Vendor: Contact your storage vendor (Hitachi) for further investigation. They should diagnose potential issues within the storage system (controller, cables, etc.).
- Further Monitoring: After updating the driver, closely monitor the system for any further I/O delays or instability.
Debugging Details
Item | Details |
---|---|
Computer Name | Machine1 |
Kernel Version | Machine Version |
Product | PII |
Edition build lab | PII |
Dump Time | Wed Dec 11 12:49:36.341 2024 (UTC + 1:00) |
System Uptime | 54 days 20:51:31.931 |
System Manufacturer | HPE |
System Product Name | PII |
Processor | Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz |
Bugcheck Info | Watchdog Bugcheck 9e |
Dump Type | Kernel Summary (Kernel address space is available - user address space may not be available) |
Build Revision | PII |
References
- Understanding how Failover Clustering Recovers from Unresponsive Resources | Microsoft Community Hub - BugCheck9e underlying mechanism.
- Bug Check 0x9E USER_MODE_HEALTH_MONITOR - Windows drivers | Microsoft Learn - BugCheck9e arguments explanation.
Disclaimer: this RCA was fabricated based on a potential event, all of these information was generated on my own lab.