The Case of the Vanishing LUN: An MPIO Mystery on Windows Server 2022

Incorrect Vital Product Data (VPD) can have profound consequences. This RCA explores how stale Target Port Group identifiers on a Vendor Storage array led to a critical LUN becoming inaccessible in a Windows Server 2022 environment.

 

Root Cause Analysis Report

Storage Vendor MPIO LUN Disappearance

Executive Summary

This RCA report analyzes an issue where a Storage Vendor LUN (Logical Unit Number) became inaccessible (hidden) after a server reboot in a Windows Server 2022 environment using MPIO (Multipath I/O). The root cause was identified as outdated Target Port Group (TPG) identifiers present in the VPD (Vital Product Data) page 83 data returned by the Storage Vendor storage array. This incorrect information misled the Microsoft Device Specific Module (DSM) during device enumeration, causing the LUN to be unclaimed.

Introduction

Environment

Component Details
Operating System Windows Server 2022
Storage Array Storage Vendor
Multipathing Microsoft MPIO, using the Microsoft DSM
Connectivity iSCSI (Adjust if Fibre Channel)
TPGs
  • Pair 1: 1000/1001 (TPs: 1000:1, 1001:2)
  • Pair 2: 5096/5097 (TPs: 5096:4097, 5097:4098)
  • Problematic (Stale): 13e8, 13e9

Purpose

This document details the root cause analysis of the disappearing Storage Vendor LUN after a server reboot. The goal is to identify the root cause, outline the sequence of events, and recommend corrective actions to prevent recurrence.

Incident Description

After a planned reboot of the Windows Server 2022 host, a previously accessible Storage Vendor LUN became inaccessible. The LUN was no longer visible in Disk Management, and applications relying on it failed. This occurred after the following sequence:

  1. Initial setup with TPGs 1000/1001 connected; LUN visible.
  2. TPGs 5096/5097 added; LUN remained visible.
  3. TPGs 1000/1001 removed; LUN remained visible.
  4. Server rebooted; LUN disappeared.

Methodology

The investigation involved:

  • Log Review: Examination of Windows Event Logs (System, Application, MPIO)
  • MPIO Tracing: Enabled MPIO and MSDSM ETW tracing.
  • Live Kernel Dump: Collected a live kernel dump.
  • VPD Data Analysis: Examined VPD page 83 data.
  • DSM Code Review: Analyzed relevant Microsoft DSM code.

Findings

Chronology and Summary

  1. Pre-Reboot (Working State): The LUN was accessible through TPGs 5096/5097. TPGs 1000/1001 were removed. mpclaim -v confirmed only TPGs 5096/5097 were active.
  2. Reboot: Server rebooted (planned maintenance).
  3. Post-Reboot (Issue): LUN no longer visible.

Event Logs:

  • Two MPIO Event ID 47 events shortly after boot (TPs 5096:4097 and 5097:4098).
  • No other disk/storage errors in System log around the boot.

MPIO Tracing (Disconnect/Reconnect):

MPIOAddSingleDevice (FFFFA38E24D91050): PDO was not claimed by any DSM
MPIODeviceRegistration() - MPIODeviceRegistration (FFFFAB83343C2050): Failed to add device (FFFFA38E24D91050) with status (c000000e)

VPD Data Analysis:

Kernel dump taken with .\tss.ps1 -liveKD both -sha_mpio -sha_msdsm -nogpresult -nosdp -nobasiclog -noupdate -noxray. Information extracted using !scsikd.scsiinquiry.

Harddisk1 (VPD Page 83 - Snippet):

 000: 00 83 00 70  02 01 00 20  00 00 00 00  00 00 00 00 | ...p ... .... ....
 010: 20 4c 55 4e  20 77 4f 6a  34 5a 3f 56  6b 7a 44 37 |  LUN  wOj 4Z?V kzD7
 020: 38 20 20 20  20 20 20 20  01 03 00 10  60 0a 09 80 | 8         .... `...
 030: 77 4f 6a 34  5a 3f 56 6b  7a 44 37 38  01 02 00 10 | wOj4 Z?Vk zD78 ....
 040: 3f 56 6b 7a  44 37 38 00  00 a0 98 77  4f 6a 34 5a | ?Vkz D78. ...w Oj4Z
 050: 01 13 00 10  60 0a 09 80  00 00 00 02  c0 a8 00 8e | .... `... .... ....
 060: 00 00 0c bc  01 14 00 04  00 00 10 01  01 15 00 04 | .... .... .... ....
 070: 00 00 13 e8                                        | ....

Harddisk3 (VPD Page 83 - Snippet):

 000: 00 83 00 70  02 01 00 20  00 00 00 00 00 00 00 00 | ...p ... .... ....
 010: 20 4c 55 4e  20 77 4f 6a  34 5a 3f 56  6b 7a 44 37 |  LUN  wOj 4Z?V kzD7
 020: 38 20 20 20  20 20 20 20  01 03 00 10  60 0a 09 80 | 8         .... `...
 030: 77 4f 6a 34  5a 3f 56 6b  7a 44 37 38  01 02 00 10 | wOj4 Z?Vk zD78 ....
 040: 3f 56 6b 7a  44 37 38 00  00 a0 98 77  4f 6a 34 5a | ?Vkz D78. ...w Oj4Z
 050: 01 13 00 10  60 0a 09 80  00 00 00 02  c0 a8 00 8f | .... `... .... ....
 060: 00 00 0c bc  01 14 00 04  00 00 10 02  01 15 00 04 | .... .... .... ....
 070: 00 00 13 e9                                        | ....

The VPD data included stale TPG identifiers 13e8 and 13e9. These were no longer valid.

Root Causes

  • Primary Cause: Stale TPG identifiers (13e8 and 13e9) in VPD page 83 data from the Storage Vendor array. This caused the Microsoft DSM to incorrectly process the device, resulting in it not being claimed.

Detailed Explanation and DSM Code Analysis

The Microsoft DSM (MSDSM) handles TPG management and device claiming as follows:

  1. Device Enumeration: DSM sends SCSI inquiry commands (including VPD page 83).
  2. TPG Processing: DSM parses VPD page 83 for TPG identifiers.
  3. Relevant DSM Functions:
    • DsmpBuildTargetPortGroupEntry(): Allocates new TPG entry.
    • DsmpBuildTargetPortListEntry(): Creates entries for Target Ports.
    • DsmpUpdateTargetPortGroupEntry(): Updates existing TPG entry.
    • DsmpFindTargetPortGroupEntry: Looks for a TPG Entry
    • DsmpFindTargetPortGroup: Looks for a TPG.

    Overall flow:

    1. Get standard inquiry data; check SPC-3 compliance.
    2. Create device serial number; populate device info.
    3. Create device name.
    4. If ALUA supported, send "Report Target Port Groups".
    5. Find/create TPG.
    6. Build/update TPG and Target Port info.
    7. If both implicit/explicit ALUA transitions allowed, disable implicit.
    8. Get VPD page 0x83 (if no matches).
    9. Match type 0x5 identifiers with "Report Target Port Groups". Use SCSI address if no match.
    10. Create controller list, removing stale entries.
  4. Failure Mechanism: Stale TPG identifiers (13e8, 13e9) caused:
    • DSM read VPD page 83 with incorrect identifiers.
    • DSM attempted to create/update entries for non-existent TPGs.
    • Conflict prevented DSM from claiming device via valid TPGs (5096/5097).
    • Resulted in MPIOAddSingleDevice and MPIODeviceRegistration failures.

Conclusion

The root cause was stale TPG identifiers in the VPD page 83 data. This misled the Microsoft DSM, preventing the LUN from being claimed after reboot.

Recommendation

Immediate Actions:

  1. Correct VPD Data: Update VPD page 83 on the Storage Vendor array to remove stale TPG identifiers 13e8 and 13e9. VPD data should only reflect currently active TPGs (5096/5097). Consult Storage Vendor documentation. This is the critical corrective action.

Long-Term Actions:

  1. Configuration Management: Implement strict Storage Vendor configuration management to prevent stale data, especially in VPD page 83.
  2. Monitoring: Monitor for MPIO Event ID 47 and related errors.
  3. Storage Vendor Best Practices: Verify VPD data accuracy after TPG configuration changes.

Appendices

Appendix A: Relevant Event Log Entries

Time Event ID Provider Description
Post-Reboot 47 MPIO MPIOAddSingleDevice (...): PDO was not claimed by any DSM
Post-Reboot 47 MPIO MPIODeviceRegistration() (...): Failed to add device (...) with status (c000000e)

Appendix B: DSM Code Snippets (Conceptual)


// Conceptual representation of DsmpBuildTargetPortGroupEntry
DsmpBuildTargetPortGroupEntry(...) {
  // ...
  // Get Target Port Group information from VPD page 83
  // ...

  for each TargetPortIdentifier in TargetPortGroupsDescriptor {
    DsmpBuildTargetPortListEntry(TargetPortIdentifier); // Create entry for each port
  }
  // ...
}

// Conceptual representation of DsmpUpdateTargetPortGroupEntry
DsmpUpdateTargetPortGroupEntry(...) {
  // ...
  // Get Target Port Group information from VPD page 83
  // ...

  for each TargetPortIdentifier in TargetPortGroupsDescriptor {
    if (TargetPortListEntryExists(TargetPortIdentifier)) {
      // Update existing entry if needed
    } else {
      DsmpBuildTargetPortListEntry(TargetPortIdentifier); // Create entry for new port
    }
  }

  // Remove entries for any Target Ports that are no longer present
  // ...
}

Appendix C: VPD data reading

The VPD page data is shown in hexadecimal format. It can be interpreted as follows:

  • Offset: (e.g., 000, 010, 020...) - Starting memory address (hex).
  • Hexadecimal Data: (e.g., 00 83 00 70...) - Actual data (hex pairs = bytes).
  • ASCII Interpretation: (e.g., ...p ... NETA PP) - ASCII representation (. for non-printable).

VPD page 83 is for "Device Identification," including these key designators:

  • Designator Type:
    • 01: Vendor Specific
    • 02: EUI-64
    • 03: NAA
    • 04: Relative Target Port
    • 05: Target Port Group
    • 08: SCSI Name String
  • Code Set:
    • 01: Binary
    • 02: ASCII
    • 03: UTF-8

 

Disclaimer: this RCA was fabricated based on a potential event, all of these information was generated on my own lab.