RCA 11: Application relying on Windows Failover Cluster fail

A Windows application, relying on Windows Failover Cluster to provide shared network drive between two data centers failed, causing this critical banking transaction app to fail.

 

Cluster Node 1: ABCK-TCVN-PAB01(Active)

Cluster Node 2: XYZK-TCVN-PAB01

 

T: drive on this VM was a RDM LUN, (datastore name is ABCD-0052-0-149), and this RDM is from the datastore in HO datacentre. 

 

During the initial investigation, there were “host storage path redundancy or storage connectivity is lost” events registered on multiple ESXi hosts in the same cluster where ESXi host, ABCT-PROD-PRP10, is part of, but the events has since stopped from around 11:30pm, 17 April.  I don’t know if the issue is related to this.

 

Migrated ABCK-TCVN-PAB01 from host ABCT-PROD-PRP10 to ABCT-PROD-PRP12, the issue re-appeared after about 2 hours’ running.

 

Time Period: 09:00am 17-April-2019 and 04:00am 18-April-2019

 

GMT: 11:00 PM 16th APRIL

GMT: 6:00 PM 17th APRIL

 

 

ABCK-TCVN-PAB01

 

  • Tried to find out event at 11:00 PM GMT for 16th APRIL but was not able to find events in the vmware logs.
  • However, we can see the events for the vMotion happening where VM is moved from ABCT-PROD-prp12 to ABCT-PROD-PRP10 and then later back from ABCT-PROD-PRP10 to ABCT-PROD-prp12.

 

2019-04-17T00:50:39.554Z| vmx| I120: MigrateSetInfo: state=1 srcIp=<10.205.53.68> dstIp=<10.205.53.66> mid=1555462239311043 uuid=30373237-3132-4753-4835-343657564c41 priority=low
2019-04-17T00:50:45.197Z| vmx| W110: VMX has left the building: 0.

 

 

2019-04-17T14:50:13.775Z| vmx| I120: MigrateSetInfo: state=1 srcIp=<10.205.53.66> dstIp=<10.205.53.68> mid=1555512613539894 uuid=30373237-3132-4753-4835-343657564c38 priority=high

2019-04-17T14:50:21.038Z| vmx| W110: VMX has left the building: 0.

 

  • We can only see the Remote Connection Failure at the time of issue from inside the VM:

2019-04-17T17:36:22.260Z| vmx| I120: GuestRpc: Got error for connection 2803:Remote connection failure
2019-04-17T17:38:02.993Z| vmx| I120: GuestRpc: Got error for connection 2903:Remote connection failure
2019-04-17T18:01:47.073Z| vmx| I120: GuestRpc: Got error for connection 3003:Remote connection failure
2019-04-17T18:03:23.651Z| vmx| I120: GuestRpc: Got error for connection 3103:Remote connection failure
2019-04-17T18:05:04.020Z| vmx| I120: GuestRpc: Got error for connection 3203:Remote connection failure
2019-04-17T18:06:44.153Z| vmx| I120: GuestRpc: Got error for connection 3303:Remote connection failure

 

RPC Associated Event: https://support.hpe.com/hpsc/doc/public/display?docId=mmr_kc-0132884

 

 

________________________Host Side Events__________

 

Hostname: ABCT-PROD-PRP10

Esxi Version: VMware ESXi 6.0.0 build-3620759 (ESXi 6.0 Update 2)

 

ABCD-0052-0-149         

naa.60060e8007c5e9000030c5e900000052 

59117ab3-c3557ada-3dd6-9cdc71782ff0 

13210601   

96894594     

2589590       

 LUN:149 

fc     

4           

 HITACHI Fibre Channel

 

  • Reviewed the Logs on the host for the Day: 2019-04-16 but was not able to find anything events other than below:

:var$ grep -ilr 2019-04-16

run/log/vmauthd.1

run/log/syslog.3

run/log/hostd-probe.2

run/log/fdm.log

run/log/vmksummary.0

log/tdlog/fdm.all

log/tdlog/syslog.all

 

  • We can see few Logs for 17th but till than the machine was moved to another host.

:var$ grep -ilr 2019-04-17

run/log/vmauthd.1

run/log/syslog.3

run/log/hostd-probe.2

run/log/vobd.log

run/log/fdm.log

run/log/vmksummary.0

run/log/vmauthd.0

log/tdlog/vobd.all

log/tdlog/fdm.all

log/tdlog/syslog.all

 

 

  • On this host we can see that the below passthrough disk was attached.

 

VMDK                 

physical/virtual RDM           

vml.ID                                                     

naa.ID                               

Display Name

XYZK-TCVN-PAB01.vmdk 

Passthrough Raw Device Mapping

 vml.0200b6000060060e8007c5e9000030c5e900abce284f50454abcde    

naa.60060e8007c5e9000030c5e900abce28 

HITACHI OPEN-V

 

  • Tried to review the Hostd, vmkernel, but was not able to find anything around the timeline on which we have recorded the issue.

 

FDM Logs:

 

  • From the FDM Logs we can see no connectivity issues between the cluster nodes.

 

2019-04-16T23:06:52.500Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=72 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-16T23:16:53.449Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=72 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-16T23:46:56.339Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a]  hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=72 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-16T23:48:13.150Z info fdm[5880DB70] [Originator@6876 sub=Invt opID=SWI-65630605] [DsStateChange::SaveToInventory] Unset _accessible for datastore (/vmfs/volumes/56d3c849-e6ba7d5c-483a-b05adab9bfb0).
2019-04-16T23:56:57.279Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=72 vm-metadata-version=16896 slv-mst-tdiff-sec=0

 

2019-04-17T18:07:45.768Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a]  hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=73 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-17T18:17:45.869Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=73 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-17T18:18:13.699Z info fdm[588D0B70] [Originator@6876 sub=Invt opID=SWI-2b12d47b] [DsStateChange::SaveToInventory] Unset _accessible for datastore (/vmfs/volumes/56d3c849-e6ba7d5c-483a-b05adab9bfb0).
2019-04-17T18:48:13.714Z info fdm[588D0B70] [Originator@6876 sub=Invt opID=SWI-589a0135] [DsStateChange::SaveToInventory] Unset _accessible for datastore (/vmfs/volumes/56d3c849-e6ba7d5c-483a-b05adab9bfb0).
2019-04-17T18:57:46.201Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=73 vm-metadata-version=16896 slv-mst-tdiff-sec=0

 

VOBD Logs:

 

  • In VOBD logs we can see that the Device NAA.60060e8007c5e9000030c5e900abce28 which is a LUN presented to the Host has issues with the Space Utilization, this is the same RDM LUN which is presented to the VM on which we have noticed the issue.

 

2019-04-17T00:50:45.388Z: [scsiCorrelator] 34211865239745us: [vob.scsi.device.thinprov.atquota] Space utilization on the thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded the configured threshold
2019-04-17T00:50:45.389Z: [scsiCorrelator] 34212071285263us: [esx.problem.scsi.device.thinprov.atquota] Space utilization on thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded configured threshold. Affected datastores (if any): Unknown.
2019-04-17T04:41:47.117Z: [scsiCorrelator] 34225726877050us: [vob.scsi.device.thinprov.atquota] Space utilization on the thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded the configured threshold
2019-04-17T04:41:47.118Z: [scsiCorrelator] 34225933014227us: [esx.problem.scsi.device.thinprov.atquota] Space utilization on thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded configured threshold. Affected datastores (if any): Unknown.
2019-04-17T07:40:57.925Z: [scsiCorrelator] 34236477615305us: [vob.scsi.device.thinprov.atquota] Space utilization on the thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded the configured threshold
2019-04-17T07:40:57.926Z: [scsiCorrelator] 34236683822432us: [esx.problem.scsi.device.thinprov.atquota] Space utilization on thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded configured threshold. Affected datastores (if any): Unknown.

 

Hostname: ABCT-PROD-prp12

Esxi Version: VMware ESXi 6.0.0 build-3620759 (ESXi 6.0 Update 2)

 

Tried to review the Hostd, vmkernel, but was not able to find anything around the timeline on which we have recorded the issue.

 

 

________________________________________________Windows Side Analysis_____________________________________________

 

System Info:ABCK-TCVN-PAB01

 

OS Name        Microsoft Windows Server 2012 R2 Standard

Version        6.3.9600 Build 9600

Other OS Description         Not Available

OS Manufacturer        Microsoft Corporation

System Name        ABCK-TCVN-PAB01

System Manufacturer        VMware, Inc.

System Model        VMware Virtual Platform

System Type        x64-based PC

System SKU        

Processor        Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, 2297 Mhz, 2 Core(s), 2 Logical Processor(s)

BIOS Version/Date        Phoenix Technologies LTD 6.00, 21/09/2015

Time Zone        AUS Eastern Standard Time

 

Cluster Logs:

 

  • From the Cluster Logs we can see that we are getting a generic Authorization error, however this is not associated with the issue that we are facing:

 

000008cc.000012ec::2019/04/16-23:45:46.982 WARN  [CHANNEL 10.210.53.29:~53934~] failure, status (10054)
000008cc.000012ec::2019/04/16-23:45:46.982 ERR   mscs_security::BaseSecurityContext::DoAuthenticate_static: Closed(1236)’ because of ‘channel to remote endpoint 10.210.53.29:~53934~ has failed with status (10054)’
000008cc.00001adc::2019/04/16-23:45:46.982 WARN  mscs::ListenerWorker::operator (): HrError(0x000004d4)’ because of ‘[SV] Authentication or Authorization Failed’

  • Here we can see that the issue started around 5:50 PM GMT where the Resource state started to change from Online to Offline.

 

000008cc.000028f0::2019/04/17-17:50:33.581 INFO  [RCM] TransitionToState(IP Address 10.207.51.129) Online–>WaitingToGoOffline.
000008cc.000028f0::2019/04/17-17:50:33.581 INFO  [RCM] TransitionToState(tcVisionP) Online–>WaitingToGoOffline.
000008cc.000028f0::2019/04/17-17:50:33.581 INFO  [RCM] TransitionToState(tcVISION_PROD_ORA Manager) Online–>WaitingToGoOffline.
000008cc.000028f0::2019/04/17-17:50:33.581 INFO  [RCM] TransitionToState(tcVISION_PROD Manager) Online–>WaitingToGoOffline.
000008cc.000028f0::2019/04/17-17:50:33.581 INFO  [RCM] TransitionToState(File Server (\\tcVisionP)) Online–>WaitingToGoOffline.

000008cc.00001e0c::2019/04/17-17:50:33.593 INFO  [RCM] TransitionToState(File Server (\\tcVisionP)) OfflinePending–>OfflineSavingCheckpoints.
000008cc.00001e0c::2019/04/17-17:50:33.593 INFO  [RCM] TransitionToState(File Server (\\tcVisionP)) OfflineSavingCheckpoints–>Offline.

  • Service Completely went offline, after this the Resource started to Move from ABCK-TCVN-PAB01(1)  to XYZK-TCVN-PAB01(2)

 

00000e68.00002018::2019/04/17-17:50:45.600 INFO  [RES] Generic Service <tcVISION_PROD Manager>: Service is now offline.

00000e68.0000131c::2019/04/17-17:50:45.643 WARN  [RES] IPv4 Address Resources: Notify port is about to be closed. Exiting…
000008cc.00002310::2019/04/17-17:50:45.645 INFO  [RCM] TransitionToState(IP Address 10.207.51.129) OfflineSavingCheckpoints–>Offline.
000008cc.000028f0::2019/04/17-17:50:45.660 WARN  [NETFTAPI] Failed to query parameters for 10.207.51.129 (status 0x80070490)
000008cc.00002310::2019/04/17-17:50:45.880 INFO  [RCM] move of group tcVisionP from ABCK-TCVN-PAB01(1) to XYZK-TCVN-PAB01(2) of type MoveType::Manual is about to succeed, failoverCount=0, lastFailoverTime=1601/01/01-00:00:00.000 targeted=true
000008cc.00002414::2019/04/23-02:07:49.858 ERR   [API] AccessCheck[AndAuditAlarm] failed.  status = 0x00000005
000008cc.00002358::2019/04/23-02:14:52.649 ERR   [API] AccessCheck[AndAuditAlarm] failed.  status = 0x00000005

  • From the System Logs we can see that we are getting the error which is pointing out an issue with the Capacity Utilization threshold Error.

 

TimeGenerated

EntryType

MachineName

EventID

Source

Message

17-04-2019 10.50

Warning

ABCK-TCVN-PAB01.rba.gov.au

144

disk

Disk 1 has crossed a capacity utilization threshold.

17-04-2019 10.53

Warning

ABCK-TCVN-PAB01.rba.gov.au

140

Microsoft-Windows-Ntfs

The description for Event ID ‘140’ in Source ‘Microsoft-Windows-Ntfs’ cannot be found.   The following information is part of the event:’T:’, ‘\Device\HarddiskVolume3’, ‘0x80000011’

 

 

Conclusion:

 

  • From the Events and logs captured we were not able to find much information from the Host logs other than the Issue associated with the Thin Provisioned storage exceeding the threshold, Windows logs are also pointing to the same issue that we have found in the Host logs.
  • Since we have seen this issue twice and there is not much information in the logs from the host end, we would not be able to conclude that this has happened because of the storage issue

 

Action Plan:

 

  • From the Logs I can see the Datastore Name: ABCD-0052-0-149 has the Size of 0.488 TB  and 0.406 TB is showing as free but since it is showing as Thin provisioned we would have to check if we have enough storage provisioned from the Hardware end.

 

 

  • I found this Information from the Command which Ran to capture the RDM Info for the Vmdks associated with the VMs

 

  • As per the Command: # vmkfstools -q /vmfs/volumes/59117ab3-c3557ada-3dd6-9cdc71782ff0/XYZK-TCVN-PAB01/XYZK-TCVN-PAB01.vmdk

 

Disk /vmfs/volumes/59117ab3-c3557ada-3dd6-9cdc71782ff0/XYZK-TCVN-PAB01/XYZK-TCVN-PAB01.vmdk is a Passthrough Raw Device Mapping

Maps to: vml.0200b6000060060e8007c5e9000030c5e900abce284f50454abcde 

 

  • The VML ID mentioned above is associated with the NAA ID which I shared with you on Email.

 

VMDK                 

physical/virtual RDM           

vml.ID                                                     

naa.ID                               

Display Name

XYZK-TCVN-PAB01.vmdk 

Passthrough Raw Device Mapping 

vml.0200b6000060060e8007c5e9000030c5e900abce284f50454abcde    

naa.60060e8007c5e9000030c5e900abce28 

HITACHI OPEN-V

 

  • I am not aware about the VROPs terminologies but I will check with one of my colleague and will get back to you.

 

Ashutosh Dixit

I am currently working as a Senior Technical Support Engineer with VMware Premier Services for Telco. Before this, I worked as a Technical Lead with Microsoft Enterprise Platform Support for Production and Premier Support. I am an expert in High-Availability, Deployments, and VMware Core technology along with Tanzu and Horizon.

Leave a Reply