RCA 11: Application relying on Windows Failover Cluster fail

A Windows application, relying on Windows Failover Cluster to provide shared network drive between two data centers failed, causing this critical banking transaction app to fail.

Cluster Node 1: ABCK-TCVN-PAB01(Active)

Cluster Node 2: XYZK-TCVN-PAB01

T: drive on this VM was a RDM LUN, (datastore name is ABCD-0052-0-149), and this RDM is from the datastore in HO datacentre.

During the initial investigation, there were “host storage path redundancy or storage connectivity is lost” events registered on multiple ESXi hosts in the same cluster where ESXi host, ABCT-PROD-PRP10, is part of, but the events has since stopped from around 11:30pm, 17 April. I don’t know if the issue is related to this.

Migrated ABCK-TCVN-PAB01 from host ABCT-PROD-PRP10 to ABCT-PROD-PRP12, the issue re-appeared after about 2 hours’ running.

Time Period: 09:00am 17-April-2019 and 04:00am 18-April-2019

GMT: 11:00 PM 16th APRIL

GMT: 6:00 PM 17th APRIL

ABCK-TCVN-PAB01

Tried to find out event at 11:00 PM GMT for 16th APRIL but was not able to find events in the vmware logs.
However, we can see the events for the vMotion happening where VM is moved from ABCT-PROD-prp12 to ABCT-PROD-PRP10 and then later back from ABCT-PROD-PRP10 to ABCT-PROD-prp12.

2019-04-17T00:50:39.554Z| vmx| I120: MigrateSetInfo: state=1 srcIp=<10.205.53.68> dstIp=<10.205.53.66> mid=1555462239311043 uuid=30373237-3132-4753-4835-343657564c41 priority=low
2019-04-17T00:50:45.197Z| vmx| W110: VMX has left the building: 0.

2019-04-17T14:50:13.775Z| vmx| I120: MigrateSetInfo: state=1 srcIp=<10.205.53.66> dstIp=<10.205.53.68> mid=1555512613539894 uuid=30373237-3132-4753-4835-343657564c38 priority=high

2019-04-17T14:50:21.038Z| vmx| W110: VMX has left the building: 0.

We can only see the Remote Connection Failure at the time of issue from inside the VM:

2019-04-17T17:36:22.260Z| vmx| I120: GuestRpc: Got error for connection 2803:Remote connection failure
2019-04-17T17:38:02.993Z| vmx| I120: GuestRpc: Got error for connection 2903:Remote connection failure
2019-04-17T18:01:47.073Z| vmx| I120: GuestRpc: Got error for connection 3003:Remote connection failure
2019-04-17T18:03:23.651Z| vmx| I120: GuestRpc: Got error for connection 3103:Remote connection failure
2019-04-17T18:05:04.020Z| vmx| I120: GuestRpc: Got error for connection 3203:Remote connection failure
2019-04-17T18:06:44.153Z| vmx| I120: GuestRpc: Got error for connection 3303:Remote connection failure

RPC Associated Event: https://support.hpe.com/hpsc/doc/public/display?docId=mmr_kc-0132884

________________________Host Side Events__________

Hostname: ABCT-PROD-PRP10

Esxi Version: VMware ESXi 6.0.0 build-3620759 (ESXi 6.0 Update 2)

ABCD-0052-0-149

naa.60060e8007c5e9000030c5e900000052

59117ab3-c3557ada-3dd6-9cdc71782ff0

13210601

96894594

2589590

LUN:149

HITACHI Fibre Channel

Reviewed the Logs on the host for the Day: 2019-04-16 but was not able to find anything events other than below:

:var$ grep -ilr 2019-04-16

run/log/vmauthd.1

run/log/syslog.3

run/log/hostd-probe.2

run/log/fdm.log

run/log/vmksummary.0

log/tdlog/fdm.all

log/tdlog/syslog.all

We can see few Logs for 17th but till than the machine was moved to another host.

:var$ grep -ilr 2019-04-17

run/log/vmauthd.1

run/log/syslog.3

run/log/hostd-probe.2

run/log/vobd.log

run/log/fdm.log

run/log/vmksummary.0

run/log/vmauthd.0

log/tdlog/vobd.all

log/tdlog/fdm.all

log/tdlog/syslog.all

On this host we can see that the below passthrough disk was attached.

VMDK	physical/virtual RDM	vml.ID	naa.ID	Display Name
XYZK-TCVN-PAB01.vmdk	Passthrough Raw Device Mapping	vml.0200b6000060060e8007c5e9000030c5e900abce284f50454abcde	naa.60060e8007c5e9000030c5e900abce28	HITACHI OPEN-V

Tried to review the Hostd, vmkernel, but was not able to find anything around the timeline on which we have recorded the issue.

FDM Logs:

From the FDM Logs we can see no connectivity issues between the cluster nodes.

2019-04-16T23:06:52.500Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=72 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-16T23:16:53.449Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=72 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-16T23:46:56.339Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=72 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-16T23:48:13.150Z info fdm[5880DB70] [Originator@6876 sub=Invt opID=SWI-65630605] [DsStateChange::SaveToInventory] Unset _accessible for datastore (/vmfs/volumes/56d3c849-e6ba7d5c-483a-b05adab9bfb0).
2019-04-16T23:56:57.279Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=72 vm-metadata-version=16896 slv-mst-tdiff-sec=0

2019-04-17T18:07:45.768Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=73 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-17T18:17:45.869Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=73 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-17T18:18:13.699Z info fdm[588D0B70] [Originator@6876 sub=Invt opID=SWI-2b12d47b] [DsStateChange::SaveToInventory] Unset _accessible for datastore (/vmfs/volumes/56d3c849-e6ba7d5c-483a-b05adab9bfb0).
2019-04-17T18:48:13.714Z info fdm[588D0B70] [Originator@6876 sub=Invt opID=SWI-589a0135] [DsStateChange::SaveToInventory] Unset _accessible for datastore (/vmfs/volumes/56d3c849-e6ba7d5c-483a-b05adab9bfb0).
2019-04-17T18:57:46.201Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=73 vm-metadata-version=16896 slv-mst-tdiff-sec=0

VOBD Logs:

In VOBD logs we can see that the Device NAA.60060e8007c5e9000030c5e900abce28 which is a LUN presented to the Host has issues with the Space utilization on thin-provisioned device, this is the same RDM LUN which is presented to the VM on which we have noticed the issue.

2019-04-17T00:50:45.388Z: [scsiCorrelator] 34211865239745us: [vob.scsi.device.thinprov.atquota] Space utilization on the thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded the configured threshold
2019-04-17T00:50:45.389Z: [scsiCorrelator] 34212071285263us: [esx.problem.scsi.device.thinprov.atquota] Space utilization on thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded configured threshold. Affected datastores (if any): Unknown.
2019-04-17T04:41:47.117Z: [scsiCorrelator] 34225726877050us: [vob.scsi.device.thinprov.atquota] Space utilization on the thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded the configured threshold
2019-04-17T04:41:47.118Z: [scsiCorrelator] 34225933014227us: [esx.problem.scsi.device.thinprov.atquota] Space utilization on thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded configured threshold. Affected datastores (if any): Unknown.
2019-04-17T07:40:57.925Z: [scsiCorrelator] 34236477615305us: [vob.scsi.device.thinprov.atquota] Space utilization on the thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded the configured threshold
2019-04-17T07:40:57.926Z: [scsiCorrelator] 34236683822432us: [esx.problem.scsi.device.thinprov.atquota] Space utilization on thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded configured threshold. Affected datastores (if any): Unknown.

Hostname: ABCT-PROD-prp12

Esxi Version: VMware ESXi 6.0.0 build-3620759 (ESXi 6.0 Update 2)

Tried to review the Hostd, vmkernel, but was not able to find anything around the timeline on which we have recorded the issue.

________________________________________________Windows Side Analysis_____________________________________________

System Info:ABCK-TCVN-PAB01

OS Name Microsoft Windows Server 2012 R2 Standard

Version 6.3.9600 Build 9600

Other OS Description Not Available

OS Manufacturer Microsoft Corporation

System Name ABCK-TCVN-PAB01

System Manufacturer VMware, Inc.

System Model VMware Virtual Platform

System Type x64-based PC

System SKU

Processor Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, 2297 Mhz, 2 Core(s), 2 Logical Processor(s)

BIOS Version/Date Phoenix Technologies LTD 6.00, 21/09/2015

Time Zone AUS Eastern Standard Time

Cluster Logs:

From the Cluster Logs we can see that we are getting a generic Authorization error, however this is not associated with the issue that we are facing:

000008cc.000012ec::2019/04/16-23:45:46.982 WARN [CHANNEL 10.210.53.29:~53934~] failure, status (10054)
000008cc.000012ec::2019/04/16-23:45:46.982 ERR mscs_security::BaseSecurityContext::DoAuthenticate_static: Closed(1236)’ because of ‘channel to remote endpoint 10.210.53.29:~53934~ has failed with status (10054)’
000008cc.00001adc::2019/04/16-23:45:46.982 WARN mscs::ListenerWorker::operator (): HrError(0x000004d4)’ because of ‘[SV] Authentication or Authorization Failed’

Here we can see that the issue started around 5:50 PM GMT where the Resource state started to change from Online to Offline.

000008cc.000028f0::2019/04/17-17:50:33.581 INFO [RCM] TransitionToState(IP Address 10.207.51.129) Online–>WaitingToGoOffline.
000008cc.000028f0::2019/04/17-17:50:33.581 INFO [RCM] TransitionToState(tcVisionP) Online–>WaitingToGoOffline.
000008cc.000028f0::2019/04/17-17:50:33.581 INFO [RCM] TransitionToState(tcVISION_PROD_ORA Manager) Online–>WaitingToGoOffline.
000008cc.000028f0::2019/04/17-17:50:33.581 INFO [RCM] TransitionToState(tcVISION_PROD Manager) Online–>WaitingToGoOffline.
000008cc.000028f0::2019/04/17-17:50:33.581 INFO [RCM] TransitionToState(File Server (\\tcVisionP)) Online–>WaitingToGoOffline.

000008cc.00001e0c::2019/04/17-17:50:33.593 INFO [RCM] TransitionToState(File Server (\\tcVisionP)) OfflinePending–>OfflineSavingCheckpoints.
000008cc.00001e0c::2019/04/17-17:50:33.593 INFO [RCM] TransitionToState(File Server (\\tcVisionP)) OfflineSavingCheckpoints–>Offline.

Service Completely went offline, after this the Resource started to Move from ABCK-TCVN-PAB01(1) to XYZK-TCVN-PAB01(2)

00000e68.00002018::2019/04/17-17:50:45.600 INFO [RES] Generic Service <tcVISION_PROD Manager>: Service is now offline.

00000e68.0000131c::2019/04/17-17:50:45.643 WARN [RES] IPv4 Address Resources: Notify port is about to be closed. Exiting…
000008cc.00002310::2019/04/17-17:50:45.645 INFO [RCM] TransitionToState(IP Address 10.207.51.129) OfflineSavingCheckpoints–>Offline.
000008cc.000028f0::2019/04/17-17:50:45.660 WARN [NETFTAPI] Failed to query parameters for 10.207.51.129 (status 0x80070490)
000008cc.00002310::2019/04/17-17:50:45.880 INFO [RCM] move of group tcVisionP from ABCK-TCVN-PAB01(1) to XYZK-TCVN-PAB01(2) of type MoveType::Manual is about to succeed, failoverCount=0, lastFailoverTime=1601/01/01-00:00:00.000 targeted=true
000008cc.00002414::2019/04/23-02:07:49.858 ERR [API] AccessCheck[AndAuditAlarm] failed. status = 0x00000005
000008cc.00002358::2019/04/23-02:14:52.649 ERR [API] AccessCheck[AndAuditAlarm] failed. status = 0x00000005

From the System Logs we can see that we are getting the error which is pointing out an issue with the Capacity Utilization threshold Error.

TimeGenerated	EntryType	MachineName	EventID	Source	Message
17-04-2019 10.50	Warning	ABCK-TCVN-PAB01.rba.gov.au	144	disk	Disk 1 has crossed a capacity utilization threshold.
17-04-2019 10.53	Warning	ABCK-TCVN-PAB01.rba.gov.au	140	Microsoft-Windows-Ntfs	The description for Event ID ‘140’ in Source ‘Microsoft-Windows-Ntfs’ cannot be found. The following information is part of the event:’T:’, ‘\Device\HarddiskVolume3’, ‘0x80000011’

Conclusion:

From the Events and logs captured we were not able to find much information from the Host logs other than the Issue associated with the Thin Provisioned storage exceeding the threshold, Windows logs are also pointing to the same issue that we have found in the Host logs.
Since we have seen this issue twice and there is not much information in the logs from the host end, we would not be able to conclude that this has happened because of the storage issue

Action Plan:

From the Logs I can see the Datastore Name: ABCD-0052-0-149 has the Size of 0.488 TB and 0.406 TB is showing as free but since it is showing as Thin provisioned we would have to check if we have enough storage provisioned from the Hardware end.

I found this Information from the Command which Ran to capture the RDM Info for the Vmdks associated with the VMs

As per the Command: # vmkfstools -q /vmfs/volumes/59117ab3-c3557ada-3dd6-9cdc71782ff0/XYZK-TCVN-PAB01/XYZK-TCVN-PAB01.vmdk

Disk /vmfs/volumes/59117ab3-c3557ada-3dd6-9cdc71782ff0/XYZK-TCVN-PAB01/XYZK-TCVN-PAB01.vmdk is a Passthrough Raw Device Mapping

Maps to: vml.0200b6000060060e8007c5e9000030c5e900abce284f50454abcde

The VML ID mentioned above is associated with the NAA ID which I shared with you on Email.

VMDK	physical/virtual RDM	vml.ID	naa.ID	Display Name
XYZK-TCVN-PAB01.vmdk	Passthrough Raw Device Mapping	vml.0200b6000060060e8007c5e9000030c5e900abce284f50454abcde	naa.60060e8007c5e9000030c5e900abce28	HITACHI OPEN-V