A Windows application, relying on Windows Failover Cluster to provide shared network drive between two data centers failed, causing this critical banking transaction app to fail.
Cluster Node 1: ABCK-TCVN-PAB01(Active)
Cluster Node 2: XYZK-TCVN-PAB01
T: drive on this VM was a RDM LUN, (datastore name is ABCD-0052-0-149), and this RDM is from the datastore in HO datacentre.
During the initial investigation, there were “host storage path redundancy or storage connectivity is lost” events registered on multiple ESXi hosts in the same cluster where ESXi host, ABCT-PROD-PRP10, is part of, but the events has since stopped from around 11:30pm, 17 April. I don’t know if the issue is related to this.
Migrated ABCK-TCVN-PAB01 from host ABCT-PROD-PRP10 to ABCT-PROD-PRP12, the issue re-appeared after about 2 hours’ running.
Time Period: 09:00am 17-April-2019 and 04:00am 18-April-2019
GMT: 11:00 PM 16th APRIL
GMT: 6:00 PM 17th APRIL
ABCK-TCVN-PAB01
- Tried to find out event at 11:00 PM GMT for 16th APRIL but was not able to find events in the vmware logs.
- However, we can see the events for the vMotion happening where VM is moved from ABCT-PROD-prp12 to ABCT-PROD-PRP10 and then later back from ABCT-PROD-PRP10 to ABCT-PROD-prp12.
2019-04-17T00:50:39.554Z| vmx| I120: MigrateSetInfo: state=1 srcIp=<10.205.53.68> dstIp=<10.205.53.66> mid=1555462239311043 uuid=30373237-3132-4753-4835-343657564c41 priority=low
2019-04-17T00:50:45.197Z| vmx| W110: VMX has left the building: 0.
2019-04-17T14:50:13.775Z| vmx| I120: MigrateSetInfo: state=1 srcIp=<10.205.53.66> dstIp=<10.205.53.68> mid=1555512613539894 uuid=30373237-3132-4753-4835-343657564c38 priority=high
2019-04-17T14:50:21.038Z| vmx| W110: VMX has left the building: 0.
- We can only see the Remote Connection Failure at the time of issue from inside the VM:
2019-04-17T17:36:22.260Z| vmx| I120: GuestRpc: Got error for connection 2803:Remote connection failure
2019-04-17T17:38:02.993Z| vmx| I120: GuestRpc: Got error for connection 2903:Remote connection failure
2019-04-17T18:01:47.073Z| vmx| I120: GuestRpc: Got error for connection 3003:Remote connection failure
2019-04-17T18:03:23.651Z| vmx| I120: GuestRpc: Got error for connection 3103:Remote connection failure
2019-04-17T18:05:04.020Z| vmx| I120: GuestRpc: Got error for connection 3203:Remote connection failure
2019-04-17T18:06:44.153Z| vmx| I120: GuestRpc: Got error for connection 3303:Remote connection failure
RPC Associated Event: https://support.hpe.com/hpsc/doc/public/display?docId=mmr_kc-0132884
________________________Host Side Events__________
Hostname: ABCT-PROD-PRP10
Esxi Version: VMware ESXi 6.0.0 build-3620759 (ESXi 6.0 Update 2)
ABCD-0052-0-149 |
naa.60060e8007c5e9000030c5e900000052 |
59117ab3-c3557ada-3dd6-9cdc71782ff0 |
13210601 |
96894594 |
2589590 |
LUN:149 |
fc |
4 |
HITACHI Fibre Channel |
- Reviewed the Logs on the host for the Day: 2019-04-16 but was not able to find anything events other than below:
:var$ grep -ilr 2019-04-16
run/log/vmauthd.1
run/log/syslog.3
run/log/hostd-probe.2
run/log/fdm.log
run/log/vmksummary.0
log/tdlog/fdm.all
log/tdlog/syslog.all
- We can see few Logs for 17th but till than the machine was moved to another host.
:var$ grep -ilr 2019-04-17
run/log/vmauthd.1
run/log/syslog.3
run/log/hostd-probe.2
run/log/vobd.log
run/log/fdm.log
run/log/vmksummary.0
run/log/vmauthd.0
log/tdlog/vobd.all
log/tdlog/fdm.all
log/tdlog/syslog.all
- On this host we can see that the below passthrough disk was attached.
VMDK |
physical/virtual RDM |
vml.ID |
naa.ID |
Display Name |
XYZK-TCVN-PAB01.vmdk |
Passthrough Raw Device Mapping |
vml.0200b6000060060e8007c5e9000030c5e900abce284f50454abcde |
naa.60060e8007c5e9000030c5e900abce28 |
HITACHI OPEN-V |
- Tried to review the Hostd, vmkernel, but was not able to find anything around the timeline on which we have recorded the issue.
FDM Logs:
- From the FDM Logs we can see no connectivity issues between the cluster nodes.
2019-04-16T23:06:52.500Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=72 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-16T23:16:53.449Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=72 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-16T23:46:56.339Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=72 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-16T23:48:13.150Z info fdm[5880DB70] [Originator@6876 sub=Invt opID=SWI-65630605] [DsStateChange::SaveToInventory] Unset _accessible for datastore (/vmfs/volumes/56d3c849-e6ba7d5c-483a-b05adab9bfb0).
2019-04-16T23:56:57.279Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=72 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-17T18:07:45.768Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=73 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-17T18:17:45.869Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=73 vm-metadata-version=16896 slv-mst-tdiff-sec=0
2019-04-17T18:18:13.699Z info fdm[588D0B70] [Originator@6876 sub=Invt opID=SWI-2b12d47b] [DsStateChange::SaveToInventory] Unset _accessible for datastore (/vmfs/volumes/56d3c849-e6ba7d5c-483a-b05adab9bfb0).
2019-04-17T18:48:13.714Z info fdm[588D0B70] [Originator@6876 sub=Invt opID=SWI-589a0135] [DsStateChange::SaveToInventory] Unset _accessible for datastore (/vmfs/volumes/56d3c849-e6ba7d5c-483a-b05adab9bfb0).
2019-04-17T18:57:46.201Z info fdm[5878BB70] [Originator@6876 sub=Cluster opID=SWI-3ab50c2a] hostId=host-36062 state=Slave master=host-36066 isolated=false host-list-version=210 config-version=73 vm-metadata-version=16896 slv-mst-tdiff-sec=0
VOBD Logs:
- In VOBD logs we can see that the Device NAA.60060e8007c5e9000030c5e900abce28 which is a LUN presented to the Host has issues with the Space utilization on thin-provisioned device, this is the same RDM LUN which is presented to the VM on which we have noticed the issue.
2019-04-17T00:50:45.388Z: [scsiCorrelator] 34211865239745us: [vob.scsi.device.thinprov.atquota] Space utilization on the thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded the configured threshold
2019-04-17T00:50:45.389Z: [scsiCorrelator] 34212071285263us: [esx.problem.scsi.device.thinprov.atquota] Space utilization on thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded configured threshold. Affected datastores (if any): Unknown.
2019-04-17T04:41:47.117Z: [scsiCorrelator] 34225726877050us: [vob.scsi.device.thinprov.atquota] Space utilization on the thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded the configured threshold
2019-04-17T04:41:47.118Z: [scsiCorrelator] 34225933014227us: [esx.problem.scsi.device.thinprov.atquota] Space utilization on thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded configured threshold. Affected datastores (if any): Unknown.
2019-04-17T07:40:57.925Z: [scsiCorrelator] 34236477615305us: [vob.scsi.device.thinprov.atquota] Space utilization on the thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded the configured threshold
2019-04-17T07:40:57.926Z: [scsiCorrelator] 34236683822432us: [esx.problem.scsi.device.thinprov.atquota] Space utilization on thin-provisioned device naa.60060e8007c5e9000030c5e900abce28 exceeded configured threshold. Affected datastores (if any): Unknown.
Hostname: ABCT-PROD-prp12
Esxi Version: VMware ESXi 6.0.0 build-3620759 (ESXi 6.0 Update 2)
Tried to review the Hostd, vmkernel, but was not able to find anything around the timeline on which we have recorded the issue.
________________________________________________Windows Side Analysis_____________________________________________
System Info:ABCK-TCVN-PAB01
OS Name Microsoft Windows Server 2012 R2 Standard
Version 6.3.9600 Build 9600
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name ABCK-TCVN-PAB01
System Manufacturer VMware, Inc.
System Model VMware Virtual Platform
System Type x64-based PC
System SKU
Processor Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, 2297 Mhz, 2 Core(s), 2 Logical Processor(s)
BIOS Version/Date Phoenix Technologies LTD 6.00, 21/09/2015
Time Zone AUS Eastern Standard Time
Cluster Logs:
- From the Cluster Logs we can see that we are getting a generic Authorization error, however this is not associated with the issue that we are facing:
000008cc.000012ec::2019/04/16-23:45:46.982 WARN [CHANNEL 10.210.53.29:~53934~] failure, status (10054)
000008cc.000012ec::2019/04/16-23:45:46.982 ERR mscs_security::BaseSecurityContext::DoAuthenticate_static: Closed(1236)’ because of ‘channel to remote endpoint 10.210.53.29:~53934~ has failed with status (10054)’
000008cc.00001adc::2019/04/16-23:45:46.982 WARN mscs::ListenerWorker::operator (): HrError(0x000004d4)’ because of ‘[SV] Authentication or Authorization Failed’
- Here we can see that the issue started around 5:50 PM GMT where the Resource state started to change from Online to Offline.
000008cc.000028f0::2019/04/17-17:50:33.581 INFO [RCM] TransitionToState(IP Address 10.207.51.129) Online–>WaitingToGoOffline.
000008cc.000028f0::2019/04/17-17:50:33.581 INFO [RCM] TransitionToState(tcVisionP) Online–>WaitingToGoOffline.
000008cc.000028f0::2019/04/17-17:50:33.581 INFO [RCM] TransitionToState(tcVISION_PROD_ORA Manager) Online–>WaitingToGoOffline.
000008cc.000028f0::2019/04/17-17:50:33.581 INFO [RCM] TransitionToState(tcVISION_PROD Manager) Online–>WaitingToGoOffline.
000008cc.000028f0::2019/04/17-17:50:33.581 INFO [RCM] TransitionToState(File Server (\\tcVisionP)) Online–>WaitingToGoOffline.
000008cc.00001e0c::2019/04/17-17:50:33.593 INFO [RCM] TransitionToState(File Server (\\tcVisionP)) OfflinePending–>OfflineSavingCheckpoints.
000008cc.00001e0c::2019/04/17-17:50:33.593 INFO [RCM] TransitionToState(File Server (\\tcVisionP)) OfflineSavingCheckpoints–>Offline.
- Service Completely went offline, after this the Resource started to Move from ABCK-TCVN-PAB01(1) to XYZK-TCVN-PAB01(2)
00000e68.00002018::2019/04/17-17:50:45.600 INFO [RES] Generic Service <tcVISION_PROD Manager>: Service is now offline.
00000e68.0000131c::2019/04/17-17:50:45.643 WARN [RES] IPv4 Address Resources: Notify port is about to be closed. Exiting…
000008cc.00002310::2019/04/17-17:50:45.645 INFO [RCM] TransitionToState(IP Address 10.207.51.129) OfflineSavingCheckpoints–>Offline.
000008cc.000028f0::2019/04/17-17:50:45.660 WARN [NETFTAPI] Failed to query parameters for 10.207.51.129 (status 0x80070490)
000008cc.00002310::2019/04/17-17:50:45.880 INFO [RCM] move of group tcVisionP from ABCK-TCVN-PAB01(1) to XYZK-TCVN-PAB01(2) of type MoveType::Manual is about to succeed, failoverCount=0, lastFailoverTime=1601/01/01-00:00:00.000 targeted=true
000008cc.00002414::2019/04/23-02:07:49.858 ERR [API] AccessCheck[AndAuditAlarm] failed. status = 0x00000005
000008cc.00002358::2019/04/23-02:14:52.649 ERR [API] AccessCheck[AndAuditAlarm] failed. status = 0x00000005
- From the System Logs we can see that we are getting the error which is pointing out an issue with the Capacity Utilization threshold Error.
TimeGenerated |
EntryType |
MachineName |
EventID |
Source |
Message |
17-04-2019 10.50 |
Warning |
ABCK-TCVN-PAB01.rba.gov.au |
144 |
disk |
Disk 1 has crossed a capacity utilization threshold. |
17-04-2019 10.53 |
Warning |
ABCK-TCVN-PAB01.rba.gov.au |
140 |
Microsoft-Windows-Ntfs |
The description for Event ID ‘140’ in Source ‘Microsoft-Windows-Ntfs’ cannot be found. The following information is part of the event:’T:’, ‘\Device\HarddiskVolume3’, ‘0x80000011’ |
Conclusion:
- From the Events and logs captured we were not able to find much information from the Host logs other than the Issue associated with the Thin Provisioned storage exceeding the threshold, Windows logs are also pointing to the same issue that we have found in the Host logs.
- Since we have seen this issue twice and there is not much information in the logs from the host end, we would not be able to conclude that this has happened because of the storage issue
Action Plan:
- From the Logs I can see the Datastore Name: ABCD-0052-0-149 has the Size of 0.488 TB and 0.406 TB is showing as free but since it is showing as Thin provisioned we would have to check if we have enough storage provisioned from the Hardware end.
- I found this Information from the Command which Ran to capture the RDM Info for the Vmdks associated with the VMs
- As per the Command: # vmkfstools -q /vmfs/volumes/59117ab3-c3557ada-3dd6-9cdc71782ff0/XYZK-TCVN-PAB01/XYZK-TCVN-PAB01.vmdk
Disk /vmfs/volumes/59117ab3-c3557ada-3dd6-9cdc71782ff0/XYZK-TCVN-PAB01/XYZK-TCVN-PAB01.vmdk is a Passthrough Raw Device Mapping
Maps to: vml.0200b6000060060e8007c5e9000030c5e900abce284f50454abcde
- The VML ID mentioned above is associated with the NAA ID which I shared with you on Email.
VMDK |
physical/virtual RDM |
vml.ID |
naa.ID |
Display Name |
XYZK-TCVN-PAB01.vmdk |
Passthrough Raw Device Mapping |
vml.0200b6000060060e8007c5e9000030c5e900abce284f50454abcde |
naa.60060e8007c5e9000030c5e900abce28 |
HITACHI OPEN-V |
- I am not aware about the VROPs terminologies but I will check with one of my colleague and will get back to you.