RAID consistency check caused storage failure?

D9BDCEA2CE · Tue Jan 10, 2017 9:22 am

That's my theory anyway, when I cancelled it everything came back.

I have two Nodes hosting several Roles. I have seen (more than once until I learned how to deal with driver updates better) the Roles go to FAILED because they've lost their storage. I resolve the problem with connecting to Starwind and Turn Off the Role, then Start it and everything is fine. Luckily I'm not in a position to lose data when it happens.

This morning I did a RAID consistency check on the drive holding storage for several Roles and they went to FAILED because they had lost their storage. Since it was the only thing I'd done I cancelled the check and everything came back. I'll try it again at the weekend when I can stop Starwind and fail the Roles over like I do when I update drivers and firmware.

This is my first live Cluster, my first Starwind. I'm learning as I go so I'm quite ready to accept that I'm missing the point somewhere. If a Role owned by Node 2 has storage on CSV2 and there's a problem with one copy of CSV2, why is it not simply switching to the other copy and letting me know? What am I doing wrong?

The issue seems to have been with Starwind, I've spent most of today going through my iSCSI/MPIO configuration and it all appears to be exactly as described in the documentation, as if it should work how it's supposed to. The iSCSI settings for one Node, for example:

Disk0 CLS01-CSV2
Starwind HAImage2 replicates over 172.16.110.11/172.16.111.11 & 172.16.110.22/172.16.111.22
Name From To
CSV2 local Default 127.0.0.1
CSV2 partner cable 0 172.16.210.11 172.16.210.22
CSV2 partner cable 1 172.16.211.11 172.16.211.22
iqn.2008-08.com.starwindsoftware:cls01-csv2
Disk 2 Port 1: Bus 0: Target 1: LUN 0
MPIO (Fail Over Only)
• 0x770100001 Active ffffe00031167010-4000013700000002*
• 0x770100005 Standby ffffe00031167010-4000013700000006
• 0x770100006 Standby ffffe00031167010-4000013700000007
iqn.2008-08.com.starwindsoftware:cls02-csv2
Disk 2 Port 1: Bus 0: Target 5: LUN 0
MPIO (Fail Over Only)
• 0x770100001 Active ffffe00031167010-4000013700000002
• 0x770100005 Standby ffffe00031167010-4000013700000006*
• 0x770100006 Standby ffffe00031167010-4000013700000007
Disk 2 Port 1: Bus 0: Target 6: LUN 0
MPIO (Fail Over Only)
• 0x770100001 Active ffffe00031167010-4000013700000002
• 0x770100005 Standby ffffe00031167010-4000013700000006
• 0x770100006 Standby ffffe00031167010-4000013700000007*
Disk1 CLS01-CSV1
Starwind HAImage1 replicates over 172.16.110.11/172.16.111.11 & 172.16.110.22/172.16.111.22
Name From To
CSV1 local Default 127.0.0.1
CSV1 partner cable 0 172.16.210.11 172.16.210.22
CSV1 partner cable 1 172.16.211.11 172.16.211.22
iqn.2008-08.com.starwindsoftware:cls01-csv1
Disk 3 Port 1: Bus 0: Target 0: LUN 0
MPIO (Fail Over Only)
• 0x770100000 Active ffffe00031167010-4000013700000001*
• 0x770100003 Standby ffffe00031167010-4000013700000004
• 0x770100004 Standby ffffe00031167010-4000013700000005
iqn.2008-08.com.starwindsoftware:cls02-csv1
Disk 3 Port 1: Bus 0: Target 3: LUN 0
MPIO (Fail Over Only)
• 0x770100000 Active ffffe00031167010-4000013700000001
• 0x770100003 Standby ffffe00031167010-4000013700000004*
• 0x770100004 Standby ffffe00031167010-4000013700000005
Disk 3 Port 1: Bus 0: Target 4: LUN 0
MPIO (Fail Over Only)
• 0x770100000 Active ffffe00031167010-4000013700000001
• 0x770100003 Standby ffffe00031167010-4000013700000004
• 0x770100004 Standby ffffe00031167010-4000013700000005*
Disk2
Starwind HAImage3 replicates over 172.16.110.11/172.16.111.11 & 172.16.110.22/172.16.111.22
Name From To
Witness local Default 127.0.0.1
iqn.2008-08.com.starwindsoftware:cls01-witness1
Disk 5 Port 1: Bus 0: Target 2: LUN 0
MPIO (Round Robin)
• 0x770100002 Active ffffe00031167010-4000013700000003*

Fri Jan 13, 2017 5:05 pm

Hello D9BDCEA2CE,
Your theory is correct - when one node fails, everything should switch to another node, so in Failover Cluster Manager, CSV should change the owner without any interruptions.
Firstly, please check from StarWind Management console that all StarWind devices are synchronized on both nodes.
Secondly, please check connections in Microsoft iSCSI initiator on both nodes. You can follow the document here: https://www.starwindsoftware.com/starwi ... -v-cluster.
I assume that during RAID consistency check, StarWind marked devices as Not Synchronized due to delays on underlying storage more that 10 seconds, but it is not clear why Storage failed in the Cluster. Possibly because of misconfiguration with iSCSI connections.
Let us know about it!

D9BDCEA2CE · Mon Jan 16, 2017 11:13 am

I never got round to it over the weekend, however I've come in this morning and there seems to be a problem. The Starwind Virtual SAN service on one node was stopped and when I started it the system began synchronisation. However that doesn't seem to be working. I'm getting this and similar messages:

High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv1, all synchronization Connections with Partner Node iqn.2008-08.com.starwindsoftware:cls02-csv1 lost
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv1, command "" execution time is 15906 ms (storage performance degradation)
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv1, current Node State has changed to "Not synchronized"
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv2, current Node State has changed to "Synchronizing"
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv1, current Node Synchronization failed, Synchronizer is Partner Node iqn.2008-08.com.starwindsoftware:cls02-csv1

The RAID says it's fine. The network connections look like they're working.

D9BDCEA2CE · Mon Jan 16, 2017 12:25 pm

Michael (staff) wrote:Hello D9BDCEA2CE,
Your theory is correct - when one node fails, everything should switch to another node, so in Failover Cluster Manager, CSV should change the owner without any interruptions.
Firstly, please check from StarWind Management console that all StarWind devices are synchronized on both nodes.
Secondly, please check connections in Microsoft iSCSI initiator on both nodes. You can follow the document here: https://www.starwindsoftware.com/starwi ... -v-cluster.
I assume that during RAID consistency check, StarWind marked devices as Not Synchronized due to delays on underlying storage more that 10 seconds, but it is not clear why Storage failed in the Cluster. Possibly because of misconfiguration with iSCSI connections.
Let us know about it!

I followed something very similar to that (StarWind Virtual SAN™ Hyper-Converged 2 Nodes Scenario 2 Nodes with Hyper-V Cluster) when I set it up. I've been through all the iSCSI and MPIO stuff in some detail trying to get a grip on it, so far I have the following, which is quite a lot of info but will hopefully help. The main thing I can see is that everything appears to be the same on both nodes, with the exception of the Witness disc:

Code: Select all

Cluster Node 1 [192.168.1.43]
iSCSI Initiator iqn.1991-05.com.microsoft:cls01.consulthyperion.com (no authentication)
MPIO (adapter is MS iSCSI Initiator) and iSCSI Targets (port 3260)

Discover Portals - iSCSI initiator, Discovery, Discover Portal (adapter is MS iSCSI Initiator)
127.0.0.1:3260		Default (local Starwind copy)
172.16.210.11:3260	172.16.210.22 (remote Starwind copy, cable 0)
172.16.211.11:3260	172.16.211.22 (remote Starwind copy, cable 1)

Disk0 CSV2
Starwind HAImage2 replicates from 172.16.110.11/172.16.111.11 to 172.16.110.22/172.16.111.22

CSV2 local		Default		127.0.0.1
CSV2 partner cable 0	172.16.210.11	172.16.210.22
CSV2 partner cable 1	172.16.211.11	172.16.211.22

	iqn.2008-08.com.starwindsoftware:cls01-csv2
	Disk 2 Port 1: Bus 0: Target 1: LUN 0
	MPIO (Fail Over Only)
	•	0x770100001	Active	ffffe00031167010-4000013700000002*
	•	0x770100005	Standby	ffffe00031167010-4000013700000006
	•	0x770100006	Standby	ffffe00031167010-4000013700000007

	iqn.2008-08.com.starwindsoftware:cls02-csv2
	Disk 2 Port 1: Bus 0: Target 5: LUN 0
	MPIO (Fail Over Only)
	•	0x770100001	Active	ffffe00031167010-4000013700000002
	•	0x770100005	Standby	ffffe00031167010-4000013700000006*
	•	0x770100006	Standby	ffffe00031167010-4000013700000007
	Disk 2 Port 1: Bus 0: Target 6: LUN 0
	MPIO (Fail Over Only)
	•	0x770100001	Active	ffffe00031167010-4000013700000002
	•	0x770100005	Standby	ffffe00031167010-4000013700000006
	•	0x770100006	Standby	ffffe00031167010-4000013700000007*

Disk1 CSV1
Starwind HAImage1 replicates from 172.16.110.11/172.16.111.11 to 172.16.110.22/172.16.111.22

CSV1 local		Default		127.0.0.1
CSV1 partner cable 0	172.16.210.11	172.16.210.22
CSV1 partner cable 1	172.16.211.11	172.16.211.22

	iqn.2008-08.com.starwindsoftware:cls01-csv1
	Disk 3 Port 1: Bus 0: Target 0: LUN 0
	MPIO (Fail Over Only)
	•	0x770100000	Active	ffffe00031167010-4000013700000001*
	•	0x770100003	Standby	ffffe00031167010-4000013700000004
	•	0x770100004	Standby	ffffe00031167010-4000013700000005

	iqn.2008-08.com.starwindsoftware:cls02-csv1
	Disk 3 Port 1: Bus 0: Target 3: LUN 0
	MPIO (Fail Over Only)
	•	0x770100000	Active	ffffe00031167010-4000013700000001
	•	0x770100003	Standby	ffffe00031167010-4000013700000004*
	•	0x770100004	Standby	ffffe00031167010-4000013700000005
	Disk 3 Port 1: Bus 0: Target 4: LUN 0
	MPIO (Fail Over Only)
	•	0x770100000	Active	ffffe00031167010-4000013700000001
	•	0x770100003	Standby	ffffe00031167010-4000013700000004
	•	0x770100004	Standby	ffffe00031167010-4000013700000005*

Disk2 Witness
Starwind HAImage3 replicates from 172.16.110.11/172.16.111.11 to 172.16.110.22/172.16.111.22

Witness local		Default		127.0.0.1

	iqn.2008-08.com.starwindsoftware:cls01-witness1
	Disk 5 Port 1: Bus 0: Target 2: LUN 0
	MPIO (Round Robin)
	•	0x770100002	Active	ffffe00031167010-4000013700000003*

 

Cluster Node 2 [192.168.1.44]
iSCSI Initiator iqn.1991-05.com.microsoft:cls01.consulthyperion.com (no authentication)
MPIO (adapter is MS iSCSI Initiator) and iSCSI Targets (port 3260)

Discover Portals - iSCSI initiator, Discovery, Discover Portal (adapter is MS iSCSI Initiator)
127.0.0.1:3260	Default (local Starwind copy)
172.16.210.22:3260	172.16.210.11 (remote Starwind copy, cable 0)
172.16.211.22:3260	172.16.211.11 (remote Starwind copy, cable 1)

Disk0 CSV1
Starwind HAImage1 replicates from 172.16.110.22/172.16.111.22 to 172.16.110.11/172.16.111.11

CSV2 local		Default		127.0.0.1
CSV2 partner cable 0	172.16.210.22	172.16.210.11
CSV2 partner cable 1	172.16.211.22	172.16.211.11

	iqn.2008-08.com.starwindsoftware:cls02-csv1
	Disk 2 Port 1: Bus 0: Target 4: LUN 0
	MPIO (Fail Over Only)
	•	0x770100004	Active	ffffe0007e719010-4000013700000005*
	•	0x770100000	Standby	ffffe0007e719010-4000013700000001
	•	0x770100001	Standby	ffffe0007e719010-4000013700000002
	iqn.2008-08.com.starwindsoftware:cls01-csv1
	Disk 2 Port 1: Bus 0: Target 0: LUN 0
	MPIO (Fail Over Only)
	•	0x770100004	Active	ffffe0007e719010-4000013700000005
	•	0x770100000	Standby	ffffe0007e719010-4000013700000001*
	•	0x770100001	Standby	ffffe0007e719010-4000013700000002
	Disk 2 Port 1: Bus 0: Target 1: LUN 0
	MPIO (Fail Over Only)
	•	0x770100004	Active	ffffe0007e719010-4000013700000005
	•	0x770100000	Standby	ffffe0007e719010-4000013700000001
	•	0x770100001	Standby	ffffe0007e719010-4000013700000002*

Disk1 CSV2
Starwind HAImage2 replicates from 172.16.110.22/172.16.111.22 & 172.16.110.11/172.16.111.11

CSV1 local		Default		127.0.0.1
CSV1 partner cable 0	172.16.210.22	172.16.210.11
CSV1 partner cable 1	172.16.211.22	172.16.211.11

	iqn.2008-08.com.starwindsoftware:cls02-csv2
	Disk 3 Port 1: Bus 0: Target 5: LUN 0
	MPIO (Fail Over Only)
	•	0x770100005	Active	ffffe0007e719010-4000013700000006*
	•	0x770100002	Standby	ffffe0007e719010-4000013700000003
	•	0x770100003	Standby	ffffe0007e719010-4000013700000004
	iqn.2008-08.com.starwindsoftware:cls01-csv2
	Disk 3 Port 1: Bus 0: Target 2: LUN 0
	MPIO (Fail Over Only)
	•	0x770100005	Active	ffffe0007e719010-4000013700000006
	•	0x770100002	Standby	ffffe0007e719010-4000013700000003*
	•	0x770100003	Standby	ffffe0007e719010-4000013700000004
	Disk 3 Port 1: Bus 0: Target 3: LUN 0
	MPIO (Fail Over Only)
	•	0x770100005	Active	ffffe0007e719010-4000013700000006
	•	0x770100002	Standby	ffffe0007e719010-4000013700000003
	•	0x770100003	Standby	ffffe0007e719010-4000013700000004*

Disk2 Witness
Starwind HAImage3 replicates from 172.16.110.22/172.16.111.22 & 172.16.110.11/172.16.111.11

Witness local		Default		127.0.0.1

	iqn.2008-08.com.starwindsoftware:cls01-witness1
	Disk 5 Port 1: Bus 0: Target 6: LUN 0
	MPIO (Round Robin)
	Round Robin
	•	0x770100006	Active	ffffe0007e719010-4000013700000007*

D9BDCEA2CE · Mon Jan 16, 2017 12:30 pm

I've been reading https://www.starwindsoftware.com/techni ... luster.pdf this morning. I've found one thing which doesn't match the document I used to set mine up.

My nodes have, amongst other things, a pair of crossover 10Gbps cables between them. At the moment I have Starwind synchronisation and iSCSI running over both cables. This document seems to be saying I should have synchronisation on one and iSCSI on the other, is that right?

I'm beginning to suspect this is wrong:

: Synchronisation.png (20.57 KiB) Viewed 12630 times

What if:
I remove the w.x.110.z interface from each of those, then restore it as Synchronisation and Heartbeat.
Then I remove the w.x.111.z and the 192.168.1.43 interfaces altogether.
Then I go to iSCSI Initiator Properties, Discovery and remove the w.x.210.z interface.

That leaves me with Heartbeat and Synch on one cable and iSCSI on the other. Assuming that's the goal, would that work?

D9BDCEA2CE · Mon Jan 16, 2017 12:34 pm

I'm going to tell myself that you all deal with this stuff so you know how I feel. Forgive my rambling on but I decided to try something and go to the old standby, restart it. I moved the Roles and stopped Starwind on the node which isn't synchronised and didn't appear to be getting anywhere. I've just seen that it's made it to 1%! I will apparently have to wait several days before I can do anything else so I'll just have to find something else to do until then so I don't tear all my hair out unless someone has a suggestion.

Edit: That didn't hold though, it reset to 0% when I went back to check on it.

Mon Jan 16, 2017 6:11 pm

Updating the community.
We have investigated the logs and found that devices were not synchronized because underlying storage issue.
As for the network reconfiguration in StarWind Management console it can be done as suggested. Just use one cable for Synchronization and another one for ISCSI and Heartbeat.

D9BDCEA2CE · Thu Feb 02, 2017 7:57 am

I've learned that you can't rely on the RAID management software to tell you if there's a problem, so now I have a filtered event log getting items from the System log in the Task Category "Storage Service" which has much more useful information for you.

With Michael's help I also sorted out my Starwind/iSCSI config which was out of date according to the current notes from Starwind he linked to above, those of you who can read MPIO reports (another thing I have learned to do recently) will see what I mean - basically I had too many paths and the wrong MPIO policy.

Mon Feb 06, 2017 9:35 am

Thank you for your feedback