Page 1 of 1

Starwind Heartbeat lynchpin results in all paths down

Posted: Fri Nov 18, 2016 6:00 am
by jimbyau
Hi Everyone,

I spent a bit of time with starwind support last night, and they confirmed for me the following behaviour is by design:

2 Physical Starwind Nodes

In each server the configuration is:
1 x 10GB, Dual Port Nic. First port handles replication channel (Crossover), Second port handles ISCSI requests from ISCSI clients, through to a seperate ISCSI switch.
1 x 1GB NIC, Handling Heartbeat channel (crossover).

If I simulate a fault scenario that is the equivalent to the 10GB NIC in the Primary Node failing, that is both the Replication Channel & ISCSI-client channel failing. What happens is this: Starwind uses the heartbeat channel NIC, shuts-down the Secondary Node, to prevent split brain. The primary Node is left marked as sychronised serving requests, however that is the NODE in which the NIC has died, and it has no path to the ISCSI network. The end results is that BOTH nodes are now unavailable to serve client requests, and all channels paths are down to the ISCSI clients.

I simulated this outage twice, once on the phone with support, and they confirmed the behaviour is correct. The net result is that a single NIC failure in the private node that affects both replication & client requests, will result in no storage availability whatsoever.

What I want to know, is why this logic cannot be improved!? This was the first fault test I tried, and there was zero storage tolerance. What are the available work arounds?

In the fault I have outlined above, the logical thing to do was to keep NODE2 online, it was receiving requests on the ISCSI channel, where NODE1 was not. The only logic in your heartbeat protocol is to shutdown the secondary nodes, and to keep the primary online, even if the primary node has NO comms on the ISCSI client channels. Even if you were to include in the logic a simple check of the ISCSI channels to verify its connectivity to the ISCSI client network, you could choose the best node to remain online and avoid a split-brain scenario.

Can we not implement a witness only node with connectivity to the ISCSI networks in order to determine the best node to shutdown in a no-comms fault on the primary node? The current logic, has a single lynchpin that results in complete down-time, even when one node was entirely healthy. What results in storage being unavailable to the ISCSI network, is the split-brain logic of the heart beat channel, not any faults with the secondary nodes! Starwind shuts down the healthy node!

3. The only work around I can think for this, is to install additional NICs in primary node to cater for ISCSI connectivity, and hope the whole driver stack does not fail for that chipset type, which I have seen it do multiple times!! But a simple improvement in the heartbeat logic, would avoid this.

Can I please get a solution for this? I am quite surprised such a simple fault, has not been addressed in the fault-tolerant logic of your product by now. The ability to rollout a basic witness server with connectivity on the ISCSI channel, would solve this problem entirely!

Cheers.

Re: Starwind Heartbeat lynchpin results in all paths down

Posted: Thu Nov 24, 2016 3:02 pm
by John (staff)
Hello Jimbyau,

StarWind is now in the process of deployment StarWind R5. R5 might be ready in month, StarWind new version will include adding witness node thus we can choose which host is working properly. Heartbeat protocol will be upgraded for R5 needs

Have great day


jimbyau wrote:Hi Everyone,

I spent a bit of time with starwind support last night, and they confirmed for me the following behaviour is by design:

2 Physical Starwind Nodes

In each server the configuration is:
1 x 10GB, Dual Port Nic. First port handles replication channel (Crossover), Second port handles ISCSI requests from ISCSI clients, through to a seperate ISCSI switch.
1 x 1GB NIC, Handling Heartbeat channel (crossover).

If I simulate a fault scenario that is the equivalent to the 10GB NIC in the Primary Node failing, that is both the Replication Channel & ISCSI-client channel failing. What happens is this: Starwind uses the heartbeat channel NIC, shuts-down the Secondary Node, to prevent split brain. The primary Node is left marked as sychronised serving requests, however that is the NODE in which the NIC has died, and it has no path to the ISCSI network. The end results is that BOTH nodes are now unavailable to serve client requests, and all channels paths are down to the ISCSI clients.

I simulated this outage twice, once on the phone with support, and they confirmed the behaviour is correct. The net result is that a single NIC failure in the private node that affects both replication & client requests, will result in no storage availability whatsoever.

What I want to know, is why this logic cannot be improved!? This was the first fault test I tried, and there was zero storage tolerance. What are the available work arounds?

In the fault I have outlined above, the logical thing to do was to keep NODE2 online, it was receiving requests on the ISCSI channel, where NODE1 was not. The only logic in your heartbeat protocol is to shutdown the secondary nodes, and to keep the primary online, even if the primary node has NO comms on the ISCSI client channels. Even if you were to include in the logic a simple check of the ISCSI channels to verify its connectivity to the ISCSI client network, you could choose the best node to remain online and avoid a split-brain scenario.

Can we not implement a witness only node with connectivity to the ISCSI networks in order to determine the best node to shutdown in a no-comms fault on the primary node? The current logic, has a single lynchpin that results in complete down-time, even when one node was entirely healthy. What results in storage being unavailable to the ISCSI network, is the split-brain logic of the heart beat channel, not any faults with the secondary nodes! Starwind shuts down the healthy node!

3. The only work around I can think for this, is to install additional NICs in primary node to cater for ISCSI connectivity, and hope the whole driver stack does not fail for that chipset type, which I have seen it do multiple times!! But a simple improvement in the heartbeat logic, would avoid this.

Can I please get a solution for this? I am quite surprised such a simple fault, has not been addressed in the fault-tolerant logic of your product by now. The ability to rollout a basic witness server with connectivity on the ISCSI channel, would solve this problem entirely!

Cheers.