HA Active / Active Data Integrity

BillArchway · Fri May 09, 2014 1:43 am

I have a question. We have a 2 node HA Configuration with several ESXi hosts having both nodes as iSCSI paths.

So if I understand Active / Active, at any one point in time, different hosts could be accessing the same LUN on different nodes. Both are making updates.

So what happens if I lose my sync channel because of nic or switch failure?

Neither node is down. But it can no longer sync updates. Which Node becomes the survivor? Doesn't one of them have to lose data?

I guess the same issue is relevant in a 3 node configuration.

Thanks,

Fri May 09, 2014 11:55 am

Situation you're talking about is referred as a "brain split" issue and is quite known in a clustered environments.

2 (or any other even number of) nodes would require a redundant heartbeat networks running between them. So with no sync channels alive one (or group) of the nodes would turn itself (themselves) OFF with no access to so-called "master" token.

3 (of any other odd number of) nodes are immune as they have quorum (voting majority).

BillArchway · Fri May 09, 2014 1:46 pm

Thanks for the response.

This is not important. I'm just curious.

So in terms of a StarWind HA perspective, I have 2 nodes. ESX is happily load balanced between the 2. We lose sync and node 2 shuts down (or node 2 crashes). ESX happily switches to the remaining valid path and keeps working. I've seen that happen and it worked as advertised.

But what happens to the pending update on Node 2 before it shut down? How does ESX know that now that last update didn't happen? Does it not report to ESX that the update is successful until it is sync'ed?

I assume in this scenario, that once the sync link is restored (or node 2 comes back online) a full sync would be required.

Again, thanks. This is the programmer in me, lol.

Fri May 09, 2014 3:49 pm

"Pending updates" actually never happen. Data is confirmed by node when all partner nodes have write ether in cache (write-back) or on disk (write-thru). If requestor has no response on some period of time timeout happens, request get aborted, path marked as "failed" and request submitted to the other node. That's the way MPIO works.

Full sync or fast sync depends on amount of data (delta) being different between nodes. With a big differences it makes sense to stream ALL the data instead of processing changed block tracker bitmap (big writes go faster then a bunch of a small out-of-bound ones).

BillArchway wrote:Thanks for the response.

This is not important. I'm just curious.

So in terms of a StarWind HA perspective, I have 2 nodes. ESX is happily load balanced between the 2. We lose sync and node 2 shuts down (or node 2 crashes). ESX happily switches to the remaining valid path and keeps working. I've seen that happen and it worked as advertised.

But what happens to the pending update on Node 2 before it shut down? How does ESX know that now that last update didn't happen? Does it not report to ESX that the update is successful until it is sync'ed?

I assume in this scenario, that once the sync link is restored (or node 2 comes back online) a full sync would be required.

Again, thanks. This is the programmer in me, lol.

BillArchway · Fri May 09, 2014 4:35 pm

Great. Makes sense.

Thanks,

Fri May 09, 2014 4:57 pm