HA Network Failover Problem

jeffhamm · Tue Oct 04, 2011 1:04 pm

Anatoly,

Let me clarify - in my scenario there is no longer a NODE1, only NODE2 is left. Let's assume that NODE1 had a total hardware failure, and it will 2-5 business days before I get replacement hardware for NODE1. I can't wait 2-5 business days and have all my HyperV virtual machines down during that time

How do I bring the NODE2 targets back online when NODE1 is completely gone? When I try to do a Full Sync on Node2 when Node1 is down, I get the following error message:

Thanks!
Jeff

jeffhamm · Tue Oct 04, 2011 6:14 pm

PS - I know I can't do a Full Sync between the nodes when Node1 is down, but I thought maybe that was the way to bring the targets back online on Node2

Do I have to recreate my targets as non-HA targets on Node2 to get them to come back online?

Thanks,
Jeff

lodiver · Tue Oct 04, 2011 8:41 pm

Hey Jeff

if you have H-A configured correctly, then node 2 should stay running when you lose node 1

when you get replacement hardware for node 1 then you will need to recreate your H-A (use existing image file for node 2 and create a new image file for node 1) and sync in the proper direction.
(start the sync from node 1)

jeffhamm · Tue Oct 04, 2011 9:59 pm

I'm simulating different fail-over scenarios before going into production:

1) If I "pull the power cord" on NODE1, then NODE2 continues to service Hyper-V hosts and their virtual machines with no interruption in service

BUT

2) If I "pull out all the network cables" from NODE1, the result is "split-brain", and then NODE2 also goes offline.

What I am pulling my hair out trying to figure out is how in this 2nd scenario to get my targets back online on NODE2. I have not been able to figure it out or gotten good feedback on how to do this yet. Any ideas are greatly appreciated!

Thanks,
Jeff

Tue Oct 04, 2011 10:10 pm

YOU DO NOT HAVE SPLIT BRAIN ISSUE. Split brain issue is data mismatch on two nodes. This can happen in only one case - you have both nodes alive and all network paths between them are gone. If you've put heartbeat using client network (very proper way) and have client connected - only one node would continue serving requests as heartbeat would be used to learn sync is dead but node is alive. If you have client disconnected as well - no data corruption could happen physically as client talks to other node only. Not the one you've completely isolated but keep active.

Tue Oct 04, 2011 10:12 pm

http://en.wikipedia.org/wiki/High-availability_cluster

jeffhamm · Wed Oct 05, 2011 1:37 am

Okay - I don't have a split brain issue

But I do still have an issue

See below - how do I bring the targets on NODE2 back online, assuming NODE1 never comes back online again?

Thanks,
Jeff

Wed Oct 05, 2011 7:03 am

From what I understand you have TWO issues now:

1) One listed directly above (no way to synchronize single node if pair is AWOL forever).

2) Second node stops responding after you've isolated first one (removed all network cables).

Please confirm.

jeffhamm · Wed Oct 05, 2011 2:01 pm

Correct - I have both of these issues. How do you sync single node if the pair is AWOL forever?

Thanks,
Jeff

Thu Oct 06, 2011 8:23 am

In case if your second node went dead forever the only way to bring your data back online is to delete existing HA devices and to create Basic image files with 65536 header size (you`ll see corresponding box in target wizard).

Thu Oct 06, 2011 8:54 am

That's the same what user named "lodiver" had kindly suggested to do few days ago. Back to your second case - our support guys would investigate opportunity to run a remote session to your HA cluster.

jeffhamm · Fri Oct 07, 2011 2:07 am

Okay, regarding Issue 1, I was able to come up with a pretty decent DR scenario if I ever run into this in Production:

- Delete HA Targets on NODE2 and recreate as basic disk images on NODE2, using same Aliases and IQNs. HyperV and VMs are now happy

- Create New Temporary Targets on NODE2
- Perform a Storage Migration of VMs from the "original" targets to the new temp targets. About 10 seconds of downtime per VM
- Delete the "original" targets on NODE2
- Bring failed node (NODE1) back online; delete all HA targets on it
- Recreate HA Targets on NODE1, using same Aliases and IQNs. Use existing disk images on both NODES, but specify NODE2 as the source node ("synchronize with partner disk")

At this point the SAN is back up just the way it was prior to NODE1 going down, but now I need to migrate the VMs back to the HA Targets:

- Perform a Storage Migration of VMs from the temp targets back to the original HA Targets. About 10 seconds of downtime per VM
- After all of the VMs are migrated back to the HA Targets, remove the temp targets from NODE2

Regarding the 2nd issue, I am interested in working with support; what would be the next steps?

Thanks,
Jeff
PS - I really appreciate all of your help on this one

Fri Oct 07, 2011 3:07 am

Jeff, Please check your inbox

I tried to reach you today but only got an autoreply message.