HA Network Failover Problem

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

jeffhamm
Posts: 47
Joined: Mon Jan 03, 2011 6:43 pm

Tue Oct 04, 2011 1:04 pm

Anatoly,

Let me clarify - in my scenario there is no longer a NODE1, only NODE2 is left. Let's assume that NODE1 had a total hardware failure, and it will 2-5 business days before I get replacement hardware for NODE1. I can't wait 2-5 business days and have all my HyperV virtual machines down during that time

How do I bring the NODE2 targets back online when NODE1 is completely gone? When I try to do a Full Sync on Node2 when Node1 is down, I get the following error message:

Image

Thanks!
Jeff
jeffhamm
Posts: 47
Joined: Mon Jan 03, 2011 6:43 pm

Tue Oct 04, 2011 6:14 pm

PS - I know I can't do a Full Sync between the nodes when Node1 is down, but I thought maybe that was the way to bring the targets back online on Node2 :)

Do I have to recreate my targets as non-HA targets on Node2 to get them to come back online?

Thanks,
Jeff
lodiver
Posts: 6
Joined: Mon Aug 23, 2010 11:05 am

Tue Oct 04, 2011 8:41 pm

Hey Jeff

if you have H-A configured correctly, then node 2 should stay running when you lose node 1

when you get replacement hardware for node 1 then you will need to recreate your H-A (use existing image file for node 2 and create a new image file for node 1) and sync in the proper direction.
(start the sync from node 1)
jeffhamm
Posts: 47
Joined: Mon Jan 03, 2011 6:43 pm

Tue Oct 04, 2011 9:59 pm

I'm simulating different fail-over scenarios before going into production:

1) If I "pull the power cord" on NODE1, then NODE2 continues to service Hyper-V hosts and their virtual machines with no interruption in service

BUT

2) If I "pull out all the network cables" from NODE1, the result is "split-brain", and then NODE2 also goes offline.

What I am pulling my hair out trying to figure out is how in this 2nd scenario to get my targets back online on NODE2. I have not been able to figure it out or gotten good feedback on how to do this yet. Any ideas are greatly appreciated! :)

Thanks,
Jeff
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Tue Oct 04, 2011 10:10 pm

YOU DO NOT HAVE SPLIT BRAIN ISSUE. Split brain issue is data mismatch on two nodes. This can happen in only one case - you have both nodes alive and all network paths between them are gone. If you've put heartbeat using client network (very proper way) and have client connected - only one node would continue serving requests as heartbeat would be used to learn sync is dead but node is alive. If you have client disconnected as well - no data corruption could happen physically as client talks to other node only. Not the one you've completely isolated but keep active.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Tue Oct 04, 2011 10:12 pm

Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
jeffhamm
Posts: 47
Joined: Mon Jan 03, 2011 6:43 pm

Wed Oct 05, 2011 1:37 am

Okay - I don't have a split brain issue :roll:

But I do still have an issue :cry:

See below - how do I bring the targets on NODE2 back online, assuming NODE1 never comes back online again?

Image

Thanks,
Jeff
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Wed Oct 05, 2011 7:03 am

From what I understand you have TWO issues now:

1) One listed directly above (no way to synchronize single node if pair is AWOL forever).

2) Second node stops responding after you've isolated first one (removed all network cables).

Please confirm.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
jeffhamm
Posts: 47
Joined: Mon Jan 03, 2011 6:43 pm

Wed Oct 05, 2011 2:01 pm

Correct - I have both of these issues. How do you sync single node if the pair is AWOL forever?

Thanks,
Jeff
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Thu Oct 06, 2011 8:23 am

In case if your second node went dead forever the only way to bring your data back online is to delete existing HA devices and to create Basic image files with 65536 header size (you`ll see corresponding box in target wizard).
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Thu Oct 06, 2011 8:54 am

That's the same what user named "lodiver" had kindly suggested to do few days ago. Back to your second case - our support guys would investigate opportunity to run a remote session to your HA cluster.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
jeffhamm
Posts: 47
Joined: Mon Jan 03, 2011 6:43 pm

Fri Oct 07, 2011 2:07 am

Okay, regarding Issue 1, I was able to come up with a pretty decent DR scenario if I ever run into this in Production:

- Delete HA Targets on NODE2 and recreate as basic disk images on NODE2, using same Aliases and IQNs. HyperV and VMs are now happy :)
- Create New Temporary Targets on NODE2
- Perform a Storage Migration of VMs from the "original" targets to the new temp targets. About 10 seconds of downtime per VM
- Delete the "original" targets on NODE2
- Bring failed node (NODE1) back online; delete all HA targets on it
- Recreate HA Targets on NODE1, using same Aliases and IQNs. Use existing disk images on both NODES, but specify NODE2 as the source node ("synchronize with partner disk")

At this point the SAN is back up just the way it was prior to NODE1 going down, but now I need to migrate the VMs back to the HA Targets:

- Perform a Storage Migration of VMs from the temp targets back to the original HA Targets. About 10 seconds of downtime per VM
- After all of the VMs are migrated back to the HA Targets, remove the temp targets from NODE2

Regarding the 2nd issue, I am interested in working with support; what would be the next steps?

Thanks,
Jeff
PS - I really appreciate all of your help on this one
User avatar
Max (staff)
Staff
Posts: 533
Joined: Tue Apr 20, 2010 9:03 am

Fri Oct 07, 2011 3:07 am

Jeff, Please check your inbox :)
I tried to reach you today but only got an autoreply message.
Max Kolomyeytsev
StarWind Software
Post Reply