HA Node went down hard - how do I bring it back online?

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

jhamm@logos-data.com
Posts: 78
Joined: Fri Mar 13, 2009 10:11 pm

Thu Jan 10, 2013 10:49 pm

One of my two HA Nodes went down hard (disk controller issues). I have a total of 9 HA Targets. 5 stayed online and available through the 2nd HA unit, but 4 went offline altogether. I had to go into the Management console on the 2nd unit and under Devices right click on the device, select synchronization, and then select itself as the source. Then it came back online almost immediately and was available to be accessed from my Hyper-V cluster.

I am in the process of repairing my 1st HA node. Fortunately, the RAID set with the OS and the StarWind Program were not damaged, and it boots up fine. However, once I finish repairing the other RAID sets what is the correct procedure to bring it back online? I know it will need to do a slow sync, but my concern is bringing the 1st HA node online and then having it interfere with the 2nd HA unit before the sync completes. Is there a way I can tell the 1st HA unit to not come online until it syncs up with the 1st HA unit? We're running build 6.0.5189

Thanks,
Jeff
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Mon Jan 14, 2013 2:10 pm

As far as I understood your main concern is to have your datastores available. Correct me if I'm wrong please.
If I`m correct then you do can achieve your plan. You just need to remove all the partners of HA device from node 1, the one that failed, through the replication manager (right click on device and choose corresponding option). After that you need to create new partners on this server and the synchronize the partners - as you have mentioned normal Sync will take place (not FastSync).
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
jhamm@logos-data.com
Posts: 78
Joined: Fri Mar 13, 2009 10:11 pm

Mon Jan 14, 2013 5:25 pm

Thanks Anatoly - you are correct that I want to continue to keep my datastores online running on Node2 during this process. Some more questions for clarification:

- Do I really have to remove all the partners? Will Node1 not "see" the partners on Node 2 and just auto sync (full sync) with them?

- If I do need to remove the partners on Node1, I'm not clear on what "remove partners" and "create new partners" mean, at least not when reviewing from the Replication Manager screen. Do you mean "Remove replica" and "Add replica"?

- After completing the procedure above do I need to do anything on the 2nd (good) HA node?

Thanks!
Jeff
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Wed Jan 16, 2013 2:21 pm

- Do I really have to remove all the partners? Will Node1 not "see" the partners on Node 2 and just auto sync (full sync) with them?
As far as I understoud those image files could be corrupted after your failure, so I`d created the new ones.
- If I do need to remove the partners on Node1, I'm not clear on what "remove partners" and "create new partners" mean, at least not when reviewing from the Replication Manager screen. Do you mean "Remove replica" and "Add replica"?
Yes, that is exactly what I meant - remove replica and add it.
- After completing the procedure above do I need to do anything on the 2nd (good) HA node?
No, just wait until the sync will finish.
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
jhamm@logos-data.com
Posts: 78
Joined: Fri Mar 13, 2009 10:11 pm

Fri Jan 25, 2013 6:53 pm

I've replaced the hardware, and am trying to open the StarWind Management Console on the (formerly) crashed HA Node. However, when I try to run the console, the splash screen comes up, but then goes away and the management console does not open. In the interest of being safe, I have disabled all NICs on the server to prevent the old HA node from accidentally taking down the "good" HA node. What could cause the StarWind Management Console not to open?

Thanks,
Jeff
jhamm@logos-data.com
Posts: 78
Joined: Fri Mar 13, 2009 10:11 pm

Fri Jan 25, 2013 6:56 pm

Never mind - I was able to open it in an RDP session (enabled the network, not StarWind, management NIC). Kinda weird though.
jhamm@logos-data.com
Posts: 78
Joined: Fri Mar 13, 2009 10:11 pm

Fri Jan 25, 2013 7:00 pm

But I still cannot connect :(

When I try to connect to 127.0.0.1, I get the following error message:

"StarWind Services not found on host"

I have checked, and the StarWind Service is running; I have tried stopping and re-starting it for good measure. Any ideas?

Thanks,
Jeff
jhamm@logos-data.com
Posts: 78
Joined: Fri Mar 13, 2009 10:11 pm

Fri Jan 25, 2013 8:58 pm

And I'm not sure of the correct order for doing all of this. Is this the correct order?

- remove replica from the remiainig "good" node pointing to the of node

- bring the old mode back on. (Having issues with this one see above)

- add replica. Is this initiated from the good node or the repaired node?

- wait for the. To sync up

- done?

Ps - do I need to remove the paths to the old node from the hyper v servers iscisi/mpio?
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Mon Jan 28, 2013 12:44 pm

Can you confirm that starwind service is running on that node?
I`d also like to clarify:
When I try to connect to 127.0.0.1, I get the following error message:

"StarWind Services not found on host"
are you connecting to the host in the StarWind Mgmt console or to the target in iSCSI initiator?
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
jhamm@logos-data.com
Posts: 78
Joined: Fri Mar 13, 2009 10:11 pm

Mon Jan 28, 2013 2:27 pm

Yes - StarWind Service is running. It starts without any errors. I have tried stopping and restarting the service, rebooting the server, etc. repeatedly

Yes - I am connecting to the StarWind Mgmt console and not trying to connect to StarWind as an iSCSI target
jhamm@logos-data.com
Posts: 78
Joined: Fri Mar 13, 2009 10:11 pm

Mon Jan 28, 2013 6:06 pm

I think I noted this earlier, but I have disabled almost all of the network connections in an effort to not have this unit come online and be accidentally accessed by my HyperV hosts. Would having the networking disabled be part of the problem?
jhamm@logos-data.com
Posts: 78
Joined: Fri Mar 13, 2009 10:11 pm

Tue Jan 29, 2013 4:13 pm

I think I see the issue maybe. I think the only interface I specified as a Management interface is the same as what I designated as my heartbeat channel between my two nodes. I don't think it is listening on the loop back address (127.0.0.1), I need to change the IP Address for that the Management Interface. I was poking around the Starwind.CFG file and found the following section:

<BCastEnable value="yes"/>
<BCastInterface value="0.0.0.0"/>
<BCastPort value="3261"/>
<CtlInterface value="10.20.221.1"/>
<listen value="*,"/>
<nolisten value="10.30.221.1:3260"/></connections>

Is there a a way I can modify the file so that the StarWind Management will listen on 127.0.0.1?

Thanks,
Jeff
jhamm@logos-data.com
Posts: 78
Joined: Fri Mar 13, 2009 10:11 pm

Tue Jan 29, 2013 7:43 pm

OK, I was able to connect by changing the <CtlInterface value> to 127.0.0.1

Now I need help with the proper order of restoring the old node :)

Is this a good procedure below?

- Remove all HA Targets from the crashed node

- On the good node, use the Replication Manager from the Devices section and use Remove Replica to remove all references to the old node

What what would be the next step? Can I simply use Add Replica from the Replication Manager on the good node, or do I need to create Targets first from the old node first? Also, should I first clear out all of the old iSCSI connections in HyperV to the old node before adding it back as a replica?

Thanks!
Jeff
jhamm@logos-data.com
Posts: 78
Joined: Fri Mar 13, 2009 10:11 pm

Wed Jan 30, 2013 3:14 pm

One more thing - the "good" node is (or was) my secondary node; will that cause any issues with adding the other back into replication? Do I need to "mark it as primary", or is StarWind smart enough to take care of that? 8)
jhamm@logos-data.com
Posts: 78
Joined: Fri Mar 13, 2009 10:11 pm

Wed Jan 30, 2013 10:51 pm

Sorry to keep bugging y'all, but is there a PDF or tech document that outlines the steps to bring the failed node back online?

Thanks!
Jeff
Post Reply