Can't log back in after power outage - XenServer 5.5

mogulbumm · Wed Feb 02, 2011 2:14 pm

We had a datacenter power outage (VERY rare) and everything power cycled.

Two StarWind units set up in HA mode. Both came back online, but I had to delete the HA image on the second unit and resync.

Now, following the process to get Citrix XenServer to see the units again. I tried reconnecting the storage to no avail, so I'm starting from scratch with the commands and cannot get XenServer to log on to the StarWind units.

Is there a process to delete, remove the "saved" login info and start over? I obviously do NOT want to lose any info on the StarWind units, but I'm at a loss. It is NOT easy to work with these in a disaster scenario with XenServer.

Wed Feb 02, 2011 5:07 pm

Hello, to make everything work you will need to do deteach and forget the Datastore, disconnect the targets and reconnect them back. After this you need to reconnect the targets and re-add the Datastore. The procedure is identical for all iSCSI targets with Xen, so it's more a Xen fault

mogulbumm · Wed Feb 02, 2011 7:52 pm

Xen will discover IQN, but will not discover LUN (Unable to add iSCSI device. Check your settings and try again).

From Xen Console, the iscsiadm discovery command gets the targets.

Try to log in with iscsiadm -m node -L all and get the following errors:

iscsiadm: initiator reported error (4 - encountered connection failure)
(note: this is on the crossover interface LUN)

iscsiadm: initiator reported error (5 - encountered iSCSI login failure)

The units are in the process of syncing (55% complete after a day and a half)

Do I have to wait until they are fully in sync? If so, how can this possibly be workable in production if you have a problem???? (3 days to resync before we can reconnect)

Last question - if I completely delete the data on the HA unit, can I recreate the HA device without a sync?

This seems to be an issue with any unexpected shutdown or power loss. In the past we have resolved this by deleting the info on both units and just starting from scratch, but that is no longer an option now that we have data we need. Please advise!!!

sls · Thu Feb 03, 2011 8:08 am

We have the same problem and we are running VMware.
http://www.starwindsoftware.com/forums/ ... t2319.html

Due by this total data inaccessible issue when the HA nodes are being resynced for whatever reason, we have to take the Starwind HA out of our production environment until Starwind realize and fix this deadly problem.

This issue should be pretty easy to reduplicate in their Lab. Just setup two nodes in HA and one ESX or XenServer. After the HA is setup and get a few VMs running on the HA target, pull the power off all there boxes and power them back up. The issue should be appeared when the HA needs resync.

The issue is the initiator can connect and see the LUN in the HA target that is being synced but you can't read or write to it until the full sync is done. If you have couple hundred Gig data, you just wait a few hours. If you have over 10 TB data such as a data center, you are dead in the water in this scenario.

The other thing we discovered that even more deadly than the wait time to restore data access. If your HA nodes loss power all together at the same time, you probably don't care which node has the most up to date data. In our last total power failure, the HA nodes did not go down together because one of the node is on a bigger UPS unit. That node ran the other 10 more minutes and we did not know that until the remaining UPS ran out of juice and cut the power on the remaining node. An hour later when the power restored, we powered back up everything and find out the VMs won't start because of the iSCSI targets are inaccessible. We pretty much did the same thing like you did - Resynced the HA images. The software did not told us that the second node has more recent data. We end up synced the image from the first node (the node lose power 1st) into the second node (the node lose power 10 minutes after). 10 minutes data loss on a 50 VMs with Exchange servers and SQL servers is a disaster. I just wish the software should alert us that the last data read/write on the image file on the first node was 21:23pm and the data read/write on the image file on the second node was 21:35pm. Are you sure want to syn the 2nd node image from the 1st image???? Maybe the software just simply tell me the data in second node has 132 thousand transactions more than the first node. Either way should prevent me to make such mistake. We figured out this error by look up the UPS event log a day later. Pretty sadly.

The StarWind software needs more reliable way to handle the failure recovery in HA setup. We just cross our fingers to see when this will happen.

Thu Feb 03, 2011 10:57 am

1) It does not happen ALWAYS. But it does happen at least to some customers so we're treating this issue seriously. Upcoming version (not sure about 5.6 as it's pretty stabilized already) will have synchronization bandwidth throttling.

2) Version 5.6 already has alert notification monitoring and logging. So also post V5.6 we'll add "last write" log event. Thanks for pointing!

mogulbumm · Thu Feb 03, 2011 5:18 pm

Anton, is there ANY way to speed up this process??? I've been down TWO DAYS waiting for a resync to complete and I cannot login in the interim.

Is there any other way around this?

Thu Feb 03, 2011 5:19 pm

Grab V5.6 as it also faces this issue.

mogulbumm wrote:Anton, is there ANY way to speed up this process??? I've been down TWO DAYS waiting for a resync to complete and I cannot login in the interim.

Is there any other way around this?