Recovery time during an iScsi outage

lbroyles · Mon Dec 12, 2011 4:37 pm

I've been having problems where one of the nodes will experience iScsi errors on the sync channel and cause the partner node to go offline and be out of sync. I've been working on this problem with Max. Right now it only happens about once every 3 weeks or so (so it is getting better) ... but when it does happen, it causes major problems because the HA is not recovering like I thought it would.

The issue that I have is when this happens, all of my VM's on my 3 ESXi servers stop. The console shows that the VMs are running, but no user can connect to them. Email is offline, SqlServer doesn't respond, file servers are offline and so forth. Then after about 5-10 minutes (which when everything halts in a production environment seems like forever), most of the VM's will recover and start functioning right where they left off. Exchange and Sql Server and so forth just start going again like nothing had happened without restarting. A handful of the file servers will need to be restarted because they experience a system fault.

So my question is why do all the VM's have an issue when the partner node goes offline in this manner? Why doesn't the system simply recover and continue with very little or no impact on the users? Is there some sort of setting in ESXi that is waiting for a timeout to occur that I can fix?

Thanks,
Loren.

Tue Dec 13, 2011 10:52 pm

I'll discuss with Max your case in a couple of hours (he's just back from trip to Germany). Initially it looks we need to find and fix 1) errors on sync channel (if they are repeatable they should be pinpointed and killed, I'm pretty much sure it's broken switch or cable or NIC going crazy) and 2) why primary storage stops responding if partner goes AWOL. Second one sounds like either our issue or it's still something with network so somehow related to the first case. It should not do what it does now...

lbroyles wrote:I've been having problems where one of the nodes will experience iScsi errors on the sync channel and cause the partner node to go offline and be out of sync. I've been working on this problem with Max. Right now it only happens about once every 3 weeks or so (so it is getting better) ... but when it does happen, it causes major problems because the HA is not recovering like I thought it would.

The issue that I have is when this happens, all of my VM's on my 3 ESXi servers stop. The console shows that the VMs are running, but no user can connect to them. Email is offline, SqlServer doesn't respond, file servers are offline and so forth. Then after about 5-10 minutes (which when everything halts in a production environment seems like forever), most of the VM's will recover and start functioning right where they left off. Exchange and Sql Server and so forth just start going again like nothing had happened without restarting. A handful of the file servers will need to be restarted because they experience a system fault.

So my question is why do all the VM's have an issue when the partner node goes offline in this manner? Why doesn't the system simply recover and continue with very little or no impact on the users? Is there some sort of setting in ESXi that is waiting for a timeout to occur that I can fix?

Thanks,
Loren.