Recovery time during an iScsi outage
Posted: Mon Dec 12, 2011 4:37 pm
I've been having problems where one of the nodes will experience iScsi errors on the sync channel and cause the partner node to go offline and be out of sync. I've been working on this problem with Max. Right now it only happens about once every 3 weeks or so (so it is getting better) ... but when it does happen, it causes major problems because the HA is not recovering like I thought it would.
The issue that I have is when this happens, all of my VM's on my 3 ESXi servers stop. The console shows that the VMs are running, but no user can connect to them. Email is offline, SqlServer doesn't respond, file servers are offline and so forth. Then after about 5-10 minutes (which when everything halts in a production environment seems like forever), most of the VM's will recover and start functioning right where they left off. Exchange and Sql Server and so forth just start going again like nothing had happened without restarting. A handful of the file servers will need to be restarted because they experience a system fault.
So my question is why do all the VM's have an issue when the partner node goes offline in this manner? Why doesn't the system simply recover and continue with very little or no impact on the users? Is there some sort of setting in ESXi that is waiting for a timeout to occur that I can fix?
Thanks,
Loren.
The issue that I have is when this happens, all of my VM's on my 3 ESXi servers stop. The console shows that the VMs are running, but no user can connect to them. Email is offline, SqlServer doesn't respond, file servers are offline and so forth. Then after about 5-10 minutes (which when everything halts in a production environment seems like forever), most of the VM's will recover and start functioning right where they left off. Exchange and Sql Server and so forth just start going again like nothing had happened without restarting. A handful of the file servers will need to be restarted because they experience a system fault.
So my question is why do all the VM's have an issue when the partner node goes offline in this manner? Why doesn't the system simply recover and continue with very little or no impact on the users? Is there some sort of setting in ESXi that is waiting for a timeout to occur that I can fix?
Thanks,
Loren.