Hi,
We have setup a Starwind HA cluster providing 8 iSCSI targets each of 500GB. Each 500GB image file is stored on its own 500GB RAID1 array. These targets provide 8 cluster shared volumes to a 2 node Hyper-V cluster. At present however we are not actively using the HA feature because of the issue surrounding taking both Starwind nodes down at the same time. Good to read that this is on Starwind's list so to speak; for me there is more chance of a total power failure to both nodes beyond my UPS battery than any other serious fault.
Everything had worked well for 2 months but then, completely unrelated to the Starwind product itself, we experienced a drive failure in one of the RAID1 arrays. User error however led to a loss of the array completely. This meant that the iSCSI target was taken offline as was the related cluster shared volume.
This itself was not a major issue, we lost the Hyper-V images on this volume but have backups and no data is stored on the affected VM’s itself. The other targets continued to operate but one, at 15 minute intervals, would go offline but then return almost immediately. This would crash the Hyper-V machines on this volume. Careful reading of both the Windows and Starwind logs showed that the Hyper-V cluster was polling for the failed target at exactly the same interval as the volume disappeared and reappeared. It is not clear from the logs, why this was happening.
At the time I decided this was not too serious and focussed efforts on working out how we were able to break a simple RAID 1 array and recovering the lost VM’s from backup.
Two weeks on I have today tested the HA feature because this would have mitigated the RAID failure above. I removed the original 5th target and creating an HA target. Everything seemed to go well no problems. I have rebooted each node individually, re-synced and all is well. I was concerned however about the power failure scenario so put this to the test powering down both SAN nodes one after the other. As expected this caused the target to go offline stating neither SAN node was synchronized. I began a full synchronisation. This is where the first unexpected issue arose. The target remained offline and hence also the 5th cluster shared volume. I was expecting once I intervened and synced the target it would immediately come back online. However what I really hadn't expected was that one of the other targets began going offline. My original issue had returned. Each time the Hyper-V cluster polled for target 5 another volume would go offline and then come back. Again I can't fathom this from the logs and it is a different volume this time.
It also worth pointing out that with each poll the same volume goes off line, it isn’t random except that when this first happened it affected volume 3 and on the second occasion volume 1.
I could do with and would appreciate some help with this issue. This is a live environment and we have 7 virtual machines running. We have good backups but I still feel very vulnerable.
Thanks in advance to anyone who reads this.
Alex
The Latest Gartner® Magic Quadrant™Hyperconverged Infrastructure Software