Hi guys,
This is just a post to showcase how a simple problem which caused big problems was solved by accident. I was thrilled and had a "hand in face" moment once I figured it out.
For quite a while (a year or so?) I've been running Starwind free for a W2K8 R2 Hyper-V clustering solution. Every once in a while I'd get a iSCSI crash which would either put the target in redirected mode or cause the CSV to be in a failed state. When this happens the VM's BSOD. To get out of redirected mode, I have to take the CSV's off line, and disconnect and reconnect the iSCSI targets. Since this was happening randomly and infrequently I didn't know what to make of it. For this reason, I didn't put any production VM's in the cluster. So really, the cluster components haven't been doing much of anything for the past year other than wasting electricity and running VM's via local storage. My Proliant servers have dual redundant power connected to unique UPS'. The iSCSI nics have their own switch.
Over the weekend, I logged into a seldomly accessed physical server with a single power supply and noted that the shutdown tracker screen came up at login asking why the unexpected shutdown. I checked the time stamp. Hmmm. Light goes off. Go back to cluster, check time stamp of iSCSI failure. Same. Go to server rack. See red X on 2nd UPS. Failed battery. So, the servers never hiccupped due to having RPS but the switch which handles the iSCSI traffic was plugged into the UPS with bad batteries. When the UPS "flickered" so did the iSCSI switch which killed the CSV's which killed the VM's. Aha! I thought I had split my iSCSI team across switches but apparently not.
So, the adage about the weakest link in the chain was proven for me. As with most IT problems, with enough digging nearly all IT problems have logical causation. Back to work...
The Latest Gartner® Magic Quadrant™Hyperconverged Infrastructure Software