Starwind, Hyper-V, and Windows 2003: delayed write failure
Posted: Fri May 15, 2009 4:57 pm
Hi,
Short: I think this is more a problem with Windows 2003, but could perhaps be addressed by a new feature in Starwind
Long:
I'm using Starwind as iSCSI storage for a Windows 2008 cluster running Hyper-V. On this cluster there are a number of virtual machines, mostly Windows 2008 and Windows 2003. The server running Starwind is running Windows 2008. It has a nice Areca RAID controller, and lots of disks, arranged as RAID 1 volumes.
Each VM has some config data, and one or more VHD files, held in a single iSCSI target served by Starwind. Each target is an img file.
I decided to move some of my less important .img files from a fast 10Krpm RAID 1 to a slower 7200Krpm RAID 1. At the time, the VMs had been removed, so Hyper-V wasn't using the targets. First, I deleted the target in the Starwind console, then I moved the .img file from one disk to the other using windows explorer, then once copied I recreated the target, pointing Starwind at the new location of the .img file.
While the .img was being copied, *all* the Windows 2003 VMs started having "Delayed Write Failures" to thier local disks (which are actually VHDs stored on iSCSI targets). Windows 2008 VMs were fine. The VMs quickly became unstable and had to be rebooted. I've since repeated the test a few times, and every time, I get the same problem. The only thing I haven't tried is restarting Starwind. I think that's what I will have to do - gracefully shut down every VM on the cluster, move the .img files around, reconfigure the targets on Starwind, and then restart Starwind - finally startup the VMs.
I think the problem is that during the file copy, there is a hell of a lot more disk activity than usual on both the source and destination disks, and because both of these were simutaneously being used by Starwind for the running VMs, delayed write failures started occuring. The cache on my RAID controller didn't help as copying a large .img would quickly fill it. I think that because Starwind was unaware of the disk activity, it couldn't do anything. The Windows 2008 VMs didn't have a problem, so I guess Microsoft have made Windows a bit more tolerant in this situation. What I can't really explain is why some of my VMs, which were on a third raid pair not involved in the move had the same problem!
What Starwind could do to help, potentially, is have a function to move .imgs from disk to disk, so that initiators using targets held on the same disks can be told to expect problems, and hopefully that will stop this issue. If that's impossible (e.g. if iSCSI doesn't support that) then I don't know if there is a solution - apart from not using Windows 2003 VMs!
I'm also worried that any other disk activity on the Starwind box could cause similar problems - e.g. defragging a drive holding imgs.
Short: I think this is more a problem with Windows 2003, but could perhaps be addressed by a new feature in Starwind
Long:
I'm using Starwind as iSCSI storage for a Windows 2008 cluster running Hyper-V. On this cluster there are a number of virtual machines, mostly Windows 2008 and Windows 2003. The server running Starwind is running Windows 2008. It has a nice Areca RAID controller, and lots of disks, arranged as RAID 1 volumes.
Each VM has some config data, and one or more VHD files, held in a single iSCSI target served by Starwind. Each target is an img file.
I decided to move some of my less important .img files from a fast 10Krpm RAID 1 to a slower 7200Krpm RAID 1. At the time, the VMs had been removed, so Hyper-V wasn't using the targets. First, I deleted the target in the Starwind console, then I moved the .img file from one disk to the other using windows explorer, then once copied I recreated the target, pointing Starwind at the new location of the .img file.
While the .img was being copied, *all* the Windows 2003 VMs started having "Delayed Write Failures" to thier local disks (which are actually VHDs stored on iSCSI targets). Windows 2008 VMs were fine. The VMs quickly became unstable and had to be rebooted. I've since repeated the test a few times, and every time, I get the same problem. The only thing I haven't tried is restarting Starwind. I think that's what I will have to do - gracefully shut down every VM on the cluster, move the .img files around, reconfigure the targets on Starwind, and then restart Starwind - finally startup the VMs.
I think the problem is that during the file copy, there is a hell of a lot more disk activity than usual on both the source and destination disks, and because both of these were simutaneously being used by Starwind for the running VMs, delayed write failures started occuring. The cache on my RAID controller didn't help as copying a large .img would quickly fill it. I think that because Starwind was unaware of the disk activity, it couldn't do anything. The Windows 2008 VMs didn't have a problem, so I guess Microsoft have made Windows a bit more tolerant in this situation. What I can't really explain is why some of my VMs, which were on a third raid pair not involved in the move had the same problem!
What Starwind could do to help, potentially, is have a function to move .imgs from disk to disk, so that initiators using targets held on the same disks can be told to expect problems, and hopefully that will stop this issue. If that's impossible (e.g. if iSCSI doesn't support that) then I don't know if there is a solution - apart from not using Windows 2003 VMs!
I'm also worried that any other disk activity on the Starwind box could cause similar problems - e.g. defragging a drive holding imgs.