The Latest Gartner® Magic Quadrant™Hyperconverged Infrastructure Software
Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)
deiruch wrote:I tested some more, also with the new beta version. Performance is fine for small LUNs (I tested 20GB and 50GB). The delays only happen with a 512GB LUN.
I placed the images on a dynamic mirrored volumes with 7.2k RPM SATA (consumer) drives. They delivers something around 80 MB/s burst and 10ms latency. Network bandwidth is 960 MBit/s when doing live migrations, latency for 4k packets is at or below 1ms.
deiruch wrote:Tested with with Iometer, default access spec and a 2GB file:
Host direct: 164 IOPS, 0.32MB/s, 6.1ms avg response time
Host iSCSI, 20GB: 134 IOPS, 0.26MB/s, 7.5ms avg response time
Host iSCSI, 400GB: 123 IOPS, 0.24MB/s, 8.1ms avg response time
VM iSCSI, 20GB: 148 IOPS, 0.29MB/s, 6.8ms avg response time
VM iSCSI, 400GB: 125 IOPS, 0.25MB/s, 8.0ms avg response time
The interesting part really is that the last test had a maximum response time of 10 SECONDS. This happened in no other test and is reproducable.
deiruch wrote:I'm fine with the results. Except for the max latency of the 400 GB test of up to 10 seconds (!).
I repeated the test with a 400GB IOmeter test file on the host (DAS) and could not reproduce latencies in that range (even when doing lots of other IO during the benchmark). The worst case I could reproduce was 1.3 seconds.
Next I did the old test again, but this time I created a 410 GB partition and tested once via StarWind and once directly with the same results. The VM over StarWind test produced latency spikes of several seconds, the VM-only or the Host-only test had no such spikes.
Next I'll look at the stack traces of the involved applications and at the drive queues.
deiruch wrote:I was able to reproduce the problem with a 20GB LUN. As you can see the request is somewhere stuck between the filesystem of the StarWind drive (P:) and the filesystem of the image (2x C:). The queue of the mirrored parent drives are empty (=no request hanging), the queue of the child drive contains one item (=one request hanging).
The setup is as follows:
A single server runs StarWind with a 20GB LUN, the server connects to itself. MPIO is not in use. I created a new partition (P:), formatted it with NTFS and then tried to copy some data to the partition.
This means I can ignore Hyper-V, clustering, MPIO (because all three are not in use) and the hardware (because otherwise the disk queue had length 1 for one of the two C: drives).
Agree? Disagree? Suggestions what to do next?
Update: Wait a minute! The queue length is 0 everywhere. But the "active time" is 100% on the iSCSI drive...
Yes, I did tests with small and big caches. The lazy writer should not influence the queue length however. It'll perform IO after some delay, but not do very long lasting IO requests. Also there's no problem in the underlying storage (I don't see long IOs there) - and that's where the lazy writer writes to. So I doubt this is the problem.Do you have StarWind cache enabled? If YES it's Lazy Writer keeping data in RAM for a while instead of putting it to disk. It's normal.
This is not an IOmeter issue. Even if it issued completly bogous IO the requests should never take more than a second or so to handle. I first noticed the issue during an ATTO benchmark run and later when copying files with Explorer. So, again, I doubt that IOmeter is the problem (#1 it can't be and #2 other apps show the same problem)10 seconds is crazy. IOmeter has some issues with queue management (no exact order fired). Can you try other tool or real world benchmark (your app of choice)?
They problems are more or less the same in a VM stored on the target or outside.Inside VM things could be very different. Let's sort out non-virtualized problems before.