Write confirmation delay

deiruch · Wed May 25, 2011 12:50 pm

Hi there

I just installed StarWind in a HA configuration. I have a single network with a single subnet and three servers (A, B, C). Two servers (A, B) run StarWind, form a failover cluster and run Hyper-V. The third server (C) participates in the cluster for Node Majority. I have a single 512GB LUN, the two StarWind servers connect to themselves and each other. The cluster passes verification (apart from the unsigned drivers and the warning about the link redundancy). MPIO is configured as "Failover only". MPIO is configured to use server A first (confirmed by looking at the stat counters). StarWind uses up to 1GB as cache. Cluster shared volumes are enabled.

I created a highly available VM on the CSV. The VM currently runs on server A - which is also the primary StarWind server. Next I installed Windows Server 2008 R2 on the VM and Live migrated forth an back between A and B. Worked perfectly fine. Then I installed ATTO and did some benchmarks with a 2GB file. Worked fine first. Then I configured ATTO to do "I/O Comparison" instead of overlapped I/O and something weird happened: ATTO froze. Well, almost.

You have to know that ATTO first creates a 2GB test file and then randomly tries to read from & write to this file. WIth overlapped I/O this takes some seconds. With I/O comparison it first writes some 50 MBs, then it stops doing anything. Resource Monitor shows "100% Highest Active Time", Queue Length is "0" and no bytes are read or written anymore. It stays like this for approx. 30 seconds, then another burst of activity happens (some more megabytes are written) and then hangs again. All other writes (from other applications) hang too during this time. Reads work. Copying large files or folders with Windows Explorer provokes the same problem after writing some hundred MBs.

I checked the event log of the VM to see whether windows had to retry failed I/O or something like that. But the event log doesn't contain any errors or warnings. Same thing on all servers. During the hangs there's no (/not much) network activity, CPU and disk usage on all servers is low. There are no new events in the StarWind log.

Any ideas what to check/try next?

Cheers,
Simon

Wed May 25, 2011 4:02 pm

Start with checking your DAS I/O subsystem performance and network performance. Both MB/sec and delays are subject to consider.

deiruch · Wed May 25, 2011 10:51 pm

I tested some more, also with the new beta version. Performance is fine for small LUNs (I tested 20GB and 50GB). The delays only happen with a 512GB LUN.

I placed the images on a dynamic mirrored volumes with 7.2k RPM SATA (consumer) drives. They delivers something around 80 MB/s burst and 10ms latency. Network bandwidth is 960 MBit/s when doing live migrations, latency for 4k packets is at or below 1ms.

Thu May 26, 2011 6:32 am

1) It makes absolutely no difference for StarWind what LUN size you use. So it's something else (larger file placed on partition in non-aligned way, layered on top of bad blocks region with drive doing extensive remap sequence etc).

2) Should be fine.

Can you provide some tests for BOTH your underlying storage (test should be run on a target machine) AND StarWind volumes mapped over iSCSI (initiator machine)? ATTO is fine for a fast-n-dirty way, Intel I/O Meter is preferred.

deiruch wrote:I tested some more, also with the new beta version. Performance is fine for small LUNs (I tested 20GB and 50GB). The delays only happen with a 512GB LUN.

I placed the images on a dynamic mirrored volumes with 7.2k RPM SATA (consumer) drives. They delivers something around 80 MB/s burst and 10ms latency. Network bandwidth is 960 MBit/s when doing live migrations, latency for 4k packets is at or below 1ms.

deiruch · Fri May 27, 2011 4:15 pm

Tested with with Iometer, default access spec and a 2GB file:

Host direct: 164 IOPS, 0.32MB/s, 6.1ms avg response time
Host iSCSI, 20GB: 134 IOPS, 0.26MB/s, 7.5ms avg response time
Host iSCSI, 400GB: 123 IOPS, 0.24MB/s, 8.1ms avg response time
VM iSCSI, 20GB: 148 IOPS, 0.29MB/s, 6.8ms avg response time
VM iSCSI, 400GB: 125 IOPS, 0.25MB/s, 8.0ms avg response time

The interesting part really is that the last test had a maximum response time of 10 SECONDS. This happened in no other test and is reproducable.

Mon May 30, 2011 4:20 pm

All of the modern spindle-based storage is utilizing ZBR (Zone-Bit Recording) these days. So track density differs for inner and outer space on disk. That's why transfer rates and seeks times are getting worser as disk gets occupied. In your case 20GB image has very low chances to be written on the border of the different zones on the disk with the different performance. And 400GB image obviously occupies space from the different zones. That's why 20GB image transfer rates and seeks are better.

I would interpret your results in the way written above...

deiruch wrote:Tested with with Iometer, default access spec and a 2GB file:

Host direct: 164 IOPS, 0.32MB/s, 6.1ms avg response time
Host iSCSI, 20GB: 134 IOPS, 0.26MB/s, 7.5ms avg response time
Host iSCSI, 400GB: 123 IOPS, 0.24MB/s, 8.1ms avg response time
VM iSCSI, 20GB: 148 IOPS, 0.29MB/s, 6.8ms avg response time
VM iSCSI, 400GB: 125 IOPS, 0.25MB/s, 8.0ms avg response time

The interesting part really is that the last test had a maximum response time of 10 SECONDS. This happened in no other test and is reproducable.

deiruch · Mon May 30, 2011 8:22 pm

I'm fine with the results. Except for the max latency of the 400 GB test of up to 10 seconds (!).

I repeated the test with a 400GB IOmeter test file on the host (DAS) and could not reproduce latencies in that range (even when doing lots of other IO during the benchmark). The worst case I could reproduce was 1.3 seconds.

Next I did the old test again, but this time I created a 410 GB partition and tested once via StarWind and once directly with the same results. The VM over StarWind test produced latency spikes of several seconds, the VM-only or the Host-only test had no such spikes.

Next I'll look at the stack traces of the involved applications and at the drive queues.

deiruch · Tue May 31, 2011 7:11 pm

I was able to reproduce the problem with a 20GB LUN. As you can see the request is somewhere stuck between the filesystem of the StarWind drive (P:) and the filesystem of the image (2x C:). The queue of the mirrored parent drives are empty (=no request hanging), the queue of the child drive contains one item (=one request hanging).

The setup is as follows:
A single server runs StarWind with a 20GB LUN, the server connects to itself. MPIO is not in use. I created a new partition (P:), formatted it with NTFS and then tried to copy some data to the partition.

This means I can ignore Hyper-V, clustering, MPIO (because all three are not in use) and the hardware (because otherwise the disk queue had length 1 for one of the two C: drives).

Agree? Disagree? Suggestions what to do next?

Update: Wait a minute! The queue length is 0 everywhere. But the "active time" is 100% on the iSCSI drive...

Fri Jun 03, 2011 9:59 am

10 seconds is crazy. IOmeter has some issues with queue management (no exact order fired). Can you try other tool or real world benchmark (your app of choice)?

Inside VM things could be very different. Let's sort out non-virtualized problems before.

deiruch wrote:I'm fine with the results. Except for the max latency of the 400 GB test of up to 10 seconds (!).

I repeated the test with a 400GB IOmeter test file on the host (DAS) and could not reproduce latencies in that range (even when doing lots of other IO during the benchmark). The worst case I could reproduce was 1.3 seconds.

Next I did the old test again, but this time I created a 410 GB partition and tested once via StarWind and once directly with the same results. The VM over StarWind test produced latency spikes of several seconds, the VM-only or the Host-only test had no such spikes.

Next I'll look at the stack traces of the involved applications and at the drive queues.

Fri Jun 03, 2011 10:00 am

Do you have StarWind cache enabled? If YES it's Lazy Writer keeping data in RAM for a while instead of putting it to disk. It's normal.

deiruch wrote:I was able to reproduce the problem with a 20GB LUN. As you can see the request is somewhere stuck between the filesystem of the StarWind drive (P:) and the filesystem of the image (2x C:). The queue of the mirrored parent drives are empty (=no request hanging), the queue of the child drive contains one item (=one request hanging).

The setup is as follows:
A single server runs StarWind with a 20GB LUN, the server connects to itself. MPIO is not in use. I created a new partition (P:), formatted it with NTFS and then tried to copy some data to the partition.

This means I can ignore Hyper-V, clustering, MPIO (because all three are not in use) and the hardware (because otherwise the disk queue had length 1 for one of the two C: drives).

Agree? Disagree? Suggestions what to do next?

Update: Wait a minute! The queue length is 0 everywhere. But the "active time" is 100% on the iSCSI drive...

deiruch · Tue Jun 07, 2011 2:39 pm

Do you have StarWind cache enabled? If YES it's Lazy Writer keeping data in RAM for a while instead of putting it to disk. It's normal.

Yes, I did tests with small and big caches. The lazy writer should not influence the queue length however. It'll perform IO after some delay, but not do very long lasting IO requests. Also there's no problem in the underlying storage (I don't see long IOs there) - and that's where the lazy writer writes to. So I doubt this is the problem.

10 seconds is crazy. IOmeter has some issues with queue management (no exact order fired). Can you try other tool or real world benchmark (your app of choice)?

This is not an IOmeter issue. Even if it issued completly bogous IO the requests should never take more than a second or so to handle. I first noticed the issue during an ATTO benchmark run and later when copying files with Explorer. So, again, I doubt that IOmeter is the problem (#1 it can't be and #2 other apps show the same problem)

Inside VM things could be very different. Let's sort out non-virtualized problems before.

They problems are more or less the same in a VM stored on the target or outside.

To be perfectly clear:

There are *no* forever-lasting IOs on the underlying storage
There *are* forever-lasting IOs on the mounted StarWind target
The problem is not just bad performance
The problem is that a writing application waits for the write to complete *forever*
It can't be network related (as the client and the server are the same machine)
Doing the same tests on the underlying storage reveals no such problems
The only problem are writes, reads seem to work fine

Thu Jun 09, 2011 7:25 pm

Can you tell me the discovery portal you use to connect the local target is it 127.0.0.1 or a public IP ?
I'm trying to reproduce this in my lab but I only get 28-35ms latency with IOmeter (I don't like ATTO)
I expect to get lower latency tomorrow, I just don't have our blades at my disposal now.

deiruch · Fri Jun 10, 2011 7:08 am

I used the public IP. Also note that (so far) I could only reproduce the issue in a HA configuration.

The server connects to itself and the HA partner server. MPIO policy is configured so that the local server is used and the HA partner as failover.

Mon Jun 13, 2011 8:17 pm

And what happens when if you try switching to 127.0.0.1 and use both pathes simultaneously (round robin)?

deiruch · Mon Oct 24, 2011 11:05 pm

I finally found the root of the problem. After some heavy debugging sessions I discovered that a write to a certain region caused the delay. Further testing revealed that one hard disk was buggy - I replaced it, did some more tests and could no longer reproduce the problem. But I still don't quite get why Windows would show those performance counter values it did... Anyway: I'm looking forward to a future version with HA-quorum. After all I absolutely need this for my home setup...