Round robin iops policy - peculiar effect

starczek · Thu Jul 19, 2012 11:37 am

Hi

I have two ESXi 5.0 hosts and one storage. Both ESXi hosts have 2 interfaces for iSCSI only and storage has 4. Every interface is configured to work in separate network segment. All connections are made using two separate HP V1910 (to get real failover) series switches without using jumbo frames (but results are the same when I use direct connections). My goal is to have failover and load balancing as well but... well... it works a bit in unexpected way. To clarify my issues I present to you two tests.

TEST 1
I log to ESXi host A (over SSH) and run following command:

Code: Select all

 dd if=/dev/zero of=10GiB_v1.dat bs=1M count=10240

Above command create 10GiB files filled with zeros on selected datastore. Creation speed depends purely on speed connection (and on RAID device on which StarWind's *.img file is placed but this is not the case here as RAID subsystem easily outperforms 2x1GbE wire speed) so I can test maximum throughput and if load balancing works well.

My expectations was something about 200 MB/s, but i get only ~90MB/s what can be seen on following screenshots.

First one shows 5 second average of write speed from esxtop. It ranges from 85 to 90 MB/s.

: RR single 2.png (883 Bytes) Viewed 18643 times

This one presents what is happening with network cards - data transfer complement each other! When one is maxed out second one is equal to zero...

: RR single.png (1.89 KiB) Viewed 18640 times

Third screenshot shows what is happening when the same command is issued when multipathing policy is set to fixed and during transfer changed to round robin. In the beginning one channel is used and that's expected. But when policy changes transfer is equally split between both paths but the sum remains same as for single path (1xGbE). I admit I expected to see that transfer speed doubles...

: Change from FIXED to RR - one host.png (1.23 KiB) Viewed 18643 times

TEST 2
Finally I have done above test for two hosts simultaneously (starting with fixed policy and change during transfer policy to round robin on both hosts). [Unfortunately I cannot upload more then three attachments so I do this in the next post.] This test showed that two parallel transfers are possible without interfering each other (none of the dd command slowed down because of second transfer). But the effect known from one host experiment is exactly the same - instead of 4x1GbE I get 4x0.5GbE. After that I'm pretty sure that not the network is limit but ESXi round robin algorithm or StarWind Target method of handling round robin algorithm.

SUMMARY
In all tests round robin policy was set to "IOPS" mode and iops operations limit to change path set to 1.

What, the heck, is going on? In theory round robin works but why I obtain only half of the wire speed? ESXi is the most recent (5.0.0b768111), StarWind as well (5.8.2013). Is it networking problem (I doubt, switches logs do not show any conflicts, dropped packets etc.). Why I cannot get the aggregated wire speed? And I'm not sure where the problem in fact is - StarWind iSCSI Target or ESXi. Any help, comments, questions and discussion is appreciated.

starczek · Thu Jul 19, 2012 11:38 am

Here is the fourth (missing one) screenshot when two ESXi hosts are involved.

: Change from FIXED to RR - two hosts.png (1.68 KiB) Viewed 18653 times

Sat Jul 21, 2012 11:32 am

OK, and can we see the diagram with the network load on the SAN servers? Also, when using RR NLB, what value d oyou have in IOPs?

Mon Jul 23, 2012 12:57 pm

To be honest, for me it looks like you have specified only one NIC as destination on your ESX.
Would you please doublecheck if both NICs on ESX are connected to both StarWind NICs and configured to use RR?

Thank you

starczek · Mon Aug 06, 2012 2:20 pm

Anatoly (staff) wrote:To be honest, for me it looks like you have specified only one NIC as destination on your ESX.
Would you please doublecheck if both NICs on ESX are connected to both StarWind NICs and configured to use RR?

Sorry for the late answer but was on holiday.

I've checked connections once more as you suggested but everything seems to be fine. My topology:

: Topology.png (31.3 KiB) Viewed 18352 times

Next two screenshots present paths configuration for both hosts. You can easily see that target IP is different for each hba.

: Paths - host A.png (4.33 KiB) Viewed 18354 times

: Paths - host B.png (4.35 KiB) Viewed 18354 times

Anatoly (staff) wrote:[...]and can we see the diagram with the network load on the SAN servers?[...]

Anatoly, what do you exactly mean? Which diagrams you want to see? Those provided in my first post are from System Monitor (part of Windows Server 2008 R2) and you can see there all network traffic during my tests. Or perhaps you want those from iSCSI target?

Anatoly (staff) wrote:[...]Also when using RR NLB, what value do you have in IOPs?

Those from iSCSI Target I guess?

Tue Aug 07, 2012 12:38 pm

Hi. Lets go through this one-by-one first.

Anatoly, what do you exactly mean? Which diagrams you want to see?

the ones that I have now is pretty enough for now.

Those from iSCSI Target I guess?

Yeap

And now about hte issue - well, to be honest for me it looks really weird. Please tell me, if this is production environment? Do you have possibility to try to reproduce this with, lets say, Microsoft target software?

starczek · Fri Aug 10, 2012 4:46 pm

Two screenshots presenting what is happening on iSCSI target during my tests.

First one shows that starting dd on ESXi host B when ESXi host A is dding doubles number of IOPS. Both hosts are using RR policy.

: iSCSI_target_IOPS_RR.png (13.29 KiB) Viewed 18295 times

Second shows what is happening when I changed MRU policy to RR during dd. This slightly increases number of IOPS.

: iSCSI_target_IOPS_change_from_MRU_to_RR.png (9.37 KiB) Viewed 18291 times

I've also opened ticket with VMware because I'm feeling that problem is rather on VMware side than Starwind.

Mon Aug 13, 2012 8:20 am

OK, I`ve got two oquestions for now:
1.Would you be so kind to keep us updated about the progress of the solving the issue from VMware side?
2.Are you able to replace starwind for testing purposes to MS iSCSI target? Or maybe just add it if you have some additional server identical to the one where starwind is running?

starczek · Fri Aug 24, 2012 1:29 pm

Anatoly (staff) wrote:OK, I`ve got two oquestions for now:
1.Would you be so kind to keep us updated about the progress of the solving the issue from VMware side?
2.Are you able to replace starwind for testing purposes to MS iSCSI target? Or maybe just add it if you have some additional server identical to the one where starwind is running?

1) Sure. Currently VMware closed my ticket due to my holidays and my reopen requests are ignored for over 2 weeks...
2) I'll try to, but this is my production environment so all test must be done very carefully.

I've also noticed one more thing during cloning VM . Machine was cloned from datastore "LUN-5" to datastore "LUN-6". As one can see data written to target is doubled comparing source datastore. Miracle? Where the extra data came from? The same is observed on disk:

: DISK_during_cloning.png (1.59 KiB) Viewed 18169 times

and iSCSI dedicated network interfaces:

: NETWORK_during_cloning.png (1.24 KiB) Viewed 18171 times

.

No other activity or VMs on those two datastores. Frankly, I'm completely out of ideas...

Mon Aug 27, 2012 8:49 am

Well, to be honest - I have no ideas as well.

I think the best way for now is to wait for VMware response.

starczek · Fri Feb 22, 2013 4:49 pm

Sorry all folks for late update but I forgot about it

Problem has been (almost) solved with VMware support.

Firstly I've been told that tests performed via SSH console on ESXi hosts using DD command are far from the proper way of testing storage performance. The one and only valid method is to test it from inside of the VM.
Secondly, if one wants to achieve reliable results then one must use Thick provisioned eager zeroed virtual disk provision type. Any other provision type have significantly and negative impact on virtual disk performance.

Having regard to the above two remarks performance finally is as expected

For Windows VMs read and write is a bit below 200 MB/s (sometimes exceeds that value). The same for Linux machines (CentOS 6.3) with one exception: read performance still caps at ~108-110 MB/s. Strangest thing is that write throughput reaches ~220 MB/s. In both cases RR policy works perfectly and load is evenly split between both interfaces but for read each is used only in ~50%.

The last issue is still investigated by VMware.

I'll try to update you as soon as there will be something new worth to publish here.

PS During comprehensive test any impact of any kind of caches (storage and VM guest OS) have been excluded.

Fri Feb 22, 2013 9:04 pm

Thank you very much for keeping us updated! Indeed a very valuable feedback.

P.S. We've also hit a "no thin-provisioned VMs" thing during our recent LSFS tests. Lightning strikes twice