HA Write Performance

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

jimbyau
Posts: 22
Joined: Wed Nov 21, 2012 2:12 am

Fri Aug 08, 2014 6:01 pm

I have discovered after testing our v8 SANS that when you compare write performance of a virtual disk running on a Single SAN, with a virtual disk running in a 2 node sychronous HA configuration, that the write performance of the synchronous storage is around 50% or less as show below.

Is this normal behavior for HA storage? If I was to add a tertiary node are we to expect another 50% write reduction? What should be the expectation?
2014-05-03_2227.png
2014-05-03_2227.png (13.32 KiB) Viewed 18297 times
2014-05-03_2229.png
2014-05-03_2229.png (12.1 KiB) Viewed 18297 times
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Fri Aug 08, 2014 6:19 pm

Two-node HA is always slower than single-node because of the extra overhead in transferring the data across the synchronisation network. The "I'm finished writing" acknowledgement can't be returned until both nodes have confirmed they have written the block.

Which is why StarWind suggests you throw as much capacity at the synchronisation network as possible, e.g. a dedicated 10GbE network interface with a simple cross-over cable between the two nodes. Even a shared 10GbE switch in the middle adds unwanted overhead.

Should it be 50% slower? Don't know - what are you using for synchronisation network?

Cheers, Rob.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Fri Aug 08, 2014 6:22 pm

So if you're only "just" happy with the performance of your single node SAN, then you should also consider other ways to offset this overhead when switching to HA like SSD L2 caching, as much RAM as you can muster for write-back caching, bigger caches in the disk controller etc.
jimbyau
Posts: 22
Joined: Wed Nov 21, 2012 2:12 am

Fri Aug 08, 2014 6:43 pm

Agreed, which is why I have 2 x 10Gbe Sync channels, with crossover optics by Solarflare & Arista.

We also use SSD caching at the hardware level, but the reality is if the additional overhead is in the networking between the SANs, SSD caching does not help even if I switch it on for writes. The testing I performed was with write-back memory caching enabled with 8GB RAM, and no reads or other I/O against the LUN. Which does leave much overheard except for the performance of the starwind sync channel.

I tried various configurations, removing the crossovers, and pumping the sync data through our arista switches to see if that affected performance, which it didnt. I tried turning on and off LSO/RSO, tried various RSS settings and buffers on the NICs, but it did not vary performance at all - the result is the same.

I had to come to the conclusion that there is a 50% write reduction performance with the starwind HA that I was not able to overcome. I was reasonably happy with the single node performance, but the write performance on HA for our requirements is still a little too slow. I see no way to overcome it based on my tests - I have $35,000 of network kit running our two SANs, with everything chosen specifically for low latency. Since the read performance is pretty fantastic, I am forced to conclude the performance bottleneck is the design limitation of the sychronisation. If there is anyway to make it write acknowledgement faster at the software level then it would greatly improve IOPS on the HA writes, buts I cannot achieve it with hardware!

Thanks
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Fri Aug 08, 2014 10:43 pm

Rob is absolutely correct: you have one extra network "hop" in between the nodes as packet P delivered to Node A needs to be sent from Node A -> Node B and confirmation received before Node A would actually fire back confirmation "data is here" to original caller. So YES writes will be slower in HA config then in non-HA one (reads are another story their performance do INCREASE because amount of independent MPIO paths to the same mirrored data is increased). How slow? It depends on a) how fast backbone connection really is (Rob is correct again) so running symmetric config is not recommended, always use something MUCH faster for backbone (we don't need switches for that so setup is kind of an inexpensive one) and b) how loaded your config is. See normally you have a pipeline so with a reasonable long queue (you'll never see less then ±64 commands pending in Hyper-V or vSphere write queues) commands are handled in the different stages of completion: command N is just fired from caller to Node A, command N - 1 is under way of being delivered from Node A -> Node B, command N - 2 is returning ACK from Node B -> Node A and command N - 3 is reported "done" by Node A to caller. ATTO Disk Benchmark is by far the best tool to measure performance here so we always recommend either Intel I/O Meter or your production config simplified (better - AND and not OR). However... In your case numbers are TOO low even for ATTO so I'd suggest to welcomed our engineers on a remote session to see what could be improved (I don't like reads did not increase and actually decreased which means MPIO is not kicking in for some reason, that's already a "full stop" here).
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
jimbyau
Posts: 22
Joined: Wed Nov 21, 2012 2:12 am

Sat Aug 09, 2014 4:25 am

Hi Anton,

What do you mean by a symmetric config ?

I dont see how I can make the backbone any faster, I cannot saturate the links between the SANS, nor improve the transactional performance with hardware - Ive installed 2 10GB crossovers between the sans for the backbone link using solarflare adapters and arista optics, at $2000 a piece, I cannot buy anything "faster". I toyed with testing emulex cards for the SANs but that is just much of the same.

I turned on Direct I/O on my benchmarks and bypassed any write caching on the virtual machine to get some better statistics on the ISCSI. When I use the default settings the "production performance" benchmark does improve to more acceptable levels, but we can thank windows & RAM for that, and not the starwind storage network. The reality is, it reveals how badly writes do suffer in a multi-node HA environment. I realise this problem isnt necessarily unique to Starwind and I am just floating it on the community for ideas of how to improve it, as it is a limitation of highly available storage and preventing write heavy customers from embracing this kind of storage technology.

I wonder if starwind has considered alternatives for the backbone link? Emulex offer an API stack for TCP that is supposed to be insanely faster than the windows TCP stack, and allows direct access to hardware - its a common approach for transactional performance in share trading industry - windows cannot deliver at this level.

Regarding the read performance, I turned off round-robin MPIO on the VMWARE HA luns. I found when running benchmark testing from a single virtual machine, a fixed path to a single SAN had better read performance, when compared with running a round-robin configuration, even when I reduced the round-robin IOP count to 1 or 3 from the default. With MPIO enabled the throughput was not very consistent and was on average 20-30% slower. We are using Emulex 10GB hardware iscsi client controllers with two ports per card. Without round-robin I can max the reads at around 900+ MB/S, (line speed) off a single SAN but I have to use a 4, 8MB or 16MB block size to do that. I cannot achieve those predictable results with round-robin enabled. IT appears there is additional overhead in the round-robin MPIO process that is hurting performance in my configuration. So to increase the overall throughput of our H/A environment across multiple SANS and virtual machines, I have been manually load balancing the paths of different virtual disks to pull the read data off different SANS. This improves overall performance accross the entire environment, but does nothing for individual VMs read throughput. Since I am happy with the VM read performance, I do not mind this approach.

Thanks everyone.
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Sat Aug 09, 2014 8:46 am

1) Symmetric config is when uplinks from your hypervisor hosts are the same as a backbone between storage HA nodes.

2) "Direct I/O" on ATTO does not bypass cache. Also you should be using Overlapped I/O policy and not Neither as you get a huge latency between multiple I/Os fired by benchmark. VMware does not work this way.

3) Yes, we'll do native InfiniBand soon. But in your case issue is in another place.

4) Before diving deeper into details please provide a) diagram for interconnection and some config details (so we would not guess what hypervisor you use, are you hyper-coverged or "Compute and Storage Separated", what storage back end you use - FLAT or LSFS, what caches you run etc) b) be ready to organize a remote session for engineers. Because from what I'm getting your setup should run much faster then it runs now.

5) Backbone: you're not going to get 100% utilization with a fast backbone and having "pulsating" traffic from single VM.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Sat Aug 09, 2014 1:33 pm

Agreed, which is why I have 2 x 10Gbe Sync channels, with crossover optics by Solarflare & Arista.
Well I'd say you've got your sync network pretty sorted! If I've done my math right, the raw speed of 2 x 10GbE = 2,560MB/s?

20Gbit/s = 20 x 1024 x 1024 x 1024 bit/s = / 8 / 1024 / 1024 MB/s = 2,560MB/s

I admit I have ZERO idea how much overhead there is on transferring a 512 byte block from A-B as there are many, many layers in the way but let's SAY is a whopping 50% overhead so that raw transfer speed is reduced to ~1,300B/s. I would guess the real figure is somewhere between the two.

Here's my complete guess at the top-level IO path:

Code: Select all

ATTO -> Windows disk system -> iSCSI -> StarWind node #1
At this point I guess the I/O splits into two parallel tasks:
	Task #1: Node #1 -> disk controller -> disk -> ACK
	Task #2: Node #1 -> network stack -> 2 x 20GbE network -> node #2 network stack -> StarWind node #2 -> disk controller -> disk -> ACK -> network stack -> network -> network stack -> StarWind node #1
At this point, both acknowledgements have been received
ACK -> iSCSI -> Windows disk system -> ATTO
On a single node, the bottleneck is most likely the disk. On the 2-node, there is a certain amount of parallelism going on as the disk writes can be carried out at the same time but node #2 is bound to be behind node #1 because of the requirement to transfer the block over the sync network.

So given that made-up-but-probably-too-low figure of 1,300MB/s above over the network, does the network feel like the bottleneck?

Well looking at the OP 512MB figures, they are getting a pretty good 1,100MB/s write speed @ 512MB block size. So *if* the network overhead is around 1,300MB then you've got two similar *sequential* I/O speeds.

So for writing a single 512MB block, then the total time could be TWICE the single node speed. 512MB block @ 1,300MB/s across the network and then 1,100MB to the disk. The return ACK won't take long as there is only a small abount of data going back.

But of course, there are two completely unqualified assumptions in here:

1. The overhead of a network is 50% - I hope it isn't that!
2. StarWind does everything sequentially - parallelism/queues could allow block #2 to be coming across the network whilst block #1 is being written

#2 could massively reduce the overhead of the network on a multi-block transfer.

I know it's never that clean and this is hugely simplified but 50% to me seems too high.

Cheers, Rob.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Sat Aug 09, 2014 1:49 pm

Since the read performance is pretty fantastic, I am forced to conclude the performance bottleneck is the design limitation of the synchronisation.
I'd say it's a little early to say that for certainty but I know where you are coming from. I'm doing a lot of lab work at the moment and even with my 2 x 1GbE sync channels, synchronisation does seem lower that gut instinct feels it should be.

Even synchronising a disk the first time when adding to HA seems to take much longer than you'd expect, i.e. one tends to know how fast you can copy a 50GB file across the network. In fact, I've just done it on one of our Hyper-V nodes with v6 StarWind systems. This isn't using 10GbE - it's using 4 x 1GbE MPIO. Reading and writing to the same cluster volume (so really thrashing those SAS hard disks) took about 5 minutes to copy a 50GB VHDX file - transfer speed of 150MB/s which is pretty reasonably considering the older disk system we have under there. But in my lab, which can sustain similar I/O speeds (because I'm using SSD), the sync time for a 50GB volume is measured in hours, not minutes. Okay, I've only got 2 x 1GbE in the lab but that's only twice as slow. If one copies the same volume of data across between the two E: drives across the single 1GbE GLAN, it's typically faster.

I'm currently doing some adjustments to my lab so can't give any figures here so later...

But considering it should be possible to parallelise the network sync and write to mirrored node, I would not expect 50% as a design aim...

Cheers, Rob.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Sat Aug 09, 2014 1:50 pm

BTW - have you tried removing disk I/O as much as possible from the equation by trying this with a RAM disk?

Cheers, Rob.
jimbyau
Posts: 22
Joined: Wed Nov 21, 2012 2:12 am

Sun Aug 10, 2014 2:22 am

robnicholson wrote:
I'd say it's a little early to say that for certainty but I know where you are coming from. I'm doing a lot of lab work at the moment and even with my 2 x 1GbE sync channels, synchronisation does seem lower that gut instinct feels it should be.

Even synchronising a disk the first time when adding to HA seems to take much longer than you'd expect,
Cheers, Rob.
Ive noticed this as well, but I think there is more to it than just copying the data. There also seems to be some logic in the process for throttling to allow for other requests to be served. Even when my arrays were not in use at all, if I change the sync priority on the Virtual disk, it would increase the sychronisation time, even though there are no other requests against that disk array.
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Mon Aug 11, 2014 11:02 am

Lab is nearly back online so some more tests this week.
barrysmoke
Posts: 86
Joined: Tue Oct 15, 2013 5:11 pm

Wed Aug 13, 2014 3:38 am

did you nttcp test your network connections between nodes? jumbo set up correctly all the way through?
you mention lab, so are your sans running virtual? you mentioned vmware, so if the sans are virtual, you have to use the paravirtual scsi driver on windows install to get any kind of real disk performance out of them.

also, test a thick iscsi target, I understand lsfs has some performance limitations currently.
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Wed Aug 13, 2014 11:22 am

Rob, may we schedule the remote session to your system? I think it will be the most efficient way to get to the bottom of the issue.
If you are OK with that, please email me (av@starwind.com) and we`ll schedule the time.
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
robnicholson
Posts: 359
Joined: Thu Apr 14, 2011 3:12 pm

Wed Aug 13, 2014 11:38 am

Hi Anatoly - I think you need to do that with the original poster jimbyau. I've got involved in this thread because I'm doing lots of lab work on v8 and HA. My lab work isn't quite finished yet.

Cheers, Rob.
Post Reply