HA Sync Performance

Sat Sep 18, 2010 5:02 pm

Guys! Current StarWind HA design is implemented in the way we don't report "OK" to the writer before the data is actually written to the both disks (first and second node). At the time it's super-stable and robust it limit the whole system with the HA cross-link performance. In upcoming version (5.6 I guess, should be release before the end of the year) we'll allow TWO write policies 1) AS IS, data is "ACK"-ed only when both disks have physically written data on them and 2) write is "ACK"-ed when it's stored in both nodes CACHE. Together with improves cross-link handling (we'll do own trunking and will utilize high performance interconnection protocol, iSCSI-based) it will limit HA nodes performance with CACHE / 2. So two node configuration will work FASTER then single node one (round robin, requests cross-split of course). Some of the improvements will be available in beta versions as "optional" logic MUCH earlier. So apply for beta program and stay tuned

camealy · Sat Sep 18, 2010 6:29 pm

I definitely will stay tuned, but what is best pratice to squeeze every bit of performance out of current versions? Do the MPIO settings or pointing to different primary vs secondary HA img's on the same array matter?

Also if performance out o HA just isn't enough, how does the mirror mode work? Is it just as slow?

Sun Sep 19, 2010 12:52 pm

1) Enable write-back cache on the both nodes.

2) Use multiple phsycial connections between two nodes and create virtual "trunk" channel out of them.

3) Use Round-Robin MPIO policy as it does split the load between two nodes and utilizes full-duplex nature of the 1 and 10 GbE.

camealy wrote:I definitely will stay tuned, but what is best pratice to squeeze every bit of performance out of current versions? Do the MPIO settings or pointing to different primary vs secondary HA img's on the same array matter?

Also if performance out o HA just isn't enough, how does the mirror mode work? Is it just as slow?

camealy · Sun Sep 19, 2010 3:14 pm

That is exactly how we have all our HA's setup. So until the performance is usable for high random write environments, would mirroring perform any better? Or would that still be limited by the dual writes?

Sun Sep 19, 2010 6:24 pm

OK. So what hardware configuration you have and what performance of your resulting cross-link channel is?

No. Mirror write is "ACK"-ed only when both disks have data written to.

camealy wrote:That is exactly how we have all our HA's setup. So until the performance is usable for high random write environments, would mirroring perform any better? Or would that still be limited by the dual writes?

camealy · Sun Sep 19, 2010 7:53 pm

One of our client's current configs is below...

-4x DL160 G5/G6 Hyper-V cluster hosts
-Quad port Intel Gigibet ET NIC 2 ports in a Static Link Aggregation bond for iSCSI access, 1 for Heartbeat, 1 for Hyper-V, onboard for mgmt
-2x ML350 G6 Starwind Hosts in HA set
-Dual port Intel Gigabit ET NIC in a SLA bond for iSCSI targets, another dual port broadcom HP card bonded for Sync, onboard for mgmt
-P410 512MB battery cache RAID card in each
-6 3.5in 300GB 15K SAS drives (2 RAID 5 arrays in each Starwind host to have img's on separate arrays)
-Have matched RAID array and host partitions all to 64K cluster size
-Have tried D-link (higher end series), HP, and Allied Telisys switches with similar results (d-link being the fastest due to the high port buffer)
-VLAN for sync, VLAN for iSCSI, VLAN for Heartbeat, not used for anything else (separate switches for management and VM's)
-Jumbo Frames all around
-Starwind img's 500GB on each array, primary and partner on separate server. (read cache only because data is CRITICAL) Cluster hosts access targets through Round-Robin MPIO
-Everything top to bottom is 2008 R2

Running about 8 Virtual machines, mix of database and file servers, 1 RDP server. Performance gets horrible when any decent traffic is going on. Sometimes I will even get alerts from our Managed Services software that the servers are not responding and have gone down if one large sequential write or read is happening on one of the VM's. (i.e. backup) Disk queue lengths on Starwind host arrays directly sit at 0-1 even during intense load, disk queues inside cluster VM's go up to 25-50... basically the OS is constantly waiting on the "drives"

I was working with Bob before he left and he was going to try out some things to help us. Fix all the bugs we were experiencing first, then the performance issue. (we are a reseller and have been trying to get this to be a repeatable, reliable, and performing solution for our clients) One of our largest deployments went down several times, the last time during an attempted upgrade. At that point Bob reconnected us to one side of the HA set and performance at least doubled. It has been great for 3-4 months, but we have no High Availability on our data. We have deployed this three other times for our clients and they are all having the performance problem. Some with as little as 3 VM's.

Thanks for any help you can offer!

Kurt

Sun Sep 19, 2010 8:02 pm

First question is - how fast your cross-link actually is? Can you run iPerf & NTTcp over the bound connection to see how much data it can pump and what latency it has?

camealy · Sun Sep 19, 2010 8:46 pm

If I were to run these on a live system (with the VM's paused) do these tools have the potential to take down my sync channel?

heitor_augusto · Sun Sep 19, 2010 10:32 pm

anton (staff) wrote: At the time it's super-stable and robust it limit the whole system with the HA cross-link performance.

So if I have two storage arrays with local write performance of 400 MBps, connected by a 1 Gbit cross-link, theoretically the performance of a HA device should be approximately 112 MBps, and the cross-link should be used at 100%. Is this correct?

In our case, using the ATTO Benchmark on a Windows 2008 VM, performance of the link is around 50%, limiting the write performance of the HA device at approximately 57 MBps.

The Windows 2008 VM is running on VMWare ESX 4.1 host, with four 1 Gbit paths to the HA device. The path policy is set to Fixed and the tests were performed using four different paths.

We also observed that when there is a fault in one of the nodes and force fast synchronization, the performance of the cross-link is also around 50% during the synchronization.

The test with the NTTtcp shows a performance of 939.078 Mbits in cross-link.

camealy · Sun Sep 19, 2010 11:38 pm

960.07 Mbps and sub 1ms latency during the test

anton (staff) wrote:First question is - how fast your cross-link actually is? Can you run iPerf & NTTcp over the bound connection to see how much data it can pump and what latency it has?

Mon Sep 20, 2010 3:58 pm

1) This sucks as you have to have doubled performance with 2 * NIC. I'd expect 1,7-1,8 Gbps at least (in single direction).

2) You need to run TWO tests at the same time to measure full-duplex implementation. In RR mode two nodes send data in the both directions at the same time.

3) What app we're talking about? IPerf or NTtcp?

camealy wrote:960.07 Mbps and sub 1ms latency during the test

anton (staff) wrote:First question is - how fast your cross-link actually is? Can you run iPerf & NTTcp over the bound connection to see how much data it can pump and what latency it has?

camealy · Mon Sep 20, 2010 5:33 pm

I had hoped it would use both as well, but in all our tests LACP and SLA bonds will keep each individual TCP session to 1 leg of the bond. If there is a network wiz out there that knows a way around this I am all ears!

Those numbers were from NTTCP and I can test again both directions, but regardless, we aren't even seeing half the 120MB plus bandwidth this test shows? (unless we are connected to one side with the exact same hardware)

Aitor_Ibarra · Tue Sep 21, 2010 2:45 am

Another way to test the sync network performance is to set up a RAM disk target on one node and benchmark it from the other. I was struggling to beat 60% of 10GbE until I did this: http://blogs.technet.com/b/winserverper ... t-vms.aspx(may only apply to hyper-v vms though - I run Starwind as a hyper-v vm, but your nic driver may have similar settings).

Results using Atto:
peak of 676MB/sec write, 814MB/sec read - when both nodes set to 2MB send and receive buffers
642MB/sec write, 981MB/sec read - with 4MB buffers

Atto only allows a queue depth of 10, maybe with more I could have maxed out the 10GbE - I had cpu to spare and it's a direct connection - no switch.

This was a 512MB ram disk, formatted NTFS with 8192 byte allocation units. I'm using jumbo frames and the recommended tcp/ip tweaks (well, the ones that apply in a hyper-v vm).

I know that this isn't exactly the same scenario as HA sync, but it is a good way of testing that the network is OK. Any modern server should have no trouble maxing out 1GbE. 10GbE is a little harder!

Aitor_Ibarra · Wed Sep 22, 2010 6:33 pm

Here are some results from within a VM, testing an HA target with WB cache.

The path is like this

Windows (SBS 2008) running on a Hyper-V cluster
VHD on CSV on starwind target
single 10GbE port on hyper-v cluster node
path 1: 10GbE switch, 10GbE to starwind 1
path 2: 10GbE switch, 1GbE to another switch, 10GbE to Starwind 2 (so 10% of max speed of path 1)
Starwind running inside hyper-v VM
Hyper-V passthrough disk (not VHD)
Areca 1680ix RAID controller on each box
RAID 1 of 7200rpm 2.5" SATA disks

Other aspects
256MB of WB cache on HA target, with very long cache expiry
Intel NICs, RSS enabled, 4 RSS queues, 9k jumbo frames
8k allocation units used (except on actual hard drives, where 4k was used)

Test with ATTO: 32MB (to fit in cache with plenty to spare), overlapped i/o, 1K,2K,4K and 8K tests, QD of 10.

Test 1 - round robin - so 50% of i/o goes over slower path

Code: Select all

1K - Write 182270 (178MB/sec) - Read 203606 (199MB/sec)
2K - Write 168979 (165MB/sec) - Read 198312 (194MB/sec)
4K - Write 175398 (171MB/sec) - Read 198782 (194MB/sec)
8K - Write 172651 (169MB/sec) - Read 183859 (180MB/sec)

Test 2 - failovover only so 0% of i/o goes over slower path

Code: Select all

1K - Write 105875 (103MB/sec) - Read 511148 (499MB/sec)
2K - Write  79512 ( 78MB/sec) - Read 512407 (500MB/sec)
4K - Write  97585 ( 95MB/sec) - Read 478171 (467MB/sec)
8K - Write  96262 ( 94MB/sec) - Read 461174 (450MB/sec)

Test 3 - least queue depth

Code: Select all

1K - Write 189154(184MB/sec) - Read 329773(322MB/sec)
2K - Write 188155(184MB/sec) - Read 319956(312MB/sec)
4K - Write 183313(179MB/sec) - Read 320328(313MB/sec)
8K - Write 186413(182MB/sec) - Read 297218(247MB/sec)

Test 4 - least blocks

Code: Select all

1K - Write 188232(183MB/sec) - Read 325528(318MB/sec)
2K - Write 181162(177MB/sec) - Read 285743(279MB/sec)
4K - Write 183313(179MB/sec) - Read 304292(297MB/sec)
8K - Write 180967(177MB/sec) - Read 273117(267MB/sec)

Going by what Anton has said, main bottleneck is that even with WB cache, writes aren't acknowledged until committed to disk on the second server
Edit: ignore this next sentence - tested using ATTO on starwind servers, underlying RAID can indeed do at least 200MB/sec
I assume that first server acks as soon as in cache, otherwise my writes would not be as fast as they are (underlying RAID-1 cannot achieve 100MB/sec).

If I make both paths 10GbE then I'd expect round robin etc to come up to 500MB/sec reads and 180MB/sec writes.

My Areca's WB cache doesn't seem to be contributing much to performance - will have to look into this.

Edit: on testing using ATTO on starwind servers to same raid-1 used for ha img. Get 200MB/sec on one server and 500MB/sec on the other. Probable reason: slower server has lots more RAID volumes that are in use. I will have to test again with Areca cache disabled so that only Starwind cache is in play.

Mon Oct 25, 2010 9:11 pm

Grab new Beta... We've probably doubled both reads and writes (and IOPs should jump over the head) in it. Checking some real life 10 GbE configs is what we'd be very interested in

Thanks!