HA Sync Performance

DavidMcKnight · Wed Sep 08, 2010 9:33 pm

Okay, let me give some more detail on my setup:

I am running Starwind for a VMware environment. My VMware hosts like my datastore boxes have both 10gig ports teamed with one port set active and the other port set standby. There are two paths between any host and any single iSCSI target, that target being replicated on two datastores (hence two paths). I set VMware with a preferred datastore path, the primary node of the HA. So, in theory all the VM hosts (5 of them) have only one active path to the primary node of the HA datastore (no round-robin). The primary node then should be handling all the requests and passing any write changes down to the secondary node.

With that in mind, I’m curious as to the data path when HA is running. So I have a 10gig network connection, and I’m assuming it’s a 10gig full duplex connection and for simplicity, for moment, let say we’re not using any cache. If I have a series of iSCSI packets come in to Starwind and these packets represent data that is suppose to be written back to the datastore, what happens to that block of data? Starwind passes one copy of the data to the hard drive controller (in my case an Areca RAID controller) and passes a second copy back out the network card (in my case an Intel 10Gb dual port CX4 card) destined to the IP address of the secondary datastore? The secondary datastore receives the HA data via Miscorosft’s iSCSI initiator software. Somehow on that secondary datastore Starwind is monitoring the iSCSI initiator traffic and when it see a WRITE command it sends a copy of the data to the hard drive controller. Is this simplistic narrative correct?

camealy · Fri Sep 10, 2010 2:25 pm

We are having the exact same performance problem with HA Starwind. Seems to run fine if we re-attach our .img's to one side only. But the moment HA starts up huge performance lags.

We have one large internal deployment, and two client side deployments of HA and they all perform identically across a variety of hardware. HP Servers, Whitebox, Broadcom NICs, Intel NICs, HP Switches, D-link Switches, Allied-Telesys stacked switches, 15K SAS drives, RAID 5, RAID 10... Nothing seems to change the performance other than moving to one side only.

What is the resolution? Would performance increase if I tried more of a DR approach like mirroring or would that perform just as bad as the HA setup?

Thanks for any assistance you can offer!

heitor_augusto · Mon Sep 13, 2010 2:23 pm

We are testing a Starwind HA setup and are also having significant performance issues. Here are the details:

- Starwind 5.4
- Windows 2008 R2 Standard
- 02 storage boxes: Intel S3420GPLC, 3ware 9650SE-24M8, 24 HDs Western Digital WD5001AALS
- Sync link using 02 cat6 crossover cables and Intel 82574L Gigabit network cards

If we attach a VMWare ESX 4.1 host to a single target (no HA), we can write data at the expected rate of 117 MB/s.

However, if we start up HA, even with 08 GB write-back cache enabled, performance drops to around 30 MB/s. In this case, bandwidth usage on the sync link doesn't go above 45% showing there's no bottleneck there.

Has anyone had any success in sorting this out yet?

Constantin (staff) · Mon Sep 13, 2010 2:26 pm

Now we are testing synchronization process to see if the low performance is trouble of our software, or not. I`ll post an update tomorrow or after tomorrow.

Mon Sep 13, 2010 2:29 pm

Both nodes have write-back cache enabled?

heitor_augusto wrote:We are testing a Starwind HA setup and are also having significant performance issues. Here are the details:

- Starwind 5.4
- Windows 2008 R2 Standard
- 02 storage boxes: Intel S3420GPLC, 3ware 9650SE-24M8, 24 HDs Western Digital WD5001AALS
- Sync link using 02 cat6 crossover cables and Intel 82574L Gigabit network cards

If we attach a VMWare ESX 4.1 host to a single target (no HA), we can write data at the expected rate of 117 MB/s.

However, if we start up HA, even with 08 GB write-back cache enabled, performance drops to around 30 MB/s. In this case, bandwidth usage on the sync link doesn't go above 45% showing there's no bottleneck there.

Has anyone had any success in sorting this out yet?

camealy · Mon Sep 13, 2010 2:57 pm

I am not speaking for the other user's setup, but we were told that in a critical uptime environment (which is the only place we use HA) that write-back was not recommended. I assume there is no methodoligy built into the sync between nodes that would protect from system failure like the battery does on a RAID card?

Thanks,

Kurt

Aitor_Ibarra · Mon Sep 13, 2010 5:21 pm

I've only just started looking at performance of HA, until now have been concentrating on reliability. I've just posted some numbers in the beta forum (I'm testing StarWind_5.5_beta_20100821). Anyway, with WB cache I've managed to hit maximums of 500+ MB/sec reads with tests where the data fits into cache and I use failover only MPIO policy.

A lot of the people seem to be experiencing issues when using the vmware initiator and my tests are with the Windows one, so this could be a factor.

Theory:
Although Starwind HA is like RAID 1 in that data ends up on two "disks", it's not quite the same in that with RAID controller based RAID-1, you've got two drives being written to almost in lock step - e.g. with RAID 1 on a RAID card both drive's drive heads are going to be in sync with only a tiny latency between them. Starwind sits on top of the file system and in turn on top of the disks / raid cards. So the drive heads on one node are going to be moving independently of the other = more latency. With round robin MPIO, a write is going to come in on one starwind box, which then has to send the data to the other before it can confirm back to the client that it has been written. At the same time, writes are coming into the other node and the same process has to be done in the other direction. I can't see how this can happen without lots more disk thrashing - effectively your sequential i/o is becoming more random. So with round robin MPIO, it's unreasonable to expect RAID-1 performance characteristics from starwind HA (synchronous replication). Write back caching (on the RAID card and/or Starwind) should mitigate this providing you don't overrun the caches.

Rhetorical question: if you had a single drive capable of say of max 100MB/sec transfers of a single, large file, would you really expect to be able to transfer two single large files simultaneously at 50MB/sec each?

I'm not denying existence of a problem, just trying to think of a reason for it... And everything I've put above about round-robin MPIO causing more disk thrashing than failover must be irrelevant in DavidMcKnight's case as he's verified that he's using failover MPIO.

camealy · Mon Sep 13, 2010 6:22 pm

Could you provide more details on the MPIO setup you are using? A lot of our installations were done by Bob when he was there.

heitor_augusto · Mon Sep 13, 2010 7:02 pm

anton (staff) wrote:Both nodes have write-back cache enabled?

heitor_augusto wrote:We are testing a Starwind HA setup and are also having significant performance issues. Here are the details:

- Starwind 5.4
- Windows 2008 R2 Standard
- 02 storage boxes: Intel S3420GPLC, 3ware 9650SE-24M8, 24 HDs Western Digital WD5001AALS
- Sync link using 02 cat6 crossover cables and Intel 82574L Gigabit network cards

If we attach a VMWare ESX 4.1 host to a single target (no HA), we can write data at the expected rate of 117 MB/s.

However, if we start up HA, even with 08 GB write-back cache enabled, performance drops to around 30 MB/s. In this case, bandwidth usage on the sync link doesn't go above 45% showing there's no bottleneck there.

Has anyone had any success in sorting this out yet?

Yes, both nodes have write-cache and queueing (NCQ) enabled.

Aitor_Ibarra · Mon Sep 13, 2010 7:17 pm

camealy wrote:Could you provide more details on the MPIO setup you are using? A lot of our installations were done by Bob when he was there.

MPIO is a client thing, not a setting on Starwind; for Windows 2008 R2, in the iSCSI initiator, once you've set up your paths, select one of them and click "Devices". Then click MPIO. You have the following load balancing policy choices:

Failover only
Round robin (the default)
Round robin with subset
Least Queue Depth
Weighted Paths
Least Blocks

If you choose Failover only you can set which path is active and which standby.

I expect that vmware's initiator has similar options.

camealy · Mon Sep 13, 2010 7:39 pm

That is good info, I believe all our Hyper-V clusters are setup with Round Robin to the Starwind HA. Is there a performance benefit to go failover only? And if so what is the downside from a actual failure standpoint?

Aitor_Ibarra · Tue Sep 14, 2010 9:34 am

camealy wrote:That is good info, I believe all our Hyper-V clusters are setup with Round Robin to the Starwind HA. Is there a performance benefit to go failover only? And if so what is the downside from a actual failure standpoint?

Performance benefit should be for two reasons: 1) less disk thrashing and 2) if one path is faster than the other, like in my network. If you are fully 1GbE or fully 10GbE then you won't see this benefit, but you should still see 1).

You can still do active/active by the way, without disk contention, if each img is held on a different RAID set - e.g. make a different starwind server preferred for each HA target. Best performance is going to be one img per raid set. Also if you are write heavy or do a lot of random i/o then you may want to stay away from RAID5 and 6. I stick to RAID 1 mostly, sometimes use RAID 10.

Downside: I've not tested the different load balancing policies enough to see if there is a difference in failover time, but failover reliability doesn't seem to be an issue. See my post on the beta forum: http://www.starwindsoftware.com/forums/ ... t2199.html, if you are using 2008 R2 on starwind or the hyper-v boxes, you may need a hotfix from Microsoft.

One annoyance with failover only is that if you do have a failover, and then bring the server back up, Microsoft MPIO doesn't automatically failback to your prefered route. Maybe weighted paths could effectively achieve this - give your prefered path a weight of 0 and the other 1.

ChrisB · Tue Sep 14, 2010 11:01 pm

In our testing the bottleneck has not been the disks thrashing as you suggest for three reasons - the disk-activity meter is rather low when we are hitting this limit (~20% active at most), we've tried load balancing policies that only utilize one path, and it also happens on a sync from box to box (an entirely sequential copy from one san to the other).

camealy · Wed Sep 15, 2010 12:50 am

That is why I am wondering if the MPIO adjustments will really help. We run perf monitors on the Starwind hosts and disk queue length is non-existent during times of high load. Yet inside a Hyper-V VM running in a Starwind hosted cluster, the disk queues go up to 30-35. We have one setup with an 8 drive 15K RPM SAS array on both ends of the HA, all 4GB LACP links through stacked switches, Jumbo frames, etc and it runs horrible under load. Switch it to one side and it runs fine.

DavidMcKnight · Wed Sep 15, 2010 4:16 pm

Let's break this down into smaller pieces.

Those of you who are using HA and can setup the following tests please answer these questions:

1. With no HA enabled what speeds are you seeing when transferring data to and from both datastores?

2. With an HA target created what speeds are you seeing transferring data to and from HA volumes and non-HA volumes on either/both datastores?

3. What is the size and raid level of your HA volume you're running these tests on?

4. What caching model did you use on your HA and nonHA volumes inside of starwind? (ex: HA 32meg WB, NonHA 16meg WT, or perhaps None )

5. what caching model is your raid card using for your HA and nonHA volumes? (ex: card has 2Gig cache, raid/volume set to WT)

Thanks

David McKnight