We've noticed some peculiar behaviour when our HA images are doing a full sync.
First, our test setup is 1 RAID1 volume on each server. The servers are Windows 2012 R2 with 16 GB RAM. They have the following network adaptors:
Access Team - 4 1gbps NICs (4gbps) teamed using windows (these are not used to connect or synchronize via Starwind)
iSCSI 1 - 1gbps NIC - HB
iSCSI 2 - 1gbps NIC - HB
iSCSI 3 - 1gbps NIC - HB
Sync 1 - 1gbps NIC - HB & Sync
Sync 2 - 1gbps NIC - HB & Sync
Sync 3 - 1gbps NIC - HB & Sync
The three sync NICs are directly connected to each other using crossover cables. The iSCSI NICs are connected to a switch that is spec'd to handle 16gbps. The other NIC Team is connected to a separate switch and network.
We're running the drives on an LSI MegaRaid 9280-24i4e with the most recent drivers and firmware. We have the most recent V6 install for Starwind and our NIC drivers are the most recent directly from the Intel website.
Now for the problem. When we test a single volume for full sync, we expect to see approximately 50% of each adapter to be in use (not sure where I remember reading that, but it was on the website somewhere). Our testing instead shows the NICs running at approximately 60 - 80 mbps. Even across 3 channels, we're looking at only about 200 mbps of sync, which is nowhere near the full capacity of those links (3gbps) or even expected performance at 50% (1.5gbps). This also means syncing a 500 GB volume is taking pretty much an entire day, which is ridiculous and unacceptable.
On closer inspection, we found that the NICs we thought were operating at 1gbps in the access team, were actually only running at 100mbps. If take this out across 4 NICs it give 400 mbps. We then noticed that the total traffic across the sync NICs was roughly 200 mbps, or half the total capacity of that team. We took a guess that the Starwind service was taking that bandwidth from that adapter as its top speed and then setting limits based on that. With this hypothesis in mind, we corrected the issue preventing those NICs from operating at full speed and tested again.
Still only 60 - 80 mbps per sync NIC.
Next we decided to test what the actual throughput on the HDDs and NICs was. Using IOMeter, the disk performance looked completely normal (I didn't record the numbers at the time, but they were not unusual). Next we tested the bandwidth across each NIC, including the teamed NICs. They all worked at their expected speeds.
We've also found that accessing the iSCSI volumes and doing performance testing on them seems to give results consistent with the speeds directly to the disks with some overhead, so the targets themselves seem fine. What is odd is that if we do write testing, the speed of the writes in IOMeter seems to have no correlation with the bandwidth use on the sync channels. I suppose that could just be an oddity with how/what IOMeter is doing and how Starwind figures out what to synchronize.
At this point we are out of ideas for what else could be causing the problem. When every other operation on these NICs and HDDS performs normally except for Starwind, we're stuck with the conclusion that the software is either not configured properly, or doesn't work properly.
Does anyone know what next steps we can take?
The Latest Gartner® Magic Quadrant™Hyperconverged Infrastructure Software