ESXi 5 to 2008 R2 based StarWind

NetWise · Sat Feb 18, 2012 6:23 am

I'm trying to figure out if I'm doing something wrong. I've installed Windows 2008 R2 on a Dell PowerEdge 2950, Quad Core, 8GB RAM, 6x146GB SAS internal. Installed StarWind and set it up with 2 add in NIC's for iSCSI, leaving the onboard NIC's for data. These two ISCSI NIC's are on a different subnet, 10.0.1.x. I've created the target and disk, ensured the networking portion only sees/selects the two ISCSI NIC's. Set the two NIC's to Jumbo Frames (MTU=9000), and set it up on a separate switch configured for Jumbo Frames.

On the ESXi side, I've:

Created a new vSwitch, and set it for Jumbo Frames MTU=9000
Added two NIC's as individual port groups - ISCSI1, ISCSI2.
Each port group is setup to only have one pNIC assigned as active, and the other is set to unused.
Each Port Group is set to Jumbo Frames MTU=9000
Each Port Group has the Failback=No option set.

In the Software iSCSI initiator, I have provided Dynamic targets for each of the two IP's on the Starwind SAN, and added both NIC's to the HBA.

I'm able to scan for and find the target, and make a VMFS volume. The volume is set for Round Robin, and IOPS=1.

Whenever I copy data, clone, or svMotion to the volume, it just takes forever. When I look at Task Manager and the NIC's on the SAN, I see it using very little of the network.

I've been fighting this now for a couple of days, thinking maybe it's because I have older SuperMicro servers, PCI-X or PCI-E NIC's, maybe I did something wrong with the windows install, switches could be bad, ESXi could be wrong, etc. So for the heck of it, I brought home a surplus Equallogic PS5000XV from work to try. Same cables, same switches, same settings, same ISCSI setup on ESXi. Added the target IP's, rescanned, found and used the volume - I'm *easily* getting 200MB/sec sustained off two NIC's worth of bandwidth at the higher block sizes, and svMotion/cloning is only limited by the fact that I have one SATA disk per host as local storage to migrate to the SAN.

Now for what its worth, I had the same sort of issues with the Microsoft ISCSI Target v3.3 QFE. So I'm willing to believe its something in that server. But it's a clean install of Windows 2008R2, all Dell drivers installed, firmware is current and I've tried a few different NIC's, including the onboards vs add in, etc. Nothing seems to be working as I'd expect.

Now I've seen some details indicating that for MPIO, I should have two different vSwitches with 1 pNIC/Port Group per vSwitch, and each should be on its own subnet. Granted my experience is with Equallogic, and every vendor is different, but I've built out 8 data centers using 12 EQL's and never had to do that.

Am I missing something? I'm really hoping I can prove this to work, so that I can use it as a home lab, and we might be able to get it for the surplus lab at work for the guys to learn and work on, as well as possibly some Ro/Bo type installs. (So it's not just an I like Equallogic troll, I honestly am trying).

Mon Feb 20, 2012 4:20 pm

Hello NetWise,

To provide you solution we need to know following information:
-What StarWind target is used in the test?
-Had you any chance to perform the real benchmarking of StarWind and can provide us with the numbers?
-What vendor of hte NICs installed in the SAN server? Have you installed the latest version of drivers?
-Is there a switch in the system?
-Have you benchmarked the HDDs where StarWind installed?

Also, there is a document for pre-production SAN benchmarking:
http://www.starwindsoftware.com/images/ ... _guide.pdf
And a list of advanced settings which should be implemented in order to gain higher performance in iSCSI environments:
http://www.starwindsoftware.com/forums/ ... t2293.html
http://www.starwindsoftware.com/forums/ ... t2296.html

Have you tried to dissable delayed ACK (goto VSphere Client, Configuration, Storage Adapters, iSCSI adapter Properties ==> Advanced, last option)?

NetWise · Mon Feb 20, 2012 5:39 pm

Thanks for the reply, and I've progress to report to the positive!

Initially my problem was that the ESXi hosts would scan for and find the LUN's and I could make a VMFS volume - but upon trying to copy to them, it just..... never seemed to even start.

So I've tried a number of things:

* On board Broadcom's on the PowerEdge 2950 are MUCH better.
* Requires the latest or more updated drivers. Definitely use the Dell provided drivers rather than Broadcom or Windows Update, etc.
* The HPNC7170 / Intel 1000 MT compatible PCI-X - should be tossed, they're defintely part of the problem.
* Switch was an older PowerConnect 5224 which is pretty ancient. I swapped in a 5424 from work, and that's better (10%?), although it didn't change drastically. It validates that the 5224 will likely work just is not ideal, which I went in understanding anyway.
* Starwind target is the latest I think, only downloaded last week. - 5.8.1889.
* Two sets of disks, both on a Dell PERC5/i RAID controller with BBU cache - 4x146GB RAID5 15K SAS (Tier1) and 2x1TB RAID1 7.2K SATA (Tier2)
* Benchmarks on the host itself are great, using Atto and match what I've come to expect from a PERC5/i historically. Obviously above a certain point the gains slow down and peak, but that's normal. But something weird - when I run Atto in a VM on the same LUN/target - my benchmarks are good to about 32kb block size, and then the writes start coming back down. Reads continue to go up similar to the local disk, but obviously limited by the 2x 1GbE connections. I need to get some screenshots so I can paste. I've assumed that perhaps the reads drop at the high end because of network contention or lack of write cache at the host side, based on other screenshots I've seen in the forum.
* I have not yet tried disabling the Ack. I understood this was sort of a last resort and not typically recommneded?

Later on today I'll ensure I get the following:

* Screen shot the local benchmarks
* Screen shot the VM based benchmarks on both targets
* Try with and without the Delayed Ack
* Review the recommended settings

Is there a recommendation for switch settings at all? Historically I have always been told that spanning-tree should either be set to RSTP/portfast or disabled if you're confident no one will be plugging in another switch by accident to force an election. Flow control should be on for all ports. Jumbo Frames should be set to on typically 9000MTU.

Thanks for the assistance, definitely starting to get to where I want to be.

Wed Feb 22, 2012 2:12 pm

NetWise wrote:* On board Broadcom's on the PowerEdge 2950 are MUCH better.

Perfect!

NetWise wrote:*Switch was an older PowerConnect 5224 which is pretty ancient. I swapped in a 5424 from work, and that's better (10%?), although it didn't change drastically. It validates that the 5224 will likely work just is not ideal, which I went in understanding anyway.

It could be the possible reason of the issue so we just need to exclude it from the list.

NetWise wrote:* Starwind target is the latest I think, only downloaded last week. - 5.8.1889.

Actually this is not hte latest. you can always check the number of the lates build here.

* Benchmarks on the host itself are great, using Atto and match what I've come to expect from a PERC5/i historically. Obviously above a certain point the gains slow down and peak, but that's normal. But something weird - when I run Atto in a VM on the same LUN/target - my benchmarks are good to about 32kb block size, and then the writes start coming back down. Reads continue to go up similar to the local disk, but obviously limited by the 2x 1GbE connections. I need to get some screenshots so I can paste. I've assumed that perhaps the reads drop at the high end because of network contention or lack of write cache at the host side, based on other screenshots I've seen in the forum.

VM stored on the local RAID (I mean not on SAN target)? If yes then you should pay your attention to this RAID.

NetWise wrote:* I have not yet tried disabling the Ack. I understood this was sort of a last resort and not typically recommneded?

That could be the possible reason too. You need to "play" with it.

NetWise wrote:Is there a recommendation for switch settings at all? Historically I have always been told that spanning-tree should either be set to RSTP/portfast or disabled if you're confident no one will be plugging in another switch by accident to force an election. Flow control should be on for all ports. Jumbo Frames should be set to on typically 9000MTU.

I think its better to dissable it and keep the same Jumbo Frame then on your NICs.