VMWare Performance Tuning

EinsteinTaylor · Thu Aug 30, 2012 12:38 am

Hey everyone,
I have spent the last couple of days reading the forums on tuning Starwinds for VMWare and while I have picked up a lot, I have also noticed that a lot is out of date, so I'm hoping someone can help me get the most performance possible with the most current information possible. We are coming off of an OpenFiler system where the performance was always lacking, but I just always assumed that was as good as it could get. Now that I have seen the power of Starwind, I am somewhat obsessed with squeezing every last IOP that I can out of this system.

The eventual goal is a 2 node HA setup. We have already built the two boxes and purchased the licensing, but right now I am just trying to squeeze the max performance out of a standalone target before throwing in the HA variable.

Testing setup:
Dell server with 8 x 7200 RPM 750GB drives in RAID 50.
Windows Server 2008 R2 SP1 x64
4 core Xeon with 14GB RAM
2 NICS for management, and 4 port Intel Pro 1000 ET for storage, bonded with dynamic aggregation mode

Cisco 2960 switch stack with LACP etherchannel(mode active) configured for both SAN nodes.
SAN is on a separate VLAN from management traffic
Jumbo frames enabled from end to end

3 x ESXi 5 hosts with 2 management NICS and a quad port Intel Pro 1000 in each.
In the ESX hosts I have created a vSwitch with 4 vmkernel ports, each with exactly 1 nic mapped. I have then mapped each vmkernel to the iSCSI HBA as discussed in all of the multipathing guides. I have changed my path management to round robin and set iops=1

I am using DiskTT as the benchmarking tool because it is quick and easy and doesn't require a PHD to run

When I run DiskTT on the drive that I am presenting the virtual drive from I am seeing speeds in excess of 200MB/s. When I create an image file and mount the VMFS datastore and storage VMotion a VM into that datastore, the best I've been able to squeeze out of the VM is about 55MB/s. This doesn't even saturate one gigabit link, much less the 4 bonded links, so I think there is room for improvement.

I have not made any of the windows TCP tweaks yet because one of the articles I read said that with 2008 and newer, the only thing that needs set is Jumbo Frames. Can anyone help me with some tweaks to bring the VM performance closer to the performance on the storage server itself.

Thanks!

Thu Aug 30, 2012 7:54 am

Hi

Two questions:
1) Could you please show us your interconnection diagram?
Something like on the attached image. You may use paint, viso, or even scanned freehand picture.

2) Is it possible for you to try also ATTO Disk Benchmark (against the local disk and then against the connected via iSCSI) which also does not require a PHD to run

http://www.attotech.com/products/produc ... _Benchmark
try the default queue depth 4 and then 10

EinsteinTaylor · Thu Aug 30, 2012 3:25 pm

: iSCSI.png (28.09 KiB) Viewed 22476 times

Attached is the diagram for our setup. I will run the benchmarks you requested shortly and post results.

EinsteinTaylor · Thu Aug 30, 2012 4:11 pm

Local server, queue depth 4:

: local_4.png (19.7 KiB) Viewed 22458 times

Local server, queue depth 10:

: local_10.png (19.97 KiB) Viewed 22460 times

VM, queue depth 4

: VM_4.png (22.65 KiB) Viewed 22461 times

EinsteinTaylor · Thu Aug 30, 2012 4:12 pm

VM Queue depth 10:

: VM_10.png (22.67 KiB) Viewed 22460 times

Thu Aug 30, 2012 4:22 pm

I don't see dedicated synchronization channels on your diagram.
Does it mean that you use LACP channels both for iSCSI Data traffic and HA synchronization traffic ?
The total bandwidth of synchronization channels on each HA node should be equal or bigger than total bandwidth of incoming iSCSI data channels. In your case with 4 iSCSI data channels it should be 4 sync channels or bigger.

EinsteinTaylor · Thu Aug 30, 2012 4:26 pm

We are performing synchronization over the same bonded nic as the iSCSI data, although for the test in question it *is not* an HA target/device. This is just a simple virtual image file presented as a single target. There is currently no VM's on our HA target so I/O and sync activity should be at a minimum

Thu Aug 30, 2012 5:39 pm

I need you to clarify one thing - are you using HA? because if yes than I want you to know that it is strongly recommended to use directly connected cables instead of it to go through the switch.
Also you can setup using few NICs for SyncChannel from within our console, which is, basicaly recommended instead of using LACP, as my colleague mentioned above as well.
Additionally as I can see you are using RAID 50, and that is why I`d like you to know next:
Recommended RAID for implementing an HA are RAID 1, 0 or 10, RAID 5 or 6 are not recommended due to low write performance.
The performance of a RAID array directly depends on the Stripe Size used. There are no exact recommendations of which stripe size to use. It is a test-based choice. As best practice we recommend at first step to set recommended by vendor and run tests. Then set a bigger value and run tests again. In third step set a smaller value and test again. These 3 results should guide you to the optimal stripe size value to set. In some configuration smaller stripe size value like 4k or 8k give better performance and in some other cases 64k, 128k or even 256k values will give better performance.
Performance of the HA will depend on the performance of the RAID array used. It’s up to the customer to determine the optimal stripe size.

And the last one: what build number are you running?

Thank you

EinsteinTaylor · Thu Aug 30, 2012 6:05 pm

There is an HA target/device configured, however that is not the target I am testing against, and there are no VM's in the datastore that is pointing to the HA.

In regards to the RAID 50, I completely understand the penalty you take using a RAID 5 based raid level, however the performance on the local RAID 50 disk is quite good, it's just when I put an iscsi target in that same disk and test a VM on it, that it only performs at about 25% of what the raw disk actually does. I will move the sync network to a pair of free NIC's so that they are not going through the switch, but with no I/O happening on the HA I suspect we won't see much difference.

I'm not really sure whether stripe size is an issue here or not, again because of the local performance, however maybe the windows block size, is different than the VMFS block size, and I need to adjust things?

Part of my initial question, is other than Jumbo Frames, what tweaks still apply, and should be set for Server 2008 R2?

The management console is showing as version 6.0.4768

EinsteinTaylor · Thu Aug 30, 2012 6:12 pm

So for now I have removed the HA targets so it is completely out of the picture. I have ran a direct connect CAT-6 cable that I will use for SYNC once i reenable HA.

At this point we are just testing the performance of one standalone target and the results are still the same as I posted above.

Fri Aug 31, 2012 7:30 am

EinsteinTaylor wrote:So for now I have removed the HA targets so it is completely out of the picture. I have ran a direct connect CAT-6 cable that I will use for SYNC once i reenable HA.

At this point we are just testing the performance of one standalone target and the results are still the same as I posted above.

I assume you have already performed iperf tests to check the network and make sure that its bandwidth is near the wire speed. Similar to the following one. Right?

Server side: iperfwin_iperf -s -w 512K
Client side: iperfwin_iperf -c 192.168.22.2 -P32 -w 64K -l 64K

OK, let's try one more test: on StarWind server create RAM disk target (1-4 GB) and connect it with iSCSI initiator on other machine over the network, run ATTO disk benchmark. This test most directly measures the network because RAM disk performance is higher.

EinsteinTaylor · Fri Aug 31, 2012 2:32 pm

I ran the iperf commands you listed and was able to saturate a gigabit link every single time. I have created a 4GB RAM disk on one of the starwind servers and used the Microsoft iSCSI initiator on the other starwind server to mount the RAM disk. I am attaching the output of ATTO with queue depths 4 & 10. Unfortunately I don't know enough about ATTO to interpret the output so I could probably use a little help there, although watching the bandwidth graphs in task manager, I am seeing it around 22-24% which with a 4 NIC bonded link sounds like it's saturating 1 gigabit line(which is what you would expect for only a single connection).

: ramdisk_4.png (19.69 KiB) Viewed 22326 times

: ramdisk_10.png (19.64 KiB) Viewed 22322 times

Fri Aug 31, 2012 2:44 pm

I ran the iperf commands you listed and was able to saturate a gigabit link every single time.

I found this explanation useful
http://communities.intel.com/thread/19600

On ATTO screenshots you can see that the maximum performance is limited by 123 MB/s ( which is the performance of 1Gb interface because 1Gb/s ~ 120-125 MB/s).

In order to get higher values I would suggest you to use 4 standalone connections and MPIO instead of LACP