ESXi iSCSI initiator WRITE speed

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Sun Apr 10, 2011 9:19 pm

Delayed ACK is not Nagle anywhere. We're actually suffering from both :) If you're interested here's a very good article about what enabled Nagle and Delayed ACK could do to so-called "send-send-recv" apps (iSCSI stack is one of them):

http://www.stuartcheshire.org/papers/NagleDelayedAck/

Yes, please give this one a try and let us know did it help in your case or not. Thank you!
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
michal
Posts: 7
Joined: Wed Feb 23, 2011 4:04 am

Mon Apr 11, 2011 7:30 am

Just a quick post, again no "proper" results to post or show yet, only subjective and limited testing as I'm busy with school for the next two weeks. These limited subjective tests are based on a few quick tests, major one being a simple file transfer from a fast file-server over SMB.

All testing was done with MTU of 1500(no jumbo frames in this round of testing)

- Disabling Delayed ACK(ESXi), VM Windows Write Caching Default/Enabled: Works Well, 33-38MB/s, HOWEVER, I do not think that this is a "proper" solution, however it is a decent workaround in the meanwhile
- Disabling Delayed ACK(ESXi), VM Windows Write Caching Disabled: Works Well, 33-38MB/s, thoughts and comments same as above
- Disabling Delayed ACK(ESXi), VM Windows Install, Write Cache setting Unknown: 10 Minutes

- Enabling Delayed ACK(ESXi), VM Windows Write Caching Default/Enabled: Issues, 11-15MB/s obviously, as this is what the threads about
- Enabling Delayed ACK(ESXi), VM Windows Write Caching Disabled: Works, 35-45MB/s, performance seems best in *CERTAIN* workloads, in others seems on par with Disabling Delayed ACK, decent workaround, but must be applied on a per VM basis
- Enabling Delayed ACK(ESXi), VM Windows Install, Write Cache setting Unknown: 32 Minutes

Like I said above, Disabling Delayed ACK on the ESXi host seems to work. With that said, having a mandatory ACK between every packet increases network traffic enough that it seems to "rob" *up to* 20% performance in certain workloads, but not much in others. Maybe this is an issue on the vmware side with their initiator, but whatever the case something is making windows cache management "unhappy" as anton said. :lol:

In the meanwhile, disabling "Delayed ACK", or leaving it enabled and disabling windows write cache, seem to be a decent workaround for now.

I will follow this up with full proper testing, screenshots and various logs when I can.
User avatar
Max (staff)
Staff
Posts: 533
Joined: Tue Apr 20, 2010 9:03 am

Mon Apr 11, 2011 9:56 am

Great thanks for the update! I will try this in my test environment too.
Max Kolomyeytsev
StarWind Software
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Mon Apr 11, 2011 10:29 am

Good job! Thank you!

P.S. With 1500 byte MTU size there's no chance to see any speeds close to wire speed. Please give a try to Jumbo frames :) Thank you again!
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
CyberNBD
Posts: 25
Joined: Fri Mar 25, 2011 10:56 pm

Mon Apr 11, 2011 6:35 pm

The ESXi setting seems to have a little different meaning indeed. It has more to do with detecting network performance and congestion.
Mechanisms seem to be a little different on both sides which results in poor performance.

I've had some background about this when doing some cisco self-study and basically it comes to this:
* For every packet that's being sent the receiver will send an ACK packet back.
* When starting a transmission the sender will send one packet and wait for an ACK.
* When ACK is received well, the sender will increase the amount of packets per ACK.
* When, after a while the sender doesn't receive the ACK on time or not at all, it means there are network issues so the sender reverts back to the last amount of packages.
* Once in a while the sender will try and increase the amount of packages again to detect if network performance has increased / issues are resolved.

That way the sender can detect and adapt itself to network performance and congestion. It's understandable that if both sides don't use exactly the same settings for accomplishing this, it can result in serious performance issues. (Resending packets in different ways when ACK's are lost or waiting on each other for ACK's while they aren't meant to come etc..)

I found this http://kb.vmware.com/selfservice/micros ... Id=1002598 VMWare KB Article about it.
Strange enough they talk about poor read speeds while we are experience poor write speeds. Interesting :)
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Mon Apr 11, 2011 7:07 pm

The idea is to have single ACK packet for a multiple sent packets assumed small in size. So we would not have flood of ACK packets for say single byte TCP send. But for iSCSI it results penalty as sending multiple packets gets delayed.
CyberNBD wrote:The ESXi setting seems to have a little different meaning indeed. It has more to do with detecting network performance and congestion.
Mechanisms seem to be a little different on both sides which results in poor performance.

I've had some background about this when doing some cisco self-study and basically it comes to this:
* For every packet that's being sent the receiver will send an ACK packet back.
* When starting a transmission the sender will send one packet and wait for an ACK.
* When ACK is received well, the sender will increase the amount of packets per ACK.
* When, after a while the sender doesn't receive the ACK on time or not at all, it means there are network issues so the sender reverts back to the last amount of packages.
* Once in a while the sender will try and increase the amount of packages again to detect if network issues are resolved.

That way the sender can detect and adapt itself to network performance and congestion. It's understandable that if both sides don't use exactly the same settings for accomplishing this, it can result in serious performance issues. (Resending packets in different ways when ACK's are lost or waiting on each other for ACK's while they aren't meant to come etc..)

I found this http://kb.vmware.com/selfservice/micros ... Id=1002598 VMWare KB Article about it.
Strange enough they talk about poor read speeds while we are experience poor write speeds. Interesting :)
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
CyberNBD
Posts: 25
Joined: Fri Mar 25, 2011 10:56 pm

Wed Apr 13, 2011 5:40 pm

Since I'm going to use some sort of bundling of 1gig connections in my final setup I started to do some tests using Multipathing. Same setup as before, just enabled the second onboard broadcom nic on each server and configured MP following best practices. Both iSCSI paths are on seperate /27 subnets.

Some results:
First, using the disabled TCP Ack setting within VMWare like last tests:


Using direct IO, MP doesn't seem to affect small transfer sizes but on larger sizes speed goes up quite a bit.
Using non-direct IO however, MP doesn't affect speeds at all :| ?

Following results seem to be even more interesting:
I enabled Delayed TCP Ack again within VMWare and restarted the ESXi machine.


Those 2 almost seem to be the opposite of what we learned until now?
To be sure I wasn't messing things up somehow I just disabled one of the 2 paths within VMWare thus going back to a single 1Gig connection without changing anything else. Ran the benchmarks again and everything went back to "normal": direct io nothing wrong, non-direct io poor write.
CyberNBD
Posts: 25
Joined: Fri Mar 25, 2011 10:56 pm

Wed Apr 13, 2011 7:35 pm

MS initiator graphs:


Same thing here. Direct IO improves at the higher transfer sizes. Cached IO little difference.
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Thu Apr 14, 2011 3:37 pm

I don't think we have anything different from we've seen before. MPIO is supposed to be faster and it's faster 1,5-2 times like it should be. I don't have an answer why cached MPIO is slower then non-cached MPIO.

P.S. And you should NOT team adapters with iSCSI. For clients use MPIO and for cross-link please wait for post V5.7 with custom sync channel MPIO stack. Still no teaming!
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
rchisholm
Posts: 63
Joined: Sat Nov 27, 2010 7:38 pm

Fri Apr 22, 2011 3:41 pm

I started testing last week and ran into the same performance problems everyone has been seeing. I changed the setup around and I've been doing a lot of testing this week with the following setup using ESXi 4.1 Update 1 Enterprise Plus with StarWind 5.6 Enterprise HA Unlimited. The 2 ESXi hosts are HP DL380G7's with dual 3.33GHz hex cores, 192 GB RAM, and 2 dual port HP 10GbE SFP+ NICs. The StarWind servers are each SuperMicro motherboards with dual 2.4 GHz quad core 5600 series Xeons, 72 GB RAM, 2 dual port HP 10GbE SFP+ NICS, 2 LSI 9280-8E RAID controllers, and 2 Areca 1880-24 RAID controllers. This testing has been done using 24 drive RAID 60's on a single Areca with Seagate 7.2K SAS 2TB drives in each of the servers. The switches being used are 2 24 Port 10GbE SFP+ HP Procurve 6600's with VLAN's setup to separate LAN and iSCSI traffic, and a 5400 series for the 1GbE connections and with a 10GbE LAN link to each switch and 10GbE iSCSI link to each switch with the different traffic types separated in VLAN's.

The StarWind servers are setup with no NIC teaming for the iSCSI or Sync channels. There are 2 10GbE Sync channels, each in separate IP's. The Sync NIC's are connected directly to each other using direct attach SFP+ cables. The 2 10GbE iSCSI NIC's are each in separate subnets, and plugged into different switches.

The ESXi servers are setup with a virtual switch with dual 1GbE NIC's teamed for management, a virtual switch with dual 1 GbE NIC's teamed for VMotion, a virtual switch with 2 10GbE NiC's teamed for LAN, and 2 virtual switches setup with a single 10GbE NIC in each, with each connected to a different 6600, and with IP's for the respective iSCSI NIC's connected to the switches for the SAN's.

I setup a 2008 R2 SP1 Enterprise VM with 16 GB RAM and 4 vCPU's for the testing. The OS is on a LUN with 256MB write back cache. I also setup 4 2 TB LUN's with 512 MB write back cache each. I set the paths to Round Robin for each LUN so there are 4 active I/O paths per LUN. Then I made the most important change. I went into the console of the ESXi servers and changed the number of IOPS per path from 1000 to 1. This made the writes 10X as fast.

Here is a sample using SQLIO to write to 4 files, each on a separate LUN that were mounted into NTFS folders. Note that this is not using Windows caching, and the file sizes are twice the size of the write back cache for each LUN. With it on, the numbers are over 100K IOPS and 6.5GB/s which is where the 4 vCPU's max out.

C:\Program Files (x86)\SQLIO>sqlio -b64 -BN -o2 -fsequential -kW -Fparam.txt
sqlio v1.5.SG
parameter file used: param.txt
file d:\corporate\it\testfile.dat with 2 threads (0-1) using mask 0x0 (0)
file d:\corporate\technical\testfile.dat with 2 threads (2-3) using mask 0x0 (0)
file d:\corporate\hstv\testfile.dat with 2 threads (4-5) using mask 0x0 (0)
file d:\corporate\online\testfile.dat with 2 threads (6-7) using mask 0x0 (0)
8 threads writing for 30 secs to files d:\corporate\it\testfile.dat, d:\corporate\technical\testfile.dat, d:\corporate\hstv\testfile.dat and d:\corporate\online\testfile.dat
using 64KB sequential IOs
enabling multiple I/Os per thread with 2 outstanding
buffering set to not use file nor disk caches (as is SQL Server)
using specified size: 1000 MB for file: d:\corporate\it\testfile.dat
using specified size: 1000 MB for file: d:\corporate\technical\testfile.dat
using specified size: 1000 MB for file: d:\corporate\hstv\testfile.dat
using specified size: 1000 MB for file: d:\corporate\online\testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec: 10046.23
MBs/sec: 627.88


I haven't even starting playing with Jumbo Frames yet, all 4 of these LUN's are setup to use the same Sync channel right now, and this is just on the 7.2K drives connected to a single Areca instead of the 15K drives connected to the LSI cards. Also, I'm anxiously awaiting the next version of StarWind with it's performance upgrades. So, there is probably still a lot more speed I can squeeze out of this.
CyberNBD
Posts: 25
Joined: Fri Mar 25, 2011 10:56 pm

Fri Apr 22, 2011 7:39 pm

rchisholm wrote:Then I made the most important change. I went into the console of the ESXi servers and changed the number of IOPS per path from 1000 to 1. This made the writes 10X as fast.
I didn't mention tis when posting my results, but I used 3 IOPS per path for this. I will see if I can compare the results using different settings.
rchisholm
Posts: 63
Joined: Sat Nov 27, 2010 7:38 pm

Sat Apr 23, 2011 12:31 am

CyberNBD wrote:
rchisholm wrote:Then I made the most important change. I went into the console of the ESXi servers and changed the number of IOPS per path from 1000 to 1. This made the writes 10X as fast.
I didn't mention tis when posting my results, but I used 3 IOPS per path for this. I will see if I can compare the results using different settings.
Did you setup each of your iSCSI NIC's in ESXi in separate virtual switches? When I had a single virtual switch and the standard 1000 IOPS I got about 6 MB/s. Going to separate virtual switches for each iSCSI NIC gave me about 60 MB/s with sufficient load. Then I did the IOPS change which brought it over 600 MB/s. So, with the changes combined, it increased the write speed 100X.

I'm going to continue testing with different IOPS settings and frame sizes. I will also be testing with Xen next week. I still need to do some testing with the fast RAID's also.
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Sat Apr 23, 2011 6:00 pm

VLANs or virtual switches should not affect / boost performance. So it's something else...

Yes, please keep us updated. Thank you!
rchisholm wrote:
CyberNBD wrote:
rchisholm wrote:Then I made the most important change. I went into the console of the ESXi servers and changed the number of IOPS per path from 1000 to 1. This made the writes 10X as fast.
I didn't mention tis when posting my results, but I used 3 IOPS per path for this. I will see if I can compare the results using different settings.
Did you setup each of your iSCSI NIC's in ESXi in separate virtual switches? When I had a single virtual switch and the standard 1000 IOPS I got about 6 MB/s. Going to separate virtual switches for each iSCSI NIC gave me about 60 MB/s with sufficient load. Then I did the IOPS change which brought it over 600 MB/s. So, with the changes combined, it increased the write speed 100X.

I'm going to continue testing with different IOPS settings and frame sizes. I will also be testing with Xen next week. I still need to do some testing with the fast RAID's also.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
rchisholm
Posts: 63
Joined: Sat Nov 27, 2010 7:38 pm

Sat Apr 23, 2011 8:44 pm

I left out that the NiC's were teamed in the single virtual switch. Probably had to do with the performance problem.
anton (staff) wrote:VLANs or virtual switches should not affect / boost performance. So it's something else...

Yes, please keep us updated. Thank you!
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Sat Apr 23, 2011 10:11 pm

You should skip using NIC teaming for routing iSCSI traffic. For clients use properly configured MPIO and for sync channel please stick with failover (V5.8 should have custom MPIO for sync channel, no this feature in V5.7 sorry...)
rchisholm wrote:I left out that the NiC's were teamed in the single virtual switch. Probably had to do with the performance problem.
anton (staff) wrote:VLANs or virtual switches should not affect / boost performance. So it's something else...

Yes, please keep us updated. Thank you!
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
Post Reply