"Lost access to volume..."

Tue Apr 02, 2013 10:59 am

Hi Volker,
I'm wondering if you're using similar operating systems on these hosts?
Normally we don't do testing in mixed OS environments, so it may be working wrong.

imrevo wrote: I'll replace the NICs in w2k3-storage. But the question is: as there are 5 HA-targets @ w2k12.micro1 with 4 secondary targets @ w2k8-micro1 and only 1 secondary target (the failing one) @ w2k3-storage (Windows 2008, wrong naming, sorry), how can it be that the connection to a HA-storage dies completely when one of the targets might have a problem with one of 3 NICs?

Tell me, which logs you need.

I might be wrong, but even if one of 2 HA hosts fails, my ESX might report a loss of redundany at max, but never a complete loss of access!

bye
Volker

imrevo · Thu Apr 04, 2013 9:27 am

Hi Max,

as soon as the new NICs arrive for the 2 new servers I bought, I'll setup everything with w2k12. Hopefully this weekend. But I really think it's the NIC in w2k3-storage (Windows 2008 storage server), as the other windows 2008 box doesn't show any problems in the HA-Cluster with windows 2012.

bye
Volker

Mon Apr 08, 2013 8:24 am

Sounds great, please keep us posted!

imrevo · Tue Apr 09, 2013 5:36 am

Good morning,

Max (staff) wrote:Sounds great, please keep us posted!

Short of a little problem due to misalignment of the 2003er VMs (will be fixed soon), almost everything looks fine now. Almost:

Doing a storage vmotion of a 25Gig VM that is powered off takes 15 minutes. There are 3 NICs dedicated to iSCSI in the ESXi and 2 NICs dedicated to iSCSI target in each of the two Starwind hosts making a total of 12 paths for every LUN as all NICs are in the same subnet. Round robin is enabled. Network utilization at the Starwind host shows only a 4-6% usage of every NIC during a storage vmotion. During ATTO benchmarking I can see a load of up to 50% per target-NIC though.

Tweaks performed for ESXi:

- disabled DelayedAck for every iSCSI target
- enabled jumbo frames for vmkernels and vSwitches dedicated to iSCSI
- iops set to 5 (also tried 1000, 1, 3, 10, 15, 20, 50, 100 and 500)
- round robin enabled

Tweaks performed for the target side (Windows Server 2012 with current Starwind release):

- GlobalMaxTcpWindowSize = 0x1400000
- TcpWindowSize =0x1400000 (per Interface, not globally your guide is wrong here)
- TcpAckfrequency = 1 (per interface)
- TCPNodelay = 1 (per interface)

and as per your guide:

Code: Select all

netsh int tcp set heuristics disabled 
netsh int tcp set global autotuninglevel=normal (really? shouldn't that be "disabled" or "experimental"?)
netsh int tcp set global ecncapability=enabled 
netsh int tcp set global rss=enabled 
netsh int tcp set global chimney=enabled
netsh int tcp set global dca=enabled

btw.:

Code: Select all

netsh int tcp set global congestionprovider=ctcp

returns an error in Windows 2012.

The LUNs are using write-back caching with a size of 2GB, all NICs are 1GBit.

Any ideas why storage vmotion is slow while ATTO shows good results?

bye
Volker

Tue Apr 16, 2013 10:16 am

Hi Volker,
I have 3 suggestions:
1. Earlier I noticed low performance when VMFS volumes used different block sizes.
2. You may get down to 28 MB/s if you're doing vMotion to the image file on the same storage array.
3. There is some sort of additional command being issued from VMware side only during Storage vMotion, like VAAI UNMAP (StarWind has temporarily limited VAAI support )