"Lost access to volume..."

imrevo · Sun Mar 17, 2013 6:45 pm

Good evening,

given:
vSphere 5, Build 469512
Windows 2008 Storage Server
StarWind iSCSI SAN v6.0.0 (Build 20121220, [SwSAN], Win64)

While trying to migrate a VM from local storage to a deduplicated storage, I get the following "event" in vSphere:

"Lost access to volume 514606a6-6a49ac6e-8150-001ec9ed7ffe (DEDUP002-W2K3-STORAGE-001) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly."

The Starwind logfile says:

3/17 19:17:10.579 15a8 Ssc: *** SscScsi_InquiryHandler: INQUIRY VPD page 0xb0 is not supported!
3/17 19:17:51.857 1140 T[5,1aa2]: Management command: abort task (CmdSN 7402, ITT 0x6a220000).
3/17 19:17:51.857 1140 T[5,1aa1]: LUN 0, state 0x40, immediate 0, ITT 0x6a220000, CmdSn 7402, TTT 0x3542
3/17 19:17:51.857 1140 T[5,1aa1]: read/write 01, read length 0, read done 0, write length 512, write done 512, DATA-IN PDUs 0
3/17 19:17:51.857 1140 T[5,1aa1]: DataSN 0, R2TSN 0, status 0, status class 0, status detail 0, response 0, counter 0, authStage 0
3/17 19:17:51.857 1140 T[5,1aa1]: CDB 0000 2a 00 00 00 ae b0 00 00 01 00 00 00 00 00 00 00 *...®°..........
3/17 19:17:51.857 1140 T[5,1aa1] Ssc 000000009F819AD0 - aborted, but still being executed.
3/17 19:17:53.167 179c T[6,5ed7]: Management command: abort task (CmdSN 32808, ITT 0x1f8d0000).
3/17 19:17:53.167 179c T[6,5ed6]: LUN 0, state 0x20, immediate 0, ITT 0x1f8d0000, CmdSn 32808, TTT 0xbdac
3/17 19:17:53.167 179c T[6,5ed6]: read/write 01, read length 0, read done 0, write length 512, write done 512, DATA-IN PDUs 0
3/17 19:17:53.167 179c T[6,5ed6]: DataSN 0, R2TSN 0, status 0, status class 0, status detail 0, response 0, counter 0, authStage 0
3/17 19:17:53.167 179c T[6,5ed6]: CDB 0000 2a 00 00 00 ac 20 00 00 01 00 00 00 00 00 00 00 *...¬ ..........
3/17 19:17:53.167 179c T[6,5ed6] Ssc 000000009F8371F0 - aborted.
3/17 19:17:53.167 179c T[6,5ed6]: Aborted task completed (state 0x600)
3/17 19:17:53.869 1140 T[5,1aa5]: Management command: abort task (CmdSN 7403, ITT 0x6c220000).
3/17 19:17:53.869 1140 T[5,1aa3]: LUN 0, state 0x20, immediate 0, ITT 0x6c220000, CmdSn 7403, TTT 0x3546
3/17 19:17:53.869 1140 T[5,1aa3]: read/write 01, read length 0, read done 0, write length 512, write done 512, DATA-IN PDUs 0
3/17 19:17:53.869 1140 T[5,1aa3]: DataSN 0, R2TSN 0, status 0, status class 0, status detail 0, response 0, counter 0, authStage 0
3/17 19:17:53.869 1140 T[5,1aa3]: CDB 0000 2a 00 00 00 ae b0 00 00 01 00 00 00 00 00 00 00 *...®°..........
3/17 19:17:53.869 1140 T[5,1aa3] Ssc 000000009F819FF0 - aborted.
3/17 19:17:53.869 1140 T[5,1aa3]: Aborted task completed (state 0x600)
3/17 19:17:55.180 179c T[6,5eda]: Management command: abort task (CmdSN 32809, ITT 0x218d0000).
3/17 19:17:55.180 179c T[6,5ed8]: LUN 0, state 0x20, immediate 0, ITT 0x218d0000, CmdSn 32809, TTT 0xbdb0
3/17 19:17:55.180 179c T[6,5ed8]: read/write 01, read length 0, read done 0, write length 512, write done 512, DATA-IN PDUs 0
3/17 19:17:55.180 179c T[6,5ed8]: DataSN 0, R2TSN 0, status 0, status class 0, status detail 0, response 0, counter 0, authStage 0
3/17 19:17:55.180 179c T[6,5ed8]: CDB 0000 2a 00 00 00 ac 20 00 00 01 00 00 00 00 00 00 00 *...¬ ..........
3/17 19:17:55.180 179c T[6,5ed8] Ssc 000000009F837710 - aborted.
3/17 19:17:55.180 179c T[6,5ed8]: Aborted task completed (state 0x600)
3/17 19:17:55.258 179c T[6,5edb]: Management command: abort task (CmdSN 32804, ITT 0x1b8d0000).
3/17 19:17:55.258 179c T[6,5ed2]: LUN 0, state 0x40, immediate 0, ITT 0x1b8d0000, CmdSn 32804, TTT 0xbda4
3/17 19:17:55.258 179c T[6,5ed2]: read/write 01, read length 0, read done 0, write length 131072, write done 131072, DATA-IN PDUs 0
3/17 19:17:55.258 179c T[6,5ed2]: DataSN 0, R2TSN 0, status 0, status class 0, status detail 0, response 0, counter 0, authStage 0
3/17 19:17:55.258 179c T[6,5ed2]: CDB 0000 2a 00 20 2e c8 00 00 01 00 00 00 00 00 00 00 00 *. .È...........
[to be continued]

What's wrong? There's nothing in the windows eventlog and I've got a NIC dedicated to iSCSI-Traffic only on both the ESXi and the Starwind host.

bye
Volker

eickst · Mon Mar 18, 2013 1:04 am

Can you create a new VM using the iscsi storage? Or is it only when trying to migrate the existing one?

Wed Mar 20, 2013 11:40 am

Hi!
It looks like the fix is included with the current build on the website:
http://www.starwindsoftware.com/registration-iscsi-san
Also, could you please tell what caching parameters are you using for the DD device.
Performance issues may occur if the DD device cache is 64 MBs or less.

imrevo · Wed Mar 20, 2013 1:59 pm

Hi Max,

Max (staff) wrote: It looks like the fix is included with the current build on the website:
http://www.starwindsoftware.com/registration-iscsi-san

Hm, I downloaded my version 1 week ago (build20121220) and I downloaded build 20130115 right now. But where is the build from 21.2.2013 mentioned in http://www.starwindsoftware.com/forums/ ... t2069.html including several iSCSI fixes?

Would be really fine to have some kind of release information on the homepage / download page and in the filename (instead of "starwind.exe" only).

Max (staff) wrote: Also, could you please tell what caching parameters are you using for the DD device.
Performance issues may occur if the DD device cache is 64 MBs or less.

Cache is 512MB or larger, for any disk.

bye
Volker

Thu Mar 21, 2013 11:34 am

Hm, I downloaded my version 1 week ago (build20121220) and I downloaded build 20130115 right now. But where is the build from 21.2.2013 mentioned in starwind-f5/starwind-the-most-recent-version-t2069.html including several iSCSI fixes?

That page was updated too early:) The information in that topic is already fixed and up to date.

There is one quesiotn that was asked that hasn`t been aswered yet:

Can you create a new VM using the iscsi storage? Or is it only when trying to migrate the existing one?

imrevo · Fri Mar 22, 2013 7:26 am

Hi,

quick update:

- added more NICs to the iSCSI-target servers
- exchanged all network cables
- done some tweaking (round robin, iops=1, Jumbo frames for iSCSI-sync-channel and iSCSI-target-channel)
- installed a new switch used for sync-traffic only
- reconfigured the HA-targets to use 2 sync- and 1 heartbeat-channel
- checked the physical disks

Result:
Same problem ("lost access to ..."), but this time to one of the HA-targets that is having 4(!) valid and active paths. I think I should take a closer look at my ESX hosts....

I'll add 2 more NICs to the ESX hosts these days, hopefully this afternoon. I'll keep you posted.

Btw.: Can you confirm that build 20130115 is the latest release?

bye
Volker

imrevo · Sat Mar 23, 2013 1:37 pm

Hi there,

quick update again:

- added an additional NIC to each ESX
- reconfigured vSphere networking
- changed MTU to 9000 not only for the iSCSI-vmkernel port but for the corresponding vSwitch as well (

)

No more errors so far. Benchmarking shows 75MB/sec writing, 220MB/sec reading (ATTO) with iops set to 5. 1, 3, 10 and 1000 iops show worse results, so I'll stick to 5.

Still to do: update to the latest Starwind build.

As soon as everything is done I'll post the config and the benchmarks.

The purpose of all this is to create a "showroom" for potential customers. Currently (due to licensing limitations) with a 2 Node HA Cluster.

I'll keep you posted.

bye
Volker

Mon Mar 25, 2013 2:16 pm

Had you any chance to upgrade the SAN? Do you have any update for this topic?

BTW, one thing that I`d like to mention: you can show 2 nodes config to your customers and be sure that 3 nodes will behave in the same way. Another way is to contact sales and ask for 3 nodes trial key (BTW you should get direct support in this case for this project)

imrevo · Tue Mar 26, 2013 6:29 am

Hi Max,

Anatoly (staff) wrote:Had you any chance to upgrade the SAN? Do you have any update for this topic?

I did and I have

See next post.

Anatoly (staff) wrote:BTW, one thing that I`d like to mention: you can show 2 nodes config to your customers and be sure that 3 nodes will behave in the same way. Another way is to contact sales and ask for 3 nodes trial key (BTW you should get direct support in this case for this project)

Well, it is actually not a project but a setup to help me acquire projects

As reseller, I'm in contact with Tatiana, works very well.

bye
Volker

imrevo · Tue Mar 26, 2013 10:31 am

Good morning,

finally, here it is: Setup example for a low budget HA solution for SMB

Comments are welcome, projects as well

bye
Volker

Thu Mar 28, 2013 4:46 pm

Before we will dive deeper into the project comments: have you upgraded the SAN? Has it solved the issue?

imrevo · Sat Mar 30, 2013 7:16 am

Hi Anatoly,

Anatoly (staff) wrote:Before we will dive deeper into the project comments: have you upgraded the SAN? Has it solved the issue?

I performed the update just some minutes ago after the error reoccurred yesterday.

bye
Volker

Mon Apr 01, 2013 9:32 am

So the upgrade helped to solve the issue, correct?

imrevo · Mon Apr 01, 2013 3:25 pm

Code: Select all

Successfully restored access to volume 5146dc2a-a7143242-8c0a-001ec9ed7ffe (HA001-W2K12-MICRO1-W2K3-STORAGE) following connectivity issues.
info
01.04.2013 16:41:49
esx02.celtic-ads.imre.home

Lost access to volume 5146dc2a-a7143242-8c0a-001ec9ed7ffe (HA001-W2K12-MICRO1-W2K3-STORAGE) due to connectivity issues. Recovery attempt is 
in progress and outcome will be reported shortly.
info
01.04.2013 16:41:49
HA001-W2K12-MICRO1-W2K3-STORAGE

Alarm 'Host memory usage' on esx02.celtic-ads.imre.home changed from Yellow to Green
info
01.04.2013 14:27:17
esx02.celtic-ads.imre.home

Successfully restored access to volume 5146dc2a-a7143242-8c0a-001ec9ed7ffe (HA001-W2K12-MICRO1-W2K3-STORAGE) following connectivity issues.
info
01.04.2013 14:25:49
esx02.celtic-ads.imre.home

Lost access to volume 5146dc2a-a7143242-8c0a-001ec9ed7ffe (HA001-W2K12-MICRO1-W2K3-STORAGE) due to connectivity issues. Recovery attempt is 
in progress and outcome will be reported shortly.
info
01.04.2013 14:25:41
HA001-W2K12-MICRO1-W2K3-STORAGE

Alarm 'Cannot connect to storage': an SNMP trap for entity esx02.celtic-ads.imre.home was sent
info
01.04.2013 14:22:47
esx02.celtic-ads.imre.home

Alarm 'Cannot connect to storage' on esx02.celtic-ads.imre.home changed from Gray to Gray
info
01.04.2013 14:22:47
esx02.celtic-ads.imre.home

Alarm 'Cannot connect to storage' on esx02.celtic-ads.imre.home triggered an action
info
01.04.2013 14:22:47
esx02.celtic-ads.imre.home

Alarm 'Cannot connect to storage' on esx02.celtic-ads.imre.home changed from Gray to Gray
info
01.04.2013 14:22:47
esx02.celtic-ads.imre.home

Path redundancy to storage device eui.048ef4db207df6cf (Datastores: HA001-W2K12-MICRO1-W2K3-STORAGE) restored. Path vmhba34:C0:T12:L0 is 
active again.
info
01.04.2013 14:22:34
HA001-W2K12-MICRO1-W2K3-STORAGE

Path redundancy to storage device eui.048ef4db207df6cf degraded. Path vmhba34:C1:T12:L0 is down. Affected datastores: HA001-W2K12-MICRO1-W2K3
-STORAGE.
warning
01.04.2013 14:22:33
HA001-W2K12-MICRO1-W2K3-STORAGE

Path redundancy to storage device eui.048ef4db207df6cf degraded. Path vmhba34:C0:T12:L0 is down. Affected datastores: HA001-W2K12-MICRO1-W2K3
-STORAGE.
warning
01.04.2013 14:22:33
HA001-W2K12-MICRO1-W2K3-STORAGE

Successfully restored access to volume 5146dc2a-a7143242-8c0a-001ec9ed7ffe (HA001-W2K12-MICRO1-W2K3-STORAGE) following connectivity issues.
info
01.04.2013 14:22:28
esx02.celtic-ads.imre.home

Lost access to volume 5146dc2a-a7143242-8c0a-001ec9ed7ffe (HA001-W2K12-MICRO1-W2K3-STORAGE) due to connectivity issues. Recovery attempt is 
in progress and outcome will be reported shortly.
info
01.04.2013 14:22:08
HA001-W2K12-MICRO1-W2K3-STORAGE

Successfully restored access to volume 5146dc2a-a7143242-8c0a-001ec9ed7ffe (HA001-W2K12-MICRO1-W2K3-STORAGE) following connectivity issues.
info
01.04.2013 14:21:53
esx02.celtic-ads.imre.home

Lost access to volume 5146dc2a-a7143242-8c0a-001ec9ed7ffe (HA001-W2K12-MICRO1-W2K3-STORAGE) due to connectivity issues. Recovery attempt is 
in progress and outcome will be reported shortly.
info
01.04.2013 14:21:20
HA001-W2K12-MICRO1-W2K3-STORAGE

I'll replace the NICs in w2k3-storage. But the question is: as there are 5 HA-targets @ w2k12.micro1 with 4 secondary targets @ w2k8-micro1 and only 1 secondary target (the failing one) @ w2k3-storage (Windows 2008, wrong naming, sorry), how can it be that the connection to a HA-storage dies completely when one of the targets might have a problem with one of 3 NICs?

Tell me, which logs you need.

I might be wrong, but even if one of 2 HA hosts fails, my ESX might report a loss of redundany at max, but never a complete loss of access!

bye
Volker

imrevo · Mon Apr 01, 2013 3:27 pm

Anatoly (staff) wrote:So the upgrade helped to solve the issue, correct?

Nope.