Starwind Target becomes unresponsive

Mon May 14, 2012 3:35 pm

Can you please update the StarWind to the latest build and try to use Dedup drives again?
If it will fail again please provide us with the StarWind logs (just send them to support@starwindsoftware.com).

Thanks

thewafflecaust · Tue May 15, 2012 8:25 am

So I removed I removed one of the basic devices and added a dedupe device.

Same result as previous. Target froze, VMs go unresponsive. I have sent the logs through to the support email.

Sigh. I dunno if it would make much difference, but the physical underlying devices I'm using are SSDs, that wouldn't cause some problem for dedupe would it?

Tue May 15, 2012 1:09 pm

I`ll ask you additionally to give us more detailed information about your system:
What OS are you running on StarWind server?
What is the devices size?
How much RAM have you assigned for what cache?
How much RAM do you have on SAN box totally?
What block size have you used for deduplication devices?

thewafflecaust · Tue May 15, 2012 1:19 pm

Windows Server 2008 R2 Standard - SP1 only, no other patches.
Physical devices are 120GB SSDs. Starwind devices are 110GB.
No caching on any devices.
8GB total on the SAN box.
I have tried auto and 256k. Same result.

thewafflecaust · Tue May 15, 2012 1:31 pm

Very interesting, your previous question prompted me to put a cache on the dedupe device.. it is current processing about 500mb/s of IO from two hosts and isnt freezing..

thewafflecaust · Tue May 15, 2012 1:50 pm

Another oddity, when i turn on round robin mpio I will see the second path become active and send traffic for a short interval (about 10 minutes), then I get the error below, then the second path becomes idle even though the overall storage operation continues (using storage vmotion to generate traffic to the san box)

5/15 23:38:35.458 b20 T[14,1e7e5]: Management command: abort task (CmdSN 124895, ITT 0xe2e70100).
5/15 23:38:35.458 b20 T[14,1e7e4]: LUN 200, state 0x40, immediate 0, ITT 0xe2e70100, CmdSn 124895, TTT 0x3cfc8
5/15 23:38:35.458 b20 T[14,1e7e4]: read/write 10, read length 32768, read done 0, write length 0, write done 0, DATA-IN PDUs 0
5/15 23:38:35.458 b20 T[14,1e7e4]: DataSN 0, R2TSN 0, status 0, status class 0, status detail 0, response 0, counter 0, authStage 0
5/15 23:38:35.458 b20 T[14,1e7e4]: CDB
0000 28 00 03 eb c6 00 00 00 40 00 00 00 00 00 00 00 (..ëÆ...@.......
5/15 23:38:35.458 b20 T[14,1e7e4] Ssc 0000000003380EB0 - aborted, but still being executed.
5/15 23:38:35.473 80c T[14,1e7e4]: Aborted task completed (state 0x600)

Tue May 15, 2012 3:28 pm

OK, could you please provide us with the screenshot of management console that will contain all devices description?

Thank you

thewafflecaust · Wed May 16, 2012 7:10 am

http://i.imgur.com/hM0OS.png

Is that what you are after?

Wed May 16, 2012 10:53 am

Yes! Thank you.

OK, and can you please check the CPU and RAM load on the SAN server while you are trying to connect to target with more then 1 ESX and it hangs?

thewafflecaust · Wed May 16, 2012 11:33 am

I put a huge amount of load on it tonight over 3 active paths and none of the paths died so I suspect yesterdays issue was an isolated one.

Do you have any particular idea as to why I need to have caching active to prevent the dedupe devices from freezing?

Fri May 18, 2012 2:44 pm

Well, it is because engine need some space for proper functionality.
So, will it be possible you to share the previously requested information?

thewafflecaust · Sat May 19, 2012 1:43 am

OK, and can you please check the CPU and RAM load on the SAN server while you are trying to connect to target with more then 1 ESX and it hangs?

Usually it was 20-30% cpu usage (under active load) and between 2-4GB RAM usage depending on how I had the target devices set up. There was no apparent correlation between system utilisation and the target going unresponsive when two hosts tried to access it.

If the engine needs space for proper dedupe functionality why is no cache even an option on dd?

Sat May 19, 2012 1:10 pm

DD is cached in the same way RAW images are. It has own level of caching (let's better say - buffering) but it's required by DD design. And it's rather small (few hundred megabytes). L1 (RAM) cache is layered on top of it in any case.

thewafflecaust wrote:
OK, and can you please check the CPU and RAM load on the SAN server while you are trying to connect to target with more then 1 ESX and it hangs?
Usually it was 20-30% cpu usage (under active load) and between 2-4GB RAM usage depending on how I had the target devices set up. There was no apparent correlation between system utilisation and the target going unresponsive when two hosts tried to access it.

If the engine needs space for proper dedupe functionality why is no cache even an option on dd?