Log On to Target - Service Unavailable

Caillin · Tue May 28, 2013 4:43 am

I've just finished setting up a lab environment to test the Native-SAN product with 3 nodes. I have the software installed as well as all the required roles/features on Server 2012 Standard.
I've configured the HA Volume and device, and confirmed that it's synchronised properly.

When I get to the part of connecting to the iSCSI target via the MS initiator, I get an error that says "Log On to Target - Service Unavailable".

I think this may have something to do with the fact that the HA LUN is configured on the system drive (as a virtual IMG file). I had planned on using 16GB USB flash drives as the "disk arrays" for testing, but it seems that SW doesn't see removable storage as a suitable storage volume.

I've attached the relevant sections of the log below:

5/28 14:21:49.310 a30 Srv: Accepted iSCSI connection from 192.168.0.144:49364 to 192.168.0.144:3260. (Id = 0x12)
5/28 14:21:49.310 a30 C[12], FREE: Event - CONNECTED.
5/28 14:21:49.310 a30 C[12], XPT_UP: T3.
5/28 14:21:49.310 88c C[12], XPT_UP: Login request: ISID 0x400001370000, TSIH 0x0000.
5/28 14:21:49.310 88c C[12], XPT_UP: Event - LOGIN.
5/28 14:21:49.310 88c C[12], IN_LOGIN: T4.
5/28 14:21:49.310 88c Params: <<< String param 'InitiatorName': received 'iqn.1991-05.com.microsoft:starwindtest1.pickard.local', accepted 'iqn.1991-05.com.microsoft:starwindtest1.pickard.local'
5/28 14:21:49.310 88c Params: <<< Enum param 'SessionType': received 'Normal', accepted 'Normal'
5/28 14:21:49.310 88c Params: <<< String param 'TargetName': received 'iqn.2008-08.com.starwindsoftware:127.0.0.1-testlun1', accepted 'iqn.2008-08.com.starwindsoftware:127.0.0.1-testlun1'
5/28 14:21:49.310 88c Params: <<< Enum param 'AuthMethod': received 'None', accepted 'None'
5/28 14:21:49.325 88c HA: CHADevice::RegisterSession: Client initiator iqn.1991-05.com.microsoft:starwindtest1.pickard.local try to register session for target 'iqn.2008-08.com.starwindsoftware:127.0.0.1-testlun1'...
5/28 14:21:49.325 88c HA: CHADevice::RegisterSession: Client session can't be registered because node is not active!
5/28 14:21:49.325 88c HA: CHADevice::RegisterSession: Return code 21.
5/28 14:21:49.325 88c Tgt: *ERROR* 'iqn.2008-08.com.starwindsoftware:127.0.0.1-testlun1' can't register session. The device 'HAImage1' may be owned by a local process!
5/28 14:21:49.325 88c T[12,1]: *ERROR* Login request: device open failed.
5/28 14:21:49.325 88c C[12], IN_LOGIN: recvData returned 10058
5/28 14:21:49.325 e84 C[12], IN_LOGIN: Event - LOGIN_REJECT.
5/28 14:21:49.325 88c C[12], IN_LOGIN: *** 'recv' thread: recv failed 10058.
5/28 14:21:49.325 e84 C[12], FREE: T7.

Caillin · Tue May 28, 2013 4:52 am

Update: For some reason it will connect fine to the third node, and not the first or second. The second node can also connect to the third node but not one and two.

Caillin · Tue May 28, 2013 5:00 am

Ok, just went back into console and checked and it's re-syncing again, although really slowly. I guess somewhere in the deployment it triggered a resync which is why the other two targets aren't online.

Tue May 28, 2013 8:16 am

Hi Caillin,
Please keep us posted!
Also, do you see any information about the synchronization taking place on the nodes?

Caillin · Tue May 28, 2013 10:41 pm

Yes, Node 1 and 2 had started a full sync from node 3. It's at 85% done after 18 hours though, so it's taking quite a long time. This is for a 20GB img file over dedicated 100Mbit. By the look of the attached perfmon, it's only syncing at 2.2Mbit which seems awfully low. An SMB file copy goes at full 100Mbit.

Thu May 30, 2013 11:31 am

Caillin,
May I ask you if you have any updates for us? Does HA devices got synchronized? Have you been able to connect t the first node?
Thank you

Caillin · Fri May 31, 2013 12:55 am

All 3 targets are up now on all 3 nodes after they finished syncing. I've got the 2012 Cluster set up and successfully created a CSV and have created a guest HA VM and Live migrating between the three nodes works like a charm.

I ran NTTTCP between node 1 and node 2, and it seems to be capping out at about 3Mbit (300KB/s) on the sync channel, so the sync slowness, and general extreme slowness inside the VM can be attributed to poor networking.

As I'm using 3x Zotac AD12 nano PCs with 2x USB NICs in each as a lab environment, I wasn't exactly expecting great performance lol. Just mainly testing how easy the software is to configure, and how well it works with 2012 clustering and CSVs.

Mon Jun 03, 2013 4:19 pm

Well, I`m glad that it all passed well and working almost seamlessly (performance is not too good due to weak network).

Please do not hesitate to contact us should you have any additional questions. It would be our pleasure to further assist you!

Caillin · Mon Jun 03, 2013 11:57 pm

Just a question regarding proper procedures and outcomes for different shutdown scenarios. I've built a bit of a picture from going through forum posts but I'm a little confused as new features have been added so would be good to get some clear outlines for the current version.

We plan to have write-back cache enabled on the 3 nodes. From what I understand:

Scenario 1: If a single node is gracefully shut down, then write-back stays enabled on the remaining two nodes, and MPIO throughput drops by 33%. If that node is brought back up, then this will trigger a quick sync.
Scenario 2: If one node is already gracefully shut down, and then a second node is gracefully shut down (I know you shouldn't do this, just trying to understand the software), then write-back cache is flushed, and the array is set to write-through as there is no longer redundancy until at least one other node returns and re-syncs. When the other nodes are brought online, this will trigger a full sync.
Scenario 3: If a single node is shut down dirty (power pulled), then write-back cache stays enabled as there is still HA on 2 remaining nodes. When node returns, it is full synced.
Scenario 4: Two nodes are shut down dirty. Write-back cache gets flushed and set to write-through, and remaining nodes are full-synced when they come back online.
Scenario 5: All nodes are shut down dirty at the same time (UPS failure). It seems that in this event, there is inevitably going to be some data corruption.

If we lose power to the building, and we have enough time to gracefully shutdown the entire environment, what is the proper procedure? If nodes are gracefully shut down one at a time, then can they be brought back online, one after the other, with only a fast sync, or will each nodes need a full sync?

I'm going to be testing these scenarios, but would love official feedback from you guys on best practices for a 3-Node HA cluster with Native-SAN, as this is a fairly new offering that doesn't seem have as much love in the tech docs as the older solutions.

Cheers.

Caillin · Tue Jun 04, 2013 1:05 am

Also one more question on LAN/NIC configuration, as the diagrams currently available don't show a 3-node Hyper-v Native-SAN arrangement and best practice. Attached is the best diagram I have that Anatoly sent me last month.

I plan on having 2 x Dual 10GBE NICs per server + 1 x Quad 1GBE NIC.
Sync channels will be direct connected to each others 10GBE NICs with no switches as per attached diagram as Sync 1,2,3 & 4.

Will it suffice to use two of the 1GBE ports on each server as the iSCSI/Heartbeat traffic?
If I am only providing iSCSI targets to the other 2 nodes, is it ok to direct connect those 1GBE iSCSI/Heartbeat channels to each other and not use a switch?
Should I add a third 10GBE card and use it for iSCSI/Heartbeat traffic? Current SAN is seeing around 2500 IOPS 95th percentile loads with current 30 Hyper-V guest environment.

Just wanting to be 100% on the best practice configuration before pulling the trigger so to speak.

Mon Jun 10, 2013 11:23 am

Scenario 1: If a single node is gracefully shut down, then write-back stays enabled on the remaining two nodes, and MPIO throughput drops by 33%. If that node is brought back up, then this will trigger a quick sync.

Scenario 2: If one node is already gracefully shut down, and then a second node is gracefully shut down (I know you shouldn't do this, just trying to understand the software), then write-back cache is flushed, and the array is set to write-through as there is no longer redundancy until at least one other node returns and re-syncs. When the other nodes are brought online, this will trigger a full sync.

Scenario 3: If a single node is shut down dirty (power pulled), then write-back cache stays enabled as there is still HA on 2 remaining nodes. When node returns, it is full synced.

Scenario 4: Two nodes are shut down dirty. Write-back cache gets flushed and set to write-through, and remaining nodes are full-synced when they come back online.

Scenario 5: All nodes are shut down dirty at the same time (UPS failure). It seems that in this event, there is inevitably going to be some data corruption.

Here are the the situations when Full synchronization is starting instead of Fast:
1)If HA device was configured to use write-back cache and one of the server was turned of not correctly (i.e. hard reset, power outage, etc);
2)If the writes on the disk errors were detected on the disk;
3)If partner, that should be the source of synchronization has the state another from "Synchronized";
4)If initial synchronization was interrupted due to any reason;
5)If extension of the HA device was performed;
6)If the Partner that should be the source of synchronization gained "Synchronized" state after initiating "Mark as synchronized" command

In all rest of the cases when Synchronization will be initiated automatically the FastSync should take place.

I hope that answered all the "Full or Fast synchronization" questions.

There is the possibility that the data will not be flushed from cache to disk in case if the node will be shutdown due to electricity failure (i.e. power cord plug-off).

P.S. you`ll see 33% performance drop after one node has been down only when running in nodes HA storage cluster.

If we lose power to the building, and we have enough time to gracefully shutdown the entire environment, what is the proper procedure? If nodes are gracefully shut down one at a time, then can they be brought back online, one after the other, with only a fast sync, or will each nodes need a full sync?

We have the document HA Maintenance and Configuration Changes, which basically answers your question. Please refer to the "Prepearing an HA device for prolonged downtime" section.

Will it suffice to use two of the 1GBE ports on each server as the iSCSI/Heartbeat traffic?

It will be enough for propper funtioning of the HA device(-es). But basically HeartBeat is low resource consuming technology (it uses ~200MB per month), but it helps to avoid the split brain issue, so it is highly recomended to have HB on each data link (except SyncChannel, which is already comes with HB functionality)

If I am only providing iSCSI targets to the other 2 nodes, is it ok to direct connect those 1GBE iSCSI/Heartbeat channels to each other and not use a switch?

That should work

Should I add a third 10GBE card and use it for iSCSI/Heartbeat traffic? Current SAN is seeing around 2500 IOPS 95th percentile loads with current 30 Hyper-V guest environment.

As I`ve mentioned above - HB is low resourse consuming, so you just need to keep your system ballanced. Please refer to the Networking section of our Best Practices documentation.

Caillin · Tue Jun 11, 2013 1:37 am

Thanks for the replies Anatoly. So the only main potential issue with using Write-back cache is a UPS failure can lead to data corruption. So to ameliorate this, we should use a second UPS on at least one of the 3 nodes?

Thu Jun 13, 2013 3:40 pm

Generally speaking - yes.
As an option you can try to write the script that`ll start shutting down the StarWind service/server on the moment of the total power autage, but once again - that`ll not protect you if UPS unit will be broken.