Slow connection and collapsing VM when checking FT

Piterskuy · Wed Sep 04, 2013 11:45 am

Hello!

I have some question about synchronization and fault-tolerance (FT).

I installed Starwind Native SAN for Hyper-V on 2 servers under Windows Server 2012 with Hyper-V. I made HA Cluster.
I use Free license and Open VPN to make a heartbeat channel (but I think it's not important for current question).
I use 100Mbit connection between servers.

So I try to check fault-tolerance. I make my first server restart, then I see that VM starts at second server everything is Ok.
The I try to make hardware restart - VM makes collapsed. Then I had to make new VM because this one became broken.

So my question is if could 100Mb/s connection (instead of 10Gb/s) influence on this or there is could be another problem.

Wed Sep 04, 2013 1:11 pm

One-by-one:
1. 100 Mbps connection doesn`t fit our system requirements.
2. Which server was restarted when you were saying "hardware restart"?
3. What server was hosting the VM when you have restarted the server?

Piterskuy · Wed Sep 04, 2013 1:45 pm

2 servers: VM-ONE and VM-TWO

1. I know, but it's the only hardware to test software.
2-3. VM-ONE hosted the VM. I reset VM-ONE (by turning reset button) and that's all. VM-TWO began hosting the VM (this VM made collapsed). I had to delete VM and made new one by using the virtual hard disk.

Wed Sep 04, 2013 2:46 pm

1. Well, we cannot guarantee that configuration will work. you can`t ride 300 km/h on the Daewoo

2.-3. Can you confirm that synchronization process was finished on the moment of the reset? What errors do you see in the Win logs related to this issue?

Piterskuy · Thu Sep 05, 2013 3:31 pm

1. Nice exapmple!

2-3. I can't confirm only online syncronization between servers on the monent of the reset. Because I don't know how to check online synchronization. The main syncro of the data during the time when the servers idle was finished!
I have logs but there are a lot of worthless information I can attach file and tell you the interval where was a restarting.
(As I remember the reset was at the junction of these two files)

Mon Sep 09, 2013 7:59 am

OK, lets stick with 2-3.

Well, that is weird - we`ve tested analogical configuration multiple times, and have`nt saw such behaviour as you described.
Can I ask you if you have configured everything according to our guide lines and recomendations:
http://www.starwindsoftware.com/starwin ... al-servers
http://www.starwindsoftware.com/starwin ... ces-manual

Piterskuy · Mon Sep 23, 2013 2:49 pm

Hello!

I replaced my slow network adapter to 1Gbit. And I haven't seen this bug again.

But I have 2 questions.
1.After hardware reset I have automatic full resynchronization of my cluster. So 50 Gb storage synchronized more than hour. Is it normal? I want to make 2Tb cluster and don't want to wait for a 2 days

2.After hardware reset on VM-TWO which hosts the VM, I see on VM-ONE that cluster disconnects from failover cluster manager and after that I have to connect to cluster manually. I attach logs(in russian):

Code: 1135
17:35:35
Узел "VM-TWO" лишен членства в активном отказоустойчивом кластере. Возможно, служба кластеров на этом узле была остановлена. Это также могло произойти из-за потери связи между данным узлом и другими активными узлами в отказоустойчивом кластере. Чтобы проверить параметры сети, запустите мастер проверки конфигурации. Если это не поможет, проверьте оборудование или программное обеспечение на наличие ошибок, связанных с сетевыми адаптерами на данном узле. Также проверьте работу других сетевых устройств, к которым подключен этот узел, таких как концентраторы, коммутаторы и мосты.

Code: 1177
17:35:56
Служба кластеров завершает работу, поскольку отсутствует кворум. Возможно, причиной этого является отсутствие связи между несколькими или всеми узлами в кластере либо переход диска-свидетеля на другой ресурс при сбое.
Для проверки сети запустите мастер проверки конфигурации. Если это не поможет, проверьте оборудование или программное обеспечение на наличие ошибок, связанных с сетевым адаптером. Кроме того, проверьте работу других сетевых устройств, к которым подключен этот узел, таких как концентраторы, коммутаторы и мосты.

I also attach full logs.

P.s. I (try to) configured everything according to manuals that you sent. I have quorum, cluster, attached disks throw iSCSI iniciator and so on.

Mon Sep 23, 2013 3:52 pm

1. The performance depends on a lot of things, and it is pretty time consuming process to find out where teh bottlencek is. Let me know if you want to dive deeper into this investigation.
2. I see that the witness disk is dropping from online but since at least one node was up you should see the quorum to be presented to the cluster. I suspect that there is some minor missconfiguration. Could you doublecheck if you have configured everything according to our guide line, described in this document (especially the part of connecting to the HA device)?

Piterskuy · Mon Sep 30, 2013 1:39 pm

Thanks a lot for the docs!
I made everything according to it.
Now everything works great by the live or dynamic migration or when restart host typically (by pressing "Start" button). But when making "hardware" restart (on the host machine) VM is closed and relaunched on the working host. So there is no any crushes (VM or cluster) but VM restarts. What am I doing wrong?
I have one idea maybe it could help you. When configuring HA storage I connected to target via iSCSI initiator and select "enable multi-path" checkbox in the "connect to target" dialog but when I opened this connection again checkbox was not select.

Tue Oct 01, 2013 3:19 pm

Well, that is expected behaviour due to the Microsoft Hyper-V architeture. You need to have "clean" shut down to have the Live Migration, not hard reset.

Piterskuy · Wed Oct 02, 2013 8:02 am

So I can't configure Fault Tolerant cluster using Starwind and Hyper-V.
VM ware + Starwind or another soft+Starwind could solve this problem?

Wed Oct 02, 2013 9:14 am

Of course you can (as far as I know

) - the VMware has RDMA, that replicates the data from VMs RAM on all hosts, which ensures the FT.