Split Brain issues, Scale Out File server
Posted: Fri Jul 30, 2021 2:48 pm
Hello. We have a cluster, 2 nodes and a witness disk for quorum. The nodes are 2 virtual machines Windows Server 2016, they are connected by 3 network interfaces: heartbeat, sync and cluster management.
The cluster have 2 disk, 1 CSV and witness disk. Both disks are connected by ISCSI
We had configured starwinds software following this guide:https://www.starwindsoftware.com/resour ... r-2012-r2/
The configuration is : Failover strategy: Heartbeat and Mode: Synchronous.
Lately we have experienced synchronization problems between nodes. If communication is interrupted between the nodes, one of them is marked as "not synchronized" and does not automatically resume synchronization. One workaround that starwinds support gave us was disconnecting and reconnecting the synchronization interface.
Last weekend, the cluster nodes lost communication with each other due to network problems for 5 minutes. The shared folders that I had in the cluster were not available from the client computers. There was a total failure of service.
From Starwinds management console all disk was marked as red, not sinchonized. From microsoft failover console, all disk were marked as offline and all cluster roles too..
Investigating the windows log (cluster.log) I found that the cluster entered at that moment in the split brain condition. I cannot understand why this happened. It is assumed that in addition to a hearbeat interface, the cluster has a witness disk connected by ISCSI which should give quorum so that the service will not fail and neither will both nodes enter a split brain condition.
Could you help me understand what happened here?
The cluster have 2 disk, 1 CSV and witness disk. Both disks are connected by ISCSI
We had configured starwinds software following this guide:https://www.starwindsoftware.com/resour ... r-2012-r2/
The configuration is : Failover strategy: Heartbeat and Mode: Synchronous.
Lately we have experienced synchronization problems between nodes. If communication is interrupted between the nodes, one of them is marked as "not synchronized" and does not automatically resume synchronization. One workaround that starwinds support gave us was disconnecting and reconnecting the synchronization interface.
Last weekend, the cluster nodes lost communication with each other due to network problems for 5 minutes. The shared folders that I had in the cluster were not available from the client computers. There was a total failure of service.
From Starwinds management console all disk was marked as red, not sinchonized. From microsoft failover console, all disk were marked as offline and all cluster roles too..
Investigating the windows log (cluster.log) I found that the cluster entered at that moment in the split brain condition. I cannot understand why this happened. It is assumed that in addition to a hearbeat interface, the cluster has a witness disk connected by ISCSI which should give quorum so that the service will not fail and neither will both nodes enter a split brain condition.
Could you help me understand what happened here?