Replication and Hearthbeat channel down?

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

jortie
Posts: 12
Joined: Wed Sep 27, 2017 12:34 am

Wed Oct 11, 2017 2:17 pm

Hi,

I have a two node cluster. Node 2 reports that both sync and heartbeat channels are down to node 1.
The replication manager on Node 1 shows the sync and heartbeat channel towards node 2 as up.

I can ping the IP addresses on both sides for sync and heartbeat. Also telnet to port 3260 succeeds in all directions.

This is the log part I have on node 2.

Code: Select all

10/11 15:20:45.265 2fe0 HA: CHAPartnerNode::SendPartnerNodeVersionRequestCommand: SendCustomControlScsiCommand(HA_CHANNEL_TYPE_SYNC) failed, error code 1168, scsi status = 0!
10/11 15:20:45.265 2fe0 HA: CHAPartnerNode::SendPartnerNodeVersionRequestCommand: Try to get partner node version through heartbeat channel.
10/11 15:20:45.265 2fe0 HA: CHAPartnerISCSIChannelManager::SendCustomControlScsiCommand: Valid channel not found!
10/11 15:20:45.265 2fe0 HA: CHAPartnerNode::SendPartnerNodeVersionRequestCommand: EXITing with failure, SendCustomControlScsiCommand(HA_CHANNEL_TYPE_HEARTBEAT) failed, error code 1168, scsi status = 0!
10/11 15:20:45.265 2fe0 HA: CHAPartnerNode::SendGetPartnerNodeInfoCommand: EXITing with failure, partner node version update failed!
10/11 15:20:47.459 1b98 Sw: *** iscsi_tcp_dispatch: iscsi_service failed with error: iscsi_service: socket error, connection lost.. Reestablish connection to the target iqn.2008-08.com.starwindsoftware:s001.clr001.syso.local-ssd
10/11 15:20:47.459 1b98 Sw: *** MountTarget: Failed to log in to target(iqn.2008-08.com.starwindsoftware:s001.clr001.syso.local-ssd). Error message: iscsi_service: socket error, connection lost..
10/11 15:20:47.459 1b98 HA: CNIXInitiator::MountTarget: unable to mount the target (1223)!
10/11 15:20:47.640 5d8 Sw: *** iscsi_tcp_dispatch: iscsi_service failed with error: iscsi_service: socket error, connection lost.. Reestablish connection to the target iqn.2008-08.com.starwindsoftware:192.168.0.100-witness
10/11 15:20:47.640 5d8 Sw: *** MountTarget: Failed to log in to target(iqn.2008-08.com.starwindsoftware:192.168.0.100-witness). Error message: iscsi_service: socket error, connection lost..
10/11 15:20:47.640 5d8 HA: CNIXInitiator::MountTarget: unable to mount the target (1223)!
10/11 15:20:47.862 2044 Sw: *** iscsi_tcp_dispatch: iscsi_service failed with error: iscsi_service: socket error, connection lost.. Reestablish connection to the target iqn.2008-08.com.starwindsoftware:s001.clr001.syso.local-coldstorage
10/11 15:20:47.862 2044 Sw: *** MountTarget: Failed to log in to target(iqn.2008-08.com.starwindsoftware:s001.clr001.syso.local-coldstorage). Error message: iscsi_service: socket error, connection lost..
10/11 15:20:47.862 2044 HA: CNIXInitiator::MountTarget: unable to mount the target (1223)!
10/11 15:20:48.279 2fe0 HA: CHAPartnerISCSIChannelManager::SendCustomControlScsiCommand: Valid channel not found!
Any thoughts on what is going on?
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Wed Oct 11, 2017 3:19 pm

jortie,

If you select a device, right-click it and go to Replication Node Interfaces - try deleting one heartbeat interface with a red cross and re-add it again. Also, what are your NICs (vendor, model) and are Jumbo Frames (MTU) enabled on iSCSI/Sync network?
jortie
Posts: 12
Joined: Wed Sep 27, 2017 12:34 am

Wed Oct 11, 2017 6:22 pm

Readding didnt make a difference. Once I readd the interface immediately shows a red cross. I have jumbo frames enabled on the replication nics. The heartbeat doenst have jumbo frames enabled.

Replication nics are Intel X540
Hearbeat is Broadcom Nextreme gigabit. I just disabled ethernet@wirespeed. It doenst make a difference though.
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Thu Oct 12, 2017 8:08 am

In certain cases disabling Jumbo frames helps to solve this issue. Please try disabling Jumbo frames on both nodes and report the links behavior.
jortie
Posts: 12
Joined: Wed Sep 27, 2017 12:34 am

Thu Oct 12, 2017 8:50 am

On one of the Intel nics jumbo frames were enabled. On the other one not. Jumbo frames are now disabled on both nodes. The Broadcom nics both had already a MTU of 1500.
I deleted all interfaces and add them again. They instantly showed up with a red cross
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Thu Oct 12, 2017 10:41 am

Could you show me some screenshots of how the same device looks on both servers in terms of Replication Node Interfaces?
jortie
Posts: 12
Joined: Wed Sep 27, 2017 12:34 am

Thu Oct 12, 2017 11:43 am

Here we go
Attachments
Knipsel3.PNG
Knipsel3.PNG (41.75 KiB) Viewed 207842 times
Knipsel2.PNG
Knipsel2.PNG (55.84 KiB) Viewed 207842 times
Knipsel.PNG
Knipsel.PNG (62.53 KiB) Viewed 207842 times
jortie
Posts: 12
Joined: Wed Sep 27, 2017 12:34 am

Thu Oct 12, 2017 11:43 am

And the last one
Attachments
Knipsel4.PNG
Knipsel4.PNG (40.41 KiB) Viewed 207842 times
Boris (staff)
Staff
Posts: 805
Joined: Fri Jul 28, 2017 8:18 am

Thu Oct 12, 2017 12:04 pm

Thank you.
In fact, I requested Replication Node Interfaces, but not Replication Manager. Could you post that for at least HAimage1 from both servers?
jortie
Posts: 12
Joined: Wed Sep 27, 2017 12:34 am

Thu Oct 12, 2017 3:04 pm

Sorry about that.
Attachments
node1
node1
node1.PNG (11.92 KiB) Viewed 207832 times
node2
node2
node2.PNG (12.08 KiB) Viewed 207832 times
Ivan (staff)
Staff
Posts: 172
Joined: Thu Mar 09, 2017 6:30 pm

Thu Oct 12, 2017 6:44 pm

Hello jortie,
Could you please share the screenshot from "Network" tab on each StarWind node like on the picture below?
Thank you.
Attachments
Screenshot_2.png
Screenshot_2.png (27.39 KiB) Viewed 207827 times
jortie
Posts: 12
Joined: Wed Sep 27, 2017 12:34 am

Thu Oct 12, 2017 11:21 pm

Here you go!
Attachments
Knipsel11.PNG
Knipsel11.PNG (234.65 KiB) Viewed 207819 times
Ivan (staff)
Staff
Posts: 172
Joined: Thu Mar 09, 2017 6:30 pm

Fri Oct 13, 2017 11:53 am

Hello jortie,
Thank you for screenshots.
Could you please check "Windows features and Roles" and remove the features like on the screenshot below (if exist) and reboot the server.
Thank you.
Attachments
Screenshot_9.png
Screenshot_9.png (31.11 KiB) Viewed 207812 times
jortie
Posts: 12
Joined: Wed Sep 27, 2017 12:34 am

Fri Oct 13, 2017 4:31 pm

They are not installed on both nodes.
Ivan (staff)
Staff
Posts: 172
Joined: Thu Mar 09, 2017 6:30 pm

Fri Oct 13, 2017 4:40 pm

Hello jortie,
Thank you for your reply.
Did you try to reboot the second node?
Could you please collect the logs and share it with us?
For quicker and easier log collection from StarWind nodes please do not hesitate using the script from our knowledge base article below:
https://knowledgebase.starwindsoftware. ... collector/
You can upload the collected logs to any cloud (dropbox, google drive, OneDrive, etc.) and share the link for download.
Post Reply