VSAN Management service port randomly dead

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
alexandru-bagu
Posts: 6
Joined: Tue Jul 02, 2024 8:55 am

Tue Jul 02, 2024 9:11 am

Hello!

We have set up the vSAN CVM Free version with two ESXi 7 nodes in the cluster with heartbeat synchronization. Everything works great with the exception of the management ports randomly dies. By the management port I meant 3261. This is the second time I find that the wine process that normally listens on the port 3261 is no longer listening for new TCP connections.
Even if the management port is down, the iSCSI targets are online and synchronizing just fine but because the management service is unreachable, the CVM UI reports that the LUN is not highly available (even though it is).

Are you guys aware of this?
I suspect the culprit might be a vulnerability scan we run every week. After I detected the last fault in the management service I decided to pay close attention to when the process no longer listens to the port 3261 and the event corresponded with the scan (or rather, a few minutes into the scan the ports died).
The scanner itself also uses a lot of CPU and one of the hosts was/is CPU starved during the process. I don't believe CPU starvation would cause this issue though.
The scanner is Greenbone's OpenVAS running on latest version.

Let me know what information I can provide to be able to avoid this issue.
yaroslav (staff)
Staff
Posts: 3427
Joined: Mon Nov 18, 2019 11:11 am

Tue Jul 02, 2024 9:27 am

Welcome to StarWind Forum.
The problem seems to be related to the vulnerability scanner, not StarWind CVM. We generally recommend excluding 3261 and 3260 from the firewall and scans.
P.S. In the future, for tot the NVMe-oF Target with StarWind NVMe-oF initiator, you will need to exclude 4420 too.
alexandru-bagu
Posts: 6
Joined: Tue Jul 02, 2024 8:55 am

Tue Jul 02, 2024 9:48 am

While I understand that the issue might be the vulnerability scanner (that is still not confirmed, anyway), because of the current landscape more and more people will do this sort of scans and end up in my situation.
More over, unless the vulnerability scanner has some NVT specific to Starwind management service it is not doing anything out of the ordinary and that means that something is poorly handled in the socket listener in the Starwind management software for it to randomly stop listening.

I am sure this issue does not affect only the free version as it's basically the same software as with the licensed one.
Obviously, in my case because I don't want to deal with this issue I will do the exclusion.
yaroslav (staff)
Staff
Posts: 3427
Joined: Mon Nov 18, 2019 11:11 am

Tue Jul 02, 2024 10:08 am

As you mentioned, the console disconnects shortly after the scanner runs. I often see this for other users, where the firewall locks 3261 ports.
What I am trying to say is here it is unlikely to stop on its own or randomly. There is a clear network blip for the service that is treated as a "network interruption".
Let me know if you have more questions.
May I wonder if you can reconnect after the blip?
alexandru-bagu
Posts: 6
Joined: Tue Jul 02, 2024 8:55 am

Tue Jul 02, 2024 10:18 am

That's the thing, it's dead. There's no way to connect to it back.

To be able to restore connectivity I have to take down the whole cluster, stop the starwind-virtual-san service on all nodes, start starwind-virtual-san on the node I know is up to date, mark it as synchronized, start the starwind-virtual-san on the other node and let it synchronize on its own.
yaroslav (staff)
Staff
Posts: 3427
Joined: Mon Nov 18, 2019 11:11 am

Tue Jul 02, 2024 11:08 am

You do not need to take a cluster down, follow this procedure to restart the VM https://knowledgebase.starwindsoftware. ... installed/
Please reach out to support@starwind.com (use this thread and 1180298) as your references
alexandru-bagu
Posts: 6
Joined: Tue Jul 02, 2024 8:55 am

Tue Jul 02, 2024 11:25 am

That procedure does not help because the nodes are reported as "Not Synchronized" in the CVM and the starwind-virtual-san is not listening on port 3261 on either nodes.

What I did today was stop the starwind-virtual-san on the 2nd node, restart startwind-virtual-san on the first node, using the powershell module mark the first node as synchronized (otherwise it is unusable by the ESXis), then start starwind-virtual-san on the 2nd node and this triggered a full resync on the 2nd node.
Restarting the VM would have done nothing to help this because when the VM would come back online it would see that it was not synchronized and I would end up possibly worst off because it wouldn't be able to communicate with the other node because the management service would be dead on the other node as well.

Again, I am not here for advice on how to deal with this issue because I already got a handle on it. I am posting here because this seems like a major flaw that you guys probably want to fix.
The management service is unreachable (netstat -nl shows that no service is listening on 3261/tcp) until the service starwind-virtual-san is restarted. The management service seems to stop listening on that port on both nodes at the same time (probably because the scanner is doing parallel scans).
yaroslav (staff)
Staff
Posts: 3427
Joined: Mon Nov 18, 2019 11:11 am

Tue Jul 02, 2024 12:04 pm

Thanks for your update.
3261 is not used for Synchronization or iSCSI it is used only for management. Interrupting the Management link alone does not lead to the issue, sure unless all the links were interrupted at the same time.
What I did today was stop the starwind-virtual-san on the 2nd node, restart startwind-virtual-san on the first node, using the powershell
This is restart procedure mishandling.
I am posting here because this seems like a major flaw that you guys probably want to fix.
StarWind VSAN should be able to reconnect after interruption over 3261, if it does not, I need to look into it. That's why I need to collect the logs from the system and look into it in detail. That's why I asked to reach out to support.
Please reach out to us to let our dev team look into this incident in detail.
alexandru-bagu
Posts: 6
Joined: Tue Jul 02, 2024 8:55 am

Tue Jul 02, 2024 12:28 pm

I will reach out to the support email momentarily.

Like I said, the nodes were in sync even though the management tcp service was not available. My issue was that the CVM was reporting that the nodes were out of sync (hence limited availability), I assume because they couldn't communicate with each other to say otherwise.
If it does happen again I will try restarting just one of the nodes to see if it shows as synchronized afterward.
yaroslav (staff)
Staff
Posts: 3427
Joined: Mon Nov 18, 2019 11:11 am

Tue Jul 02, 2024 12:48 pm

Let us resume cooperation in the support ticket. Thanks.
User avatar
asshley675
Posts: 5
Joined: Tue Jul 02, 2024 12:53 pm
Location: Thailand

Wed Jul 03, 2024 6:00 am

Try adjusting scan settings to reduce load on your ESXi hosts during critical times. Also, ensure proper CPU allocation and monitor for any network or firewall impacts. Document issues for troubleshooting, and consider consulting VMware or Greenbone support for tailored advice. This approach should help maintain stability in your setup.
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Thu Sep 12, 2024 6:09 pm

Any news OP can share with us?
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
Post Reply