Odd CVM Appliance Issue

mkilgore · Tue Oct 22, 2024 7:28 pm

I have two CVM nodes, configured according to the Hyper-V Failover Cluster CVM documentation, everything was working great for about a week, but then, due to a network issue one of the appliances got a bad IP via DHCP (instead of it's assigned reservation), I resolved it by setting the IPs to static, which seems to have mostly worked.

The issue however, is that one of the CVMs sees both appliances as online, working and functioning (we'll call it CVM01), and the other, sees only itself as online (CVM02). To make things even weirder, the LUN shows everything is online and working correctly, and if a CVM goes offline it properly shows that it needs to sync.

To make matters worse, when CVM01 goes offline (during cluster aware Windows Updates), the failover server will bomb out on the storage, and both the virtual disks will disappear, as if CVM02 isn't available for connection at all (despite being online).

I've double checked that SR-IOV is disabled on the Hyper-V interfaces, both of the CVM appliances can ping each other properly, on all of the interfaces. And I've double and triple checked the MPIO and iSCSI settings on the Windows host machines. I'm just kind of lost at this point.

Also interesting, CVM01 has 4 sessions on the LUNS, while CVM02 has just 2 sessions on each LUN. Despite the windows MPIO on the device showing a connection to both of them equally.

This is a personal cluster on the free licensing. The trial has expired, so no access to the WebUI functionality.

Tue Oct 22, 2024 8:17 pm

Thanks for sharing your story and sorry to read about these issues.
Trial license takes the storage down when expired.
What you experienced points to the fact that one of the nodes was not synchronized. Once the only active node went down, the storage has disappeared. There's nothing strange here.
Few suggestions: disable SRIOV, make sure to use static IPs and have redundant network config.
See more here https://www.starwindsoftware.com/system-requirements and https://www.starwindsoftware.com/best-p ... practices/.
I'd suggest to start with reviewing the IPs and assigning ones you used for DATA and REPLICATION during the original setup.

See also the restart procedure https://knowledgebase.starwindsoftware. ... installed/.

mkilgore · Tue Oct 22, 2024 8:32 pm

I should note that the free licensing has been assigned (not the trial) for a bit now.

SR-IOV is disabled (and has been since the start), and they are now configured with static IPs (from the time it was first an issue with DHCP reservation) for the management side (with redundant switches and NICs) and the Data and Sync interfaces have always been static.

I've restarted both nodes now, and there is no change to CVM02 still showing CVM01 offline, while CVM01 is showing both nodes as online.

For the IP addresses, the Mgmt IPs are 172.31.125.155 for CVM01, and 172.31.125.159 for CVM02

For the Data side, CVM01 lives on 172.16.10.4, while CVM02 lives on 172.16.10.5 (the hosts are .1 and .2)

For the Sync side, CVM01 lives on 172.16.20.4 while CVM02 lives on 172.16.20.5 (the hosts are .1 and .2)

Both the Data, and Sync interfaces are connected directly to each other via direct attach copper cables. And everything can ping properly (the hosts to each other and CVM01 and CVM02, and so forth so on) across both the Data, Sync, and Mgmt NICs.

For additional information, I used this guide (which was emailed with the free license) to configure everything, and I followed it incredibly closely with only minor deviations where needed for my environment. https://www.starwindsoftware.com/resour ... shell-cli/

Tue Oct 22, 2024 10:16 pm

Please pull the support bundles from both nodes. I will ask the dev team to look into it. Do you have the date of the incident? As an exclusion, please log a call with us at support@starwind.com. Use 1232570 and this thread link as your references.

mkilgore · Wed Oct 23, 2024 1:25 pm

I have reached out to support as per your message.

This started roughly in August shortly after I spun up the instance, but I haven't had time to truly do any debugging until this week due to other obligations.

Wed Oct 23, 2024 1:57 pm

Did you try restarting the affected node?

mkilgore · Wed Oct 23, 2024 2:08 pm

Both nodes have been restarted a few times since it started having issues, and yesterday they also rebooted as part of an update.

alexandru-bagu · Sat Dec 07, 2024 4:40 pm

Did you find a solution for this issue?

I am running into the same issue right after updating the CVMs.

*EDIT*

I had a look at the logs of the management web app and I can see the following:

Code: Select all

2024-12-07 16:47:31.787 00017 DEBUG    | [RemoteHostClient]: Send GET request: 'https://10.229.0.20/api/v1/pools?nodeId=19759F2C-2738-11EF-81AC-2F66235916ED'
2024-12-07 16:47:31.790 00017 DEBUG    | [RemoteHostClient]: GET response: Length: 0. [Satus: Unauthorized, Error: '', StatusDescription: Unauthorized, Encoding: ]
2024-12-07 16:47:31.790 00017 DEBUG    | [RemoteHostClient]: GET Request time spend: 0 sec
2024-12-07 16:47:31.790 00017 INFO     | [RemoteHostClient]: Throw respose exception:
2024-12-07 16:47:31.790 00017 DEBUG    | [RemoteHostClient]: Response exception message: Request failed with status code Unauthorized
2024-12-07 16:47:31.790 00017 ERROR    | [RemoteStorageAppliance]: Receive remote storage pools error: Request failed with status code Unauthorized
2024-12-07 16:47:31.790 00017 DEBUG    | [RemoteStorageAppliance]: Stack trace:    at RemoteNodeClient.RemoteHostClient.ThrowResultException(RestResponse response)
   at RemoteNodeClient.RemoteHostClient.HandleResponseErrors(RestResponse response)
   at RemoteNodeClient.RemoteHostClient.ExecuteRequestAsync[T](Func`1 request)
   at RemoteNodeClient.PartnerNodeClient.GetStoragePoolsAsync(String nodeId)
   at StarStack.Storage.Management.Storage.RemoteStorageAppliance.GetStoragePools(ICommandContext context)

It looks like 10.229.0.21 (the CVM that has the issue) is unable to query the api of the 10.229.0.20 (the one that has no issues).
Is there a way to force a reauthentication?

Sat Dec 07, 2024 6:45 pm

Attaching the follow-up.
PROBLEM
One node reports another as down, while the node that is reported as "offline" sees both nodes as "ONLINE"
This is a known problem. It may happen sometimes to CVM during configuration or update.

Fix
Removed and readded the node labeled as OFFLINE from the Appliance menu.

Changes made
Removed a redundant connection from the favorite targets.
Removed the OFFLINE node from the Appliances view of the affected VM and readded it back