Updating NIC firmware/drivers

D9BDCEA2CE · Fri Aug 21, 2015 1:37 pm

Recently I had an exciting morning when I updated the firmware and drivers for the NICs on one of my Nodes in a two node cluster. Before I started I failed all the Roles over to the other Node so I thought I was safe. Then I had to restart the Node several times as part of the updating process. By the time I was done several Roles had failed, saying they couldn't access their discs. The Roles had apparently failed more than once in 6 hours and stopped as per the rules which are set up (the default).

For example, I have a Role, let's call it Server1. Server1 has a preferred owner of Node1, then Node2. Storage for Server1 is on CSV1, which is in Starwind and synchronised between Node1 and Node2. There's also a CSV2 - Roles with their storage on CSV1 failed, Roles with storage on CSV2 didn't.

I had assumed that if a Role is moved from Node1 to Node2 then I can safely restart Node1, as storage requests for Server1 will be handled by the copy of the CSV it uses for storage on Node2.

Am I doing it wrong? How can I tell? I spent a long time on the plan before I set it all up, I'm pretty sure it's right and normally I can install updates, failover a node, restart it and fail it back without problems. Repeatedly restarting a node seems to have upset the system somehow and I'd like to understand why.

I can imagine you might need more information, let me know. I think once I get what happened I'll have learnt something important...

darklight · Tue Aug 25, 2015 9:11 am

Hi, may I wonder how much time did pass between your restarts?
Doing it too fast will definitely cause problems with windows clustering service which does not start normally (meaning whole clustering replication between nodes) and could cause abnormal behavior.

D9BDCEA2CE · Tue Aug 25, 2015 1:08 pm

It was several times in one morning. It sounds like I need to space out my updates if I try to do it again.

Rajesh.Rangarajan · Mon Aug 31, 2015 9:54 am

D9BDCEA2CE wrote:It sounds like I need to space out my updates if I try to do it again.

- Definitely a good idea.

I would also recommend putting StarWind service into "manual" mode on the host where you are performing multiple restarts. This may help you to avoid full sync after finishing maintenance process.

Wed Sep 16, 2015 8:57 pm

Thanks for your help Rajesh and darklight.

D9BDCEA2CE, do you have any questions?

D9BDCEA2CE · Wed Sep 23, 2015 10:37 am

Hmmm...

Today I decided to update the NIC drivers on one of my nodes - I like to keep firmware and drivers up to date where possible. I left it all alone after the last incident but I've been aware the two nodes were using different drivers and felt I should resolve it. I planned to space the updates out as recommended but this morning it didn't even involve a restart.

I paused one node and failed the roles to the other node. That went fine. Then I upgraded the NIC drivers on the paused node, including the ones responsible for the Starwind connection between the nodes. Then I resumed the paused node and failed the roles back - two roles had lost contact with their discs and one seemed OK but failed when I tried to interact with it. I was able to get them going again by shutting them down and starting them from 'Off' but obviously that's not optimal.

I've checked Critical Events for the roles concerned but there's nothing there.

I do see these events on the node which was running:
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv2, all heartbeat Connections with Partner Node iqn.2008-08.com.starwindsoftware:cls02-csv2 lost
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv2, current Node passed to "Not synchronized" State, Reason is Synchronization Connections with Partner iqn.2008-08.com.starwindsoftware:cls02-csv2 were lost
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv2, current Node State has changed to "Not synchronized"
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-witness1, all heartbeat Connections with Partner Node iqn.2008-08.com.starwindsoftware:cls02-witness1 lost

These events are repeated, it's when I was updating the drivers on the other node (CLS02). I had assumed, perhaps incorrectly, that the data would be retrieved automatically from the working node when the one I was updating became unavailable.

Does this sound familiar to anyone? I'm about to start picking through exactly what I've done so I can see if I can figure it out.

Please let me know if I've failed to explain anything clearly, I'm learning this stuff as I go along.

D9BDCEA2CE · Wed Sep 23, 2015 11:11 am

More info:

These all happened at about 08:37
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv1, all heartbeat Connections with Partner Node iqn.2008-08.com.starwindsoftware:cls02-csv1 lost
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv1, current Node passed to "Not synchronized" State, Reason is Synchronization Connections with Partner iqn.2008-08.com.starwindsoftware:cls02-csv1 were lost
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv1, current Node State has changed to "Not synchronized"High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv1, all heartbeat Connections with Partner Node iqn.2008-08.com.starwindsoftware:cls02-csv2 lost
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv2, current Node passed to "Not synchronized" State, Reason is Synchronization Connections with Partner iqn.2008-08.com.starwindsoftware:cls02-csv2 were lost
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-csv2, current Node State has changed to "Not synchronized"
High Availability Device iqn.2008-08.com.starwindsoftware:cls01-witness1, all heartbeat Connections with Partner Node iqn.2008-08.com.starwindsoftware:cls02-witness1 lost

Then at about 08:38 the Failover Clustering system says
Cluster node 'CLS02' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Cluster Shared Volume 'CSV1' ('Cluster Disk 1') has entered a paused state because of '(80000011)'. All I/O will temporarily be queued until a path to the volume is reestablished.
Cluster Shared Volume 'CSV1' ('Cluster Disk 1') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

I have my notes for the exam I'm working on (70-414) and my record of what I did setting it all up so I'll dismantle what's going on - you never know, I may find the answer myself, I'll make notes here as I go along.

D9BDCEA2CE · Wed Sep 23, 2015 1:29 pm

Each cluster node has five discs, two are real and three are called STARWIND STARWIND Multi-Path Disk Device.

They correlate to CSV1, CSV2 and Witness1.

On Node 1: CSV1 is Offline, CSV2 is online, Witness1 is online.
On Node 2: CSV1 is Online, CSV2 is offline, Witness1 is offline.

I think that means that for each Starwind disc there's only one version online. Is that right? I'm thinking that there's one live copy and the other is synchronised until it's needed.

This morning I interfered with Node 2, the roles which had trouble store their data in CSV1. Next question, why didn't CSV1 just use the copy on Node 1?

D9BDCEA2CE · Wed Sep 23, 2015 1:34 pm

Reading back through this post (see Rajesh's comment), I'm beginning to wonder - if I had stopped the Starwind Virtual SAN service on Node2 before I did the maintenance, would Starwind have brought the copy of CSV1 it has on Node1 online and used that until I started the service on Node2 again?

That would seem to be a sensible response given what the software is supposed to be doing, I'm just confused because what I'm seeing causes me to lack trust that if a node failed the other node would be able to take over properly. Why doesn't the node which is working just take over?

I'm not convinced this is right:
On Node 1: CSV1 is Offline, CSV2 is online, Witness1 is online.
On Node 2: CSV1 is Online, CSV2 is offline, Witness1 is offline.

However MPIO shows three active paths to the CSVs, like this one:
MPIO Disk1: 03 Paths, Fail Over Only, Symmetric Access
SN: 56AA055973E343B5
Supported Load Balance Policies: FOO RR RRWS LQD WP LB

Path ID State SCSI Address Weight
---------------------------------------------------------------------------
0000000077010001 Active/Optimized 001|000|001|000 0
TPG_State: Active/Optimized , TPG_Id: 2, TP_Id: 2
Adapter: Microsoft iSCSI Initiator... (B|D|F: 000|000|000)
Controller: 46616B65436F6E74726F6C6C6572 (State: Active)

0000000077010006 Standby 001|000|006|000 0
TPG_State: Active/Optimized , TPG_Id: 1, TP_Id: 1
Adapter: Microsoft iSCSI Initiator... (B|D|F: 000|000|000)
Controller: 46616B65436F6E74726F6C6C6572 (State: Active)

0000000077010005 Standby 001|000|005|000 0
TPG_State: Active/Optimized , TPG_Id: 1, TP_Id: 1
Adapter: Microsoft iSCSI Initiator... (B|D|F: 000|000|000)
Controller: 46616B65436F6E74726F6C6C6572 (State: Active)

I suppose what I'm getting at is - why is what I'm doing here different from what happens when I drain the roles from a node and restart it to install updates?

Rajesh.Rangarajan · Mon Sep 28, 2015 12:01 pm

Hello,

Did you try to restart StarWind service after updating NIC drivers? StarWind "binds" to the available network adapters after each restart and there is a high possibility that it just "lost" NICs after when you have updated drivers.

Thu Nov 05, 2015 10:16 pm

Thank you, Rajesh!

Anything else we can help here?

D9BDCEA2CE · Tue May 17, 2016 2:07 pm

It's been a while but I find myself hunting around the internet and coming across answers from a while ago so I'll finish the story...

What I wasn't doing, as Rajesh was directing me towards although I didn't understand, was stopping the Starwind service before I fiddled about with things - it doesn't like being disconnected and connected too often in a given period. If I do one restart for updates it's OK but when you update drivers and firmware on NICs it tends to flick on and off quite a lot.

I have just, over a few days, replaced all the network drivers on both my nodes without incident.

On Node 1 I first paused the cluster node and drained the roles. Then I set the Starwind services to manual and stopped them. This means everything is relying on Node 2. Then I carefully noted the network configuration so I could rebuild it later. On Node 1 I was able to replace the network drivers and restart the machine without disturbing any of the virtual machines. Once the work was done and I restored the network configuration I restarted and reconfigured the Starwind services, waited for them to show properly in the console, then resumed the cluster node and passed the roles back.

Having done a lot more reading in the time since my previous comments I left everything alone for a couple of days so I didn't set off any alarms about too many failures in a given period, then I repeated the whole process for the other node.

All my hosted machines worked smoothly throughout the process. Excellent.

Thank you for your help folks.

Wed May 18, 2016 9:42 am

Thank`s a lot for your story, this can be very helpful for all community.
We are really appreciate your contribution into our community.