Couple of troubleshooting questions

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
elaw
Posts: 14
Joined: Thu Jul 18, 2013 1:59 pm
Location: Bedford, MA

Thu Jul 18, 2013 2:10 pm

Hello!

I'm evaluating native SAN, have it set up in a test environment, and it's generally working. But I'm sporadically getting "all heartbeat connections with the parter node...were lost" events in the log. It only seems to happen once every few hours, and the connection always comes back within a few seconds.

My first question: is there any way to change which networks are used for heartbeat and sync once a device is set up? I'd like to see if this is related to the specific network being used for the heartbeat channel.

Next: any other general tips for troubleshooting this? The heartbeat channel runs over a gigabit ethernet network shared with a bunch of other stuff (that's why I'd like to try changing it), but I'm also running the free version of SAN and NAS on two different servers on this same network and they don't experience this problem.

I've tried doing some throughput and ping tests on the network and everything seems fine. Throughput averages > 950 Gbps, and latency is always < 1 ms. I can see where heavy network traffic might cause this kind of issue but one of the events happened at 3:41 this morning when nobody was here and nothing was going on, so that explanation doesn't really fit.
btoups2
Posts: 6
Joined: Wed Apr 10, 2013 1:59 am

Thu Jul 18, 2013 2:27 pm

We had that issue, not exactly sure what the cause was, but we performed the following steps.

On Every NIC
1. Firmware update
2. Driver Update
3. Verify all advanced settings (Offload and such)

Get this sorted out before you go into production as it caused VM corruption for us.
elaw
Posts: 14
Joined: Thu Jul 18, 2013 1:59 pm
Location: Bedford, MA

Fri Jul 19, 2013 12:08 pm

Thanks!

I updated firmware and drivers but the problem continues. Any suggestions on what advanced settings to look at, and what they should be set to? This is a brand-new box so everything's set to default. The NIC is an Intel I350t if it matters.
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Tue Jul 23, 2013 3:14 pm

Hello
My first question: is there any way to change which networks are used for heartbeat and sync once a device is set up? I'd like to see if this is related to the specific network being used for the heartbeat channel.
Yes, it is possible. Thsi question is answered here.
Next: any other general tips for troubleshooting this? The heartbeat channel runs over a gigabit ethernet network shared with a bunch of other stuff (that's why I'd like to try changing it), but I'm also running the free version of SAN and NAS on two different servers on this same network and they don't experience this problem.
Do you mean that you are running two SAN softwares on the same machines?

Can I also ask you to [rovide the comunity with the detailed network diagram of your system and the build number that you are running?

Thanks
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
elaw
Posts: 14
Joined: Thu Jul 18, 2013 1:59 pm
Location: Bedford, MA

Tue Jul 23, 2013 4:49 pm

We've actually got a webex set up with someone this afternoon so hopefully we'll get this resolved then. But to answer your questions...

The software is not installed on the same machines. We've got SAN and NAS installed on a couple of Dell NAS boxes that run Windows 2008, with one HA drive set up between the two. That was our initial test setup - the HA drive is accessed by two Windows 2012 VMs set up in a cluster. And the whole setup works fine.

The "problem setup" is two different servers, both identical and brand-new (as in, there's nothing else on them) and running Windows 2012. Native SAN is set up on them, and we're going to be using them as failover-clustered Hyper-V servers. We've set up one VM on them as a test, and other than the intermittent heartbeat loss this setup is working well too.

I don't have a network diagram but it's pretty simple so I'll try to describe it verbally. Each of the problem servers has 4 NIC ports. Teaming is not used.

We have two adjacent buildings that are connected to each other with 1-gigabit fibers. In each building, there is a stack of network switches for "general use" connecting both servers and workstations together, and a network switch for servers only. The "general use" stacks in each building are connected to each other with fiber, and the "servers only" switches in each building are connected to each other with their own fibers.

The management IPs on the servers (10.1.10.13 and 10.2.10.2) are used for the heartbeat and are connected to the "general use" switches. A dedicated subnet for just those two servers(10.192.3.0/24) is used for synchronization. It uses separate NIC ports on each server and is connected to the "servers only" switches.

I should add that our original test setup, the one with SAN and NAS on the other two servers, is set up exactly the same way and is connected to the same network switches. That setup was running, and continues to run, with no problems.

Oh and on the build number, this is what I see in the log: "StarWind iSCSI SAN v6.0.0 (Build 20130527, [SwSAN], Win64)". Looking in the log from one of the servers that's not experiencing the problem, I see the same except the build number is 20130131.
User avatar
Anatoly (staff)
Staff
Posts: 1675
Joined: Tue Mar 01, 2011 8:28 am
Contact:

Mon Jul 29, 2013 10:28 am

Thanks for the provided information.

Well, since you had RS with one of our tech engies I think your system should be performing well now, can you confirm that?

Thanks
Best regards,
Anatoly Vilchinsky
Global Engineering and Support Manager
www.starwind.com
av@starwind.com
elaw
Posts: 14
Joined: Thu Jul 18, 2013 1:59 pm
Location: Bedford, MA

Mon Jul 29, 2013 11:53 am

Well... I think it's a little early to say.

As of yesterday morning, nothing we had tried worked. They seem convinced it's a network connectivity issue, but we tried updating adapter firmware & drivers, turning off all power-saving settings, and changing all the cables with no effect. The network switches report 0 errors on the ports that are being used for that function. I wrote a batch file that I ran on one server that repeatedly pinged the other one, and out of 86,000 pings it only failed once! I ran the same file on one of the two servers that's not having issues (pinging the other one in its pair) and it failed at a much higher rate: 57 times out of 270,000 pings.

Anyhow, yesterday I finally decided to change the heartbeat to go over a different connection between the two problem servers. The process was a bit of a headache... following the procedure in the article you linked, it never asked me what kind of synchronization to do, it just did a full one. So I got to sit there and watch it sync for several hours. But since then no errors have been reported in the log. If that remains the case for another day or two, I'll consider it fixed.
User avatar
Max (staff)
Staff
Posts: 533
Joined: Tue Apr 20, 2010 9:03 am

Wed Jul 31, 2013 11:14 am

Hi Elaw,
Thank you for keeping us updated.
The full sync is the most painful part of the procedure and we'll eliminate it in the upcoming v8.
As for the nic problems, we had a few heartbeat bugs submitted and fixed within last two weeks. These were quite hard to identify.
The systems however were close to ones you've described.
I will inform you as soon as we have an official release with these fixes in case the problem returns
Max Kolomyeytsev
StarWind Software
elaw
Posts: 14
Joined: Thu Jul 18, 2013 1:59 pm
Location: Bedford, MA

Wed Jul 31, 2013 11:41 am

Great! When the new release comes out, what's involved in updating? Just install it on top of what I already have?

Thinking back a little, I'm realizing the heartbeat isn't the only glitch I've been seeing. If I leave the Starwind management console open, once in a while, very infrequently (like every few hours), it'll say it's lost it's connection to the server. If I wait a few seconds and click "connect", it'll reconnect with no trouble. So it's behaving in a similar fashion to the heartbeat connection.

In those cases I know it's not a problem with the NIC, because I'm connected to the server via RDP over the same NIC and RDP doesn't lose its connection.

I don't know if it's a factor or just a coincidence, but the problem NIC is the same one carrying the "outside world" interface to the cluster.

And just an update: since I've moved the heartbeat connection to the other NIC, there are still no errors. Yay! :D
User avatar
Max (staff)
Staff
Posts: 533
Joined: Tue Apr 20, 2010 9:03 am

Mon Aug 05, 2013 10:36 am

That's quite interesting, I can only think about interference with the network activity on that network.
By the way, do you have any additional meters other than RDP.
In my experience RDP sustained a network cable disconnect for several seconds and didn't even grey out the RDP session window.
It's definitely I think I'll discuss with our R&D.
Will there be any chance to ask you for some assistance with diagnosing connection loss in your scenario?
Max Kolomyeytsev
StarWind Software
elaw
Posts: 14
Joined: Thu Jul 18, 2013 1:59 pm
Location: Bedford, MA

Mon Aug 05, 2013 12:10 pm

I'd be happy to do anything I can with diagnosis, as long as it's not too intrusive on the server, as we're now using that machine in production.

The one big thing I can think of that's different between the two pairs of servers (other than that one is running SAN and NAS and the other is running native SAN, and the Starwind build numbers are slightly different) is that the problem pair is clustered, and the network interface in question is used for cluster management. Whereas the other two servers are not clustered. So I wonder if the problem might be related to the virtualized-IP weirdness that goes with clustering.
User avatar
Max (staff)
Staff
Posts: 533
Joined: Tue Apr 20, 2010 9:03 am

Mon Aug 12, 2013 9:58 am

We have scheduled to reproduce the issue in-house. I will keep you updated.
Max Kolomyeytsev
StarWind Software
Post Reply