After the time change, sync channel dropped and iscsi connec

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
dmansfield
Posts: 14
Joined: Sun Mar 14, 2010 8:34 pm

Mon Mar 14, 2011 6:14 pm

Good Afternoon, moments after the time change on Sunday, our 10 Gigabit sync channel was dropped and our iSCSI connections lost. Errors compounded from there and both servers self-instructed a reboot. Because both servers rebooted at approximately the same time, our HA datastores were both thrown out-of-sync, and we were forced into an 8+ hour recovery. We have two StarWind HA Enterprise servers running primarily HA datastores. The server are Dell T710s running Windows Server 2008 R2. We have contacted Dell and they have scoured our log files and have found no hardware errors. Please assist us in determining the cause of our issues on Sunday so that we can prevent a recurrence. Thank you.
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Mon Mar 14, 2011 7:11 pm

1) There should be a reason for connection to drop. Either Windows Event Log or StarWind Log should contain data entries about this event. Can we have both ones checked please?

2) Who had instructed both servers to reboot? We don't do this so who was it?

After finding out why connection was lost and who had instructed machines to reboot we'd definitely find out who's guilty and what to do never seeing this issue again. Thanks!
dmansfield wrote:Good Afternoon, moments after the time change on Sunday, our 10 Gigabit sync channel was dropped and our iSCSI connections lost. Errors compounded from there and both servers self-instructed a reboot. Because both servers rebooted at approximately the same time, our HA datastores were both thrown out-of-sync, and we were forced into an 8+ hour recovery. We have two StarWind HA Enterprise servers running primarily HA datastores. The server are Dell T710s running Windows Server 2008 R2. We have contacted Dell and they have scoured our log files and have found no hardware errors. Please assist us in determining the cause of our issues on Sunday so that we can prevent a recurrence. Thank you.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
dmansfield
Posts: 14
Joined: Sun Mar 14, 2010 8:34 pm

Thu Mar 17, 2011 7:04 pm

Dell is indicating that the issue is most likely with TOE being enabled on the Intel 10Gb nics. Before doing what they recommend I want to make sure nothing will impact StarWind; especially the commands from the elevated command prompt. Dell's recommendations are to do the following:

Step 1. Install the latest tcpip.sys for server 2008 R2 http://support.microsoft.com/kb/2386184

Step 2. Run the following commands from elevated command prompt:

netsh int tcp set global rss=disabled

netsh int tcp set global netdma=disabled

netsh int tcp set global chimney=disabled

netsh int tcp set global autotuninglevel=disabled

netsh int tcp set global congestionprovider=none

Step 3. Disable TOE - Intel Configuration
a. Open Device Manager
b. On each Intel NIC in Device Manager, disable the following:
c. NOTE: Not every option is available or exist on the Advanced Tab.
i. Offload Receive IP Checksum
ii. Offload Receive TCP Checksum
iii. Offload TCP Segmentation
iv. Offload Transmit IP Checksum
v. Offload Transmit TCP Checksum
vi. IPV4 Checksum Offload
vii. Large Send Offload v2 (IPV4)
viii. Large Send Offload v2 (IPV6)
ix. Receive-Side Scaling
x. TCP Checksum Offload (IPV4)
xi. TCP Checksum Offload (IPV6)
xii. UDP Checksum Offload (IPV4)
xiii. UDP Checksum Offload (IPV6)

Step 4. Install SP1 for Windows Server 2008 R2

Step 5. Have a StarWind engineer review the iscsi configuration. Dell commented that it is an unsupported configuration the way it is setup.

I have sent this information in to StarWind tech support but also wanted to get this out for other Forum viewers to see.

Thank you
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Thu Mar 17, 2011 7:21 pm

Good. One question so far.

"Dell commented that it is an unsupported configuration the way it is setup."

What does it mean exactly? Could you please clarify.

Thank you very much!

Anton
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
dmansfield
Posts: 14
Joined: Sun Mar 14, 2010 8:34 pm

Thu Mar 17, 2011 8:38 pm

By "Good" are you saying the proposed changes from Dell are okay implement and won't conflict with StarWind? I am not sure what Dell is refering to about the iscsi settings. They did say that it may be fine but it is not a setup that they are used to seeing. Can we do a remote session with one of your engineers to take a quick look?
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Fri Mar 18, 2011 8:17 am

OK to check. Should be performance touched in any case and we'd see this pretty soon.
dmansfield wrote:By "Good" are you saying the proposed changes from Dell are okay implement and won't conflict with StarWind? I am not sure what Dell is refering to about the iscsi settings. They did say that it may be fine but it is not a setup that they are used to seeing. Can we do a remote session with one of your engineers to take a quick look?
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
dmansfield
Posts: 14
Joined: Sun Mar 14, 2010 8:34 pm

Wed Jul 06, 2011 2:32 pm

Update for anyone running Intel 10gb nics for the sync channel. We were running two direct connections in a team type of "Adapter Fault Tolerance" with one nic active and the other nic in standby. This teaming is what was causing the servers to crash. We took away the team and are running with just one nic on each side enabled and everything works now.
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Wed Jul 06, 2011 9:35 pm

Thank you very much for your update!

P.S. With V5.8 it would not be necessary to team NICs any more. Just an opposite :))
dmansfield wrote:Update for anyone running Intel 10gb nics for the sync channel. We were running two direct connections in a team type of "Adapter Fault Tolerance" with one nic active and the other nic in standby. This teaming is what was causing the servers to crash. We took away the team and are running with just one nic on each side enabled and everything works now.
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
clayton@mcc911.org
Posts: 17
Joined: Fri Dec 17, 2010 4:00 pm

Thu Jul 07, 2011 9:16 pm

you still never said why the servers rebooted??
User avatar
anton (staff)
Site Admin
Posts: 4021
Joined: Fri Jun 18, 2004 12:03 am
Location: British Virgin Islands
Contact:

Fri Jul 08, 2011 6:27 am

StarWind never forces machines to reboot and it's user-land component so even if StarWind crashes it's not deadly to the whole system. That's why the answer is "I don't know".

It's definitely not StarWind but something else (software or hardware) so I'd suggest to start looking at system event log looking for "red light" messages from other guys.
clayton@mcc911.org wrote:you still never said why the servers rebooted??
Regards,
Anton Kolomyeytsev

Chief Technology Officer & Chief Architect, StarWind Software

Image
Post Reply