Troubleshooting "not synchronized" status

wallewek · Sat Sep 30, 2017 9:56 pm

I'm having an issue with a pair of newly created device storage devices (the witness ones are working fine).

For starters, I'm getting warnings showing up as "Partner ... is not synchronized" or "Current node is not synchronized" depending on where I look.

1. Nowhere can I find any indication if it is actually _trying_ to synchronize, or to tell it to start. It just sits there saying it's not.

2. The error messages give no hint as to _why_ it's not synchronized. There's no diagnostic information whatever.

3. This forum interface does not seem to allow one to search for an exact phrase like "not synchronized", so as a result I have a great mass of useless results when I try to see what else might have been posted on the topic.

So before i start trying to guess why it's not synchronized -- what am I missing? How do I find answers to these questions?

If I look at the events tab following a reboot of one of the hosts, I see that the witness device has entries saying "Synchronizing" and "Synchronized". I don't see anything like that for the other device. I see synchronization connection established for all devices, but that's as far as it goes.

It's almost as if something hasn't gotten the message to start synch.

It's looking a bit like there is some old configuration settings hanging around from the previous storage device definition, mucking things up. I see several "device added" entries in the Events log for devices that don't exist anymore.

I suspect I have iSCSI issues, and I'll be investigating that, but I wanted some clarification on this synch status thing. What does one do in a case like this?

Follow-on question: how can I check or re-set to ensure iSCSI targets have "Allow multiple connections" enabled? All I can see is a way to set it during device creation using the Add Device Wizard. What about later?

-- Ken

Mon Oct 02, 2017 4:55 pm

Hello wallewek,
Thank you for interesting in StarWind solution.
First of all, could you please describe your configuration? Network for Sync/iSCSI (throughput, connection (switch or direct)), type of disks, RAID array level.
Could you collect the logs and share it with us? For quicker and easier log collection from StarWind nodes please do not hesitate using the script from our knowledge base article below:
https://knowledgebase.starwindsoftware. ... collector/. You can share the link for download from any cloud (google drive, one business, dropbox, etc.)

However, you can check our KB article about the basic troubleshooting by link below:
https://knowledgebase.starwindsoftware. ... ting-note/

In most cases, the reason of unsync state of StarWind devices you can find in System and Application logs (delays, network connectivity problem, etc.)

Thank you

wallewek · Mon Oct 02, 2017 8:52 pm

Thank you Ivan,

I will try to get you that info. In the meantime, here is some key info.

It appears my configuration is virtually identical to that described in the thread
https://forums.starwindsoftware.com/vie ... f=5&t=4792 created less than a month ago.
Even the circumstances are comparable.

These are a pair of identical, automatically-updated small Server 2016 servers with unmirrored SATA 7200 RPM drives, each with four 1GB NICs in separate subnets, connected via an HP managed 1GB switch. I have checked the network connections carefully, and I am seeing no signs of errors or high load. Telnet to ports works fine.

IMPORTANT NOTES
My problems started when I deleted previously created lsfs storage images to replace them with thick provisioning. Those lsfs images were working and synching fine, they were just too small so I had to delete and recreate them.

After creating the new, totally empty thick CSV image and replicating it, I could not get iSCSI connections to work properly for the new image(s). The original, unmodified witness disks continue to chug along, connected and synched, with no problem.

Using the iSCSI Initiator Properties tool, I can connect to the first server's new CSV target, from either host, no problem. But I cannot connect to the second server's new CSV target at all, not locally or remotely. I get the "Service Unavailable" message, even after rebooting.

And, of course, with no iSCSI connection there can be no synch at all.

I had earlier noted the KB article https://knowledgebase.starwindsoftware. ... kb4019215/
but have not yet applied the update. I wasn't clear whether it is relevant for Server 2016 Standard. But I did use the KB recommendation to clean up old disconnected iSCSI targets, which helped somewhat.

Would you recommend applying it for Server 2016?

-- Ken

Tue Oct 03, 2017 4:38 am

Ken,
Although mainly it was Win2012 that was hit by the issue, yet there was some part of Win2016 (appr. about 5% maybe) that suffered as well. Applying the updates would be a reasonable step in troubleshooting your issues.

wallewek · Tue Oct 03, 2017 7:39 pm

OK, issue not resolved, here is current status.

I couldn't find a Microsoft Server 2016 update directly corresponding to the Server 2012 one described above, and of course 2016 updates are managed very differently. But I did find a big update for Server 2016 -- the latest -- that hadn't been applied yet. So I tried it on the second server (only). Made zero difference.

So I tried tearing down the CSV image install and doing it all over again. Same result: the first server can access its own image fine, and the second server can access the first server's image fine too, but but neither server's iSCSI configuration can connect to the replica image at all. No how, no way.

So I tried tearing the CSV down _again_ and this time created the primary copy on the second server. By now I'm getting pretty quick at going through the steps. Anyway, this is fascinating: the connection problem was now reversed. I can connect to the primary image on the second server (where it was first created) no problem at all, from either server. But again, neither server can access the new _replica_ CSV image, either locally or remotely. As soon as I try to connect to the replica iSCSI target from either server, I instantly get a Log On to Target -- Service Unavailable error.

Oh, and BTW, I accidentally connected to the pre-existing Witness image remotely on the server I couldn't connect to the new CSV on. No problem at all. I disconnected it immediately, as I understand Witness disks aren't supposed to be cross-connected (not entirely clear why), but it's interesting that I had no problem with it.

So... it looks like there's something munged up with the Replica creation. No clue what. But I think I might try creating the replica manually on the second server and then connect to it there.

By the way, even though the iSCSI targets won't connect, the recipient (replica) image is showing as replicating! I cannot fathom how that is possible.

-- Ken

Wed Oct 04, 2017 9:29 am

Ken,

Please submit a support case and indicate the forum topic you come from (and/or your forum nickname).

wallewek · Thu Oct 05, 2017 5:16 am

OK, support case submitted, logs from both hosts plus iSCSI screenshots submitted in a single zip via the case submission interface.

Note, it appears that creating the storage in reverse (as described in the Tue Oct 03 posting) did eventually result in a pair of synchronized images, but the iSCSI connections (or not) still don't make any sense. (See attached iSCSI screenshots.) I haven't yet tried to do anything with those drives.

It might also be worth noting, I tried having the firewall totally turned off on both servers. No change.

Related topic threads:
"Why can't I extend a device/image?" https://forums.starwindsoftware.com/vie ... f=5&t=4811
"Re: Extending the HA image, not the virtual disk." https://forums.starwindsoftware.com/vie ... ice#p27003 (September postings only)

Thanks for taking this issue seriously. I really hope I haven't done anything stupid. Pretty much all of my work has been based on the technical paper "StarWind Virtual SAN Hyper-Converged 2 Nodes Scenario with Hyper-V Cluster" Published: August 2016
https://www.starwindsoftware.com/starwi ... -v-cluster

-- Ken

Thu Oct 05, 2017 10:21 am

I confirm I received a support case, yet there were no logs attached, just the screenshots and a Word doc zipped. I have sent you a request for logs as well. Check your email.

wallewek · Thu Oct 05, 2017 3:19 pm

I beg your pardon, I made a mistake when I tried to create a single zip containing both log collections. It would really help if your case submission website allowed the multiple attachments you requested.

I have now sent both log zips in separate email replies.

-- Ken

Mon Oct 09, 2017 1:16 pm

Ken,
As a result of logs investigation, one of your nodes appeared to have bad blocks on the physical drive. This was the cause why you were not able to get the system working properly. In order to investigate the issue further, I would highly recommend you checking disks health on both of your nodes.
Please keep me updated about your findings.

wallewek · Tue Oct 10, 2017 5:47 am

Thanks Boris, but this issue was actually NOT about synchronization. It was about iSCSI target connections, or rather the inability to make them. As I wrote on October 2nd,

After creating the new, totally empty thick CSV image and replicating it, I could not get iSCSI connections to work properly for the new image(s)...

Using the iSCSI Initiator Properties tool, I can connect to the first server's new CSV target, from either host, no problem. But I cannot connect to the second server's new CSV target at all, not locally or remotely. I get the "Service Unavailable" message, even after rebooting.

And, of course, with no iSCSI connection there can be no synch at all.

Here's a screenshot of the iSCSI connection failure message for clarity. I got this _immediately_, _only_ when trying to connect to the newly created _replica_ target, from _either_ host. There's no opportunity for synch to even get started.

: iSCSI logon fail.png (23.39 KiB) Viewed 27040 times

I eventually resolved the issue by reversing the direction of image creation -- creating the first image on the second server, the one I couldn't make replica image iSCSI target connections to, and creating the replica on the first server, the one where I created the first image before.

Since October 5th, when I sent those logs in, I have conversed by email with Ivan Ischenko and Boris Yurchenko (yourself?) about this. My last emailed status report read:

So far, so good! The iSCSI targets look right, the favourites look right, images are synching find and now recognized by Windows as CSVs. I have one Hyper-V VM cluster node in and configured for HA, seems to move OK. I’m planning to reboot both hosts, check MPIO etc., and import more VMs into the cluster, we’ll see how that goes. And I plan to update the forum so others can gain the benefit of the experience.

But I wish I knew what caused the problem, and how to prevent it in the future. Here are the key points so far as I can see:

1. The problem with creating iSCSI replica targets appeared to have been triggered by deleting the old images and creating the new ones at the same location with the different filesystem type. No clue why, they weren’t even the same name, although they were in the same location -- maybe I did something wrong. But there appeared to be some kind of iSCSI target conflict, due to hidden issues that didn’t appear anywhere I could see. Never did figure out what that issue was.

2. The way I resolved it was to create the primary image for the new CSV on the other host, i.e. going in the reverse direction, and creating the replica on the host where I created the primary before. That worked, albeit with some initial oddities. But after reboots and iSCSI it started to work.

3. While I did apply the latest Server 2016 rollup patch, it’s not clear to me that it had anything to do with the issue at all. The StarWind KB article referring to the Server 2012 patch wasn’t of any help, really.

Like I say, I really wish I knew what the problem was, or even how to diagnose it! I never got any help locating the actual cause of the connect failures. I think StarWind should think about how to address this in the future: Server 2016 is going to be more and common.

Any thoughts on that?

At this point, since creating the primary and replica images in the reverse direction, I've had no further issues with iSCSI targets or synchronization. I still want to do a full cluster one-server-at-a-time reboot to confirm that there aren't any further problems, but I feel pretty confident it won't be an issue.

Oh, and BTW, to take your concern about bad blocks seriously, I did today run CHKDSK /R /SCAN on all drives. Zero errors. Pretty sure that's got nothing to do with it.

-- Ken

Tue Oct 10, 2017 4:36 pm

Ken,
Check the logs for this error first reported on September, 27:
17:41:52.528 13e4: IMG: *** ImageFile_IoCompleted: Disk operation failed. Disk path: F:\Storage1\Storage1.img. Error code: (23) (StarWind log)
This had not been resolved by October, 3.
The latest event in the Windows system log says:
kmHV3.domain.net Error The device, \Device\Harddisk1\DR1, has a bad block. Disk 10/3/2017 13:00
The order of creating the disks does not matter for the ability of connecting to the targets. Please check your logs for the disk issues resolved messages (smth like RAID rebuild).
What I see from your story is that one of your sides WAS NOT synchronized. At the same time, iSCSI initiator can connect only to a target that IS synchronized to prevent data corruption on the disk. This behavior is by design. This is how it works. In your case, you reported the device was not synchronized on the node. And this resulted from bad blocks on your physical drives that prevented the disk from getting sychronized.

Can you reproduce your steps in the order you did them when having an issue and confirm you still face the same behavior? Thanks for your efforts.

wallewek · Wed Oct 11, 2017 6:44 pm

Thank you Boris,

I've checked the Windows System Event logs, and do confirm that those bad block errors did show up as recently as October 3rd (a week ago). Nothing since then. As I've had no other issues with the drives, and haven't made any drive-related changes, it's a bit of a mystery. FWIW, these are "bare" drives, not hidden behind RAID controllers, and I am now monitoring their health with HD Sentinel (which reports them as healthy).

It may be interesting that the system that was logging these errors (KMHV3) is the one where I created the original image, not the replica. When I reversed that, the problem went away -- but then, so did the drive errors. Weird.

Could you clarify something for me please, Boris? Up to now, I've assumed that StarWind used iSCSI for storage synchronization. It now appears I was mistaken -- that something else is used for replication, and that (from what you say) StarWind is not only fully capable of replicating without iSCSI, it will disable iSCSI access to the replication target if synchronization is incomplete.

Is that correct? Do you have any documentation that clarifies how synchronization works?

It also sounds as though, when creating a replica image partner, one should not even attempt to create or adjust iSCSI targets until synchronizing is done. Is that correct?

I think it would be really helpful if the user interface and/or documentation was more clear about things like this, or about errors that occur. I was really in the dark about these mysterious failures.

Final note: as per your request, I've just created another image and replica on the same hosts, in the same direction, as the original problem ones, and it succeeded without problems, including synch and (once that was complete) iSCSI connections, in both directions. So I think this issue is probably resolved.

-- Ken

Delo123 · Thu Oct 12, 2017 12:52 pm

If the disk showed bad blocks, there are bad blocks Maybe changing the replication direction is causing these blocks are not read or written currently but i wouldn't trust that disk anymore, even more because it doesn't seem to a redundant disk

Thu Oct 12, 2017 1:16 pm

Ken,

It appears that you had problems with HA replication only when you had bad blocks on your physical drive. After they had gone away, creation of HA replication went flawless. Yet, it is a mistery where your bad blocks have gone, but I believe this is out of scope of StarWind.

Your assumption on the iSCSI protocol being used for synchronization is correct. Yet, to prevent any troubles with information replication and consistency, the StarWind VSAN service marks the non-synchronized partner as non-accessible to iSCSI connections, as I mentioned before. I cannot go into more details on this, as the mechanism that backs the synchronization process is one of the core things of the commercial product and shall not be disclosed.

It also sounds as though, when creating a replica image partner, one should not even attempt to create or adjust iSCSI targets until synchronizing is done. Is that correct?

As for iSCSI target connections, I would recommend you to to create an HA device before connecting to it, but not to create a standalone image, connect to it and replicate later.

I am really heppy your issue got resolved and you no longer have troubles with StarWind.