I'm loosing my files on LSFS

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

joonas
Posts: 8
Joined: Tue May 09, 2017 3:57 am

Tue May 09, 2017 4:15 am

Hi!

We have a Starwind SAN with HA LSFS and a flat quarum witness without HA. On top there is Hyper-V 2016 Failover Cluster.

Yesteday I decided to upgrade our Starwind SAN with 10GbE adapters (Mellanox ConnectX-2 with latest WinOF drivers). I shutdown our "secondary" server first and added the adapter. Then I shutdown all cluster VMs, brought offline all cluster disks (CSV and Quarum witness) from Failover Cluster Manager.

I shutdown the primary SAN, added the adapter. After turning on the primary, I installed the drivers and restarted the primary. I checked that the primary was up and the cluster could connect to it. All clusted disks online! All seemed ok. Then I started the secondary. During the driver installation it rebooted and started to sync the HA device.

Then I tried to start a VM. Error: couldn't bring the Virtual Machine configuration online!!! All disks seemed ok. So I went to see \\SRV01\d$\ClusterStorage\Volume2\Virtual Machines\Virtual Machines. Only empty folders!!! No files! :shock:

At that time VHDs seemed ok in C:\ClusterStorage\Volume2\Virtual Hard Disks were ok. So I decided to create a new VM with the existing VHD. This seemed to work fine, but then I notice something aweful: Also files from C:\ClusterStorage\Volume2\Virtual Hard Disks started to disappear :shock: :cry:

Is there anything I can do now? Several huge VMs are now just gone!!! Please help me! Is there a way to use somekind of Recovery Mode and mount the lsfs from older .spspx file?

I have started a DSR recovery from our backup server, but the backup is a day old for some of the machines :(
Vitaliy (Staff)
Staff
Posts: 35
Joined: Wed Jul 27, 2016 9:55 am

Tue May 09, 2017 5:39 pm

Hello Joonas,

Is there any chance you could share the logs from both hosts?
For quicker and easier log collection from StarWind VMs please do not hesitate using our knowledge base article:
https://knowledgebase.starwindsoftware. ... ogs-bat-2/

We are going to investigate them with R&D guys.
joonas
Posts: 8
Joined: Tue May 09, 2017 3:57 am

Sun May 21, 2017 5:20 am

These files have been uploaded to you FTP site.
Michael (staff)
Staff
Posts: 319
Joined: Thu Jul 21, 2016 10:16 am

Thu May 25, 2017 8:44 am

Hello Joonas,
Thank you for the logs.
We will provide you with the investigation results as soon a possible.
Michael (staff)
Staff
Posts: 319
Joined: Thu Jul 21, 2016 10:16 am

Wed May 31, 2017 5:24 pm

Hello Joonas,
I am sorry for the delay in our response.
Our R&D team investigated the logs and unfortunately did not find the reason why the files disappeared. There are a lot of events in StarWind logs which point to huge delays on underlying storage (reads and writes), thus it caused synchronization issues between the nodes - the device was not able to synchronize.
In order to restore the data, please add LSFS device on SRV01 node as standalone one (Add device (advanced)->Hard Disk Device->Virtual Disk->Use an existing Virtual Disk).
Since HA synchronizing with snapshots, added LSFS device should have at least one snaphot. According to the logs, it was created at 21:39.
5/8 21:39:08.096 7e2c Sw: CSnapshotManager::OnSnapshotCreateComplete: Snapshot 'ScheduledSnapshot' [ id: 51] has been successfully created, UID '384301'.
Then click recently added device, open Snapshot Manager and mount available snapshot as a new target. The mounted target should be conencted via iSCSI Initiator and data should be available on appeared disk.
The same operation can be done on node SRV04 but we see from the logs that device synchronization failed several times due to storage delays, thus it could contain not a valid data.
Please let us know if it helped.
joonas
Posts: 8
Joined: Tue May 09, 2017 3:57 am

Wed May 31, 2017 7:24 pm

I keep getting error when adding the new device:
Failed to create Device. Underlying storage not found. (May be it has not been added yet?) [OK]

What should I do?
Michael (staff)
Staff
Posts: 319
Joined: Thu Jul 21, 2016 10:16 am

Fri Jun 02, 2017 7:48 am

It looks like the files for L2 cache has been removed.
In order to fix it, please edit LSFS configuration file (.swdsk) to remove information about L2 cache there. You can follow the KB here: https://knowledgebase.starwindsoftware. ... dance/661/
As a result, you should get the file similar to this text:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<header signature="StarWind" version="1.0">
  <device active="true" plugin="lsfs" name="lsfs">
    <storages>
      <storage id="1" type="device" name="lsfs" lun="0x0">
        <interval size="2560000"/>
        <inquiry>
          <serial_id>583E6394D3DEA4B3</serial_id>
          <vendor id="STARWIND"/>
          <product id="STARWIND        " revision="0001"/>
          <eui_64>583E6394D3DEA4B3</eui_64>
        </inquiry>
        <geometry>
          <sector size="512" psize="4096"/>
          <track sectors="16"/>
          <cylinder tracks="32" count="65535"/>
       [b] </geometry>
      </storage>[/b]
    </storages>
  </device>
  <system>
    <resources>
     [b] <storages>
        <storage id="3" name="My Computer\D\lsfs1\lsfs1.spsp" type="file">
          <interval size="2560000"/>
        </storage>
      </storages>[/b]
      <network/>
    </resources>
  </system>
  <node id="1" name="lsfs" shut="false" active="true">
    <storages>
      <storage_ref id="3"/>
    </storages>
    <parameters>
      <DeduplicationEnabled>yes</DeduplicationEnabled>
      <SnapshotsDisabled>no</SnapshotsDisabled>
      <BlockSize>4096</BlockSize>
      <Compressed>no</Compressed>
      <HashType>2</HashType>
      <mode>1</mode>
      <DataCacheSize>0</DataCacheSize>
      <EsxCompatibleMode>no</EsxCompatibleMode>
      <AutoDefrag>yes</AutoDefrag>
      <VerifyData>no</VerifyData>
      <SupportDeletion>yes</SupportDeletion>
      <CacheMetadata>no</CacheMetadata>
      <SnapshotFlushPeriod>10000</SnapshotFlushPeriod>
      <SnapshotOffset>0</SnapshotOffset>
      <SnapshotTimestamp>0</SnapshotTimestamp>
      <SnapshotID>0</SnapshotID>
      <IsDeviceStorage>0</IsDeviceStorage>
    </parameters>
  </node>
</header>
Please let me know if device will be added.
joonas
Posts: 8
Joined: Tue May 09, 2017 3:57 am

Fri Jun 02, 2017 8:33 am

Is there a way to make sure this will never happen again? Some kind of test to run?

This is actually a third time that we have lost data because of synch failure. We are considering moving away from Starwind if there is no solution for this.
Michael (staff)
Staff
Posts: 319
Joined: Thu Jul 21, 2016 10:16 am

Fri Jun 02, 2017 10:25 am

Hello Joonas,
Were you able to add device and mount the snapshot?
joonas
Posts: 8
Joined: Tue May 09, 2017 3:57 am

Tue Jun 13, 2017 2:33 pm

I'm sorry, but it says to me that the device mount has failed... So I can't mount that device and thus I can't access Snapshot manager.

But I need to make absolutely sure that we are not going to encounter this anymore. This is actually 4 time that we have lost data because of some odd synch problem between the servers. We have actually changed the hardware twice to get this more reliable...
Michael (staff)
Staff
Posts: 319
Joined: Thu Jul 21, 2016 10:16 am

Thu Jun 15, 2017 12:24 pm

Hello Joonas,
In order to fix the issue with mounting, please try to apply the following workaround:
1). Stop StarWind service;
2). Locate Starwind device folder and rename .splumap and .spvmap files extension in folder where failed devices located;
3) Start Starwind service;
4) Wait for the mounting progress;
As for the synchronization question, indeed I saw a huge delays during devices sycnhronization in your logs. Please check if network and storage performance is acceptable.
I am looking forward to hearing from you.
joonas
Posts: 8
Joined: Tue May 09, 2017 3:57 am

Wed Jun 21, 2017 12:33 pm

Sorry, still doesn't mount...

Is there a way to do some testing regarding synchronization issues?

I'll have 2 10G Ethernet adapters, that I'm going to use in the system for synch. The actual iSCSI traffic is in 1G Ethernet.
We are using WD 6TB Red Pro spinddles in RAID1 and some Kingston SSDs in RAID1 for cache.
Anything that sounds bad here?
Michael (staff)
Staff
Posts: 319
Joined: Thu Jul 21, 2016 10:16 am

Thu Jun 22, 2017 4:38 pm

Hello Joonas,

Could you please collect the logs one more time and upload them to our FTP? We are going to investigate the issue with device mounting. Please make sure that logs contain the time when device was not able to mount.

As for the network configuration, from the storage performance side, the WD 6TB Red Pro gives about 1.5 Gbps on read operations, while Kingston SSDs should give about 5 Gbps.
I would suggest you assigning one 10Gbps port for StarWind Synchronization traffic and another 10Gbps port for iSCSI traffic. It will cover the storage operations while 1 Gbps links can be used for StarWind Heartbeart.
joonas
Posts: 8
Joined: Tue May 09, 2017 3:57 am

Fri Jun 30, 2017 4:40 am

Hi!

I recollected the logs. They are on the FTP server in folder under my name.

I was reading https://www.starwindsoftware.com/techni ... ctices.pdf and realized that it actually recommends running both sync and iSCSI in the same two adapters, if using hyperconverged setup. Have I understood correctly? Would it be good to have 10G: sync+iSCSI; 10G: sync+iSCSI; 1G: heartbeat + management; 1G: general VM internet trafic?

Is there any tools I could use to test my hw performance / configuration?

Thanks,
Joonas
Vitaliy (Staff)
Staff
Posts: 35
Joined: Wed Jul 27, 2016 9:55 am

Mon Jul 03, 2017 6:16 pm

Hello Joonas,
joonas wrote: I recollected the logs. They are on the FTP server in folder under my name.
I have sent them to R&D guys. Once I get a report I will update you.
joonas wrote:Hi!
Would it be good to have 10G: sync+iSCSI; 10G: sync+iSCSI; 1G: heartbeat + management; 1G: general VM internet trafic?
It would not be good. We do not recommend mixing iSCSI and Sync traffic. Best way to use one 10 GB port for iSCSI and the other for Sync.
And additional 1 Gb NIC would be correct to use for one more Heartbeat.
joonas wrote: Is there any tools I could use to test my hw performance / configuration?
Yes, we use Microsoft diskspd utility, you can find it here:
https://gallery.technet.microsoft.com/D ... e-6cd2f223

Thank you.
Post Reply