Massive data corruption after SW reboot
Posted: Mon Jul 27, 2015 2:57 pm
We have had several occurrences of massive corruption after Starwind servers were rebooted.
We have about 60TB data served behind each SW box.
Each box has 3x iScsi targets, each serving a 20TB volume to ESXi hosts.
Each 20TB LUN is a Flat file (no LSFS) and has 8GB L1 cache and 60GB SSD L2 caches in WB mode.
The last time we rebooted both Starwind boxes they each hung at the shutdown screen for over an hour.
We were forced to power them off manually and when they came back on the VMs were brought back online but we had massive (and i mean MASSIVE) file corruption across all volumes.
It took many days of chkdsk repairs to get things back to normal and we had several hundred GBs of data corrupted to the point were it was screwed completely.
We tried to do an update of SW software on one box at a later stage in case it was a bug and while stopping the SW service it hung for well over an hour again and eventually the process stopped with a timeout error but once we restarted SW and brought the volumes back online the VMs once again showed extensive corruption.
Our investigations have led us to believe that the use of L2 cache on the SSDs is the cause of the issue. When monitoring disk writes to the 20TB LUNs we see huge writes going to the SSD Cache but very little to the actual main datastores.
I think the cache flush algorithms are at fault as the data is being written to the SSD cache and not to the main volume, or at least not fully.
I assume that when the SW process is being stopped during a server shutdown all the data sitting in the SSD cache is being written out to disk which takes a long time and is causing the SW process to hang at the stopping stage which eventually bombs out without finishing. Once the server is rebooted the L2 cache is dumped by SW and the data is lost leading to our corruption.
So I have some questions:
Why is the cache not flushing periodically or as soon as the disk queue has decreased to the main data store?
Why is the SW process taking so long to stop?
Why is the data sitting in the L2 cache being dumped on a server reboot? This is a serious flaw. No raid controller would dump your data in cache waiting to be written after a reboot so why would Starwind do it?
Is there any way to manually request SW to write all cached data to disk?
We are going to need to reboot our servers again someday. That is inevitable.
We need a way to be able to sync the cache to disk while the service is in a running state and be able to stop the SW service gracefully without suffering more corruption.
I have been told elsewhere on this forum that the use of L2 cache in WB mode is not recommended.
I completely disagree as we have been using this method of cache on Nexenta and Solaris with ZFS for years to massively increase performance and have never had an adverse effect from a reboot.
What can we do to fix this issue on Starwind?
We have about 60TB data served behind each SW box.
Each box has 3x iScsi targets, each serving a 20TB volume to ESXi hosts.
Each 20TB LUN is a Flat file (no LSFS) and has 8GB L1 cache and 60GB SSD L2 caches in WB mode.
The last time we rebooted both Starwind boxes they each hung at the shutdown screen for over an hour.
We were forced to power them off manually and when they came back on the VMs were brought back online but we had massive (and i mean MASSIVE) file corruption across all volumes.
It took many days of chkdsk repairs to get things back to normal and we had several hundred GBs of data corrupted to the point were it was screwed completely.
We tried to do an update of SW software on one box at a later stage in case it was a bug and while stopping the SW service it hung for well over an hour again and eventually the process stopped with a timeout error but once we restarted SW and brought the volumes back online the VMs once again showed extensive corruption.
Our investigations have led us to believe that the use of L2 cache on the SSDs is the cause of the issue. When monitoring disk writes to the 20TB LUNs we see huge writes going to the SSD Cache but very little to the actual main datastores.
I think the cache flush algorithms are at fault as the data is being written to the SSD cache and not to the main volume, or at least not fully.
I assume that when the SW process is being stopped during a server shutdown all the data sitting in the SSD cache is being written out to disk which takes a long time and is causing the SW process to hang at the stopping stage which eventually bombs out without finishing. Once the server is rebooted the L2 cache is dumped by SW and the data is lost leading to our corruption.
So I have some questions:
Why is the cache not flushing periodically or as soon as the disk queue has decreased to the main data store?
Why is the SW process taking so long to stop?
Why is the data sitting in the L2 cache being dumped on a server reboot? This is a serious flaw. No raid controller would dump your data in cache waiting to be written after a reboot so why would Starwind do it?
Is there any way to manually request SW to write all cached data to disk?
We are going to need to reboot our servers again someday. That is inevitable.
We need a way to be able to sync the cache to disk while the service is in a running state and be able to stop the SW service gracefully without suffering more corruption.
I have been told elsewhere on this forum that the use of L2 cache in WB mode is not recommended.
I completely disagree as we have been using this method of cache on Nexenta and Solaris with ZFS for years to massively increase performance and have never had an adverse effect from a reboot.
What can we do to fix this issue on Starwind?