Issue Repairing HA

Software-based VM-centric and flash-friendly VM storage + free version

Moderators: anton (staff), art (staff), Max (staff), Anatoly (staff)

Post Reply
andrew.kns
Posts: 8
Joined: Wed Sep 11, 2024 8:04 pm

Wed Sep 11, 2024 8:31 pm

Firstly, I am using a 2-node vSAN Free, hyper converged deployment where the vSAN controller is running as a virtual machine and the raid controller is parsed directly to the VM on each node.

After performing a routine maintenance on my master node that involved a reboot, all but one of my LUN's recovered to HA status. LUN 'VMPool01' became split into 'VMPool01-01' & 'VMPool01-02'. Normally the alias on the master node is hidden and only the partner node alias shows inside the web dashboard however for this LUN both alias were now showing. There also appeared to be some sync issues so for these reasons I decided to clear the configuration on the master node so that I could resync from the partner node.

After modifying the StarWindX script AddHaPartner.ps1 I was returned some error '200 Failed: operation can not be completed..' which prompted me to run the RemoveHaPartner.ps1 script first. I attempted to run AddHaPartner.ps1 multiple times following this with no success, getting a new error '200 Failed: invalid partner info..'

My next idea was to clear the config from the master node in case this was causing the issue. I SSH'd into the CVM console and removed the header files for the LUN from the storage volume as well as the configuration header files from the '/opt/starwind/starwind-virtual-san/drive_c/starwind/headers/' directory. This did not resolve the issue and I am still plagued with a '200 Failed' code for invalid partner info.

Node 1 (Master):
Version: 1.6.578.7343
Nic 1: 10.10.10.28
Nic 2: 192.168.10.10
Nic 3: 192.168.20.10

Node 2 (Partner):
Version: 1.6.578.7343
Nic 1: 10.10.10.141
Nic 2: 192.168.10.20
Nic 3: 192.168.20.20


Script:

Code: Select all

param($addr="10.10.10.141", $port=3261, $user="root", $password="starwind", $deviceName="HAImage5",
	$addr2="10.10.10.28", $port2=$port, $user2=$user, $password2=$password,
#secondary node
	$imagePath2="VSA Storage\mnt\sdb1\volume0",
	$imageName2="masterVMPool01",
	$createImage2=$true,
	$targetAlias2="VMPool01-01",
	$autoSynch2=$true,
	$poolName2="sdb1",
	$syncSessionCount2=1,
	$aluaOptimized2=$true,
	$syncInterface2="#p1={0}:3260" -f "192.168.20.10",
    $hbInterface2="",
	$bmpType=1,
	$bmpStrategy=0,
	$bmpFolderPath="",
    $selfSyncInterface="#p1={0}:3260" -f "192.168.20.20",
	$selfHbInterface="#p1={0}:3260" -f "192.168.10.20"
	)
	
Import-Module StarWindX

try
{
    Enable-SWXLog
    
    $server = New-SWServer $addr $port $user $password
    $server.Connect()

	$device = Get-Device $server -name $deviceName
	if( !$device )
	{
		Write-Host "Device not found" -foreground red
		return
	}

    $node = new-Object Node
    $node.HostName = $addr2
    $node.HostPort = $port2
    $node.Login = $user2
    $node.Password = $password2
    $node.ImagePath = $imagePath2
    $node.ImageName = $imageName2
    $node.CreateImage = $createImage2
    $node.TargetAlias = $targetAlias2
    $node.SyncInterface = $syncInterface2
    $node.HBInterface = $hbInterface2
	$node.AutoSynch = $autoSynch2
	$node.SyncSessionCount = $syncSessionCount2
	$node.ALUAOptimized = $aluaOptimized2
	$node.PoolName = $poolName2
	$node.BitmapStoreType = $bmpType
	$node.BitmapStrategy = $bmpStrategy
	$node.BitmapFolderPath = $bmpFolderPath

    Add-HAPartner $device $node $selfSyncInterface $selfHbInterface $selfBmpFolderPath
}
catch
{
	Write-Host $_ -foreground red 
}
finally
{
	$server.Disconnect()
}
yaroslav (staff)
Staff
Posts: 3599
Joined: Mon Nov 18, 2019 11:11 am

Wed Sep 11, 2024 9:14 pm

The scripts do not look entirely correct.
Please see the sample of the script here
viewtopic.php?f=5&t=6852&p=37208&hilit=HINT8#p37208
viewtopic.php?f=5&t=6863&p=37292&hilit= ... ps1#p37292
Please also make sure to remove the affected node following
1. The corresponding img and header from /mnt/... directory.
2. The corresponding directory and headers in /opt/starwind/starwind-virtual-san/drive_c/starwind/headers

Also, make sure the StarWIndX module (i.e., Management Console) is up-to-date.
andrew.kns
Posts: 8
Joined: Wed Sep 11, 2024 8:04 pm

Wed Sep 11, 2024 9:22 pm

Yaroslav,

Thank you for the reply. Those scripts you linked appear to create a new HA LUN unless I am mistaken and they can be used to repair an existing one?

Please let me know.
Andrew

Ref script used to create LUN prior to it breaking:

Code: Select all

param($addr="10.10.10.28", $port=3261, $user="root", $password="starwind",
	$addr2="10.10.10.141", $port2=$port, $user2=$user, $password2=$password,
#common
	$initMethod="Clear",
	$size=10240000,
	$sectorSize=512,
	$failover=0,
	$bmpType=1,
	$bmpStrategy=0,
#primary node
	$imagePath="VSA Storage\mnt\sdb1\volume0",
	$imageName="masterVMPool01",
	$createImage=$true,
	$storageName="",
	$targetAlias="VMPool01-01",
	$poolName="sdb1",
	$syncSessionCount=1,
	$aluaOptimized=$true,
	$cacheMode="none",
	$cacheSize=0,
	$syncInterface="#p2={0}:3260" -f "192.168.20.20",
	$hbInterface="#p2={0}:3260" -f "192.168.10.20",
	$createTarget=$true,
	$bmpFolderPath="",
#secondary node
	$imagePath2="VSA Storage\mnt\sdb1\volume0",
	$imageName2="partnerVMPool01",
	$createImage2=$true,
	$storageName2="",
	$targetAlias2="VMPool01-02",
	$poolName2="sbd1",
	$syncSessionCount2=1,
	$aluaOptimized2=$false,
	$cacheMode2=$cacheMode,
	$cacheSize2=$cacheSize,
	$syncInterface2="#p1={0}:3260" -f "192.168.20.10",
	$hbInterface2="#p1={0}:3260" -f "192.168.10.10",
	$createTarget2=$true,
	$bmpFolderPath2=""
	)
	
Import-Module StarWindX

try
{
	Enable-SWXLog

	$server = New-SWServer -host $addr -port $port -user $user -password $password

	$server.Connect()

	$firstNode = new-Object Node

	$firstNode.HostName = $addr
	$firstNode.HostPort = $port
	$firstNode.Login = $user
	$firstNode.Password = $password
	$firstNode.ImagePath = $imagePath
	$firstNode.ImageName = $imageName
	$firstNode.Size = $size
	$firstNode.CreateImage = $createImage
	$firstNode.StorageName = $storageName
	$firstNode.TargetAlias = $targetAlias
	$firstNode.SyncInterface = $syncInterface
	$firstNode.HBInterface = $hbInterface
	$firstNode.PoolName = $poolName
	$firstNode.SyncSessionCount = $syncSessionCount
	$firstNode.ALUAOptimized = $aluaOptimized
	$firstNode.CacheMode = $cacheMode
	$firstNode.CacheSize = $cacheSize
	$firstNode.FailoverStrategy = $failover
	$firstNode.CreateTarget = $createTarget
	$firstNode.BitmapStoreType = $bmpType
	$firstNode.BitmapStrategy = $bmpStrategy
	$firstNode.BitmapFolderPath = $bmpFolderPath
    
	#
	# device sector size. Possible values: 512 or 4096(May be incompatible with some clients!) bytes. 
	#
	$firstNode.SectorSize = $sectorSize
    
	$secondNode = new-Object Node

	$secondNode.HostName = $addr2
	$secondNode.HostPort = $port2
	$secondNode.Login = $user2
	$secondNode.Password = $password2
	$secondNode.ImagePath = $imagePath2
	$secondNode.ImageName = $imageName2
	$secondNode.CreateImage = $createImage2
	$secondNode.StorageName = $storageName2
	$secondNode.TargetAlias = $targetAlias2
	$secondNode.SyncInterface = $syncInterface2
	$secondNode.HBInterface = $hbInterface2
	$secondNode.SyncSessionCount = $syncSessionCount2
	$secondNode.ALUAOptimized = $aluaOptimized2
	$secondNode.CacheMode = $cacheMode2
	$secondNode.CacheSize = $cacheSize2
	$secondNode.FailoverStrategy = $failover
	$secondNode.CreateTarget = $createTarget2
	$secondNode.BitmapFolderPath = $bmpFolderPath2
        
	$device = Add-HADevice -server $server -firstNode $firstNode -secondNode $secondNode -initMethod $initMethod
    
	while ($device.SyncStatus -ne [SwHaSyncStatus]::SW_HA_SYNC_STATUS_SYNC)
	{
		$syncPercent = $device.GetPropertyValue("ha_synch_percent")
	        Write-Host "Synchronizing: $($syncPercent)%" -foreground yellow

		Start-Sleep -m 2000

		$device.Refresh()
	}
}
catch
{
	Write-Host $_ -foreground red 
}
finally
{
	$server.Disconnect()
}
yaroslav (staff)
Staff
Posts: 3599
Joined: Mon Nov 18, 2019 11:11 am

Wed Sep 11, 2024 9:39 pm

There's something fundamentally wrong in the way you are referring to those IP addresses. That's why I shared similar scripts that may give some clues on how you can introduce them into the script.
andrew.kns
Posts: 8
Joined: Wed Sep 11, 2024 8:04 pm

Wed Sep 11, 2024 11:54 pm

UPDATE RESOLVED

After fiddling with the script, I copied the node object used to create the HA device from my initial creation script. Using this, I was able to repair the HA LUN with the following script. It is syncing now.

NOTE: I found the need to reboot the vSAN controller following every failed attempt. Not doing so would create additional failures that could not be explained.

Code: Select all

param($addr="10.10.10.28", $port=3261, $user="root", $password="starwind", $deviceName="HAImage5",
	$addr2="10.10.10.141", $port2=$port, $user2=$user, $password2=$password,
#common
	$initMethod="Clear",
	$size=10240000,
	$sectorSize=512,
	$failover=0,
	$bmpType=1,
	$bmpStrategy=0,
#primary node
	$imagePath="VSA Storage\mnt\sdb1\volume0",
	$imageName="masterVMPool01",
	$createImage=$true,
	$storageName="",
	$targetAlias="VMPool01-01",
	$poolName="sdb1",
	$syncSessionCount=1,
	$aluaOptimized=$true,
	$cacheMode="none",
	$cacheSize=0,
	$syncInterface="#p2={0}:3260" -f "192.168.20.20",
	$hbInterface="#p2={0}:3260" -f "192.168.10.20",
	$createTarget=$true,
	$bmpFolderPath="",

    $selfSyncInterface="#p1=192.168.20.10:3260",
	$selfHbInterface="#p1=192.168.10.10:3260"
	)
	
Import-Module StarWindX

try
{
	Enable-SWXLog

	$server = New-SWServer -host $addr2 -port $port -user $user -password $password

	$server.Connect()

    $device = Get-Device $server -name $deviceName
	if( !$device )
	{
		Write-Host "Device not found" -foreground red
		return
	}


	$firstNode = new-Object Node

	$firstNode.HostName = $addr
	$firstNode.HostPort = $port
	$firstNode.Login = $user
	$firstNode.Password = $password
	$firstNode.ImagePath = $imagePath
	$firstNode.ImageName = $imageName
	$firstNode.Size = $size
	$firstNode.CreateImage = $createImage
	$firstNode.StorageName = $storageName
	$firstNode.TargetAlias = $targetAlias
	$firstNode.SyncInterface = $syncInterface
	$firstNode.HBInterface = $hbInterface
	$firstNode.PoolName = $poolName
	$firstNode.SyncSessionCount = $syncSessionCount
	$firstNode.ALUAOptimized = $aluaOptimized
	$firstNode.CacheMode = $cacheMode
	$firstNode.CacheSize = $cacheSize
	$firstNode.FailoverStrategy = $failover
	$firstNode.CreateTarget = $createTarget
	$firstNode.BitmapStoreType = $bmpType
	$firstNode.BitmapStrategy = $bmpStrategy
	$firstNode.BitmapFolderPath = $bmpFolderPath
    
	#
	# device sector size. Possible values: 512 or 4096(May be incompatible with some clients!) bytes. 
	#
	$firstNode.SectorSize = $sectorSize

        
	Add-HAPartner $device $firstNode $selfSyncInterface $selfHbInterface
    
}
catch
{
	Write-Host $_ -foreground red 
}
finally
{
	$server.Disconnect()
}
yaroslav (staff)
Staff
Posts: 3599
Joined: Mon Nov 18, 2019 11:11 am

Thu Sep 12, 2024 7:22 am

Thanks for your update and sharing your script.
andrew.kns
Posts: 8
Joined: Wed Sep 11, 2024 8:04 pm

Thu Sep 12, 2024 11:51 am

Following some more reboots to complete the maintenance, the LUN has broken again in the same way that it did originally.
Screenshot 2024-09-12 074150.png
Screenshot 2024-09-12 074150.png (44.26 KiB) Viewed 8594 times
Is this a bug?
yaroslav (staff)
Staff
Posts: 3599
Joined: Mon Nov 18, 2019 11:11 am

Thu Sep 12, 2024 12:13 pm

Sorry to read that.
Could you please let me know what exactly was done?
You could reach to support@starwind.com using this thread and use 1213496 as your reference.
andrew.kns
Posts: 8
Joined: Wed Sep 11, 2024 8:04 pm

Thu Sep 12, 2024 12:39 pm

Prior to additional restarts. I validated that all my LUN's were 'Highly Available'. I only restarted Node 1, the same node that caused the original issue. My troubleshooting / work involved multiple restarts however once I was complete with the maintenance, I checked the StarWind SAN dashboard webapp to find the screenshot I pasted in my previous post.

I will reach out to support now.
andrew.kns
Posts: 8
Joined: Wed Sep 11, 2024 8:04 pm

Sat Sep 14, 2024 12:23 am

Public Update:

We found some discrepancies in the starwinds configuration file. It appeared that there were multiple image file objects for the same file on the disk, the one that was experiencing issues. After clearing all the config for the issue LUN, the following was completed:
- Removed partner target from node 2 that pointed to node 1
- Deleted header files from the CVM configuration files
- Deleted the LUN files from the disk array

Following a reboot of node 1 CVM, I used the AddPartner script I posted above to rebuild the HA LUN. This appeared to be a successful and complete resolution to my issue.

The original cause appears to be previous failures from repairing the LUN that resulted in leftover bad configuration in the StarWind.cfg configuration set.
Post Reply