HA iSCSI as Fault Tolerant Service

danisoto · Mon Dec 03, 2012 11:06 am

Hi,

After testing the last release of iSCSI target with HA support (great! 128GB for free is good for our tests), I found that the current implementation it is not enough for Fault Tolerant services.

We do several tests, but this one is very descriptive (Build 20121115):

* Two nodes (A and B) with an HA iSCSI target full synchronized. A: Master; B: Slave.
* "Controlled shutdown" of node A.
* Node B runs and the iSCSI target still works.
* "Controlled shutdown" of node B.
* Obviously, iSCSI target disappears.
* After starting the node B, the iSCSI target does not start and the volume remains forever as out of sync.

Please, this behaviour in completely unacceptable! With CONTROLLED shutdowns it's impossible to have any data discrepancy. The systems must allow to auto-enable the target volume when no data is lost. If both nodes are powered off (with a shutdown) the system NEVER auto synchronize!

I suggest to re-read my past thread about and automatic timeout for auto-synch in case of power failure:
http://www.starwindsoftware.com/forums/ ... t2670.html

I prefer a system that does not need ANY kind of user intervention in case of errors. We can tolerate some errors, as we use a fault tolerant system on top of the shared volume. But the system needs to restart the target in ANY situation.

Please, can you improve this?

Thank you!

Mon Dec 03, 2012 11:36 am

It has been discussed with a lot of customers and finally we came to set of decisions / axioms:

1) Both nodes down @ the same time **IS** an emergency situation. Planned downtime should not be planned this way.

2) After WHOLE cluster is down it's required to have operator intervention to pick up synchronization directon as we cannot take this decision ourself.

Making long story short: for planned downtime DO NOT put both (or triple if you use 3-way mirror) nodes @ the same time as it makes ZERO sense to everybody.

danisoto wrote:Hi,

After testing the last release of iSCSI target with HA support (great! 128GB for free is good for our tests), I found that the current implementation it is not enough for Fault Tolerant services.

We do several tests, but this one is very descriptive (Build 20121115):

* Two nodes (A and B) with an HA iSCSI target full synchronized. A: Master; B: Slave.
* "Controlled shutdown" of node A.
* Node B runs and the iSCSI target still works.
* "Controlled shutdown" of node B.
* Obviously, iSCSI target disappears.
* After starting the node B, the iSCSI target does not start and the volume remains forever as out of sync.

Please, this behaviour in completely unacceptable! With CONTROLLED shutdowns it's impossible to have any data discrepancy. The systems must allow to auto-enable the target volume when no data is lost. If both nodes are powered off (with a shutdown) the system NEVER auto synchronize!

I suggest to re-read my past thread about and automatic timeout for auto-synch in case of power failure:
http://www.starwindsoftware.com/forums/ ... t2670.html

I prefer a system that does not need ANY kind of user intervention in case of errors. We can tolerate some errors, as we use a fault tolerant system on top of the shared volume. But the system needs to restart the target in ANY situation.

Please, can you improve this?

Thank you!

danisoto · Mon Dec 03, 2012 12:25 pm

Hi Anton,

Shocking news for us! In this case we need to search for another software.
Why not put this functions as optional?

Regards!

Mon Dec 03, 2012 12:55 pm

I`d like to clarify one tech moment - your datastores WILL bring up automatically after both nodes failure. All you need to do is to turn them on, and the HA devices will "identify" which Mirror contains the actual data and the synchronization will start automatically, and the datastores will start being online on that moment.

I hope we are on the same page on this.

Thank you

lohelle · Mon Dec 03, 2012 2:29 pm

danisoto:
If you shut down node A and run on node B for a few minutes.. then you shut down node B also.

Scenario 1: You turn on only node A. You really do not want node A to start serving clients automaticly, as this is not the most recent data.

Scenario 2: You turn on only node B. This "should" be the most recent data, but the software does not know if you forced node A to be "syncronized" and then it crashed again (or was shut down)

Scenario 3: Both nodes start, and when they can exchange "info" the sync process from the correct partner starts.

danisoto · Mon Dec 03, 2012 2:57 pm

Hi Anatoly,

I like to clarify: Our tests with the last version don't work as you described in some (critical) cases.

Test environment:

* Windows 2012
* StarWind iSCSI SAN v6.0.0 (Build 20121115, [SwSAN], Win64) Free license
* Device1: IBV1 40GB on node A (name: "40gb-primary")
* Device2: IBV1 40GB on node B (name: "40gb-secondary")
* Target1: "HA-40GB-Primary" on node A
* Target2: "HA-40GB-Secondary" on node B
* Two gigabit NICS on each node
* "Auto synchronization after failure" enabled
* Cache: Write-back caching 512MB

Action shutdown = controlled power off
Action crash = reset, plug remove, hard power off

Our results:

Test 1: Both nodes up, restart node B --> CORRECT, fast resync and target always on!
Test 2: Both nodes up, restart node A --> CORRECT, fast resync and target always on!

Test 3: Both nodes up, crash node B --> CORRECT, full resync and target always on!
Test 4: Both nodes up, crash node A --> CORRECT, full resync and target always on!

Test 5: Both nodes up, shutdown node B, restart node A --> FAIL, target offline!
Test 6: ...after Test 5, boot node B --> CORRECT, full resync and target online!

Test 7: Both nodes up, shutdown node A, shutdown node B, boot node A --> FAIL, target offline!
Test 8: ...after Test 7, boot node B --> CORRECT, full resync and target online!

Test 9: Both nodes up, crash node A, restart node B --> FAIL, target offline!
Test 10: ...after Test 9, boot node A --> CORRECT, full resync and target online!

Test 11: Both nodes up, shutdown node A, crash node B, boot node B --> FAIL, target offline!
Test 12: ...after Test 11, boot node A --> CORRECT, full resync and target online! (why? Test 14 is similar but fails!)

Test 13: Both nodes up, shutdown node B, crash node A, boot node B --> FAIL, target offline!
Test 14: ...after Test 13, boot node A --> FAIL, both nodes "Not synchronized". Manual "Mark as Synchronized" needed!

Test 15: Both nodes up, crash node B, crash node A, boot node A --> FAIL, target offline!
Test 16: ...after Test 15, boot node B --> FAIL, both nodes "Not synchronized". Manual "Mark as Synchronized" needed!

Test 17: Both nodes up, crash both, boot at the same time nodes A & B --> FAIL, both nodes "Not synchronized". Manual "Mark as Synchronized" needed!

I'll like to agree your support and effort! And we can accept the FAIL behaviour in case of only one node is running (5,7,9,11,13 & 15).
Nevertheless, if you could add a simple (and optional) support for FT, we appreciate it! (for cases 14, 16 & 17)

Our suggestion: add a timeout that auto-set "synchronized" for Primary after booting and both nodes mark "not synchronized". Obviously, this will be an extremely advanced and risky optional operation. We don't ask to force this behaviour, only to add this as optional.

Without this behaviour it's impossible to run a Fault Tolerant Service with the StarWind iSCSI SAN.

Regards!

danisoto · Mon Dec 03, 2012 3:10 pm

lohelle wrote:danisoto:
If you shut down node A and run on node B for a few minutes.. then you shut down node B also.

Scenario 1: You turn on only node A. You really do not want node A to start serving clients automaticly, as this is not the most recent data.

Scenario 2: You turn on only node B. This "should" be the most recent data, but the software does not know if you forced node A to be "syncronized" and then it crashed again (or was shut down)

Scenario 3: Both nodes start, and when they can exchange "info" the sync process from the correct partner starts.

Hi Lohelle,

Our concern is about both nodes running AND both marked as "not synchronized".
In this case we need this deterministic behaviour: "Always mark PRIMARY as SYNC".

We can't accept any user intervention to to carry out a STATIC algorithm, like "both nodes up, both out of sync, then set primary as sync".

This behaviour as option:

1) It's very easy to implement at this point.
2) Don't disturb any client.
3) It's an easy solution to enable a simple FT Service.

I know: there no guarantees of consistency, but our filesystem on top of the iSCSI volume can tolerate errors!

Please, reconsider your thinking.

Mon Dec 03, 2012 5:38 pm

Instead of looking @ other software you need to re-think what you're doing. If you're putting up all nodes being down there's no way to know
whose data is the mos recent one. You cannot trust to time recorded in log as if cluster went down - time may go out of sync as well. You need
to power up the cluster, manually mount volumes in read-only mode and check what's there to figure out whose data is the most recent. Human
operator can do it, machine - cannot.

It's not a big deal to assign say "master token" to some node of the cluster or use timestamps to figure out who was written last. But it's not reliable.

danisoto wrote:Hi Anton,

Shocking news for us! In this case we need to search for another software.
Why not put this functions as optional?

Regards!

Mon Dec 03, 2012 5:51 pm

What you describe here is the easiest way to wipe out recent say SQL transactions accepted by Node B.

danisoto wrote:
lohelle wrote:danisoto:
If you shut down node A and run on node B for a few minutes.. then you shut down node B also.

Scenario 1: You turn on only node A. You really do not want node A to start serving clients automaticly, as this is not the most recent data.

Scenario 2: You turn on only node B. This "should" be the most recent data, but the software does not know if you forced node A to be "syncronized" and then it crashed again (or was shut down)

Scenario 3: Both nodes start, and when they can exchange "info" the sync process from the correct partner starts.
Hi Lohelle,

Our concern is about both nodes running AND both marked as "not synchronized".
In this case we need this deterministic behaviour: "Always mark PRIMARY as SYNC".

We can't accept any user intervention to to carry out a STATIC algorithm, like "both nodes up, both out of sync, then set primary as sync".

This behaviour as option:

1) It's very easy to implement at this point.
2) Don't disturb any client.
3) It's an easy solution to enable a simple FT Service.

I know: there no guarantees of consistency, but our filesystem on top of the iSCSI volume can tolerate errors!

Please, reconsider your thinking.

danisoto · Mon Dec 03, 2012 6:22 pm

danisoto wrote:
lohelle wrote:danisoto:
If you shut down node A and run on node B for a few minutes.. then you shut down node B also.

Scenario 3: Both nodes start, and when they can exchange "info" the sync process from the correct partner starts.
Hi Lohelle,

Our concern is about both nodes running AND both marked as "not synchronized".
In this case we need this deterministic behaviour: "Always mark PRIMARY as SYNC".

After SEVERAL tests, results are inconsistent:

1) Both nodes in sync.
2) Controlled shutdown of one node.
3) After some time, shutdown/crash/reboot of the active node.
4) When both nodes boot, some times they synchronize and some times don't! Different order & same order --> different results!

Can you explain the current implemented algorithm?
Thank you!

danisoto · Mon Dec 03, 2012 6:51 pm

anton (staff) wrote:Instead of looking @ other software you need to re-think what you're doing. If you're putting up all nodes being down there's no way to know
whose data is the mos recent one. You cannot trust to time recorded in log as if cluster went down - time may go out of sync as well. You need
to power up the cluster, manually mount volumes in read-only mode and check what's there to figure out whose data is the most recent. Human
operator can do it, machine - cannot.

Hi Anton,

First, thank you for your time!

Let me to clarify two points:

1) After our tests, automatically resync DON'T WORK in all cases even if you CORRECTLY SHUTDOWN one node. If some node is down, and AFTER the other goes down... why you can't determine the correct state after boot? When both nodes are up (I don't say when only one node is up), the current build DON'T RESYNC automatically in all cases. At time, your software resync if the second node shutdowns correctly. But when the second (the only active) crash, some times after booting both nodes resync, and some times not.

2) For a FT system, you need to provide a deterministic behaviour in any situation (this it's the theory). So, if you set a PRIMARY and SECONDARY, when BOTH nodes are running, you know who is master: always it's the primary. When the entire cluster crash, after boot you always do the same task again: "mark primary as sync". No other option is feasible. In the other hand, when you have only one node running before the crash you are in the first case (explained above).

All of this isn't related to the rightness of the data in the volume. It's only related to the consistency of data between nodes. You can guarantee this consistency, and with an error-free filesystem on top you can operate without errors, but only if the volume is always enabled after boot.

This is like the use of Write Caches without an UPS. You *CAN* lose data. I know, I can lose data if I force the optional "set primary as sync when booth nodes say "no synchronized" and the software (you) can't determine automatically who has the most recent data". But this depends on what data you put on it, don't think this?

Obviously, this option isn't reliable for databases or this kind of systems, but is acceptable for filesystems with correctable functions.

Please, try to understand our requirements.
Thank you!

Tue Dec 04, 2012 9:18 pm

1) We'll check why it goes this way. Either "yes" or "no".

2) It's not a big deal to sync with somebody having "master" token. If you're fine with data loss of course.

danisoto · Wed Dec 05, 2012 8:32 am

Hi Anton,

Thank you for, at least, checking the current behaviour and for think to incorporate a simple solution in the future.

I hope to receive good news in the next (or following) version.

Regards!

Fri Dec 07, 2012 1:07 pm

I`m glad that we are on the same page now.

Well, lets wait for the good news:)

danisoto · Mon Dec 10, 2012 8:23 am

Anatoly (staff) wrote:I`m glad that we are on the same page now.

Well, lets wait for the good news:)

Great! Please, send me info when a beta is available for testing.

Regards!