Large HA SAN Build

rchisholm · Tue Nov 30, 2010 7:45 pm

All the parts for the new HA SAN for our VSphere and Xen Desktops are currently coming in. The planned build is:

2 identical machines, each with:

2008 R2 64bit Server
Chenbro 9U 50 3.5" drive chasis
2 2U LSI 24 drive SAS2 JBOD
SuperMicro X8DTH-6F motherboard
2 X Xeon E5620 2.4 GHz Quad Core
6 X 4GB DDR3 RAM for 24 GB
2 X HP NC522SFP Dual Port 10GbE NIC with 2 ports for iSCSI and 2 ports for sync
2 X LSI 9280-8E with batteries
2 X Areca 1880IX-24 1GB Cache with batteries
2 X 500 GB SAS2 7.2K Constellation hard drives in RAID 1 for OS off integrated LSI SAS controller
24 X 2 TB SAS2 7.2K Constellaton in RAID 60 on Areca 1880 (can add another 24 drives for future expansion)
40 X 146 GB 15K SAS2 Savvio drives with 20 in each JBOD in RAID 10's
8 X 32 GB Intel X25-E SSD drives with 4 in each JBOD for LSI CacheCade

The 2 VSphere Enterprise Plus servers for the virtual SQL servers will use the drives in the JBODs and will be setup as:

HP DL380G7 Dual Xeon X5680 3.33 GHz Hex Core
192 GB DDR3 RAM
2 X HP NC522SFP Dual Port 10GbE NIC
2 X 146 GB 10K SAS in RAID 1 for OS

The 3 Xen Desktop Hosts will use the drives in the Chenbros and will be setup as:

HP DL360G7 Dual Xeon X5660 2.8 GHz Hex Core
72 GB DDR3 RAM
2 X HP NC522SFP Dual Port 10GbE NIC
2 X 146 GB 10K SAS in RAID 1 for OS

The switches are 2 HP ProCurve 6600-24XG for the 10GbE and 2 HP ProCurve 6600-48G-4XG for the 1GbE

Anyone see any problems with this build?

Wed Dec 01, 2010 3:48 pm

IMHO RAID 60 is not the best Idea, why not a RAID 10 instead? the remaining costs can be easily spend on software backup solutions.

rchisholm · Wed Dec 01, 2010 6:50 pm

Obviously I'm going with RAID 10 for the DB stuff, but I was thinking RAID 60 for the user file storage since it will see about 90% read activity and a 24 drive RAID 60 will give us about 67% more storage than 24 drive RAID 10 for the same cost and number of drives. Since there is 1 GB cache, upgradable to 4 GB, on the 1880's and I'm putting 24 GB of RAM in each box, I was thinking that I could set the battery backed up cache to writeback and dedicate some writeback cache from RAM for the volumes going on the RAID 60 to offset the increased write I/O cost. If you don't think this will work well, let me know because I haven't built them yet and can easily set up the storage as RAID 10 also.

Wed Dec 01, 2010 6:59 pm

Ha, you didn't mentioned anything about the load. If the 60 will be more loaded on reads then it makes sence.
I'm not yet ready to calculate any exact numbers in mind but generally the idea sounds very good.

rchisholm · Wed Dec 01, 2010 7:26 pm

Sorry I didn't mention the load yesterday. I got pulled away while typing up the description to talk to the people installing the air conditioners for our new data center where this stuff is going to live.

The databases are much more read than write, but we end up doing very large writes at certain parts of the day. Some of our tables are 10's of millions of rows of data, with some of the groups of tables that get joined on a regular basis having 100's of millions of rows of data. We also have some stored procedures that hit the tempdb pretty hard. I'm planning on creating multiple RAID 10's on the LSI JBOD's to handle these different types of traffic and using the SSDs for CacheCade.

What I would like to know for this is:
Can I safely dedicate writeback cache through StarWind on these volumes with HA, or do I need to rely on the horsepower of the hardware only?

Edit:
Also, I wasn't thinking about the write cache for the XenDesktops. You were right to point out RAID 60 may be a problem. I'll have to do RAID 10 for the 2 TB drives also. I'm used to regular VM's in ESX and wasn't thinking about the large increase in writes I'm going to see with XenDesktop.

Aitor_Ibarra · Thu Dec 02, 2010 11:33 am

Make sure you have a good winch to lift those servers into the racks - even with all the drives and PSUs removed, those chassis are going to be very heavy!

Personally I would have gone for the Supermicro double sided chassis, and I think that where databases are concerned, SLC SSDs or even over provisioned MLC SSDs are a better choice than 15K SAS, if you're after maximum IOPs.

Writeback cache: I tested the recently released build of Starwind 5.5 quite extensively. This included running a continuous write / verify job to the write back cache of an HA target. I ran this over five days with no errors, and while doing that, I had each Starwind node reboot once per hour, which meant that there was a failover every 30 mins. You will have to decide for yourself whether that's good enough and maybe come up with your own tests (SQLIO might be a good one).

One issue, that will be addressed in a future release, is that HA sync doesn't yet use the full bandwidth of 10GbE. I got about 10% in each direction, which means that max sequential write speeds are limited to about 200MB/sec. However, this is per target, so if you have multiple LUNs, you will use that fat pipe more effectively. Also, there is a workaround if you are prepared to use windows software raid over iSCSI luns. This will give you the full 10GbE bandwidth for writes as well as reads, but it means that any shut down of a Starwind node requires a full RAID rebuild, which you will have to monitor from your SQL boxes.

Definitely get independent UPS for your Starwind boxes, regardless of whether your datacentre has a UPS and generator backup. The reason is, that if you still lose power, you want the Starwind boxes to shut down gracefully, and you want one to shut down before the other. You probably will want graceful shutdowns of your SQL Servers too.

I've personally moved away from Areca as I've had problems with their cards and my motherboards (the same as yours). However, that was previous gen Areca, and the new 1880 use the same LSI 2108 ROC as the 9280s you've got, so hopefully all should be well. I don't know if this applies to Areca, but on LSI you can't do online capacity expansion of RAID 10,50 or 60. So if you ever fill a RAID volume, you really need to create a new, bigger one, and then forklift the data. It might be worth having spare, empty JBODs to make this easier.

CacheCade has a max of 32 SSDs and 512GB per controller, so you may want to max this out to get the best performance. I would see how you go with your current spec vs no CacheCade, and consider adding more X25-Es if you see a real benefit.

rchisholm · Thu Dec 02, 2010 6:11 pm

The Chenbros arrived the other day. The Fedex guy was very unhappy. They weighed 130 lbs each.

I already have a server with 3 Areca 1680's and 48 Intel X25-E 32GB drives in RAID 10's, and the users of the database didn't notice any difference in speed over the 15K SAS drives it was on before. Our database is more of a DSS than an OLTP, so it's not too surprising. Lot's of big sequential reads and some big sequential writes. There are some random also of course, but they are usually small and good indexing and cache tend to take care of them. However, it is growing very quickly and we've gone quite a bit further on scaling it up than the manufacturer of our specialized software thought was possible. Therefore, we are getting ready to split the database into multiple database servers. Sticking with all SSD is a rather expensive option that we don't get a large benefit from, but the SSD CacheCade can be implemented with the stack of spare SSDs we already have.

Good to hear that the writeback cache is stable, but dissapointing to hear that it is slow on the HA sync. The software manufacturer only supported a single database file (and tempdb file) until I questioned them on it when I took over here. Now they recommend a tempdb per core and breaking the database into 5 files for the major groups of tables. I was going to put the OS, tempdb, log, and 5 db files onto seperate targets anyway, so hopefull that will be able to utilize the sync better too. That way each of the db servers will be using 8 targets.

What problems did you run into with the older Areca's?

Aitor_Ibarra · Thu Dec 02, 2010 6:41 pm

Sure, if your access pattern is sequential then SSDs don't provide such a huge benefit. Having said that the 1680 is a bit limited by PCIe 1.0 - the mosty I could push was about 1GB/sec to cache. Of course, that's before iSCSI and Ethernet take their toll, but the 1880 (and LSI 9280, same ROC) can double that, so if you've got access to 40GbE...

I had Areca 1680ix-24 and I found that quite often after reboot the card wouldn't see any drives connected to it, when installed on that Supermicro motherboard. The supplier tried changing power supplies and even did a motherboard swap; yet the card works ok in other servers. So it's a specific issue between that motherboard (or perhaps the Intel 5520 chipset) and that RAID card. I have an identical server which is OK, but doesn't get rebooted very often, so as a precaution I'm upgrading both to LSI. Although I will miss the 4GB cache, having SAS-2 and access to CacheCade and FastPath features should make up for it. Plus I've found it's easier to get LSI spares etc than Areca, at least in the UK.

The sync performance issue isn't too bad as it's per target, and I have a lot of targets, and I'm sure it will be addressed soon. The benefits of this build of 5.5, even compared to the last rock solid build of 5.5 are huge and I'm glad that it's now released and supported.

rchisholm · Thu Dec 02, 2010 7:02 pm

I ran into the 1 GB/s barrier on the 1680's also. I also ended up at a 125K IOPS barrier on each. I ended up striping 2 of them in Windows with 12 of the X-25E's in RAID 10 on each to be able to get to about 2 GB/s and over 200K IOPS sequential reads with cache set to writeback on a 100 GB testfile. Too bad our software really won't take advantage of all that horsepower.

I may actually have the problem you are describing but just don't know it. I have only shut down that server once in the year it's been running.

I'll let you know if there are any issues with the motherboard and the 1880's. They are getting upgraded to 4 GB cache each now. The IT director just put in the PO for me.

Sat Dec 04, 2010 4:58 pm

Stay away from cache on controllers in non-HA configuration (even with battery). Own power does work if array is fine. But if you'd manage to have power loss when array is in degraded state (member disk is AWOL or just getting high SMART error rate and started to work real slow) your multi gigabyte WB cache will be on it's own... Many gigabytes of transactions will be lost. Recovery from backup will take hours as full content should be recovered.

rchisholm · Sat Dec 04, 2010 6:48 pm

anton (staff) wrote:Stay away from cache on controllers in non-HA configuration (even with battery). Own power does work if array is fine. But if you'd manage to have power loss when array is in degraded state (member disk is AWOL or just getting high SMART error rate and started to work real slow) your multi gigabyte WB cache will be on it's own... Many gigabytes of transactions will be lost. Recovery from backup will take hours as full content should be recovered.

These will be in an HA configuration. The controllers will have batteries, and the redundant power supplies in the systems will be spread across multiple PDU circuits across multiple UPS's with a diesel generator backing them up. The XenDesktops will be on the Arecas with 4GB cache. The databases will be on the LSI's with 512 MB cache.

Is this a safe setup?

Sat Dec 04, 2010 7:32 pm

I think yes.

P.S. ...and it will be even safer with triple rather then double node HA.

rchisholm wrote:
anton (staff) wrote:Stay away from cache on controllers in non-HA configuration (even with battery). Own power does work if array is fine. But if you'd manage to have power loss when array is in degraded state (member disk is AWOL or just getting high SMART error rate and started to work real slow) your multi gigabyte WB cache will be on it's own... Many gigabytes of transactions will be lost. Recovery from backup will take hours as full content should be recovered.
These will be in an HA configuration. The controllers will have batteries, and the redundant power supplies in the systems will be spread across multiple PDU circuits across multiple UPS's with a diesel generator backing them up. The XenDesktops will be on the Arecas with 4GB cache. The databases will be on the LSI's with 512 MB cache.

Is this a safe setup?

rchisholm · Sat Dec 04, 2010 7:49 pm

Can you describe Triple Node HA for me? I have licensing for 4 nodes, just don't necessarily have the hardware in the budget, but unexpected big projects do pop up from time to time.

anton (staff) wrote:I think yes.

P.S. ...and it will be even safer with triple rather then double node HA.

rchisholm wrote:
anton (staff) wrote:Stay away from cache on controllers in non-HA configuration (even with battery). Own power does work if array is fine. But if you'd manage to have power loss when array is in degraded state (member disk is AWOL or just getting high SMART error rate and started to work real slow) your multi gigabyte WB cache will be on it's own... Many gigabytes of transactions will be lost. Recovery from backup will take hours as full content should be recovered.
These will be in an HA configuration. The controllers will have batteries, and the redundant power supplies in the systems will be spread across multiple PDU circuits across multiple UPS's with a diesel generator backing them up. The XenDesktops will be on the Arecas with 4GB cache. The databases will be on the LSI's with 512 MB cache.

Is this a safe setup?

Constantin (staff) · Mon Dec 06, 2010 3:45 pm

Triple HA - is future option of StarWind. Currently it`s not available

rchisholm · Wed Jan 05, 2011 3:21 pm

We ran into a little trouble trying to get the 4GB cache modules for the 1880's so we're going to stick with the 1GB that came with each of them and add an extra 48GB of RAM to each of the StarWind nodes for 72GB each instead.

What's the maximum amount of system write cache that tends to be useful for an individual LUN?