Crazy HA RAM disk idea

Aitor_Ibarra · Fri Oct 08, 2010 12:12 am

I can't actually try this as the moment as one of my servers is out for repair.

Anyway, I've had this idea...

Currently, you can't have an HA target that uses a RAM disk, and write performance of the Write Back cache is held back a bit by writes not being acknowledged until both nodes have written to disk, not just cache (less cautious option is coming soon).

But what if you had a small target that had to be HA but needed maximum write performance? Could you hit both 1 million IOPS and 1GB/sec?

1) Take 2 servers with at least six cpu cores, and 3GB of RAM more than you need for your target. E.g. if you want a 5GB target, you need 8GB of RAM. You also need 10GbE NICs, or better still 40GbE...
2) Install Windows 2008 R2 and the hyper-v role and the free Starwind RAM disk. Create a Ram disk that automounts on startup. Format it as NTFS and give it a drive letter, let's say R:/
3) write two batch files, startup_ram.bat and shutdown_ram.bat:

Code: Select all

REM startup_ram.bat
COPY c:\ramdisk.vhd r:\ramdisk.vhd /y

Code: Select all

REM shutdown_ram.bat
COPY r:\ramdisk.vhd c:\ramdisk.vhd /y

4) Use the Hyper-V manager or whatever to create a fixed size vhd as c:\ramdisk.vhd
5) run startup_ram.bat - and time how long it takes (speed will be limited by your HD)
6) Create a new VM, with a virtual scsi controller and r:\ramdisk.vhd as a disk on that controller
7) Set the VM startup properties to delayed start, using the thime you got in 5) plus a few extra seconds/minutes

Create a group policy / local system policy / scheduled task to run startup_ram.bat at startup and shutdown_ram.bat at shut down
9) Give your VM four cores and 1GB RAM, and set up the virtual NICs
10) Install WIndows and Starwind in the VM

11) Repeat on other server

12) In Starwind, set up an HA target with the img stored on the ram disk.

Now...
if you restart the server gracefully, the VM will be shut down gracefully, and all writes will be finished to the img, the VM will close the VHD, and the batch file will copy the VHD to the hard drive. When the server starts, the RAM disk is created, the VHD is copied, and then the VM starts, and Starwind will resync the target.

Who needs a TMS RAMSAN?!

...and you could probably do this without virtualising, providing Starwind doesn't mind storing the img on a ram disk (in hyper-v, it won't be able to tell!)

- store img on R:\ , not vhd - adjust bat files
- set Starwind service to manual and add a net start to startup_ram.bat after the copy and a net stop to shutdown_ram.bat before the copy

Mon Oct 25, 2010 8:58 pm

I don't think it sounds like a production environment (as RAM is still gigabytes and data turns out to be terrabytes if not petabytes) but I think it's not going to take us a lot of time to have something like this implemented in the software. So I think we'll put this one into the roadmap as a "test task" for one of the newly hired software developers. Thanks a lot for this suggestion!

Timothy Mitchell · Fri Jan 18, 2013 8:41 am

I had a similar idea with a practical production need that led me to stumble upon this thread.

We have a number of large Hyper-V SAN servers that host VMs as they are built before migrating them to standard diskless hosts (16x 2.8Ghz Processors, 256GB RAM, Quad 10Gb/s SFP+ NICs, 24x 512G SSDs, 48x 1T HDD)

The systems are configured to share out storage (Using the Microsoft ISCSI software target) for Hyper-V hosts and the VMs that run on them.

In our configuration we used differencing disks to significantly reduce the volume of disk space needed for the VMs that we run. When configuring Hyper-V to use a differencing disk most of the IOs are on the parent disk. 300+ VMs latter and a few of these disks (about 40G in size) represents over 90% of our disk IOs. Hyper-V will not cache the parent disk in RAM natively and the parent disk is always read only.

So... We created a 40G RamDisk and shared it using the Server 2012 SMB file share and whala, no more bottleneck.

We implemented a PowerShell command similar to the startup command in the comment above as a parameter that must be run prior to the start of the SMB file sharing service and it works perfectly.

We are currently finalizing testing for a production rollout so that we can increase the number of VMs and hosts without increasing the number of SAN systems to host the parent disk.

Fri Jan 18, 2013 9:08 am

I'm not sure why you stick with a single controller design, nothing backended RAM storage - these things don't look "safe" to me.

Also I don't understand why you cannot use StarWind and use massive amounts of memory to be used as a distributed write back cache (flash tier is also coming soon). That would be safe and fast.

sammybendover · Mon Jan 21, 2013 7:56 pm

How soon do you anticipate this? I am anxiously awaiting a ssd cache system

How would you recommend implementing the SSD's? As a RAID 0, 10, or 1? I have 4 drives total I am planning on using.

Mon Jan 21, 2013 9:24 pm

There are many linked things inside V8 and we cannot release stripped down version faster. Sorry for this.

RAID0 for performance.

sammybendover wrote:How soon do you anticipate this? I am anxiously awaiting a ssd cache system

How would you recommend implementing the SSD's? As a RAID 0, 10, or 1? I have 4 drives total I am planning on using.

jhamm@logos-data.com · Fri Jan 25, 2013 6:00 pm

Does the StarWind cache also operate as a Read Cache (in addition to write back)? If so, could you accomplish the same thing by assigning the same amount of memory (or more) to the cache during target creation? For example, I could create a 5 GB HA target, and then specify a 5 GB cache. Would this give similar performance to using the RAM disks?

Fri Jan 25, 2013 8:32 pm

1) "Write back" is a cache flush policy. Cache always accelerates reads and sometimes writes.

http://en.wikipedia.org/wiki/Cache_(computing)

(read about "Writing policies" section)

2) No. Cache accelerates fequent used data only. If you have sustained writes to different addresses or reads from
non-touched addresses cache will get full soon (first case) or will be operating in cache miss mode (second case) and
you'll have system working SLOWER because of a cache provided latency. So RAM disk is ALWAYS faster.

jhamm@logos-data.com wrote:Does the StarWind cache also operate as a Read Cache (in addition to write back)? If so, could you accomplish the same thing by assigning the same amount of memory (or more) to the cache during target creation? For example, I could create a 5 GB HA target, and then specify a 5 GB cache. Would this give similar performance to using the RAM disks?

jhamm@logos-data.com · Fri Jan 25, 2013 8:58 pm

"and you'll have system working SLOWER because of a cache provided latency."

Does this mean it is better to have a smaller cache as opposed to a bigger cache? I thought the bigger the cache the better the performance; am I wrong there?

Thanks,
Jeff

Fri Jan 25, 2013 9:10 pm

No it does not mean anything like this. I means cache helps with some particular workloads (VM I/O is a good example). With some other ones it's useless (pure random reads),
parasitic slowing down everything (sequential heavy write, say video capture) and with some simply dangerous (single controller, database with transactions).

jhamm@logos-data.com wrote:"and you'll have system working SLOWER because of a cache provided latency."

Does this mean it is better to have a smaller cache as opposed to a bigger cache? I thought the bigger the cache the better the performance; am I wrong there?

Thanks,
Jeff

jhamm@logos-data.com · Fri Jan 25, 2013 9:27 pm

So if my workloads are exclusively VMs, big cache is good then?

Fri Jan 25, 2013 9:29 pm

If you have dual controller (node) or a triple controller (node) design - superb.

jhamm@logos-data.com wrote:So if my workloads are exclusively VMs, big cache is good then?

Timothy Mitchell · Mon Feb 18, 2013 9:52 am

It makes since and is perfectly safe.

The whole point of a parent differencing disk is that all traffic is read only. When you have multiple child disks the read demand on the parent disk is raised for each new child disk. All changes the VM makes to the drive (write traffic) is sent to the child differencing disk which is stored through a different SMB share directly on the SSDs without passing through the RAM disk.

The startup command, which has to be run before starting the SMB service, repopulates the read only parent disks to the RAM disk from the SSDs when the SAN unit powers up.

In our configuration the parent disk holds a SYSPREP version of the Windows Server 2012 OS. This gives us a huge reduction in storage space required for each VM because of deduplication for the OS files. The counter point is that the parent disk now receives most of the read traffic needed to run the OS.

More information on differencing disks:
http://technet.microsoft.com/en-us/libr ... s.10).aspx

We have been stress and failure testing these since my original post and they run, crash under load (because we pulled the power cord), shutdown, and startup without us having to do anything. The system for the RAM drives is hosted across multiple SAN units leaving the parent disk available in case one SAN unit fails.

Currently we have tested a single SAN pair to support over 2,000 VMs whereas before the read stress of the parent disk was so much of the SSDs IO that they could only support about 300 VMs.

anton (staff) wrote:I'm not sure why you stick with a single controller design, nothing backended RAM storage - these things don't look "safe" to me.

Also I don't understand why you cannot use StarWind and use massive amounts of memory to be used as a distributed write back cache (flash tier is also coming soon). That would be safe and fast.

Mon Feb 18, 2013 4:01 pm

There are many ways to skin a cat (c) ...

You can handle everything on hypervisor level (diff clones) or you may use deduplicated storage where SAN will take care of mapping the same data to the same blocks
and caching them deduplicated. Result would be pretty much the same.

I still don't buy your approach with non-redundant SANs and cannot figure out how failover happens.

Timothy Mitchell · Thu Feb 21, 2013 3:43 pm

anton (staff) wrote:There are many ways to skin a cat (c) ...

You can handle everything on hypervisor level (diff clones) or you may use deduplicated storage where SAN will take care of mapping the same data to the same blocks
and caching them deduplicated. Result would be pretty much the same.

For any normal filing system this would be true. However, when you consider the impact that live migration or v-motion has on any storage system, local or network, you would always want to use differencing disks.

Why?

During any form of live migration all disks directly associated with a VM must be transferred between hosts. If the hyper visor is not aware of the deduplication method when it transfers a VM between hosts it will copy the full contents of the drive and not just the unique section.

For instance, two systems running a web server, the first uses a single virtual disk for all operating system data and the second uses a differencing disk for the same information. On the first the total VHD disk size is 30G, on the second the parent VHD disk is 25G and the differencing disk is 5G. During a live migration of the first web server the entire 30G VHD must be copied between servers for the migration to occur. For the second web server to live migrate only the differencing disk is transferred between hosts, provided the parent disk is available at the same local or network path. If the parent disk is not at the same path the system will prompt you to include the parent disk in the migration.

The argument can be made that a good deduplication SAN can detect duplicate information being read by one server and written by another and allow the copy process to happen faster. However in my experience this has placed a high and unnecessary load on your network for transferring the information between systems and on the memory and processors of your SAN for detecting the duplicate information.

Pushing this separation architecture to its most efficient, use a differencing disk for your operating system drive and have the OS in the virtual machine directly connect an additional drive letter to a separate ISCSI or SMB shared disk(s) for any installed applications, stored files, and log files (Except ISCSI and SMB logging). This will reduce the size of your differencing disk to typically less than half a Gig and allow live migrations of typically large systems such as SQL or Exchange to occur in a matter of seconds on a standard gigabit network.

anton (staff) wrote:I still don't buy your approach with non-redundant SANs and cannot figure out how failover happens.

As for the non-redundant SANs, we are running redundancy on our SANs and our environment requires us to run N+2 no single point of failure on all systems. This means that we must be able to take two failures or configuration errors within any service layer and continue to operate.

Each individual SAN is deployed as a pair running failover clustering. The clustering service is only for ISCSI and SMB VHD access. The active device within the cluster handles VHD access and the passive device runs the volume shadow copy service for creating backups without impacting production performance. The passive device is also configured as preferred for DFS traffic and offsite archive.

Each of our sites contains 3 pairs of SAN units. Each member within a cluster runs on a separate SAN pair. SQL is replicated using transactional synchronization. Exchange databases are replicated using DAG. DNS and Active Directory use their own replication system. Web front ends are configured with secondary SQL servers. UAG load balancers use NLB based VIPs for accessible IPs and are configured to detect web server errors in case multiple SQL failures leave a web front end orphaned from both of its SQL servers. And finally DFS is run directly on the SANs for file replication and availability.

Site to site replication is handled through SQL transaction logging, Exchange DAGs, and DFS. Virtualized GTM load balancers on ANYCAST IP addresses redirect CNAMEs to the nearest available site. This design allows our sites to run independent of each other in the case of complete network outage or area isolation. Any split brain clustering issues are handled through the transactional synchronization system when the sites are able to communicate with each other again.