StorSimple Deep Dive 2 – Data Efficiencies

Been an exciting few weeks since my first deep dive post into StorSimple last month and a very busy end to the Microsoft year! Been great seeing the interest from customers when we start discussions about being able to leverage the cloud without having to change the way their application architecture.

In my first StorSimple deep dive, found here, I talked about StorSimple from a hardware and platform perspective. In this post I want to talk about the efficiencies that StorSimple offers.

What do you mean by efficiency?

When you talk about efficiencies it can mean many things to many different people. In this case I’m talking about the way StorSimple optimises data for performance, capacity and moves data between tiers based on a smart algorithmic approach where it views data at a block (sub LUN/file) level . That is really good…but of course other companies do this as well, so what differentiates StorSimple? The biggest differentiator here is that the lowest tier isn’t SATA/NL-SAS but the cloud. And when we are talking about low cost, highly available storage, it is hard to beat the economies of scale that the cloud can offer and of course  it goes with out saying <AzurePlug> that Windows Azure is the best example of this with our price structure the same for any DC in the world and the fact we geo replicate all data by default </AzurePlug>.

How much of my data is really hot?

When data is created it is hot and will be referenced quite often however, after some time, this data generally grows cold very quickly. You only want to look at those old photos of your cat, when you have a new meme to create, on rare occasion. Generally anything north of 85% of the data can be cold. Keeping this data unoptimised and on local disk is obviously not the most effective use of your technology budget.

Life of a Block with StorSimple

So into the deep dive bit! I’m going to talk about the life of a block and how we treat this block, from a StorSimple perspective. I’ve done a series of whiteboards…or drawn on my screen with <MS_Plug>my touch enabled windows 8 laptop</MS_Plug> below to explain how we treat a block. Block sizes on StorSimple are variable, and we generally select the block size based on the kind of workload running on a specific volume (LUN). StorSimple deduplicates, compresses and encrypts data before it is tiered off to the cloud; this generally provides between 3x to 20x space savings, dependent on workload.

  1. A block is first written into NVRAM (battery backed DRAM that is replicated between controllers).
  2. The blocks are then written down on to a Linear tier, which is eMLC SSD drives; so low latency and high IO. 1
  3. Blocks of data generally don’t stay on the linear tier for long, unless they are subject to continuous IO requests. Blocks are taken, near immediately, down to the dedupe tier. This data remains on SSD, with the low latency and performance you expect, but the data is deduplicated in line before arriving here on a block level, providing significant space savings.2
  4. As the blocks start to cool then are then taken to a SAS tier and compressed in line on the way down there. This all happens on a block (sub file/LUN) level so if a VM system file, for example, was located on StorSimple the parts of that installable which are hot remain on SSD while the majority of the capacity that is infrequently accessed will be taken to SAS.3
  5. As the StorSimple appliance starts to use it’s local capacity it will then encrypt the coldest blocks of data and tier them off to the Cloud using RESTful APIs. When this is Windows Azure this means three copies of the data will be kept in the primary data centre and three copies of the data will be in the partner data centre by default. Suddenly you can use the cost and availability efficiencies of the cloud, without having to change your application, operating system or, most importantly, the way you view your files. The data is encrypted with AES256 bit encryption and private key is specified by the customer. 4
  6. Then in the event that is called back from the cloud it will be a seamless process. The metadata, which is always stored locally on the appliance, knows exactly which block of data is required. StorSimple will make a RESTful API call over HTTPS to bring the data back to the appliance in a very efficient manner, as the data is compressed, deduplicated and there is no need to search for the location of the block of data. Not only will the block you require be recalled but other corresponding blocks will be pre-fetched back for further performance optimisation based on the read pattern. For the below example I’ve show block “F” being recalled to the local appliance. This data will be stored on the deduped SSD tier as it is now hot data once more. This back end process is totally transparent and the only thing that will be that will occur is slightly higher latency on the blocks which have been tiered to the cloud.5

How do we know who is insane cold…and how

Easy! Whoever has this stamp is insane…


Now that my obsession with the Simpsons is addressed how do we decide what blocks are cold, when they are tiered and what happens, from a performance perspective, when blocks of data need to be accessed and they are located on a Cloud provider.

StorSimple uses a Weighted Storage Layout (WSL). This then goes through the below process:

  • BlockRank – All volume data is dynamically broken into “chunks”, analyzed and weighted based on frequency of use, age, and other factors
  • Frequently used data remains on SSD for fast access
  • Less frequently used data compressed and stored on SAS
  • As appliance starts to fill up optimised data is encrypted and tiered to the cloud

But what about what I think?!

Automation is great, but what if I want to manually influence the priority around data sets are tiered to the cloud? StorSimple offers a solution for this as well. You manually can specify a volume to “local preferred” so it is the last data set that will be tiered to the cloud, and only tiered off if all other datasets have been tiered off and the local capacity of the appliance is reaching capacity.

Examples of data sets you might set to prefer local are:

  • Log Files
  • DataBase Files
  • VM System files

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s