Monday, March 21, 2011

Data Lifecycle Management

I have been spending a significant amount of time lately researching how computer data storage has been evolving in the last 5 years. As many of you know, I earn my living as a consultant who develops complex IT solutions for my clients - matching appropriate technology solutions to real business problems. Data Lifecycle Management is a hot topic these days !

Business-grade data storage solutions generally take on two flavors: Network Attached Storage (NAS), and Storage Area Network (SAN). Typically, what differentiates these types of data storage is how the host computers connect to them. NAS devices are connected to the Ethernet network, and use TCP/IP protocols to connect, like CIFS (Windows) and NFS (UNIX/Linux). A newer standard called iSCSI is seeing a rapid rise in popularity.

NAS devices are attractive to smaller organizations because they are less expensive to deploy. The are simply attached to the existing Ethernet network and then they expose their data volumes to the host computers. SAN devices require Fibre-Channel cabling and switching gear to connect the hosts. This is significantly more expensive than NAS, so is considered a "higher-end" solution. It is also considered to be more secure, as the users of the applications and the storage are on physically separate networks, using different access protocols.

In almost all data storage systems, the device is a collection of computer hard disks in an enclosure. It will also house the controller heads (the brains to read & write the data) and the means of connection, be it Ethernet or fibre-channel. The brains provide different means of striping the data across the disks, mitigating the risk that a hardware failure of a single drive permanently destroys data.

Newer data storage systems have evolved from the traditional ones to include different types of computer hard disk in the same array. The heads (brains) understand that the different types of disks will have different price:performance ratios. So rather than treating all workloads the same, those with higher performance requirements - real-time applications or databases - will be intelligently migrated to the faster disks. Those with low performance requirements will move to the less-expensive, lower-performance disks.

Then there are other software advances in the heads (brains) which help by deduplicating the stored data. Imagine you wrote a paper on encyclopedia. Rather than writing out the word encyclopedia every time it appears in the paper, it gets written once, and then all subsequent copies are stubs which simply point back to the first - like an abbreviation. But the heads can do this with every word in the paper ! This can lead to efficiencies in the storage system of up to 90% ! Then there are other means of making the data storage more efficient, like thin-provisioning.

As you can see, organizations will go to great lengths to protect their data. The final tier in this Data Lifecycle Management paradigm is the role of tape. In traditional data storage systems, data would be backed up onto tapes, which are then sent offsite for safe keeping. Companies such as Iron Mountain have made a great business out of managing the logistics of off-site data storage for other organizations.

But tape is not without it's problems. While it IS remarkably inexpensive, in terms of cost-per-Gigabyte-stored, it is also somewhat fragile. Like most media, it is very sensitive to magnetic fields, and simply holding your tapes too close to a mobile phone places them at risk !

Since the price of disk-based storage is rapidly coming down, and the availability of high performance flash disks (no spinning platters, these are solid state !), is becoming common-place, the concept of "near line" storage is really taking hold. Organizations will acquire a second inexpensive storage array, and use it to store copies of the Production data.

This near line storage allows for all kinds of operational efficiencies. In the event that a user accidentally deletes an important file, it is quickly and easily restored from the near line storage system. In a traditional system, the operator would have to identify what tape the file was on, order the tape back from the off-site facility, and aft it arrives, restore the file. This could take a significant amount of time.

With data archival systems (like traditional tape backup systems), there are three key factors: Recovery Time Objective (RTO), Recovery Point Objective (RPO), and - of course - cost. RTO is the amount of time it takes from when the file was deleted to the time it has been restored. The RPO is how far back in time was the deleted file copied to the archival medium. That could be measured in minutes, hours, or days ! The entire outage is measured in terms of the sum of RTO + RPO. Finally, the costs are associated with the length of the outage targets. Conventional wisdom is that the shorter the outage window, the more expensive the protection scheme.

So the final tier of Data Lifecycle Management is still tape, but it is used more for long-term data retention. When the organization looks at the age of it's data, it may no longer be cost-effective to keep stale, unused files on the near line storage system. Tools are available to keep track of when data is accessed and migrate it down through the storage tiers, based on policies. For example, a policy can state that anything not accessed in 14 days should be moved from the primary storage system to the near line storage system. And if it remains untouched there for 30 more days, it gets marked for long-term data retention, offsite on tapes.


The opinions expressed in this post are purely those of the author. Opinions are like noses; everyone has one and they are entitled to it !

No comments: