Dear StorageMojo: replace 4PiB of tape with object storage?

by Robin Harris on Tuesday, 10 June, 2014

An architect and regular StorageMojo reader faces a perplexing problem: should he move his organization’s archive from tape to object storage? The first question is, of course, is it technically feasible?

The economics come next.

Here’s his problem opportunity, lightly edited:

I’m really curious about where to turn next with yet another infrastructure project.

Picture a large HSM environment currently storing about 4PiB+ of data, with all of the bells and whistles as far as data validation at the tape layer (Data Integrity Validation via Oracle T10K-C/D) and ZFS at the top for disk consistency. SAM-FS (Storage & Archive Manager) is the data mover/stack in use.

What of object storage in this space? Are we there yet? Can we consider something like a commercialised/packaged up Ceph to be able to pull off the preservation and “forever” feats of big HSM environments with multiple tape copy scale?

My gut feeling is that Ceph has a place and will be a rockstar filesystem in the end, and might just take a place alongside ZFS as a stalwart of modern “just use it” thinking, but I do wonder if Ceph has been used at that kind of storage scale with the sole intent of a very high throughput, high performance (think: 10+GB/sec, 100k+ IOPS) file serving platform ATOP an object storage container BEING RELIED UPON FOR a long term archiving and preservation infrastructure.

Are we there yet? Is the tapeless world now possible with all the goodness and data integrity capabilities that tape offers without the vendor smarm/marketing and “tape is dead” rhetoric? It is almost embarrassing to hear such words at press and conference events. Not sure how company execs still get away with saying it without being pelted with tomatoes.

What about it, object storage vendors, can you meet his needs without bashing tape?

The StorageMojo take
Tape has been ceding market share for the last 60 years. But it’s still here. The recent work by IBM and Fujifilm holds the potential for tape to begin to gain market share, if they can meet their aggressive density growth rates.

Until that happens, the rising economic threshold for tape makes scale out object storage every more attractive. But attractive even at 4PiB?

Also, there are several other options besides Ceph. Scality, OpenStack, QStar, Quantum, Object Matrix, Amplidata, DDN’s WOS and Cleversafe among others.

Vendors, start your engines.

Courteous comments welcome, of course. Vendors encouraged to comment but, please, identify your company and try to add as much value as you can. Please ask any questions in the comments or email StorageMojo and they’ll be passed on to the architect.

{ 18 comments… read them below or add one }

jan June 10, 2014 at 7:00 am

I don’t think Ceph is really there yet at scale. Also economically it will use a lot more space as it doesn’t have erasure coding [Ed note: Stale data. Ceph has had Erasure Coding since the Firefly release.] so you are looking at multiple full copies.

If you want performance in object storage I would look at Scality. And if you want more than just object storage take a look at SGI’s Infinite Storage Gateway where you can use Object Storage + Tape though the integration has someway to go I think it’s getting there – OEM’s Scality.

Quantum may be able to do it but you are locked into everything Quantum, Cleversafe can probably do it but I think their more for very large archives rather than performance object storage.

Ian Colle June 10, 2014 at 8:34 am

Perhaps one of our multi-PB production customers could talk to you about how Ceph scales. 😉

Ceph has had Erasure Coding since the Firefly release.

For details on how to configure it, see http://www.ceph.com/docs/master/rados/operations/erasure-code-profile/

Joe Arnold June 10, 2014 at 8:35 am

Joe Arnold – CEO/cofounder of SwiftStack – chiming in.

Swift object storage they’re at the present coming out of the OpenStack project. The durability characteristics for long-term storage are very good. Three examples: The ability to self-heal when there is a device failure means that data continues to be protected even if an operator does nothing. Another example is the ability to deploy across multiple data centers which further reduces risks. Lastly, is the ability to continuously scrub data to re-validate objects and replace any that have gone corrupt.

Dig more into the architecture here: https://swiftstack.com/openstack-swift/architecture/

Erasure code development is in the works with Intel/Box/SwiftStack – which will further reduce storage costs: http://www.intel.com/content/www/us/en/storage/swift-with-erasure-coding-storage-demo.html

Multi-petabyte archives can be done extremely cost effectively with very high durability levels. SwiftStack is: one of the largest contributors to open source project within OpenStack, and we also provide a software platform for object storage that makes deploying object storage across commodity hardware incredibly easy to deploy and scale.

Example use cases for archive:
Internal IT (with HP – short): http://youtu.be/XPUnRB9LIH8
Life sciences (with FHCRC – longer): http://youtu.be/ImrYe0CBL-E

Ross Turk June 10, 2014 at 8:38 am

@jan – Ceph does have erasure coding. It was released in Ceph Firefly a few weeks ago. That, along with the cache tier pools, brings flexibility around performance/cost for this use case. It’s a bit on the new side, but worth looking into!

-Ross (Red Hat)

matt June 10, 2014 at 9:16 am

Ceph is SLOW and lack of erasure-coding means it’s pointless to use. Cleversafe’s prices are insane. DDN is probably the fastest of the bunch but is hardly inexpensive and is playing catch up to Cleversafe in the notion of globally-distributed.

I get the impression the author doesn’t understand what object storage is for. It’s not for real time access. It’s for async requests on essentially static and largish objects. Would I store say Oracle database exports or archive logs into S3/Clever/WoS? Yes. But I wouldn’t dream of trying to run a high iops “fileserver” off of that same stack.

With 4PB of HSM that probably cost a mint. Object storage is really just (much) faster tape but the workload characteristics are essentially the same.

John Dickinson June 10, 2014 at 10:55 am

I’m the project technical lead for OpenStack Swift. It’s an object storage system, and I’ve seen quite a few migrations from more traditional storage system, including tape, to Swift. It’s really interesting to look at why people are choosing to move from tape-based systems to Swift.

We all acknowledge that storage requirements (for everyone) are growing rapidly. More and more servers, more and more users, and the changing demographics of how applications are used all contribute to this massive growth in storage. And one could use tape-based systems to get very low cost storage, but the high access times that tape gives you means that tape isn’t a good fit for data that is consumed by modern apps.

With web and mobile apps, users require data portability, and current data consumption models center around on-demand access to a very long tail of data. Both of these realities require a high-availablity storage system. Tape can’t provide that, and so storage engines like Swift are needed to provide cheap, durable, available storage that supports the high concurrency required in these use cases.

These characteristics are why object storage systems, and OpenStack Swift specifically, are being adopted. In addition to some of the largest public cloud service providers, Swift is being adopted for video streaming, gaming, enterprise IT, scientific research, in-house sync-and-share backends, and more. Swift meets the high availability and concurrency access requirements of these applications, and it scales to massive levels. Combine that with the fact that it’s open-source and doesn’t have any hardware lock-in, and you’ve got a storage engine that offers a compelling solution to modern storage problems.

John Wilkins June 10, 2014 at 11:43 am

@Matt

Ceph does have erasure coding. See:
http://ceph.com/docs/master/architecture/#erasure-coding
http://ceph.com/docs/master/rados/operations/erasure-code-profile/

Ceph writes data and replicates it before acknowledging the write to the client to provide a guarantee that it wrote the data successfully. That means that you won’t get a false sense of security about a write from mechanism like drive caches. While that may introduce latency, it provides a guarantee that Ceph wrote the data to disk multiple times successfully.

Ceph also provides a means to accelerate I/O with the same guarantee: Ceph provides cache tiering too. A cache tier comprising fast SSD drives can address I/O latency for a subset of the cluster while still providing the cost efficiency of an erasure coded backing storage pool.

See http://ceph.com/docs/master/rados/operations/cache-tiering/ for details.

matt June 10, 2014 at 3:51 pm

re: Ceph has erasure-coding. Fine, I wasn’t aware of that very recent development. Interesting description of how Ceph does erasure-coding which is not how S3 does it at all. Or was the Ceph description watered down so much that it lost accuracy? Or was this an “optimization” with the intent that non-degraded reads were really fast (compared to S3) and low cost in CPU?

In any event, object-storage (cached or otherwise) is great for write-rarely, read many times. It really is just tape on steroids. I run across a disturbing number of people who think they want to run their virtual machine disks on object storage and have a very hard time understanding that’s the worst possible use case. (want to store a snap-shot of the VMDK for posterity? sure, go ahead) For anything that’s write it, and forget about it being lost due to bit-rot or node/cartridge failure, EC-based object storage is the right tool.

matt June 10, 2014 at 4:41 pm

excellent paper if you haven’t read it even if it’s a bit dated. http://web.eecs.utk.edu/~plank/plank/papers/CS-08-625.pdf

Ian Colle June 11, 2014 at 3:31 am

RE: the above paper, Ceph uses jerasure.

https://github.com/ceph/jerasure

Philippe Nicolas June 11, 2014 at 12:31 pm

If I understand correctly the question: “Is it possible to replace 4PB of tape with Object Storage for an archive use ?”

Of course yes, plenty of examples where object storage solutions replace tapes environment and many of them are selected instead of tapes. Let me explain why in details.

From a technology perspective, storing and preserving 4PB of data with an object storage technology, is exactly what all object storage vendors do every day in many verticals and industries. But it doesn’t mean they store archive data or production data. Archiving implicitly means storing data for a pretty long time and this data is the only and unique copy, in fact the last one. So it has to be highly protected. Object storage solutions use essentially 2 protection mechanisms: replication and erasure coding (EC) and some of them have capability to extend this to remote sites with stretched cluster for instance. By the numbers, just to put things in perspective, 4PB just represents less than 2.5 racks with 4U chassis each with 60 4TB drives protected by a simple EC model, let’s say 1.5 ratio between raw and usable storage.

But how is it integrated with the environment ? is it fully transparent for the user ? and at what cost ?

Integration with a data mover could be made at multiple levels with object APIs or file interfaces, it depends of the capabilities offered by the platform. We heard many times the term Active Archive meaning that access is transparent, pretty fast without the penalty of offline data. Finally the archive platform is pretty similar to a big file servers, a bit slower than production file servers if I compare the two but both of them offer easy data/file access for an application or a user.

Plenty of examples of such products exist on the market supporting multiple object storage solutions. Data management could be a HSM-like or a pure archiving approach with the difference that HSM maintain a link/pointer/stub on primary storage to “make a transparent” extension with the secondary environment. One of the other key advantage is also the random access of the disk drives without any limitation like the number of tape drives that introduce latency and bottleneck when the number of requests is superior to the number of tape drives. In that case, requests are queued but the user waits the data, reducing the value of the approach.

The cost is also very attractive and it could be even better if you consider additional features such deduplication, compression, cold/spin-down/MAID or zero watt disk.

In both cases – object storage and tapes -, what improve user experience is the capability to find, retrieve and serve content very fast with content indexing and search capabilities.

At Scality, we have multiple integrations with this data mover engines supporting large environments with object or file integration, depending of the need by the user. We deployed this in various verticals with great success replacing tapes sometimes or selected instead of tapes as well.

Without any doubt, seen multiple times, Object Storage is a great and natural choice for Archive and tape replacement. The key question now is: what are the last tape advantage that object storage can’t deliver ? Think about that and you realize that the answer is not so obvious. Some people say mobility but even with this one alternatives can be found. And I’m not sure mobility is what people expect for archive.

Josh June 12, 2014 at 6:54 pm

It’s not surprising that NetApp’s StorageGRID isn’t in the conversation yet as it’s usually confined to the Healthcare space, but it would be worth looking into. There has been some press regarding F5’s ARX HSM integration, but it can be integrated with many other similar solutions. There are NFS, CIFS and RESTful front-ends and utilizes the E-Series platform on the backend (DDP!), including the not-too-shabby 4U/60 drive enclosures for density.
http://www.netapp.com/us/media/ds-3038.pdf
(I work for a NetApp Partner)

Mark Pastor June 16, 2014 at 7:44 am

Mark Pastor – product marketing manager at Quantum – my 2 cents…

I agree with many of the previous comments. Erasure code technology protects a single (and only) instance of data very effectively, and it eliminates the cost and process complexities associated with replication to RAID or other secondary copy storage.

I respectfully disagree with the comment from Jan above, “… but you are locked into everything Quantum…”. Our object storage platform solutions offer NAS (CIFS/NFS) gateways as well as S3 native interface – keeping things very open for interfacing to your HSM system of choice. While our StorNext file system is one of the strongest shared workflow file systems on the market, and can offer substantial benefits for tiered storage architectures, it is not the only on-ramp to the Lattus object storage platform.

From a performance perspective, Lattus object storage systems have been deployed as part of active workflow environments in demanding media and entertainment and federal applications and elsewhere –performance of access is not a concern, particularly where tape is the incumbent. http://www.quantum.com/customerstories/theark/index.aspx

While many customers implement object storage in place of tape, there are still many that value the combination of these tiers. The benefits of policies associated with tape are still relevant. Some organizations require a physical copy of data located offsite. Others prefer to segment their content: the more valuable content will reside on object storage, and the content that is only retained for compliance or other non-business relevancy can be moved to tape, which continues to be the most cost effective storage technology out there. Yes, object storage systems are getting very close, but depending on the use case and scale of the less active data, tape is still a valuable option.

The new wave of object storage systems offer two key benefits:
1. Erasure code technology, which is useful in eliminating unnecessary replication of data, improves performance of the overall environment as well as substantially reducing storage costs.
2. An object interface (among others) that usually delivers a cloud, global access to content model. This valuable archive data can now have highly responsive, global access and availability.
A key aspect of a best of breed solution is the elimination of long term storage refresh and data migration woes. Where Quantum’s Lattus solution differs is that you can invest in a solution based on today’s storage component technology, and incrementally and non-disruptively replace storage component technology modules, and have data re-spread behind the scenes to the new components when you are ready to retire the old components. While the system you own in five years may not have any of the original storage components in it, your data will never leave the system..

There are some exciting new technologies associated with tape, like LTFS, that can really strengthen a tape-based archive. Here is a discussion of LTFS as well as erasure code technology. https://blog.quantum.com/index.php/the-boundary-between-primary-data-and-archive-data-has-blurred/

Good luck with your new architecture – it is a great time to be laying that out.

John ( other John ) June 20, 2014 at 1:54 am

Can I ask a question to all here?

Is not the idea of object stores to have the application address directly the underlying data with persistence handled by a API?

As in very close to database functionality?

I have a primary confusion, because object stores make sense to me when you can take a URI and have that replicated in many places. I think this was covered by John Dickinson better above. In the context of needing to have a file available in multiple locations. Then a single address is great if the store system can replicate that to where it is needed.

My primary confusion extends a bit, in the sense that I think of object stores as being best used when you design a application for them.

Because file systems are improving much lately, and taking on light database like functionality. Microsoft at least publicly told the world they wanted a database fs as early as win95. RE-FS, on windows server is interesting to look at for the way it was approached.

I think lots of applications which do not quite fit a database requirement, but which need certain kinds of transactional integrity, can be suited to rich file system design. Another rarely noted Microsoft technology is TxF, link below.

When you combine application logic with simple calls to persistent store, you can get huge design wins, when building a application. I’m Microsoft biased , because I found a niche of work when involved with already MS hosted environments, to leverage overlooked functionality. Some of the tech just sitting there unused can give you features that would normally be extolled by the big storage vendors, it’s a cute thing to be able to say, “oh, and you just got HSM through Storage Spaces so I only bought a much smaller flash disk for you, oh, and we’ve just allocated a couple of cores to running parity and you have file redundancy replication to near line rust… ” there’s a lot in there which can make a huge difference in a smaller shop. Tell me you have a billion dollar transactional gig, and I’ll probably look to Hitachi, but that’s another game. I don’t want to shill, rather to draw attention to a area which I think is overlooked, where you can do big iron things in small environments, and I’m not NDA bound to not share benchmarks and other data. My side angle on this is that I think I’d earn more pitching the systems mentioned here, but I sometimes think who works with big storage overlook something that could be interesting.

Okay, digression over. I get the feeling from this question, that there’s a interest to look at a object store system, from the point of view of running existing applications in place right now, and leveraging object store benefits more later. It feels like a “what can I get now, that can run my existing app, and will that automatically benefit from all the theoretical facilities straight away?” I also feel the question might be about whether you can buy a storage system to replace what they have already, that offers capability for future possible redesign. Certainly if you can get a whole feature set in the same price bracket and loose nothing on just expanding what you have, that’s a nice thing.

It’s a large store already. All the bells and whistles as said.

Is there a lot of cold data?

Because object storage could be well suited to tiring as far as Amazon Glacier, if that maps nicely. The devil is in the metadata and the front end. But hypothetically, turning over big chunks of a petabyte size store to cold storage would be a nice win. Even if some redesign is required.

What I do not get, and would really appreciate any comment you can offer, is why object storage works if you don’t treat it primarily in the application. I thought the whole point is to offload persistence to the fs
And to have global addressing, so possibly many applications can reference the same data. Medical records sound like the thing for this. So do the kind of OneNote container files that we’ve been using to dump the background to sales calls, bundling lots of kinds of files, even the archive if compliance phone call recordings are going in there. You can’t do full ACID on such a lot of varied data, but you can at least have ACID working at the revision save level. A URI for john_work_customerone could be appended john_work_customer__one_day_date just like VMS adds [n] after files for versioning, and the simplicity of that when writing a front end access system is neat. I set up a forced full copy on a appending a telephone record, for example, so there’s a time stamp of the file of pitch materials and RFQs and references to match the last interaction, same with sending or receiving a email. The little app just wants to see where we were at with draft contracts and what documents had been exchanged, whenever there’s progress. Anyhow, that’s just smarter versioning. But with OneNote you just have a file, no matter how many people want to access that. Having one file with a single URI would be great, because you want that file close to multiple access points. If the fs is able to give one locator for a local file, know when to replicate that because you are looking at it from another location, or when mobile, and maintain integrity, that’s great. That’s the sort of thing I’m looking for.

But OneNote is not a database. I mean I cannot enforce transactional integrity at a level of editing the documents and get point in time rollback, because apart from it’s a kind of container, there’s too much complexity there. So having a really good file system that acts like a database for blobs, is the next best thing, and a definite win.

What I would like explained to me, is why you would run a real database atop a object file store?

Surely this is making too many layers?

If you have a good backup software, maybe that’s doing HSM also, but you’ll get drivers for the database you use, which understand the file format and the database behavior so that data doesn’t get munched.

If you have a application that treats the fs as a database of sorts, you either don’t have that understanding at a driver level of the app, or you need to put some logic in somewhere to figure out the status of a file and what its needs are. Obviously if a big query is run infrequently and the fs thinks it’s cold data and shoves it off fast storage, you’ll take a hit. Possibly just having a front end optimizer that tells you there’s a wait for that data will satisfy a patient user. Basically there has to be some knowledge of what’s going on with the files, and the app will call a storage API to get a look at availability.

I think I’m assuming HSM is going on, because the question out top notch HSM front and forward in the current setup. What I am grappling with is the idea that object storage can manage a lot of extra metadata about a file that can be gotten by a application. But where is the logic placed? Is it transparent totally, as in you can sling normal files at a object store through a driver and some magic happens behind the scenes? Or is it a two way conversation the app has to have with the file system to get the best results?

Please forgive me my confusion. I may be describing my own problem or opportunity, rather than helping here. Is there anything inherent in a object fs that I am misunderstanding? I mean, it does seem possible to just plug in some systems, such as the Quantum one, and they’ll do a awful lot, out of the box. Do object file systems have the kind of software interface exposure I am thinking about.? Is a object store something you can treat as general purpose right now? Would not the extra layer of abstraction be harmful to a big running transactional database? Are there gains to be made but the biggest gains come if you write your application to talk direct to the file system?

A connected question I have, is if object stores could make a performance break beyond the curve, would a vendor start getting access to a application layer in terms of a business relying on that, and would that in turn be a lock in? I can imagine that a super file system might replace many kinds of databases, especially the NOSQL kind, and it would be cool to just start thinking of everything as files, at a stretch that data ingest might be as simple, or crude, as copy and paste a folder. I do get the impression that storage companies are trying to move higher up the software stack, to get close to applications, and that might be a very good thing. I’m going to have a long re read of the comments above, thank you all for chiming in, there’s about no place to hear it straight about these things, and you almost never get to hear a vendor explaining their thinking without getting feet under a table. I’d love to see more of these “ask storagemojo” entries, but I would love more that the questions were more detailed as to setup and aspirations. It’s very hard to figure out a generic response to what sounds a bit like “if I buy this new gear, will I get the cool features and not be fired?”. Sorry, I don’t mean to denigrate the questioner one bit, but I would have maybe understood this better if the question had included why they thought object storage was relevant, and how they thought they could get a win, and also if they are contemplating a new architecture for their applications.

All best from me ~john

(the Microsoft file system link, )
http://en.wikipedia.org/wiki/Transactional_NTFS

John ( one more John ) July 1, 2014 at 7:55 pm

Yes, Ceph does have erasure codes now. But all erasure codes are NOT the same. The ones used in Ceph are either very primitive ( k = 2, m = 1 ) or very computationally expensive ( k = 10, m = 4 ): 1) k = 2, m= 1 is just the most inefficient RAID 5 ( 2 data disks with 1 parity disk ); and 2) k = 10, m = 4 uses the most computationally expensive Reed Solomon code ( just google it if you don’t know what Reed Solomon code is ). So Ceph either cannot provide high enough data protection degree for long term archival use cases with RAID 5 or demands a lot of computation.

More importantly, for each large size objects, Ceph can at most uses 10 disks in parallel with k = 10, m = 4 ( forget about k = 2, m = 1, which is a joke for a production configuration ) even if you have hundreds of disks available for use. This means I/O performance is bad, which matches test results in our evaluations, and also one of the above comments.

Hope Ceph will have much better engineering talents to integrate real advanced erasure codes to make it at least reliable enough with low cost for archival use cases, now that it is part of Redhat. Jerasure is just an academic toy library to demonstrate concepts, not for commercial production storage systems.

Girish January 24, 2015 at 8:19 pm

If the use case is data archival only, can a tape based solution be economically more viable than object storage (disk)?

Three points to consider –

1. Capital cost of a disk based storage system v/s cost of tape based system (tape library/standalone drives + tapes)
2. Usable life of the archival system – Typical life of a disk drive/storage server is around 3-4 years, max 5 years, right? Typical life of a tape – documents state nearly 30 years (source: http://searchdatabackup.techtarget.com/tip/How-long-does-tape-last-really) – practically lets consider 10 to 12 years to ensure tapes remain backward compatible with newer tape drives in use at that time. If the requirement is to keep the data around for more than 3-4 years, then tape based approach is definitely more economical.
3. Power consumption – Disks are always on, spinning, consuming power whereas tape is an offline medium; write the data, done with it. Might need to periodically read tapes to do some form of data consistency checks but apart from that, zero power consumption.

Just something to consider..

Girish

Steve February 11, 2015 at 11:01 am

Girish, there are other mitigating factors that may make disk more desirable than tape to organizations:

1. Human capital to manage media: Often, not always, media must be removed from libraries, and stored in an external location. Creating and managing a process by which to do that, can get expensive.

2. While tapes may last 30 years, LTOs timeline obsoletes media long before it’s lifetime expectancy will expire. This means that organizations will either need to keep “old” tape drives around for as long as data might be needed to be restored. They will also need to consider keeping “old” copies of their backup application around, in order to ensure that “old” data can be restored via those “old” tape drives. Let’s not forget about legacy versions of operating systems to run that “old” backup application.

The alternative will be to invest in “conversion” infrastructure. Read data off of LTO5 for example, write it to LTO8.

The value proposition for either really depends on the organization itself and what they’re comfortable committing to.

Jacob Marley February 12, 2015 at 12:13 am

Steve, if a tape base solution can get more than the 3to4years a disk based solution does tape still wins.

Leave a Comment

Previous post:

Next post: