I’ve had the pleasure of moderating a half-dozen panel discussions on Big Data and object storage in the last few months. It’s been a learning experience.
Big Data has always been as big as we could afford, be it block, file or object. Google’s MapReduce, running on massive commodity-based scale-out GFS object storage, inspired the current Big Data craze.
If storage were the only problem we could stop with a discussion of Amplidata, the object storage company that sponsored StorageMojo’s participation (check out the StorageMojo object storage video to get a quick overview). But there’s much more to it than that.
Defining Big Data
Ever notice that the number of disk drives in most infrastructures stays fairly constant? The growth in drive capacity parallels the growth in data, so disk drive populations should be stable.
Thus a definition of Big Data based on the number of drives is simple and flexible over time. Defining the lower bound of Big Data as an application that uses 100 large capacity SATA drives means that with 4TB drives you aren’t doing Big Data unless you have 400TB of raw storage.
Smaller data sets can be vital to business or research; this isn’t a “mine is bigger than yours” competition. But as a rule of thumb you can probably live with filers if you’re certain your application won’t need more than 100 drives ever and, if you will, you should consider object storage.
There is no doubt that advanced erasure codes – variously referred to as rateless or fountain codes – are technically and financially superior to the older and much more common Reed-Solomon erasure codes for Big Data.
They are lower overhead, offer higher redundancy and better security than RS codes. Want to survive 4 failures with ≈50% overhead? Rateless codes are the answer.
Moving Big Data
Shipping boxes of big SATA drives overnight is still the high bandwidth solution – assuming the drives survive – but the physical hassle is prohibitive for most operations. Assuming you need to move gigabytes regularly, you need 2 things: a fast transport; and a storage system that can manage sustained write performance – which not all can.
The friendly folks at Aspera software have figured out how to achieve line speed data rates across long distances, a trick TCP has never learned. They can feed the data directly to an object storage system.
Object storage isn’t a drop-in replacement for NAS: the common REST interface is much simpler than NFS 4.1 or SMB 3.0. Many new apps are now written to use REST, but what about legacy apps that expect NAS?
At least one vendor,Panzura offers a NAS front end to object storage. Their Global Cloud Storage System serves up NAS protocols while using object storage on the back end.
But they only support a few back-end options, so you better like their choices. There are options beside Panzura though.
The business problem
The technology is there, but as I asked at one of the panels “if this stuff is so freakin’ great, why isn’t everyone using it?” It isn’t only enterprise IT inertia.
There’s another problem: business units and enterprises aren’t used to seeing storage as horizontal infrastructure, where scale-out storage wins. Even major adopters of object storage typically use a single app to justify an investment.
Networks are a horizontal infrastructure. We buy them anyway. It is a solvable problem that I’ll be writing more about more later.
The StorageMojo take
A few years ago there was debate about public vs private clouds. I continue to think there are business cases to be made for both.
But they aren’t the same cases.
As the commoditization of scale-out architectures grows and the cost of public cloud storage becomes an issue, the advantage that public providers offer today will decline. AWS is driving app development to the S3 API, and those apps will migrate easily to private clouds.
The growth of private clouds won’t crater AWS. Yet AWS will enable private clouds.
Courteous comments welcome, of course. How do you define Big Data?