I’ve had the pleasure of moderating a half-dozen panel discussions on Big Data and object storage in the last few months. It’s been a learning experience.
Big Data has always been as big as we could afford, be it block, file or object. Google’s MapReduce, running on massive commodity-based scale-out GFS object storage, inspired the current Big Data craze.
If storage were the only problem we could stop with a discussion of Amplidata, the object storage company that sponsored StorageMojo’s participation (check out the StorageMojo object storage video to get a quick overview). But there’s much more to it than that.
Defining Big Data
Ever notice that the number of disk drives in most infrastructures stays fairly constant? The growth in drive capacity parallels the growth in data, so disk drive populations should be stable.
Thus a definition of Big Data based on the number of drives is simple and flexible over time. Defining the lower bound of Big Data as an application that uses 100 large capacity SATA drives means that with 4TB drives you aren’t doing Big Data unless you have 400TB of raw storage.
Smaller data sets can be vital to business or research; this isn’t a “mine is bigger than yours” competition. But as a rule of thumb you can probably live with filers if you’re certain your application won’t need more than 100 drives ever and, if you will, you should consider object storage.
Technology
There is no doubt that advanced erasure codes – variously referred to as rateless or fountain codes – are technically and financially superior to the older and much more common Reed-Solomon erasure codes for Big Data.
They are lower overhead, offer higher redundancy and better security than RS codes. Want to survive 4 failures with ≈50% overhead? Rateless codes are the answer.
Moving Big Data
Shipping boxes of big SATA drives overnight is still the high bandwidth solution – assuming the drives survive – but the physical hassle is prohibitive for most operations. Assuming you need to move gigabytes regularly, you need 2 things: a fast transport; and a storage system that can manage sustained write performance – which not all can.
The friendly folks at Aspera software have figured out how to achieve line speed data rates across long distances, a trick TCP has never learned. They can feed the data directly to an object storage system.
Integration
Object storage isn’t a drop-in replacement for NAS: the common REST interface is much simpler than NFS 4.1 or SMB 3.0. Many new apps are now written to use REST, but what about legacy apps that expect NAS?
At least one vendor,Panzura offers a NAS front end to object storage. Their Global Cloud Storage System serves up NAS protocols while using object storage on the back end.
But they only support a few back-end options, so you better like their choices. There are options beside Panzura though.
The business problem
The technology is there, but as I asked at one of the panels “if this stuff is so freakin’ great, why isn’t everyone using it?” It isn’t only enterprise IT inertia.
There’s another problem: business units and enterprises aren’t used to seeing storage as horizontal infrastructure, where scale-out storage wins. Even major adopters of object storage typically use a single app to justify an investment.
Networks are a horizontal infrastructure. We buy them anyway. It is a solvable problem that I’ll be writing more about more later.
The StorageMojo take
A few years ago there was debate about public vs private clouds. I continue to think there are business cases to be made for both.
But they aren’t the same cases.
As the commoditization of scale-out architectures grows and the cost of public cloud storage becomes an issue, the advantage that public providers offer today will decline. AWS is driving app development to the S3 API, and those apps will migrate easily to private clouds.
The growth of private clouds won’t crater AWS. Yet AWS will enable private clouds.
Courteous comments welcome, of course. How do you define Big Data?
What is big data?
It’s actually quite simple: You have big data whenever a single host isn’t enough to either store or process your data.
What does it mean?
Suppose you have a Postgresql Database and you run into scaling problems. There’s a choice now, either get “better†hardware so that you can continue to work on a single host or split the database to span multiple hosts.
Suppose you have a file store and all disks are full. You can either buy larger disks or use some distributed storage system where you just add hosts to expand the total storage capacity.
In both cases you are dealing with big data.
Big data (for me) isn’t anything that says X MB of data. It’s simply the case when you need decide to use a distributed system to handle your data.
Hi Robin,
Great article… I’m glad to see someone is brave enough to draw a line in the sand. However, I half-heartily disagree with your definition.
Today, Big Data is a problem space surrounding the harmonization of analytics for combined unstructured and structured data from multiple sources both local and distributed. It’s complicated by issues of scale, costs, performance, provenance, consistency, tampering and availability.
Many storage companies believe they need to respond to the capture the big data market. Not to insult anyone, but, most storage companies lack the detailed understanding of use cases, applications and the value of integration. A good example is forcing object storage into the space when there are other, more compatible alternatives. Sisyphus pushing the rock up that hill will probably have an easier time than force feeding object storage.
I would have to question the current value of advanced erasure codes in this model, I’m a strong advocate of T10-DIFF and end to end data integrity checking. However, today’s issues with correctness, provenance, consistency and tampering of data seems to be a more overwhelming problem that may out way the value provided by new erasure codes.
The big data technologies, like hadoop, closure and dozens of other are an interesting part of the space. Many of these no-sql technologies are applying methods that were vetted out in the early and mid-80s. If they didn’t work back then on data sets that were 50MB, what makes anyone think they will work any better on data sets of 50TB ?
I think you are correct, there are markets for both private and public big data. Is the interface going to be object base, well probably not, not even if they claim some proprietary, protected RESTful interface is “open”.
Today, we cannot separate commodity analytics from Big Data. Data processing is the core enabler of the market, without the analytics we have no need for big data.
Glad to see StorageMojo is going strong and keeping relevant
Big Data…3 sources for any thing that is factual.
We have DATA but want is needed is good INFORMATION.
There is lots’ of information published and placing it on the Cloud.
But Humans vs. Machine generated. On a good day , Typing 60-100 wps. MapReduced prior to writing.
Today, the HPC market must process tera/peda/exaflop of data to say something intelligible.
“MapReduce” to find “INFORMATION” (full table scan) is like pausing for minutes before thinking and being able to “compute” a response.
7 seconds — it’s big data (because it is probably being run on a hadoop cluster)
Reference – Look at 2012 Election – More Compute.
Researching HDS vs EMC and then to post a blog w/ a synopis of www to hone a storage “pro/con” requires big data.
Requires (? how much data to validate?) it’s curious exercise but I would presume—it’s not a megabyte.
Thanks for the blogs, posts, pricelists and keeping Mojo!
-TAJ, SolutionsArchitect.com
“There is no doubt that advanced erasure codes – variously referred to as rateless or fountain codes – are technically and financially superior to the older and much more common Reed-Solomon erasure codes for Big Data.
They are lower overhead, offer higher redundancy and better security than RS codes. Want to survive 4 failures with ≈50% overhead? Rateless codes are the answer.”
Yeah, right. Except that Reed-Solomon codes are optimal and there are no optimal rateless codes known to date. Which means that with the given level of redundancy RS offers strictly better security.
I used to be a regular reader of storagemojo, but recently I started thinking whether it’s worth it. I feel that quality of posts dropped significantly.
Hasn’t HDS been doing Object storage with NAS access for a while now (HDI, for the HCP system, and BluARC I think Natively does both).