From the category archives:

Architecture

Cold storage

by Robin Harris on Monday, 29 June, 2009

As the economics of data storage push more and more data onto disks, the energy efficiency of data storage is ever more critical. Storage is anti-entropic, so keeping bits organized requires energy. How can we minimize that energy input?

Data cooling is the major reason disk drives have remained a viable storage strategy for 50 years. The IOPS/MB has dropped steadily for decades, yet disks remain the preferred tool outside of very low latency or high-bandwidth applications.

Looking forward to massive scale-out storage infrastructures the data will get even cooler. Copan’s MAID architecture, which turns disks off when not in use, is a rational extension of the cool data concept.

As data continues to cool we will eventually see millions of disk drives – along with tapes – sitting idle. But even if we have cold archive disks, one of disk’s big advantages over tape is the ease with which data can be spread over multiple drives for data protection.

Not RAID 5
You can’t count on any one hard drive actually restarting after a few months or years of idle time. Nor can you expect that any specific sector will be readable. Cold data requires even more advanced – energy efficient, disaster-tolerant – storage techniques than RAID arrays offer today.

Oh, and they need to be cheap too. Which means RAID arrays won’t get this business. What about open source software?

Erasure coding
In A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries For Storage (pdf) James S. Plank, Jianqiang Luo, Catherine D. Schuman, Lihao Xu, and Zooko Wilcox-O’Hearn examine 5 open source implementations of 5 different erasure codes: Reed-Solomon, Cauchy Reed-Solomon, Even-Odd, Row Diagonal Parity and Minimal Density RAID 6 codes.

Picture 7
Typical storage system with erasure coding – figure from the paper

Several companies – including Cleversafe, NetApp and Panasas – use erasure codes today to ensure higher data availability. What Plank et. al. wanted to know is how well these codes work and what system designers need to know to use them effectively.

The OSS implementations tested are:

  • Luby, a C version of CRS.
  • Zfec, a highly tuned Reed-Solomon library.
  • Jerasure, a GNU LGPL C library that includes RS, CRS and 3 MDR6 among others.
  • Cleversafe
  • released an open source version of their dispersed storage system, from which the authors used just erasure coding parts.

  • EVENODD/RDP
  • , patented codes not available to the public and included for performance comparison.

Most important result
The study found that while tuning boosts performance and some architectures are much faster than others,

Given the speeds of current disks, the libraries explored here perform at rates that are easily fast enough to build high performance, reliable storage systems.

Translation: this isn’t string theory.

Other findings include:

  • The RAID 6 codes out-perform the general purpose codes.
  • For non-RAID 6 codes, the Cauchy Reed-Solomon performs much better than straight RS
  • CPU architectural features, such as cache size and memory behavior, make it hard to predict an optimal data structure for a given code configuration.
  • The code’s memory and cache footprint can have a large impact on performance.
  • Specialized RAID 6 codes hold promise for creating efficient storage that can withstand numerous concurrent disk failures.
  • Multicore performance issues are largely unexplored.

The StorageMojo take
The architect for a planned commercial 200 PB cold-storage infrastructure confessed that he can see how to get to 25 PB today, but not beyond. Yet they have no choice but to start building now.

This market’s eventual structure may parallel that of today’s tape silo market: everal hundred large customers who are continuously churning through rolling upgrades of media and servers.

Right now, tape silos still enjoy an economic advantage over disks. But it looks like disks have more degrees of freedom to improve their cold storage economics than tape.

In just 5 years the first exabyte cold storage systems will be on the drawing boards. It is time for disk companies to get serious about a tape-replacing archival disk. And for clever startups to focus on this emerging market.

Courteous comments welcome, of course.

{ 4 comments }

Not a filesystem, not a database.

by Robin Harris on Wednesday, 17 June, 2009

Jeff Darcy has a good post on key data stores, like Amazon’s Dynamo, and how they differ from filesystems and databases. He relates his transition from a filesystem purist to a more flexible perspective.

The thing that really changed my mind about this was an observation in the Dynamo paper: strong consistency reduces availability. I’ve always thought of data availability in terms of data not being lost or stranded on the other side of a failed network connection. The Dynamo insight is that many applications have to do a lot of work within a small acceptable-response-time window, and to make sure that they fit into that window they have to impose deadlines on all sub-operations including data access. If consistency issues make data unavailable within that deadline then they’ve made it unavailable period, with practically the same effect as if the data were unavailable in any other sense.

In short, while there is a class of applications where traditional consistency is important, there is an emerging class where strong consistency isn’t affordable or necessary. Good stuff.

Another point
Many of the features that make up these non-FS/non-DB stores seem to have a lot in common with object storage. In a highly mobile world the whole idea of placing a file in cyberspace by a path name is anachronistic at best: it could be, physically, almost anywhere and is most likely in several places at once.

The StorageMojo take
While the name “object” is problematic for market acceptance, the concept of managing objects in a flat address space – like the web itself – is a better fit for a mobile networked world. There is a major opportunity to move file management infrastructure forward to reflect the world we now live in rather than a 35 year old server environment.

Courteous comments welcome, of course. Thanks to Wes Felter’s Hack the Planet blog for the link to Jeff’s post.

{ 8 comments }

Outrageously cool new hard drive

by Robin Harris on Monday, 15 June, 2009

DataSlide has come out of stealth mode with a very creative SSD replacement technology. They call it a Hard Rectangular Disk or HRD.

Here’s their quick overview:

DataSlide applies technology in new, patented ways to achieve unprecedented high performance 160,000 IOPS & 500MB/sec and low power <4 Watts for a magnetic storage device:

  1. A piezoelectric actuator keeps the rectangular media in precise motion
  2. A diamond solid lubricant coating protects the surfaces for years of worry free service
  3. A massively parallel 2D array of magnetic heads reads from or writes to up to 64 embedded
    heads at a time

Here’s a diagram, courtesy DataSlide:

But that’s not all. According to the redoubtable Chris Mellor at The Register a

. . . 2-dimensional array of 64 read-write heads, operating in parallel, . . . positioned above an piezo-electric-driven oscillating rectangular recording surface. . . .

The data organization compared to a disk drive look like this:
courtesy DataSlide

Chris also reports that Oracle’s Embedded Global Business Unit is working with DataSlide to incorporate a database to create a “smart” storage device for use in I/O intensive “multiple concurrent stream” applications.

The company says the drive is at the prototype stage and uses existing high-volume production technologies, including perpendicular recording media, semicondutor lithographic heads and LCD glass treatments.

The StorageMojo take
DataSlide has taken much from IBM’s Millipede concept and reimagined it using common technologies. While much remains to be done to productize the prototype, the fact of such architectural creativity should spur new thinking at the hard drive companies.

Of course, just like SSDs, with such low latencies it doesn’t make much sense to stick the device at the end of a long, complex, high-latency interconnect chain. PCI-e HRD card, anyone?

Also, the relatively low capacity – 36GB – of the prototype device suggests it may slot in between larger capacity SSDs and DRAM. Until we know the economics though that is almost baseless speculation.

Let’s hope they can get it to market in less than 3 years. And let the based speculation begin!

Courteous comments welcome, of course. This post was updated from the original with the digrams and some minor edits.

{ 8 comments }

Atmos gets no love from EMC sales

by Robin Harris on Tuesday, 9 June, 2009

A couple of reliable informants tell me the same story: EMC’s Atmos is in a fight for its life. Symm and Clariion sales people are treating the new born product as a competitor, not another EMC product.

The dozen or so Atmos sales people – yes, they have a tiny dedicated salesforce – are finding the well poisoned almost anywhere they go. Issues such as performance, stability, quality and future support of Atmos are reportedly being raised. Perfectly fair questions for any v.1.0 product – but they’re usually asked by competitors.

To be fair to the EMC field, the Atmos product web page is not up to EMC’s usual standards – a customer testimonial is conspicuous by its absence. Nor is there much on the business case for cloud storage.

Sales people don’t get medals for being the first to sell a radical new product. The experienced ones stay away until they get good reports from someone they trust. With Atmos that could be a while.

The StorageMojo take
In EMC’s famously sales-driven culture the local offices are used to doing as they please. As long as they make their numbers Hopkinton doesn’t mind.

But Atmos is different. Scale out architectures are the future of the industry and Atmos is EMC’s entry into the race. But EMC’s sales force doesn’t want to sell an immature product – or an architecture that will replace much of their current revenue with cheap commodity capacity.

The Atmos team is rumored to have a 1 year dispensation from making money or even many sales. That may need to be extended a couple of years.

At some point EMC’s sales force will need to get on board with Atmos. EMC better hope that some other scale-out vendor doesn’t get in those accounts first.

Courteous comments welcome, of course.

{ 7 comments }

Configure a 100 TB HD video infrastructure

by Robin Harris on Sunday, 7 June, 2009

The video folks have an interesting set of problems: large needs; major bandwidth; time-critical collaboration; lots of metadata; and more. Like budgets. I do some video production myself and empathize.

They are today where most of us will be in 10 years: lots of large files; local and remote sharing; processor and bandwidth intensive operations; large archives of wanted and rarely accessed files. Today high-end video folks are working at 2k, 4k and, sometimes, 8k video resolutions – and 10 years from now I wouldn’t be surprised if home users weren’t too.

What prompts this is a note I received from, well, I’ll let him introduce himself.

I have a boutique post-production company and I’m a filmmaker. We are small, under a dozen, but swell to a few times that size with freelancers on a project-by-project basis. Because we work with very high resolution media, we need a lot of space, and very high throughput to each user. . . . [W]e’re all working with 2K and 4K media (300 and 1200MBps respectively to EACH user) and 3D animation rendering. . . . We use a mix of Linux, Windows, and OS X clients. In total, we could easily make use of 100TB+ right now, and prefer to stop archiving everything to tape and deleting it, but rather migrate to another tier of storage but keep in one global namespace with the tape just for disaster recovery. We also need security administration.

I can’t find a storage system that does all this. DataDirect Networks seems to be the du jour high-end storage for my industry, and supposing I’m willing to finance that big-ticket brand, they still don’t have a filing system answer. They’re suggesting StorNext or CXFS, and I know the multi-user scalability and expansion limitations well (can anybody say “forklift”?).

The closest I’ve come is Lustre. It seems like it would fit the bill nicely, especially since we’re savvy to integrate in-house, except that it is Linux only, and NFS/CIFS gateways don’t seem like a great idea. I keep hearing they’re working on at least a Windows client, but who knows when it will be ready?

Can you help at all? What have I overlooked? Doesn’t anyone make what I’m looking for?

Short answer to last question:
No.

Longer answer:
No. But there are workarounds.

For those new to video, here’s an abbreviated chart of some video rates in megabytes per second:
video_data_rates1 [Adapted from Integrity Data Systems which offers the whole chart. Aspect ratios and frame rates left out.]
Update: Larry Jordan, a writer and trainer in video editing, graciously wrote to let me know that the above data rates are uncompressed – and that most production houses would use compressed data. The amount of compression varies based on the codec as Larry explains in this informative post. End update.

Issue 1: Interconnects
GigE won’t even handle 32-bit RGB standard def video. And when you get into HD video it gets hairier fast. Trunk multiple GigE’s? 10GbE? 4x Infiniband? FC? eSATA or PCI-e direct attached storage?

Issue 2: Virtualization
A single address space is a wonderful thing. You’ll need a software layer that clusters multiple boxes. You’ll also probably want to build an archive infrastructure that is distinct from your higher performance working set storage, but some vendors will disagree.

Likely software suspects include IBRIX, Parascale, Caringo, MatrixStore, Bycast and Permabit.

On the combined HW/SW side there’s Panasas and Isilon. Something tells me there are some other options, like HP’s Extreme Data Storage 9100, that are also applicable.

Lustre is not a product I would recommend since it was designed for HPC, a market where PhDs work as sysadmins. Sun may have tamed it since they bought it, but it is a non-trivial piece of software.

Come one, come all
StorageMojo readers are invited to offer their 2¢ worth. Architecting is non-trivial, especially if money is an object.

Update:
Our interlocutor wrote in to add some detail:

thanks for the response. Here’s some answers:

– We can manage expensive interfaces like 10GigE and Infiniband QDR. We’ve been paying for dual-channel 4Gb FC for the past few years, after all. I just want to also allow standard Gigabit connections to the cheap seats without a lot of complexity. So I guess the jargon for that would be “multiprotocol” switching?

– The large naming space might be a luxury. The fact is that jobs come in one of three general sizes, and we could have volumes of that size waiting to take on new jobs as they come in, so at least there is one namespace per job. As you said, capacity is cheap…

– Truth is I am pretty savvy, but other than that we have a lot of power desktop users but not sysadmin types. I contract some people with steady part-time work, but it has been our business model to try to keep as many of our full-time people on the creative and producing side as possible, and not in support/administration.

The one thing I don’t understand is what you say about Infiniband not being so great when there’s lots of node churn?

I know what you mean about DAS, but I think I’ve ruled out distributing the data through push/pull from a central repository. The fact is jobs just move to fast through here for that, and we often have about two seconds notice that we need to bring a certain job’s data to System X, Y or Z to do work on it. It’s very dynamic.

I see some brands in your blog post I haven’t checked on yet.

What turned me onto Lustre is that Frantic Films in London has deployed it. They’re the only ones AFAIK.
End update.

The StorageMojo take
Some thoughts on the infrastructure issues.

Capacity is cheap, network bandwidth is expensive. Raw SATA disk is less than $0.10/GB. 10GbE switch ports are over a grand apiece. Infiniband is better from a price/performance perspective, but not as friendly for networks where there is much node churn – unless that’s been fixed in the last few years.

Direct attached storage will give you the best performance – especially with 4k. The new PCI-e attached arrays from JMR and others can offer up to 4,000 MB/sec bandwidth. Stripe across 4 of those and you’ll be able to handle 8k.

Transaction processing is well on its way to niche status, like mainframes and hierarchical databases that once ruled the earth. It is a big file world out there and the files are getting bigger every year.

Courteous comments welcome, of course. I’ve done work for many of these folks – but not all – at one time or another.

{ 25 comments }

Why we’re getting vertical – again

by Robin Harris on Monday, 18 May, 2009

Until the 1980s the computer industry was characterized by a vertical integration of the major players. They produced their own CPUs, operating systems, applications, networks, peripherals, interconnects, and in some cases clusters.

With the advent of the PC and Ethernet the industry had for the first time a high-volume computer and network. The IBM PC’s use of a Microsoft operating system and an Intel processor in an inadvertent open architecture set in motion a new set of economic forces that in less than 10 years drove several billion dollar plus minicomputer companies out of business.

Likewise ethernet volume drove the development of very low-cost networking components. With the broad acceptance of the TCP/IP protocol stack the die was cast and existing network architectures, including IBM’s Systems Network Architecture, DECnet and new ones that used token ring or token bus architectures were crushed.

Increasing volumes drove cost down the learning curve. Intel, after getting out of the DRAM business, used its gusher of money to drive process technology faster than any of its competitors could. Board and system vendors, able to concentrate on a single CPU architecture, drove their costs down creating an economic implosion that wiped out most competing chip architectures in less than a decade.

Likewise, Microsoft’s DOS and Windows operating systems, became effective standards for the high-volume computer business. Application vendors either migrated to Microsoft or died along with their minicomputer hosts.

The horizontal industry
In a decade the structure of the industry was radically changed: large vertically oriented computer companies such as DEC, Wang, Prime, Data General, CDC, and most of the seven dwarves ceased to exist. In their place arose a new group of horizontally integrated companies such as Intel, Microsoft, Cisco, Oracle, SAP and, in services, IBM Global Services.

In this new world the battles took place within these horizontal layers: Intel versus AMD and spark; Microsoft versus Linux; Oracle versus Informix and mySQL.

The vertical reintegration
But after 2 decades where it was obvious that horizontal integration was the winning strategy, we’ve seen a U-turn towards vertical business models again.

  • Cisco moving into servers – and with that big warchest, how about storage too?
  • EMC buying VMware to offer virtual servers. And selling real servers in Atmos.
  • Oracle buying Sun and saying they’ll offer fully integrated HW/SW systems.
  • Apple stocking up on chip guys.
  • Other buys, like HP buying LeftHand and EDS, Dell & EqualLogic, IBM & XIV, that point to more integrated offers.

What’s going on?
Ask yourself what drove companies horizontal.

  • Economy of scale. $10B companies could maintain credible R&D on all the pieces, but smaller companies couldn’t – creating a market for vendors who focused on 1 layer.
  • Standards. Whether de jure or de facto, standards such as TCP/IP, MSDOS, Netware, SCSI and IDE opened people’s eyes to multi-vendor infrastructures and freedom from lock-in.
  • Distribution costs. Dedicated account teams can keep CIO’s happy, but down market the margin dollars aren’t there for that kind of handholding. Enter the VAR channel and the distributors who support them.
  • Increased capital intensity. With multi-billion dollar chip plants, coming investments for patterned media & HAMR, 10 & 40 GigE, small companies just couldn’t afford to stay in the game.
  • Margin cherry-picking. Disk drive vendors did the work, but the array products got the margins. Likewise, Intel got great margins while server vendors didn’t – and the same with Microsoft and Cisco.

What’s changing?
The dynamics are fascinating. This is only a partial list.

  • Wall Street. If you want a higher stock price you need to show Wall St. that you can and will grow. When, like Cisco, you dominate your segment, what else can you do?
  • Vertical is cheap. Companies are cheap right now. Lots of open source software. Fabless semiconductor design, commodity infrastructures, scale-out storage and computes: it just isn’t that expensive to move up the stack.
  • Best defense. Cisco served notice on HP’s and IBM’s server business. EMC is plucking the high-margin software from commodity servers. Oracle could be packaging up dedicated app/database/server/storage racks and containers. Or sell you a service that does most of the same stuff.
  • Solutions, not products. Sun made great building blocks – but customers don’t have the people to put them together and VARS aren’t getting the margins to do it for them. “I want an X that will do Y” is the customer demand. Package it up and win the sale.
  • Shrinking margins. There is no shelter from this storm. Cisco knew its free ride was ending as IBM and HP look for more revenue – see “best defense” above.

The StorageMojo take
Blood on the streets. More M&A. Shifting battle lines.

“Co-opetition” is shifting to plain old bare-knuckle competition.

Vendors can say goodbye to 60% gross margins. Point product diversity will increase until the product landscape stabilizes. That will be about 5 years.

Courteous comments welcome, of course.

{ 5 comments }

DAS: the biggest surprise at NAB ‘09

by Robin Harris on Wednesday, 22 April, 2009

Direct attached storage may catch on
PCI-e DAS is getting traction in the media world. At least a dozen vendors – all smaller – were showing it, and customers were responding.

JMR’s BlueStor is promising over 4 GB/sec with PCI-e attach. In a world where a single 4k frame is almost 50 MB, that is speed production companies need.

More on NAB later.

The StorageMojo take
Beth Pariseau noted the DAS movement at SNW earlier this month. This isn’t just a Hollywood moment. There’s more to this nascent DAS resurgence than the need for speed.

  • Multi-core systems. Multi-core, multi-thread systems are like a cluster in a box – only cheaper. DAS looks like a SAN to an 8 core system.
  • Management. When you can easily attach several dozen TB of cheap SATA to a physical machine, who needs a SAN? Not to mention the optical PCI-e extension cables.
  • Cost. There’s something that looks a lot like worldwide depression going down. DAS is cheap(er) and as long as systems scale inside the box a SAN offers few advantages.

A DAS resurgence. Will wonders never cease.

Courteous comments welcome, of course.

{ 13 comments }

Private clouds won’t fly

by Robin Harris on Friday, 17 April, 2009

Massive economies of scale make cloud computing and storage inevitable. But if the scale required for economic clouds exceeds the capacity requirements of the largest enterprises “private clouds” won’t fly against their public counterparts.

Some cool implications:

  • Clouds favor smaller users. Large enterprises have too many economic and contractual risks to embrace public cloud infrastructure.
  • Public clouds are cheaper. As cloud technology evolves expect further increases in economies of scale that enable cloud providers to further undercut enterprise computing and storage costs while still earning a healthy profit on the services they provide.
  • Small is beautiful. And profitable. The low costs of cloud-based infrastructure combined with the global reach of the Internet drive large enterprises to focus even more on what only large enterprises can do – like developing and marketing trillions of dollars of toxic securities.

McKinsey weighs in
Ken Brill and his merry band at The Uptime Institute have a McKinsey discussion document available for download called Clearing the air on cloud computing. It offers the most concise definition of cloud computing yet.

Definition: Clouds are hardware-based services offering compute, network and storage capacity where:

  1. Hardware management is highly abstracted from the buyer
  2. Buyers incur infrastructure costs as variable OPEX
  3. Infrastructure capacity is highly elastic (up or down)

McKinsey also makes a useful distinction between cloud services and cloud infrastructure. Cloud services comply with two of the key requirements: hardware abstraction and elastic infrastructure. In a service the buyers do not incur infrastructure costs as OPEX.

Under this definition cloud storage and cloud computing are both clouds. Application is irrelevant to taxonomy. Google app engine is a cloud computing. Gmail is a cloud service.

McKinsey has a couple of key observations:

Clouds already make sense for many small and medium-size businesses, but technical, operational and financial hurdles will need to be overcome before clouds will be used extensively by large public and private enterprises

Rather than create unrealizable expectations for “internal clouds,”CIOs should focus now on the immediate benefits of virtualizing server storage, network operations, and other critical building blocks

In other words, there is lower hanging fruit for large enterprises.

The StorageMojo take
If history is any guide cloud infrastructure will find its way into large enterprises the way PCs and before them minicomputers did: through the initiative of LOB executives.

Enterprise clouds such as Cisco’s new UCS promise to enable the glass house to compete with the cloud house. That strategy is necessary but not sufficient: internal clouds won’t have the scale and the capacity to offer internal customers the variable OPEX and the highly elastic capacity.

Yet the glass house need only provide a fraction of the flexibility of an Amazon to win most of the business – in the midterm. In the long term nimble competitors whose cost advantage comes from cloud computing will force large enterprises to outsource non-essential data intensive services.

Cloud computing is entering the peak of the hype cycle. A year from now many data center managers will be mocking the unfulfilled promise of cloud computing, just as they earlier mocked PCs, minicomputers and Ethernet.

Like those successful assaults on the glass House, cloud infrastructure will take a decade or more to flank large enterprise IT. Smart IT managers will stay tuned to the cloud’s inexorable economies of scale.

Courteous comments welcome, of course.

{ 8 comments }