StorageMojo





Robin Harris    


Write off-loading enterprise storage

July 20th, 2008 by Robin Harris in Architecture, Enterprise, Future Tech

It isn’t clear how serious the enterprise storage vendors and and their customers are about reducing energy consumption. A server may have 4-8 cores, consuming 50 W when idle, attached to 8, 16 or even 24 drives each pulling 8 W at idle.

High end it drives, whose demise is widely predicted, may consume 12 W at idle. If they are serious storage is a good place to start.

But how?
A recent paper from Microsoft research in Cambridge, Write Off-Loading: Practical Power Management for Enterprise Storage (pdf) by Dushyanth Narayanan, Austin Donnelly and Antony Rowstron, studies the issue. The traditional view is that enterprise workloads are too intense to generate savings by spinning down disks.

The team analyzed block level traces from 36 volumes in an enterprise data center and found that significant idle periods exist. They found that a technique they call write off-loading can save 60% of the energy used by enterprise disk drives.

Ring for the MAID
Main memory caches are good for handling reads but their lack of persistence means they are not effective for writes. That is the impetus for the write off-loading techniques.

Blocks intended for one volume are redirected to other storage in the data center. During write intensive periods the disks are spun down and the writes redirected. Blocks are off-loaded temporarily, for for as much as several hours, and are reclaimed in the background after the home volume disks are spun up.

The team reports

Write off-loading modifies the per volume access patterns, creating idle periods during which all the volumes disks can be spun down. For our traces this causes volumes to be a vital for 79% of the time on average. The cost of doing this is that when a read occurs for a non-off-loaded block it incurs a significant latency while the disks spin up. However our results show that this is rare.

Locality of reference hasn’t gone away.

Yes, you can spin disks down in the enterprise
The Microsoft team used servers in their Cambridge research facility to measure volume access patterns. This isn’t hard-core OLTP but there are generic server functions such as user home directories, project directories, print server, firewall, Web staging, Web/SQL server, terminal server and a media server.

They acknowledge that for TPC-C and TPC-H benchmarks disks are too busy to benefit from write off-loading. Nonetheless, even OLTP systems have significant variations in their workloads. At night for example, traffic might be light enough to power down many array disks.

The team took a week’s worth of traces. The total number of requests was 434 million, with 70% reads. They found that peak loads were substantially higher than average loads. This over-provisioning enables the power savings of write off-loading.

They also found that the workload is read dominated. Yet on 19 of the 36 volumes the traced volumes had 5 writes for every read.

How write off-loading works
A dedicated manager is responsible for each volume. The manager decides whether to spend the disks up or down and also when and where to off-load writes.

The manager off-loads blocks to one or more loggers for temporary storage. The storage could be a disk or SSD but the team only tested disk-based bloggers.

Loggers support four remote operations: write, read, invalidate and reclaim. They write the blocks and the associated metadata including the source manager identity the logical block numbers and a version number.

The invalidate request includes the version number and the logger marks the corresponding versions as invalid. Every claim is like a read except the logger can return any valid range it is holding for the requesting manager.

Their implementation uses a log-based on-disk layout.

Manager determines when to off-load blocks and went to reclaim them while ensuring consistency and performing failure recovery. The manager fields all read and write requests, handing them off to loggers and/or caches as needed.

Performance
Write off-loading is vulnerable to 10-15 second delays when a read forces a disk to spin up. 1% of the read requests had a response time of more than 1 second.

The write performance is equivalent to array performance in 99.999% of the cases. Here’s a figure that gives results for a “least idle” servers.


The tested configurations:

  • baseline: Volumes are never spun down. This gives
    no energy savings and no performance overhead.
  • vanilla: Volumes spin down when idle, and spin up
    again on the next request, whether read or write.
  • machine-leveloff-load: Write off-loading is enabled but managers can only off-load writes to loggers running on the same server: here the “server” is the original traced server,not the test bed replay server.
  • rack-level off-load: Managers can off-load writes to any logger in the rack.

And this differs from MAID how?
In a massive arrays of idle disks (MAID) a small number of the disks are kept spinning to act as a cache while the rest are spun down. This requires additional disks per volume. Copan Systems claims power savings of 75% with their “enterprise MAID” product. [Note to Copan - I'd be happy to have you compare your approach in the comments.]

Write off-loading does not require additional disks per volume or new hardware. The technique can use any unused data storage on the LAN.

The StorageMojo take
Can write off-loading become a viable commercial product? If Microsoft were to commercialize it in Windows Server at a low price it certainly could. Given the general reluctance of Redmond to productize MR concepts I wouldn’t expect anything soon. Too bad.

What this also underscores is the continued development of tightly coupled of storage and server architectures for cost-effective solutions with unique benefits. The ability to relax some constraints of the (increasingly atypical) “typical” enterprise data center work load shows what can be accomplished through creative architecture.

As the leading OS vendor, Microsoft has an unparalleled opportunity to bring these ideas to market and create functional differentiation with Linux. I hope someone with clout in Redmond is looking at this.

Comments welcome, of course. What could be more appropriate in an era of massive write-offs?

Design Tradeoffs for SSD Performance

July 15th, 2008 by Robin Harris in Architecture, Future Tech, SSD/Flash Disk

A new Usenix paper looks at NAND flash SSD performance. From a team at Microsoft Research and the University of Wisconsin, including Ted Wobber who worked on last year’s A Design for High-Performance Flash Disks [see Flash chance for the StorageMojo take on that excellent paper - a post Ted was kind enough to review and comment on].

Design Tradeoffs for SSD Performance (by Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse and Rina Panigrahy) makes a deep dive in flash translation layer (FTL) issues. As the authors note, flash vendors keep their FTL designs secret, so the team developed a NAND flash simulator to look at how design choices affected performance.

What they found
They ran several workloads on their trace-based simulator, including TPC-C, Exchange and some file system benchmarks. They found several critical issues in SSD design.

  • Data placement Needed for wear leveling and load balancing.
  • Parallelism Single flash chips aren’t very fast so they need to work together.
  • Write ordering Small random writes are a killer.
  • Workload management You can optimize for sequential or random workloads, but managing both well is hard.

Canonical part
The paper’s discussion of flash memory is based on the spec for Samsung’s K9XXG08UXM 4 GB Single Level Cell (SLC) package. Other parts may differ, but NAND physics are the basic challenge.

The Samsung part has 2 2 GB dies (chips) in the package. Each die has 8192 blocks - a block is 64 4 KB pages - organized into 4 planes of 2048 blocks. The dies can be addressed independently, while cross-plane operations are limited to planes 0 & 1 or 2 & 3. Each page has 128 bytes for metadata.

Cross-plane operations are a form of parallelism. The Samsung part also provides a copy-back operation so one page can be copied to another without transporting the data off of the die. Copy-back is limited to copies within the same flash plane of 2048 blocks.

Expensive writes
NAND flash is a type of EEPROM. About the only characteristics it shares with disks are block structure and persistence. To write - or as the flash guys say program - it must first be erased. And you can’t just erase a 4 KB page - you have to erase an entire block.

An erase operation takes 1.5ms, making it considerably more expensive than a read or a write. To maintain a supply of empty blocks a cleaning process - garbage collection - runs when the free block supply gets low.

SLC flash is good for about 100,000 writes, so not only do you have to manage the full block erasure problem, but you also have to manage the life span of each block - the wear-leveling problem.

[Wear-leveling will become even more acute with next-gen 3 and 4 level cells. Speculation is that the write spec could drop as low as 1,000 per cell.]

Here is a table of the operational flash parameters for the Samsung part from the paper:

SSD controller architecture
The flash packages of course are only the building blocks of an SST. Much of the magic comes from the architecture and optimizations of the SSP controller logic. This is a generalized block diagram for an SSD controller:

Key elements:

  • Host interconnect SATA, USB, FC, PCI-e
  • Buffer management for pending and satisfied requests.
  • Multiplexer to manage instruction and data transport along the serial connections to the flash packages.
  • Processor to manage request flow and mappings from the logical block address to physical flash locations.
  • RAM for the processor.

On a cheap USB thumb drive all these elements may be integrated into a single chip. On a high-performance fiber Channel SSD these elements may be separated on their own PC board.

The size of the flash packages also has an impact on cost and architecture. A 32 GB SSD build with the Samsung parts would require 136 pins at the controller. Larger SSDs may not have enough pins for full interconnection between the controller and the flash packages, requiring additional engineering trade-offs.

Faking it
Borrowing a simulator, DiskSim from Garth Gibson’s Parallel Data Lab at CMU, the team modified it to reflect SSD latency and architecture. Features unique to SSDs, such as multiple request queues, logical block maps, cleaning and wear-leveling states were added.

Workloads
They used a collection of workload traces they named TPC-C, Exchange, IOzone and Postmark, as well as a group of microbenchmarks generated by DiskSim.

The TPC-C trace came from a large-scale configuration comprising 14 HP MSA1500 FC controllers supporting 28 36 GB disks. Exemplifying the current high-end OLTP problem, each controller had over a terabyte of disk, but the benchmark used only 160 GB of that capacity.

The Exchange server was similarly over-configured with 6 RAID controllers each running 1 TB capacity, while the 15 minute trace utilized only 250 GB of that with a 3 reads for every 2 writes workload.

Microbenchmarks
These were run using 4 KB I/Os. With cleaning enabled the write operations include the extra overhead. Sequential I/Os have less cleaning overhead. Note cleaning has a ~30% hit to the random write rate.

Trade-off summary
The researchers looked at several design techniques:

  • large allocation pool
  • large page size
  • over provisioning
  • ganging
  • striping

These deserve some explanation.

A large allocation pool is convenient for achieving performance, but there is a cost. If the page size is small, there is more overhead of managing the pages.

If the page size is large, it is easier to manage the pages, but writes smaller than the page size require a read-modify-write operation, which kills performance.

Over provisioning reduces the cleaning overhead, at the cost of more expensive storage.

Ganging requires more explanation. A flash package is made of one or more dies or chips. The serial interface to the flash packages is a primary bottleneck for SSD performance. Spreading a write across multiple serial interfaces is an obvious way to improve performance. The cost comes in the interconnect density between the packages and the dies.

If a write can be interleaved across multiple flash packages, read or write bandwidth can be substantially improved. The ability to place multiple packages in an SSD, and to interleave operations across those packages, is key to the performance improvements that SSD vendors have been advertising.

The StorageMojo take
This paper is too rich in detail to summarize well. If understanding SSD controller design is important there is no substitute for a careful read.

The net is that engineers have many options in configuring and managing flash devices inside a solid state disk. The interaction of these design choices with applications is likely to remain a fruitful area of study for years to come.

Expect to see many performance oddities as new solid state disk designs are released. This is a different world than disk drives. There is much innovation and much to learn.

A macro longer-term trade-off is the extent to which SSD vendors should attempt to alter operating system behavior to better match SSDs. In the short term designers must conform to today’s disk I/O oriented operating systems. In the long term however, there must be major opportunities to tweak operating systems to enhance solid-state disk performance.

For this reason SSDs is may find their best short term market to be inside storage arrays where array vendors have complete control over the interface to the array software. This will be no small advantage as array vendors struggle to remain relevant in a world where high performance solid state disks have the potential to replace midsize arrays.

Comments welcome, of course.

Update:
Ted Wobber kindly wrote in with a comment I’m reproducing in full, since he does a better job of getting to the heart of the matter than I did:

I think the bottom line is that flash devices are a lot more complicated than you might think they would be. At first glance, the conventional wisdom is that something constructed out of solid-state circuitry should be fundamentally simpler than a device with very small parts moving at high speed. However, you have to remember that NAND-flash is built on quantum tunneling, and while the software layers that build up from there don’t involve advanced physics, the properties of the medium create complexities and tradeoffs that might not be expected.

We don’t talk with SSD vendors at a great level of detail since we’d prefer not to be under NDA unless there is a good reason. However, informal discussions and other materials I’ve seen have convinced me that our evaluation of the state of affairs isn’t far from the truth. It’s my opinion that most manufacturers are well aware of these sorts of tradeoffs, and they carefully consider them along with the requirements of their target markets and cost structures. The point of our article was to talk about these tradeoffs in an academic forum unconstrained by IP issues, and to begin to tease apart the tangle of related issues.

In sum, SSDs constitute a marvelous step forward and are really useful in many applications. However, they are not a panacea, at least not yet.

/Ted

Thank you, Ted.

Testing, testing, 1 2 3 . . .

July 7th, 2008 by Robin Harris in Architecture, Disk, SSD/Flash Disk

George Ou weighs in
Many good points have been made about the problems with the Tom’s Hardware flash SSD tests. My former colleague George Ou, late of ZDnet, weighed in with an excellent summary of the TH testing problems:

The tests are very flawed.  If you read the results, the SSDs with the worst power consumption aren’t the ones getting the worst battery life.  The ones with great performance and above average power consumption turn out to be the worst on battery life WITH THE TEST THEY RAN.
 
What this says is that Tomshardware’s measurements weren’t wrong, but what they were measuring was wrong.
 
The load test was not well controlled.  The SSDs with great performance allowed the benchmark to run faster which cranked the CPU more.  The difference in the CPU state is what explains the discrepancy in their data.
 
A proper measurement would have done a fixed amount of CPU work and a fixed amount of storage work and then you can see how long the battery lasts.  They could have simply played a movie off the storage system and let it play until the battery died.  Videos are great because they’re fixed computational workload and fixed storage workload.
 
This is yet another example of bad science by Tomshardware.

I don’t buy the “play a movie” test - that only tests playing a movie - but I do accept that Tom’s Hardware didn’t do a great job of testing. So what?

I’ll be returning to the testing issues shortly - after pausing for this disclosure.

Disclosure: I’m biased towards notebook flash drives
Unlike, AFAIK, any of the commenters - pro or con - I used a flash-based Windows notebook every day for 5 years and loved it. It had a 10 hour battery life, a full-size keyboard and a sleep mode that really worked. Bliss!

I also paid an extra 20% - $400 back when the dollar was worth something - for the dinky 10 MB CF card it used. It was worth every penny.

Based on my sample size of 1 (me) here’s WHY it was worth an extra 20%:

  • Battery life. The Omnibook 300 went from 5 hours to 10 hours of battery life with flash.

Factors that didn’t matter:

  • Performance: I never compared the disk to the flash, but the performance was “good enough” with either.
  • Durability: nobody gets 5 years out of a notebook drive, but crashing wasn’t a liability since all docs were copied to an external system.
  • Boot up time: sleep mode worked perfectly, so I’d reboot once a month at most. I did not care about boot time.
  • Multi-media workloads: while I agree with George that a video provides a good fixed workload, notebook SSDs are aimed at business travelers whose workloads commonly allow drives to spin down. But this is a topic that deserves a deeper look.
  • Capacity. The Omnibook had a compression utility that effectively doubled capacity to 20 MB. But it was easy to copy stuff off the ‘book - Laplink - so it never felt cramped.

Those are my biases. They may or may not be the biases of Mr. Road Warrior - but I suspect they are close. End disclosure.

Testing, testing, testing
Performance testing is a black art. That’s why test driving applications remains popular: there are so many variables that predictions based on benchmarks are close to useless.

Because of that I prefer to look at the preponderance of evidence rather than a single benchmark or set of tests. More data points paint a clearer picture.

For example, the single most positive SSD test I’ve found is Anandtech’s MacBook Air SSD. The similar results of another test is here.

Battery Life Test (H:MM) 80GB 4200RPM HDD 64GB SSD % Improvement
Wireless Internet + MP3 4:16 4:59 16.8%
DVD Playback 3:25 3:56 15.1%
Heavy Downloading + XviD + Web Browsing 2:26 2:42 11.0%

Bottom line best case: a 17% improvement. Not zero but not, as most reviewers concluded, enough to justify the price.

Ars Technica also reviewed the MBA SSD and had mixed results. They concluded:

. . . I had high hopes for the battery life on the SSD model. Unfortunately, I was met with only moderate gains when there were any at all.

More Anandtech
Anandtech also tested a high-end Memoright SSD in a high-end MacBook Pro. Here are their results:

Battery Life in Hours (Higher is Better) MacBook Pro (Hitachi 5400RPM) MacBook Pro (Memoright SSD)
Wireless Internet Browsing + MP3 Playback 5.13 hours 5.0 hours
DVD Playback 3.88 hours 3.58 hours
Heavy Downloading + XviD Playback + Web Browsing 3.38 hours 3.37 hours

The StorageMojo take
All workload testing is a compromise - but the preponderance of the evidence is clear: significant - i.e. 40% or better - notebook power advantages just aren’t there. UMPCs that can’t afford a disk - flash will win. Notebooks? Hasta la vista, baby.

The one SSD advantage that is yet to be debunked is durability. Someone made a case that just the maintenance advantages alone justify SSDs for enterprise notebooks. And it may be that simple.

Yet even there, the issues of hard CapEx dollars against softer expense dollars will work against SSDs.

Maybe the next gen of flash controllers will solve all the problems and usher in the age of flash storage everywhere. But piddly 20-30 minute gains for an extra $300 bucks won’t do it.

Comments welcome, of course. Just so everyone knows: I haven’t done any work in the last few years for either flash drive or disk drive vendors. I wish them both the best.

StorageMojo at SNIA Symposium

July 7th, 2008 by Robin Harris in Off-Topic

If your company is a SNIA member and you’re in the Bay Area the Storage Networking Industry Association Symposium might be the excuse you’re looking for to cut work on a lovely summer day.

I’ll be delivering a keynote address on Wednesday morning, July 23rd, at the St. Claire. Think of it as an interactive Animatronic version of StorageMojo.

Topic: Crossing the Next Storage Chasm: 5 New Technologies that will Change your Data Center. The blurb:

New technologies are changing the face of storage. Robin Harris, analyst and author of the StorageMojo blog, looks at 5 of them, including flash SSDs, 10 GigE, and Google-scale storage. Get the incisive StorageMojo take on these topics and what you really need to know.

Notice I left myself some wiggle room. What do you think the other 2 topics should be?

The StorageMojo take
The data center has more pieces in motion today than ever before. The possibilities are almost infinite, but budgets and attention spans aren’t.

As an industry we don’t do a very good job of a) listening to customers and b) responding with insightful solutions. How can the industry help itself and customers get through the maze? I have a few suggestions.

Comments welcome, of course.

Notebook SSDs are dead

July 2nd, 2008 by Robin Harris in Disk, Future Tech, SSD/Flash Disk

It’s all over but the shouting
The scoop: the gap between notebook SSD promise and performance has been growing steadily. Now a review in Tom’s Hardware puts the final nail in the coffin. The title says it all:

The SSD Power Consumption Hoax : Flash SSDs Don’t Improve Your Notebook Battery Runtime – they Reduce It

By as much as an hour. A winner with the stupid high-end notebook demographic. The Paris Hilton market.

Ouch. Oops. Who knew?

Or who should have known?

Details
There’s a longer piece with some detail at Storage Bits but here’s the summary:

  • A Crucial SSD - costing $25/GB - used more power - 1.6 W at idle - than any 2.5″ notebook drive requires.
  • A Memoright 32 GB drive used a full 2 W at idle
  • An Mtron 32 GB flash drive reduced battery life by almost an hour.
  • The slowest drive - a year old Sandisk SSD 5000 - almost equaled the Hitachi 7200 RPM Travelstar’s energy use. But the SSD offers fewer IOPS than the hard drive!
  • They tested against a 200 GB Hitachi Travelstar 7k200, but other 2.5″ 7200 RPM drives have similar power envelopes.

And, of course, a 5400 RPM drive is more efficient. And a 160 GB 1.8″ drive is even more efficient, roomier and cheaper than any of the SSDs TH tested.

My guess on the not-easily-or-quickly-fixed culprit? The flash control logic - disk translation layer - needs cycles for wear leveling, garbage collection, buffer and cache management, flash mux/demux and the SATA interface - with frequent background operations even when the drive is idle.

And don’t forget the 20 volts required to write a cell.

Tom’s singles out Crucial for special mention:

Users who purchase this drive because of Crucial’s statements such as “low power consumption” and the product being ideal for “users who want longer battery life” will most likely be disappointed. While the total battery runtime certainly depends on the workload — we used Mobilemark 07 — the minimum and maximum power consumption measurements prove that Crucial’s statements of low power consumption are in fact wrong: 1.6 W idle power is more than any 2.5” notebook hard drive requires.

Did anyone even think to check the facts? At least one engineer had to know - and he told someone.

What’s the dynamic?
Some will say I’m premature, like when I said HD DVD was dead a year ago. But think about the market dynamic:

  • Cool but costly new technology needs early adopters
  • Based on the marketing, hip high-end adopter spring for costly status symbol with claimed road-warrior features
  • But the supposed advantages don’t exist, so the early adopters feel like chumps
  • Word of mouth stops. Who wants to admit they were suckered?
  • Notebook SSDs slip into obscurity as enterprise and very low-end SSDs move into the spotlight

Making early investors/adopters look stupid is not a winning strategy.

The StorageMojo take
The notebook SSD vendors have dug themselves a very deep hole. How to fix?

  1. Stop digging. A month in detox would help. Some encounter group time with the HD DVD folks.
  2. Form a serious performance consortium and get real about performance, power and longevity.
  3. Do the hard work of getting notebook operating systems better optimized for flash. Use Linux and OS X to beat Microsoft into some semblance of cooperation. Do the engineering for Apple - they’re open source, right? If Apple does it, it’s cool - and you need cool.

What the SSD guys will do:

  • Deny and obfuscate. “Not representative. Slanted. Unfair. Conspiracy.”
  • Claim next gen will fix all problems.
  • Performance, performance, performance. Which is a weak reed as well.
  • Point to cost curves show that, without a doubt, flash overtakes disk in 5 years.

And then hope the smart, techy, affluent road warrior demographic has a short memory. Good luck with that.

Comments welcome, of course.

The Hitz report

July 1st, 2008 by Robin Harris in Off-Topic

The NetApp/Sun patent battle continues. I don’t see how NetApp can win this, given the Supreme Court’s Teleflex decision, which makes prior art a question that can be appealed all the way to the Supreme Court.

But the company is doggedly pursuing the battle, and Dave Hitz’s recent declaration - which he hoped would remain private - has been unsealed.

It is an illuminating document.

Lame logic
Dave early points to Sun’s Jeff Bonwick’s statement that NetApp’s WAFL was

. . . the first commercial file system to use the copy-on-write tree of blocks approach to file system consistency.

As if that proves anything. Sun is arguing that earlier non-commercial research experimented with those and other techniques, establishing the prior art and invalidating NetApp’s patents. One NetApp patent has already been removed from the litigation and I expect more to follow.

Fear and trembling
Hitz goes on to say

Because Sun is exploiting NetApp’s patented technology for free and creating interest in ZFS by giving it away for free, it does not have to cover the true cost of incorporating ZFS into the Sun Fire X4500 and marketing it. Sun is thus able to undercut NetApp’s pricing on a per gigabyte basis, like any counterfeiter. This negatively affects NetApp’s ability to compete in the storage space. In responding to normal market pressures, NetApp would have to consider shrinking its normal profit margins. Reduced profit margins in this marketplace can be permanent and difficult to quantify.

One would have to believe that if Sun were paying a reasonable license fee to NetApp the Sun Fire x4500 wouldn’t be competitive with NetApp’s products. That doesn’t compute: the x4500 is a box of disks with a commodity motherboard. It’s the packaging density and cost amortization across 48 drives that gives the x4500 its $/GB advantage.

NetApp could do the same tomorrow - and I hope they’re working on it. They’d enjoy the same cost structure as Sun. Sun still has all the costs of building, debugging and marketing a complex product so NetApp would have the cost advantage.

Losing in the court of public opinion
Dave later comments on the public campaign Sun is waging against NetApp:

I am painfully aware that IP litigation is not favorably viewed by many members of the open source community. Indeed, the mere paricipation in a lawsuit can bear a reputational cost. Aside from the obvious monetary costs of protracted litigation and the distraction of resources from normal business functions, if the Court grants Sun’s motion for a partial stay, NetApp will suffer irreparable harm to its reputation because it will prolong this whole matter rather than allowing for a prompt disposition.

Newsflash: NetApp has suffered harm and will suffer further harm no matter how this gets resolved. If they win, they lose. In the court of public opinion it would be better if they lost.

The court did grant the partial stay. And unsealed Dave’s declaration.

The StorageMojo take
So sad and unnecessary. I understand the impulse to protect one’s intellectual property. I’ve done it myself on occasion.

NetApp’s biggest misperception is that WAFL is somehow central to the success they are enjoying today. That was true about 10 years ago. Guys, your average F500 CIO today could care less about WAFL.

NetApp is growing because they offer a compelling value proposition of quality products, relevant services and worldwide support. WAFL certainly supports that, but as NetApp execs note much of their recent success is due to the integration software that NetApp now offers.

WAFL is a small piece of the picture. Sun could copy it line for line and still not have a quarter of what NetApp offers.

NetApp faces challenges. Storage commoditization threatens all vendors traditional 60% gross margins. The GX integration is problematic and the bottom line benefit uncertain. EMC’s move into cloud file services is a clever flanking strategy.

But letting fear drive you isn’t the answer. Boldness and innovation - NetApp’s traditional strengths - is the way to a profitable and high-growth future. Sun is a distraction, not a direction.

Comments welcome, of course. Disclosure: I’ve met Dave Hitz a couple of times and he is a genuinely fine person. If you think I’ve pulled some punches here that’s the reason.

Update: A commenter felt I didn’t get Dave’s point across because I’d edited the quote to what I - perhaps mistakenly - thought were the Most Significant Bits. Here’s the salient part of ⁋5 of Dave’s declaration:

5. Sun’s ZFS technology appears to be a conscious reimplementation of NetApp’s innovative WAFL filesystem, as admitted by the creators of ZFS: “The file system that has come closest to our design principles, other than ZFS itself, is WAFL . . . the first commercial file system to use the copy-on-write tree of blocks approach to fie system consistency.”

I still don’t follow the logic that Bonwick’s acknowledgment of WAFL’s technical features means that ZFS is a “conscious reimplementation” of WAFL. Evidently the judge wasn’t persuaded either.
End update.

David Caminer: app design for 1st business computer

June 29th, 2008 by Robin Harris in Enterprise

Sometime we forget how young the computer revolution is. The death 10 days ago of David Caminer, who led the application programming for the world’s first business computer, the Lyons Electronic Office (LEO) is a reminder.

LEO performed its first business calculation - with 2,000 words of memory - on November 17, 1951, evaluating costs and margins on baked goods for J. Lyons & Company, a British chain of tea shops. Mr. Caminer was the systems analyst for the project, which grew into an early computer company that eventually became part of ICL.

From the obituary in the Independent

In 1947 a Lyons fact-finding team visited the United States to catch up on new developments in office methods. They learned for the first time about the newly invented electronic computer. No machine had yet been built, but they learned that Maurice Wilkes at Cambridge University was as far ahead as anyone in constructing a machine. On its return to England, the team made contact with Wilkes, who agreed to supply the design information to Lyons, and Lyons agreed to provide some additional finance and manpower to the project.

The Cambridge machine sprang into life in May 1949, and Lyons then proceeded to construct a copy of the machine. A Cambridge engineer, John Pinkerton, led on the hardware side, while Caminer was put in charge of application development.

As today, many early computer projects went disastrously wrong. Not so at Lyons. Although the technology was radical and innovative, Caminer’s approach to the computerisation of business processes was utterly conservative. He assumed that what could go wrong would go wrong. He therefore set out on a learning curve – computerising simple jobs first, and gradually taking on ones that were critical to the business, such as payroll and stock control. Caminer was an early advocate of management by exception, using the computer to bring critical issues to the attention of management.

Like some current computer industry luminaries, Mr. Caminer was political active, campaigning against British Fascist Oswald Mosely in the 30s and 40s and apartheid later, welcoming Bishop Desmond Tutu to his Borough.

Read more. The New York Times obit. An appreciation from Frank Land at the Leo Computers Society web site.

The first LEO ran for over 13 years - presaging IT’s “if it ain’t broke, why fix it?” mentality.


A LEO computer [courtesy the LEO Computers Society]

The StorageMojo take
As with so many revolutionary 20th century technologies - jet aircraft, radar, antibiotics - the British had an early lead that Americans eventually erased. Arguably the British lead in commercial business computers was the largest of all.

Given Mr. Caminer’s success in bringing large IT projects in on time, we should probably be sorry that we didn’t learn more from him and his methods.

Comments welcome, of course.

Optimism and manycore computing

June 26th, 2008 by Robin Harris in Architecture, Clusters, Future Tech

The parallel computing/manycore initiatives may be missing the point. The challenge of manycore computing is burn up as many CPU cycles as possible doing things that we don’t do today because the computational cost is too great. Making existing apps go faster is secondary.

Today’s focus on creating manycore development platforms like OS X.vi server’s Grand Central may be a subset of where the real action will be. Maybe current levels of parallelization are good enough for most apps. So what does that leave?

How else can we use manycore computing?
Some thoughts:

Application speed up That won’t be the big win for current apps - most feel current processors are fast enough - look at the popularity of the Eee. But I’d love Handbrake to rip my DVDs faster.

Advanced UI capabilities such as voice recognition that are loosely coupled independent processes. Your application won’t run any faster, but it will be easier to use. This is an area Microsoft is looking at. Historically, the UI has been a major consumer of improved CPU and display capability.

New forms of communication and entertainment, such as 3D virtual worlds. This is an extension of the video editing market. And just think of the storage requirements!

Communities of cellular automata One core, one or a few automata. For example, Brian Tung’s and Leonard Kleinrock’s 1996 paper Using Finite State Automata to Produce Self-Optimization and Self-Control discusses using automata to guide a group of agents to cooperate on a task in a distributed systems environment.

Optimistic computing defined by David Jefferson in a 1990 ACM paper titled Virtual Time II: Storage Management in Distributed Simulation as

An optimistic simulation mechanism is one that takes risks by performaning speculative computation, which, if subsequently determined to be correct, saves time, but which is incorrect, must be rolled back.

Update: Rethinking virtualization because once a core costs $3 and you’ve got 32 or 64 of them in a $2k server, why would you spend hundreds of dollars on software to create virtual machines when you’ve got dozens of real ones?

There’s value in easy migration of virtual machines from one physical server to another. A “thin” virtualization layer atop a manycore OS - Windows 7? - could enable Microsoft to take back VMware’s market cap and reassert control of the entire OS stack.
End update.

High desert optimist
Many performance enhancements already use optimistic concepts. But the ability to throw massive computes from networks on a chip - oh, and how about reconfiguring those on-chip networks on the fly - could take us in directions we, or at least I, can’t imagine.

The StorageMojo take
The first effort with any new technology is to recreate what you could do with the old technology. It is only with the 2nd generation that the truly innovative stuff enabled by the new technology gets built.

Consider this an effort to short-circuit that historical process.

Comments welcome, of course. Thanks to Prof. West for pointing out the Jefferson paper to me.

IT is a factory; the Web is a playground

June 24th, 2008 by Robin Harris in Architecture, Enterprise

Over on O’Reilly radar, Nat Torkington, does a neat riff on the enterprise SOA movement. He likens enterprise IT to a stern father:

. . . with strict rules, transgressors to be punished;. . .

while the Web is:

. . . the nurturing parent (the API provider) who encourages experimentation, self-development, and happiness.

It is an amusing read, but like lots of developers and engineers, Nat misunderstands enterprise IT’s motivation. They aren’t into control for the sake of control. (Well, some of them are, because some people are like that. But that isn’t the key reason.)

Control is a means to an end. The goal is production. Enterprise IT is a factory. The Web is a playground.

Expecting the two to be similar is a fundamental confusion. If you were put in charge of Goldman’s IT, you’d turn into a control freak too.

Statistical process control
Factories produce more and higher quality goods by reducing variability. Variability creates problems that cost money, either warranty costs or greater downtime/setup costs.

Enterprise IT is a factory
I first learned this truth when I was selling to engineers for development and to manufacturing for MRP. The engineers were all about the money and the freedom to tinker.

The manufacturing guys just wanted it to work. Save a few bucks on a 3rd party expansion rack? Why? Any glitch would wipe out the savings. So they wouldn’t go there.

The Web is a playground
Sure, there are people, like me, for whom the Web is instrumental in their work. I have backups for everything. The big destination sites do the same.

But for most of us the Web is something more casual: entertainment; shopping; news; communication. As long as it usually works we’re fine. The local cable loop goes down for a couple of hours and we’ll survive.

The StorageMojo take
The engineering and manufacturing cultures are very different, even though both groups are technical. This is why the gap between Silicon Valley and enterprise IT is so wide: the SV engineers think they get IT. And they don’t.

If you can show IT how your product reduces variability in their environment, giving them more certainty about production, you will have their attention. NUMA architectures, for example, add variability, despite higher average performance on tuned workloads.

So you could predict they wouldn’t be successful in the enterprise.

Words like “flexibility,” “experimentation” and “mashup” just don’t compute in the enterprise infrastructure. I’ve been as frustrated by the IT mindset as anyone, but complaining won’t change it. They are doing the best they can with the tools they have.

Want to do something great? Give IT better tools for managing variability.

Comments welcome, of course.

Short videos from Seattle Scalability Conference

June 20th, 2008 by Robin Harris in Off-Topic

I’ve put together a couple of ~3 minute video excerpts from the Seattle Scalability Conference last Saturday. I’ve edited them to be useful standalone intros. Maybe they’ll entice you to learn more.

Chapel: productive parallel programming at scale
Bradford Chamberlain of Cray talks about a new language that he and his colleagues are developing. It isn’t released to the public yet, but he is looking for collaborators interested in moving it beyond a pure HPC focus.

Chapel appears to dramatically simplify parallel programming, if the code samples are any indication.

This is only 3 minutes out of 30, so if this whets your appetite be sure to look for the full video - shot on better equipment - on YouTube. As of this writing it isn’t up yet.

Carmen: a scalable science cloud
This is 3 minutes from early in a talk that Paul Watson of Newcastle University gave on cloud computing for neuroscience research. Neuroscience has a number of issues - including 100,000 researchers worldwide - that lend themselves to a cloud approach.

The full talk is up on Google Video.

Commenters on my ZDnet blog
inform me that Microsoft has solved all these multicore programming problems. Maybe the next scalability conference should be held in Redmond.

It’s official: ZFS in Mac OS 10.6 server

June 19th, 2008 by Robin Harris in Architecture, Information Management

Can single-user OS X be far behind?
Here’s the official Apple announcement:

For business-critical server deployments, Snow Leopard Server adds read and write support for the high-performance, 128-bit ZFS file system, which includes advanced features such as storage pooling, data redundancy, automatic error correction, dynamic volume expansion, and snapshots.

The StorageMojo take
Cool! And only 2 years later than I’d predicted. I’m an optimist.

As I noted almost 2 years ago:

StorageMojo.com has devoted time to this issue because today’s computer business is largely driven by consumer computing, not enterprise computing. Putting a really modern integrated file and storage management system on a consumer OS would raise the bar for everyone else.

I stand by that.

Comments welcome, of course.
For more on ZFS see:
Want to know more about ZFS? I’ve been hot on it for over a year. See:

Cloud computing podcast

June 16th, 2008 by Robin Harris in Future Tech

Gary Orenstein has published a podcast of a discussion we had a couple of weeks ago about cloud computing.

Cloudy days on the hype cycle
Cloud computing and storage is still climbing the hype cycle. Remember client-server computing? It was going to change the world. It did, but not as we expected. Now it is an invisible part of the infosphere.

Likewise cloud computing. It is another arrow in the quiver, not a howitzer. The critical issue is how creatively and transparently we utilize it. No doubt many of us will be surprised.

In 15 years cloud computing will be as obvious to users as client-server is today.

The StorageMojo take
The podcast discusses other issues in cloud computing and storage. Kudos to Gary for putting on the cloud computing series.

Comments welcome, of course. I’ve done work Gary’s employer, Gear6, in the past. This discussion was conducted gratis.

Seattle Scalability Conference quick take

June 16th, 2008 by Robin Harris in Architecture, Clusters, Future Tech

I’m relaxing in beautiful Port Townsend, Washington today, under the gray skies of the coldest June in almost 100 years. The fire in the wood-burning stove and Frank’s strong coffee provide the good cheer.

Temporal compare
My comments are more impressionistic than considered. No “best of” selections now.

Comparing this year’s conference to last year’s is tricky. The Googlers who selected the papers didn’t profess a theme, choosing what they found interesting. So it may be a Rorschach inkblot test to see a pattern in the 2 conferences, but I do.

Last year’s conference focused on cluster scalability - building really big clusters that go beyond the 8,000 or so node clusters Google uses. Jeffrey Dean last year was open about Google’s desire to knit their data centers into a single global name space.

This year the focus moved up the stack to file systems and programming languages. The problem of multi-core chips seemed especially pertinent.

Bradford Chamberlain’s Chapel language attacks the issue of programming multicore/processor systems and sounded promising [download a technical pdf on Chapel here].

Vijay Menon’s “Scalable multiprocessor programming via transactional memory” seeks to replace clustering’s traditional reliance on threads and locks with an atomic transactional model of file access. He noted that Azul Systems uses hardware transactional memory in their 800+ core Java servers.

And there was more.

The StorageMojo take
Scalability is a key problem. The Googler’s desire to involve industry as well as academe gives this conference a dual personality that I like. At its best we see ideas beginning to morph into platforms.

The slow take will be coming as I look further into the papers that were presented. In the meantime Garth Gibson, CMU prof and RAID paper co-author, made some interesting comments on the earlier Scalability Conference post.

Comments welcome, of course. Looking forward to returning to NoAZ tomorrow.

Off to Seattle

June 12th, 2008 by Robin Harris in Off-Topic

That’s right: the second Seattle Conference on Scalability - sponsored by Google - is this Saturday [see a couple of posts back for more info]. I’m also attending the bonus meeting in Fremont Friday evening.

I’m bringing the video production backpack and I’ll try to get some video clips up if I capture something short & interesting. Sunday I’m going to get some Father’s Day love and then up to charming Port Townsend for a couple of days R&R with Frank.

If you’ve spent time in PT, you know Frank. So no guarantees on the video.

The StorageMojo take
The StorageMojo team has been celebrating the 500 post mark - by not posting. But now its back to work.

If you’re at the Conference look me up. Always pleased to meet StorageMojo readers - even occasional ones - or people who could be StorageMojo readers.

Roadrunner’s backing store

June 11th, 2008 by Robin Harris in Architecture, Clusters, Disk, NAS, IP, iSCSI, SAN, FC

I wrote a short piece on ZDnet about Los Alamos National Labs new Cell Broadband Engine based supercomputer, Roadrunner. With ~14k v.3 Cell processors - an earlier version powers the PS3 game console - and another ~7k dual core Opterons, the Roadrunner’s ~3,250 compute nodes pack a lot of compute cycles.

The key compute element is the new version of the PS3 chip - called a PowerXCell 8i Processor - features 8x faster double-precision floating point and over 25 GB/sec of memory bandwidth. And it can address 64 GB RAM. There are 4 8i’s per compute node.

Nothing I read mentioned the disk storage - until the friendly Panasas PR person suggested I talk to Larry Jones, VP Product Marketing. Panasas is providing the back end storage for Roadrunner.

I did, and here’s what I learned.

LANL storage infrastructure
LANL’s 6 supercomputers + Roadrunner share the Panasas storage through LANL-developed IO nodes. While Roadrunner itself uses dual-data-rate 4x Infiniband for internode communication, the I/O nodes attach to Panasas through trunked GigE.

The advantage of the I/O nodes is that the entire Panasas storage pool is available to each supercomputer. Lots of bandwidth.

Roadrunner currently has about 80TB of RAM, roughly 24 GB per compute node. That works out to about 4 GB RAM per processor.

The jobs these machines run are huge. A simulation can run 6 months or more. Depending on criticality a job gets checkpointed every hour or maybe once a day.

The Panasas installation at LANL, begun in 2003, is currently 2 PB. Assuming an average of 500 GB drives, that means 4,000 disk drives.

Panasas uses 5 trunked GigE links to each of the 8 controllers in a single rack. They are now in beta for 10 GigE, which reduce link count from 40 to 8 per rack while doubling bandwidth.

The hot rodders at LANL should like that.

The StorageMojo take
Roadrunner’s 80 TB RAM is a sizable storage infrastructure in its own right. Keeping it fed and backed up is a major job.

Consumerization of IT is a common concept - but what we see here is the consumerization of HPC: Playstation CPUs; SATA drives; Linux OS; air cooling. The old model of highly customized kit for HPC is dead.

Which is a good thing for the rest of us. We get some of the smartest people in computing working on platforms that we might also use, developing applications that otherwise would never be available to the consumer market.

I’ll never run molecular dynamics codes, but maybe my kids will. After all, I can now edit feature length movies on my desktop. Who would have believed that just 20 years ago?

Comments welcome, of course. Disclosure: I did some work for Panasas last year and - who knows? - might do some more in the future. I like the team and the way they are pushing pNFS.



« Previous ArticleNext Article »
StorageMojo RSS Feed November 2008 October 2008 September 2008 August 2008 July 2008 June 2008 May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007