by Robin Harris on Saturday, 17 January, 2009

Truism: flash is not the same as disk. So why don’t we take advantage of that – rather than hiding it?

Partly is it is the human SOP: first build the old thing out of the new stuff. Not to mention the commercial allure of hundreds of millions of SATA interfaces in the wild.

Helping us move on is a paper by researchers Vijayan Prabhakaran, Thomas L. Rodeheffer and Lidong Zhou Transactional Flash (pdf) of Microsoft Research. Vijayan has also co-authored flash papers with Ted Wobber et. al. noted elsewhere on StorageMojo.

Flash is a good fit.
The authors note that the essence of all transactional constructs is to avoid in-place data modification – enabling roll back to a known state. Since flash SSDs can’t re-write data in place, TransFlash makes a virtue of flash necessity.

Flash SSD architectures also have much parallelism, due to the use of many flash chips, each including multiple planes and blocks, with multiple I/O paths to support garbage collection and wear-leveling – and now, WriteAtomic.

Finally, the data scattering caused by avoiding in-place data rewrites – typically through copy-on-write strategies – is not the problem for flash that it is for disks: flash excels at fast random reads.

What is TransFlash?
TransFlash is a flash SSD with 3 important enhancements:

  • It exports a transactional interface WriteAtomic.
  • The flash controller implements a cyclic commit that uses flash’s per-page metadata storage – typically 128 bytes – instead of the common independent commit record.
  • Both of these features are implemented in the flash translation layer controller firmware – no hardware engineering required.

The authors named their invention TxFlash, but I like TransFlash better since Tx also abbreviates transmit. It also sounds sexier, a rare quality in computer science naming. Really guys, it will help commercial adoption.

WriteAtomic model
The key API construct is described thusly:

TxFlash exports a new interface, WriteAtomic (p1 . . . pn), which allows an application to specify a transaction with a set of page writes, p1 to pn. TxFlash ensures atomicity, i.e., either all the pages are written or none are modified. TxFlash further provides isolation among multiple WriteAtomic calls. Before it is committed, a WriteAtomic operation can be aborted by calling an Abort. By ensuring atomicity, isolation, and durability, TxFlash guarantees consistency for transactions with WriteAtomic calls.

The authors compare 3 commit protocols – traditional commit, simple cyclic commit, and back pointer cyclic commit – and evaluate their resource requirements. The table shows that the new commit protocols reduce I/O overhead, differing in their treatment of aborted transactions.
The simple cyclic commit has to erase aborted transactions before any new writes can be written to the same page. This could slow response times if aborted transactions are common.

Compared to traditional commits, the new protocols double transaction throughput because they don’t require additional commit writes and write ordering. This is most important with small transactions, as transfer times affect large transactions.

End-to-end benefit
The author’s simulations with a pseudo-device driver under various workloads found that TransFlash adds minimal overhead. The big win is in file system complexity, that:

. . . can be reduced by using the transactional primitives from the storage system. For example, the journaling module of TxExt3 contains about 3300 LOC when compared to 7900 LOC in Ext3. Most of the reduction were due to the absence of recovery and revoke features and journal-specific abstraction.

The StorageMojo take
TransFlash works on multiple levels:

  • It simplifies a longstanding problem with little required device investment.
  • It creates a high-value storage interface – with its attendant margin enhancement opportunities – for an industry whose current margin cows will soon die.
  • It reduces file system complexity – an under-appreciated issue – while improving performance for small write transactions.

History will favor BPCC as Moore’s Law drives flash translation layer controller performance up and flash storage costs down. Unless someone comes up with something even better.

Whether or not TransFlash ever sees the light of day, the paper is a welcome reminder of the benefits of pushing the envelope. With all the new storage technologies coming online we’ll have many opportunities to change the I/O landscape in coming years.

Courteous comments welcome, of course.

{ 4 comments… read them below or add one }

the storage anarchist January 19, 2009 at 4:17 am

Interesting concept. But I wonder if this could be implemented in a manner that would meet the high-availability requirements of most transactional applications. I guess you could mirror the WriteAtomic operation to two separate flash drives/devices. But remotely replicating the entire transaction would seem to stretch the limits of practicality.

RC January 19, 2009 at 12:28 pm

You can either make flash devices smarter to work around file systems optimized for disk type layouts, or make file systems smarter to work with flash.

In general, the dumber approach wins, because the flash vendor will use the gimmick to differentiate their product.

John F February 4, 2009 at 4:46 pm

Speaking of Microsoft research into SSD, I found this one from November 2008 quite interesting as well.

In it, you’ll find

“Broadly we found that SSDs will fully replace disks once
the SSDs’ cost per GB drops by 1–3 orders of magnitude.
Waiting for full replacement is not necessary, however. Using
the solid-state storage as a caching tier is more promising
in the short to medium-term: a small amount of solidstate
storage used as a read cache benefits up to 25% of our
traced workloads, and for 10%of them this can be done costeffectively
already at today’s SSD prices.”

Fazal Majid March 10, 2009 at 11:04 am

It’s really interesting that this Microsoft team used Linux (ext3 filesystem) rather than NT as its testbed.

Leave a Comment

Previous post:

Next post: