A transaction processing system for NVRAM

Adapting to NVRAM is going to be a lengthy process. This was pointed out by a recent paper. More on that later.

Thankfully, Intel wildly pre-announced 3D XPoint. That has spurred OS and application vendors to consider how it might affect their products.

As we saw with the adoption of SSDs, it takes time to unravel the assumptions built into products. Take databases: they spent decades optimizing for hard drives, and when SSDs came along many of those optimizations became detrimental.

Durable transactions
On the face of it it shouldn’t be that hard. You want a durable transaction, you have persistant NVRAM. Are we good here?

Nope.

In a paper published by Microsoft Research, DUDETM: Building Durable Transactions with Decoupling for Persistent Memory, the authors (Mengxing Liu, Mingxing Zhang, Kang Chen, Xuehai Qian, Yongwei Wu, Jinglei Ren) go into the issues:

While persistent memory provides non-volatility, it is challenging for an application to ensure correct recovery from the persistent data on a system crash, namely, crash consistency. A solution . . . is using crash-consistent durable transaction[s]. . . .

Most implementations of durable transactions enforce crash consistency through logging. However, the. . . dilemma between undo and redo logging is essentially a trade-off between update redirection cost and persist ordering cost.

The authors make a bold claim:

[O]ur investigation demonstrates that it is possible to make the best of both worlds while supporting both dynamic and static transactions. The key insight of our solution is decoupling a durable transaction into three fully asynchronous steps.

Solution
To create a fully decoupled transaction system for NVRAM, the researchers made three key design decisions.

A single, shared, cross-transaction shadow memory.
An out of the box Transaction Memory.
A redo log as the only way to transfer updates from shadow memory to persistent memory.

These design choices enabled building an ACID transaction in three decoupled, asynchronous, steps.

Perform: execute the transaction in a shadow memory, and produce a redo log for the transaction.
Persist: flush the redo log of each transaction to persistent memory in an atomic manner.
Reproduce: modify original data in persistent memory according to the persisted redo log.

Performance
The paper is lengthy and a recommended read for those professionally interested in transaction processing on NVRAM. But here’s their performance summary.

Our evaluation results show that DUDETM adds guarantees of crash consistency and durability to TinySTM by adding only 7.4% âˆ¼ 24.6% overhead, and is 1.7Ã— to 4.4Ã— faster than existing works Mnemosyne and NVML.

The StorageMojo take
As we’ve seen with the transition from hard drives to SSDs, unwinding decades of engineered-in assumptions in the rest of stack is a matter of years, not months. There’s the issue of rearchitecting basic systems, such as transaction processing, or databases, and then the hard work of stepwise enhancement of those new architectures as we gain knowledge about how they intersect with the new technology and workloads.

There are going to be many opportunities for startups that focus on NVRAM. The technology is coming quickly and with more technology diversity – there are several types of NVRAM already available, with more on the way, and each has different trade-offs – which means that the opportunities for creativity are legion.

Courteous comments welcome, of course.

2 Comments

Colby on Monday, 19 June, 2017 at 12:02 pm

Robin,
As usual, your ability to see–and communicate–the forest AND the trees is a pleasure.

Thanks for reminding us: “unwinding decades of engineered-in assumptions in the rest of stack is a matter of years, not months,” and highlighting “the opportunities for creativity” embedded in the current state by how we got here.

As to where we’re going, as this part of the rapidly growing body of theory and practice in general optimistic concurrency control, I expect cross-pollination. Particularly in distributed implementations. For instance, how might network OCC findings be applied to DUDETM in a rack scale distributed fabric, or a more distributed–and mobile–environment? Global/Local optimization in distributed systems is and will always remain an extremely difficult problem, though there are imperfect though effective non-obvious approaches that are only improving.

Love your work.

Colby

KD Mann on Tuesday, 20 June, 2017 at 12:10 pm

Great article Robin, and thanks for bringing attention to this important topic.

Flash SSDs were supposed to be a panacea for all persistent memory bottlenecks — we’ve seen how that didn’t play out. Intel’s 3D XPoint was supposed to be byte-addressable, but here we are in mid-2017 and the only implementation available is still as a block-device (Optane). The performance gains over conventional SSDs are an order of magnitude away (or more) from the claims made by Intel/Micron and, without byte-addressability it’s definitely NOT NVRAM, it’s just another SSD.

The essence of this paper (which I agree is excellent) seems to be “we have to cache everything in DRAM, and then use a redo log to make it work”. Not sure why they called it “shadow memory” instead of a DRAM cache.

From the paper:

“To realize the efficient decoupled execution, we make the
following design choices. First, we maintain a single shared,
cross-transaction shadow memory, which is logically a
volatile mirror or cache of the whole persistent memory
space.”

The importance of this paper is that it illustrates that (once again) the latest advances in non-volatile memory, even assuming we someday can buy non-volatile memory that is byte-addressable, will still not be as good as caching in DRAM, and certainly not the panacea that the marketing hypesters are selling us.

This reminds me of Jim Gray’s famous quote — ‘Tape is dead, disk is tape, flash is disk, but Ram Locality is King.”

It would appear that RAM locality is still King 🙂