Block I/O has been a stalwart of SCSI, IDE and SATA interfaces for over 30 years. But sharing those devices hasn’t been easy and certainly has only rarely, if ever, made it into enterprise production systems. That’s why we have expensive Fibre Channel SANs and NAS boxes.
The advantage of superfast block storage devices – enabled by NAND flash – has turned the entire storage industry on its head. Expensive for disks, NAND flash makes I/Os cheap. But because flash performance is so high and capacity is so costly it makes sense to share among multiple servers.
Why don’t we share SCSI devices? One reason is economic: the cost of SCSI disk drives and SCSI storage arrays isn’t high enough to justify extra infrastructure.
But a more pertinent reason is that keeping track of who has used what block when – keeping data consistent – is hard. Hard because we don’t want added latency and because blocks are small and many. Large data structures with lots of updates? Latency.
In Beyond Block I/O: Implementing A Distributed Shared Log In Hardware researchers Michael Wei of UC San Diego and John D Davis, Ted Wobber, Mahesh Balakrishnan and Dahlia Malkhi of Microsoft Research propose a network storage interface that addresses these problems, making it possible to share a single or cluster of block devices across many servers.
They tie a shared log structured filesystem – named Corfu, an island near Paxos in the Aegean Sea – to a hardware-based I/O sequencer that – as the name suggests – hands out log access slots to clients. The client requests an address and the sequencer delivers a global log address (GLA).
A shared log is a powerful tool for coordinating large numbers of I/O requests and clients. The Shared Log Interface Controller or SLICE is the hardware that translates Corfu log offsets onto individual slice virtual addresses (SVA) which are mapped to slice physical addresses (SPA).
The goal is to enable high throughput distributed transactional applications in data center or cloud infrastructures. By persisting everything onto the global log and maintaining metadata in memory for fast access the team has already had good experiences with systems built on top of this infrastructure. These include Zookeeper, the Hyder database, a general-purpose transactional key value store and a state machine replication library, and a log structured virtual drive.
In addition to enabling the coordination of multiple clients accessing the block storage, the shared log also enables clustering of multiple flash storage devices. All clients write to the tail of a single log and read from its body concurrently.
The slice API has several key features:
- The entire address space is write-once only. This eliminates the need for coordinated metadata support.
- To support the write-once semantics the address space of every slice grows infinitely, limited only by the device’s lifetime.
- Configuration changes are managed through the concept of a configuration epoch. The slice can then deny service to any client which is not aware of the current configuration.
- Commands to mark epoch addresses as read-only and to trim written pages that can be erased during garbage collection.
- The write once semantics and the need to garbage collect means that addresses cannot be recycled – thus the infinite address space.
Virtual to physical address mapping with Cuckoo hashing – a technique they tested against Chain hashing – used about 1.8MB per GB of capacity.
The prototype SLICE implementation used a Beehive many-core design on an FPGA with a GigE port, a SATA SSD and 2GB of DRAM – an implementation that can fully saturate a GigE link. A product would likely use an ASIC and onboard flash and an upgrade to 10GigE.
A low-cost SLICE controller is important since potentially hundreds could support a single shared log. The key is the data structure that supports the persistent mapping from the 64bit Space of Virtual Addresses (SVA) to the physical addresses. The SVA isn’t infinite – number of blocks times the number of flash program erase cycles -defines its effective size.
This data structure is similar to those in an SSD Flash Translation Layer, and may be one of those architectural leverage points that could create a cost-effective and compelling device. The prototype used an SSD instead of raw flash, so the trade-offs from the latter remain to be investigated.
In a single instance test the SLICE prototype was on par in performance with a Xeon-based client, while the reads/watt were an order of magnitude better and appends/watt were ≈5x better.
In a scale test, the SLICE controller scaled linearly up to 8 nodes, achieving over 200k reads and almost 200k appends. At about 570k tokens per second the sequencer begins to max out the network, so a 10 GigE network, a high-performance NIC and batching sequence numbers could exceed over 1 million tokens and require more than 64 SLICE controllers for maximum performance.
The StorageMojo take
The architectural implications of cheap I/O continues to unfold. In combination with the benefits of massive scale – where reads/watt makes great sense – the opportunities to build efficient commodity-based infrastructures seem to multiply.
As costs for network bandwidth, flash capacity and compute cycles continue to decline the benefits of rethinking enterprise class infrastructure keep growing. Low-level device sharing – now confined behind costly array controllers – is another surprising option.
Courteous comments welcome, of course.
Update/expansion from Mahesh Balakrishnan, one of the authors:
I did a careful read of your post and it’s a very accurate analysis of the work. . . . [T]he difficulty of sharing block devices is further compounded if you require fault-tolerance. We found that it’s very hard to consistently replicate updates across a set of drives if the drives are passive (i.e., they support only read/write operations and can’t actively participate as coordinators in replication protocols) and expose a conventional read/write address space.
Essentially, multiple clients writing to the same block on a replica set can enter race conditions where they overwrite each other’s updates in different orders on different replicas, creating inconsistency. The write-once address space (along with a chain replication protocol) solves that problem; when two clients try to write different values to the same block, the one that reaches the first replica earlier ‘wins’. Of course, the write-once address space helps even for non-replicated storage; as you point out, safely mediating access to a single drive’s conventional address space from multiple clients would require some form of coordination metadata or concurrency control mechanism external to the drive.
Mahesh also noted that this work is part of a larger effort to build a new stack. More on that later on StorageMojo.
A beginning of Microsoft’s answer to VSAN?
vSAN is more of an answer to Microsoft’s storage spaces. At a first read, this seems more like Microsoft’s answer to Ceph, but goes way beyond that.
Having said that the implications of this research will take me a reasonable amount of time to grok completely, so I’m probably a little off.
OMG ! Check out the diagram on Page 6 of the whitepaper … Token Ring LIVES in advanced storage design ! #FCoTR is actually an architectural possibility …
On a more serious note, the idea of allowing multiple servers to directly access a large solid-state address space without sharding is really interesting, but without looking through the rest of the research, it seems like problems associated with the distributed metadata management and garbage collection have been somewhat trivialized. In addition the centralised sequencer is makes me worried about single points of failure and domain scalability.
OTOH, It’s clear these guys are way smarter than me, so I wouldn’t be entirely surprised if my concerns are unwarranted, but even so it looks like there’s lots of hard engineering work required to turn something like this into a product I’d be betting a business on.
Corfu and Paxoi are islands in the Ionian sea