Boxwood, like Gaul, is divided into three parts
Boxwood is structured as several interdependent layered services. Sounds good, but what does it mean? First, recall what we want from our ideal storage infrastructure:
- Fault tolerance – which means redundancy and, for really bad failures like corruption and asteroids, backup
- Adaptability – which means load balancing and capacity expansion
- Consistency – when a write happens, that data is instantly persistent and available to all applications, such as databases and filesystems, despite failures, system loading or competing service demand
So we need mechanisms that guarantee all that and more. Oh, and can we please do all this as cheaply as possible?
Let’s get physical
Boxwood assumes a physical infrastructure of clustered systems with locally attached disks. Interconnected by Gigabit Ethernet, cluster members are accessed through low-level (i.e. fast and lightweight) remote procedure calls (RPC). There are no RAID controllers yet data is highly available even in failure mode. The software is the important part.
Taking it from the top, i.e. the application facing side, Boxwood consists of three major layers:
- Distributed B-tree module. Oddly, no one knows what B-tree stands for, if anything. It is generally considered the best general-purpose data structure for secondary storage. Wikipedia’s article is a good place for a quick refresher. Boxwood’s B-tree is optimized to allow single B-tree ACID transactions without locks, reducing overhead.
- Below that is the chunk manager which the B-tree module uses for storage. The chunk manager hides disk details, presenting only sector-aligned sequence of bytes identified by a globally unique handle. The chunk manager performs four basic functions: allocate/deallocate and read/write. For network economy the chunk manager is implemented using a primary and a secondary running on two different machines using RPCs to update. One manager is the primary and handles allocation/deallocation while both perform read/writes on their locally attached disks. A large cluster may have thousands of these chunk manager pairs.
- Under the chunk manager is a virtualized disk called a replicated logical device or RLDev. As the name suggests the data is replicated twice, using a technique called chained declustering (pdf) which has been shown to offer both high availability and excellent load balancing in event of a failure without the physical and economic overhead of RAID.
Would you like services with that?
Actually, you have little choice. Boxwood requires some cluster-wide services to make this work, because it expects components to come and go. So everyone needs to know the existing cluster state, who owns what, and who to trust. These services are:
- Paxos services
- Lock services
- Transaction services
- Failure detection
Boxwood clients rely on three kinds of state information to do their jobs. Paxos maintains that information for clients: lock information; RLDev layer; and, the chunk manager. The Paxos service is essential for recovery or reconstruction of those states. Usually, the state info will include server, disk and network status and configuration.
Paxos is implemented on a cluster, as befits an essential service. Yet Paxos is invoked only when there are failures in the system, so it is not a bottleneck.
Boxwood implements a simple (to CompSci Ph.Ds) general purpose distributed lock manager. All clusters have the problem of controlling who gets to write data, so lock managers are essential for maintaining data integrity. Before any server writes to an existing RLDev, it requests permission to do so from the lock server. If no other process holds a lock on that RLDev the write proceeds. If another process does hold a lock, Paxos pings it to release the lock, which it does once the write is complete.
Transaction and logging service
As befits a database-oriented infrastructure, Boxwood provides simple transaction and logging support for clients. Logging supports both redo and undo operations. The logs themselves are stored on RLDevs, which are available to any machine on the cluster, so service recovery can be performed by any machine.
Things break. The real question is “how do we know?” In Boxwood, servers exchange keepalive messages which are monitored by a collection of observer systems. The system is engineered with The Byzantine Generals Problem in mind, so that even a machine that thinks it sees a failed server (“I see dead people”) due to network problems or simple idiocy, will not keep the failure detector from doing its job. If machine A is not told by its observers that it is alive, then it considers itself dead. If observers B consider A dead, then it doesn’t receive an acknowledgment and after a time it considers itself dead. A simple, effective and low-overhead method of detecting failures. In a really large Google-killer cluster, there will be plenty.
Layers + services = Boxwood
Let’s step back for a minute and consider what all this means. We have a low-cost infrastructure using standard Windows servers as building blocks. Locally attached disks – no RAID hardware – Gig E interconnect, and a set of services that are mostly clustered and highly scalable themselves.
These services are designed – unlike Google’s Bigtable and GFS – to support transaction processing’s ACID requirements and database apps. At the same time all these abstractions dramatically simplify the application programmer’s job by providing services like RLDevs, transaction undo and redo primitives, failure detection and more that, at the application level can be darned difficult to provide reliably. Is that sounding cooler, grasshopper?
Part III will be arriving Real Soon Now. Boxwood performance, scuttlebutt about where Microsoft is testing this out and more. Stay tuned.
This ‘grasshopper’ thinks this still sounds very much like the core strategy in Sistina/Redhat GFS.
Perhaps someone knowledgeable in GFS could comment..?