Next week I’m flying to Seattle to attend a one day conference on scalability hosted by Google’s Kirkland office.
It is a great set of presentations with leading edge practitioners. Here’s the agenda, presenters and edited descriptions of the topics:
- Keynote I: MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets – Jeff Dean
Search is one of the most important applications used on the internet, but it also poses some of the most interesting challenges in computer science. Providing high-quality search requires understanding across a wide range of computer science disciplines, from lower-level systems issues like computer architecture and distributed systems to applied areas like information retrieval, machine learning, data mining, and user interface design. In this talk, I’ll highlight some of the behind-the-scenes pieces of infrastructure that we’ve built in order to operate Google’s services.
- Breakout I: Lustre File System – Peter Braam
This lecture will explain the Lustre architecture and then focus on how scalability was achieved. We will address many aspects of scalability mostly from the field and some from future requirements, from having 25,000 clients in the Red Storm computer to offering exabytes of storage. Performance is an important focus and we will discuss how Lustre serves up over 100GB/sec today going to 100TB/sec in the coming years. It will deliver millions of metadata operations per second in a cluster and, write 10’s of thousands of small files per second on a single node. If you like big numbers (but less than a Gogol) please come to this talk.
- Breakout I: Building A Scalable Resource Management Layer for Grid Computing – Khalid Ahmed
We will show how to build a centralized dynamic load information collection service that can handle up to 5000 nodes/20,000 cpus in a single cluster. The service is able to gather a variety of system level metrics and is extensible to collect up to 256 dynamic or static attributes of a node and actively feed them to a centralized master. A built-in election algorithm ensures timely failover of the master service ensuring high-availability without the need for specialized interconnects.
This building block is extended to multiple clusters that can be organized hierarchically to support a single resource management domain that can span multiple data centers. We believe the current architecture could scale to 100,000 nodes/400,000 cpus. Additional services such as a distributed process execution service, and a policy-based resource allocation engine which leverage this core scale-out clustering service are described. The protocols, communication overheads, and various design tradeoffs that were made the development of these services will be presented along with experimental results from various tests, simulations and production environments.
- Breakout II: VeriSign’s Global DNS Infrastructure – Patrick Quaid, Scott Courtney
VeriSign’s global network of nameservers for the .com and .net domains sees 500,000 DNS queries per second during its daily peak, and ten times that or more during attacks. By adding new servers and bandwidth, we’ve recently increased capacity to handle many times that query volume. Name and address changes are distributed to these nameservers every 15 seconds — from a provisioning system that routinely receives one million domain updates in an hour. In this presentation we describe VeriSign’s production DNS implementation as a context for discussing our approach to highly scalable, highly reliable architectures. We will talk about the underlying Advanced Transactional Lookup and Signaling software, which is used to handle database extraction, validation, distribution and name resolution. We also will show the central heads-up display that rolls up statistics reported from each component in the infrastructure.
- Breakout II: Using MapReduce on Large Geographic Datasets & Google Talk: Lessons in Building Scalable Systems – Barry Brumitt, Reza Behforooz
- Keynote II: Description TBD – Marissa Mayer
- Breakout III: Stream Control Transmission Protocol’s Additional Reliability and Fault Tolerance – Brad Penoff, Mike Tsai, and Alan Wagner
The Stream Control Transmission Protocol (SCTP) is a newly standardized transport protocol that provides additional mechanisms for reliability beyond that of TCP. The added reliability and fault tolerance of SCTP may function better for MapReduce-like distributed applications on large commodity clusters.
SCTP has the following features that provide additional levels of reliability and fault tolerance. Selective acknowledgment (SACK) is built-in to the protocol with the ability to express larger gaps than TCP; as a result, SCTP outperforms TCP under loss. For cluster nodes with multiple interfaces, SCTP supports multihoming, which transparently provides failover in the event of network path failure. SCTP has the stronger CRC32c checksum which is necessary with high data rates and large scale systems. SCTP also allows multiple streams within a single connection, providing a solution to the head- of-line blocking problem present in TCP-based farming applications like Google’s MapReduce. Like TCP, SCTP provides a reliable data stream by default, but unlike TCP, messages can optionally age or reliability can be disabled altogether. The SCTP API provides both a one-to-one (like TCP) and a one-to-many (like UDP) socket style; use of a one-to-many style socket can reduce the number of file descriptors required by an application, making it more scalable.
The additional scalability and fault tolerance come at a cost. The CRC32c checksum calculation currently is not off-loaded to any NIC available on the market, so it must be performed by the host CPU. In high bandwidth environments with no loss, SACK processing may become a burden on the host CPU.
- Breakout III: Scalable Test Selection Using Source Code Deltas – Ryan Gerard
As the number of automated regression tests increase, the ability to run all of them in a reasonable amount of time becomes more and more difficult, and simply doesn’t scale. Since we are looking for regressions, it is useful to hone in on the parts of the code that have changed from the last run to help select a small subset of tests that are likely to find the regression. In this way we are only running the tests that need to be run as your system gets larger and the number of possible tests scales outward. We have devised a method to select a subset of tests from an existing test set for scalable regression testing based on source code changes, or deltas.
- Breakout IV: YouTube Scalability – Cuong Do
This talk will discuss some of the scalability challenges that have arisen during YouTube’s short but extraordinary history. YouTube has grown incredibly rapidly despite having had only a handful of people responsible for scaling the site. Topics of discussion will include hardware scalability, software scalability, and database scalability.
- Breakout IV: Challenges in Building an Infinite Scalable Datastore – Swami Sivasubramanian, Werner Vogels
In this talk, we will present the design of one of our internal datastores, HASS. HASS is designed to be “always” available, i.e., it will always accept read/write requests even if disks are failing, routes are flapping or if datacenters are being destroyed by tornados. HASS is designed for incremental scalability where adding or removing nodes can be done easily and the load gets evenly distributed among the nodes uniformly without requiring any operator intervention. In this talk, we will focus on a single and one of the most crucial ideas in HASS’s design: its ability to partition data. HASS uses consistent hashing to partition its data across its storage nodes. The basic consistent hashing algorithm is well understood in the academic literature and several research systems have been designed using it. In this talk, we will discuss our experiences with using the basic consistent hashing algorithm and the optimizations we performed to achieve more uniform load distribution and ease of operation.
MapReduce is a programming model and library designed to simplify distributed processing of huge datasets on large clusters of computers. This is achieved by providing a general mechanism which largely relieves the programmer from having to handle challenging distributed computing problems such as data distribution, process coordination, fault tolerance, and scaling.
Since launching Google Talk in the summer of 2005, we have integrated the service with two large existing products: Gmail and orkut. Each of these integrations provided unique scalability challenges as we had to handle a sudden big increase in the number of users.
Which ones should I attend?
I’m torn between a couple of the breakout options. Lustre vs. scalable resource management. YouTube vs. infinitely scalable datastore.
I know some of you folks are intimately involved with these topics, so I’d appreciate your suggestions, not only for which to attend, but what questions you’d like to see addressed. If some of you are also going to be there I’d also be pleased to meet f2f as well.
That last breakout session is a really tough choice. How can I be in two places at once?
While I’m up there I’m also hoping to tour Isilon’s lab and see their gear in action.
Comments and suggestions welcome. Last I heard the conference was full with a waiting list.