<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>StorageMojo &#187; Information Management</title>
	<atom:link href="http://storagemojo.com/category/information-management/feed/" rel="self" type="application/rss+xml" />
	<link>http://storagemojo.com</link>
	<description>Data storage info &#38; analysis</description>
	<lastBuildDate>Fri, 20 Jan 2012 06:10:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Nimble Storage architecture video</title>
		<link>http://storagemojo.com/2011/08/03/nimble-storage-architecture-video/</link>
		<comments>http://storagemojo.com/2011/08/03/nimble-storage-architecture-video/#comments</comments>
		<pubDate>Wed, 03 Aug 2011 23:26:15 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Backup]]></category>
		<category><![CDATA[Information Management]]></category>
		<category><![CDATA[SOHO/SMB]]></category>
		<category><![CDATA[Video]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2483</guid>
		<description><![CDATA[I sat down with Nimble Storage co-founder and VP of engineering Varun Mehta to discuss their architecture &#8211; and shoot some video. Varun has been part of several Valley success stories &#8211; NetApp, Sun, Data Domain &#8211; and has a first hand perspective on disruptive technologies. Varun and co-founder Umesh Maheshwari &#8211; a brilliant architect [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>I sat down with <a href="http://www.nimblestorage.com/" target="_blank">Nimble Storage</a> co-founder and VP of engineering Varun Mehta to discuss their architecture &#8211; and shoot some video. Varun has been part of several Valley success stories &#8211; NetApp, Sun, Data Domain &#8211; and has a first hand perspective on disruptive technologies.</p>
<p>Varun and co-founder Umesh Maheshwari &#8211; a brilliant architect and a very nice guy &#8211; designed the Nimble product that he discusses. Take 4 minutes to learn more about <i>Innovations in Storage Architecture at Nimble Storage</i>:</p>
<p><object width="500" height="306"><param name="movie" value="http://www.youtube.com/v/KxQVmSe_o3M?version=3"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/KxQVmSe_o3M?version=3" type="application/x-shockwave-flash" width="500" height="306" allowscriptaccess="always" allowfullscreen="true"></embed></object></p>
<p>Or you can see it in HD on <a href="http://www.youtube.com/watch?v=KxQVmSe_o3M" target="_blank">YouTube</a>.</p>
<p><strong>The StorageMojo take</strong><br />
The Nimble guys have great technology, but they&#8217;ve also put together a compelling value proposition: collapse 3 time-consuming and complex workflows &#8211; primary storage, backup and archiving &#8211; into 1 appliance. Include all the needed software, price it well, target under-served mid-sized companies and you have a recipe for another Valley success. </p>
<p>The tech trends they&#8217;re riding will only get better. But the business trends are in their favor as well. SMB&#8217;s today have many TB of data and little staff to manage it &#8211; or capital to invest. With Congress ensuring that America operates well below capacity for years to come, the times favor thrifty solutions like Nimble&#8217;s.</p>
<p><strong>Courteous comments welcome, of course.</strong><br />
Nimble bought my time for this video, but I made all editorial decisions.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2011/08/03/nimble-storage-architecture-video/&text=Nimble Storage architecture video" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2011/08/03/nimble-storage-architecture-video/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Google&#8217;s Megastore</title>
		<link>http://storagemojo.com/2011/04/20/googles-megastore/</link>
		<comments>http://storagemojo.com/2011/04/20/googles-megastore/#comments</comments>
		<pubDate>Wed, 20 Apr 2011 16:50:29 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Cloud computing & storage]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Information Management]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2349</guid>
		<description><![CDATA[Megastore handles over 3 billion writes and 20 billion reads daily on almost 8 PB of primary data across many global data centers. In a paper by Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Léon, Yawei Li, Alexander Lloyd, Vadim Yushprakh titled Megastore: Providing Scalable, Highly Available Storage [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Megastore handles over 3 billion writes and 20 billion reads daily on almost 8 PB of primary data across many global data centers. </p>
<p>In a paper by Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Léon, Yawei Li, Alexander Lloyd, Vadim Yushprakh titled <a href="http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf" target="_blank">Megastore: Providing Scalable, Highly Available Storage for Interactive Services</a> Google engineers describe how it works. From the abstract:</p>
<blockquote><p>
Megastore is a storage system developed to meet the requirements of today&#8217;s interactive online services. Megastore blends the scalability of a NoSQL data store with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high-availability. We provide fully serializable ACID semantics within fine-grained partitions of data. This partitioning allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between data centers.
</p></blockquote>
<p><strong>The mission</strong><br />
Support Internet apps such as Google&#8217;s AppEngine. </p>
<ul>
<li>Scale to millions of users</li>
<li>Responsive despite Internet latencies to impatient users</li>
<li>Easy for developers</li>
<li>Fault resilience from drive failures to data center loss and everything in between</li>
<li>Low-latency synchronous replication to distant sites</li>
</ul>
<p><strong>The how</strong><br />
Scale by partitioning the data store and replicating each partition separately, providing full ACID semantics within partitions but limited consistency guarantees across them. Offer some traditional database features if they scale with tolerable latency.</p>
<p>The key assumptions are that data for many apps can be partitioned, for example by user, and that a selected set of DB features can make developers productive.</p>
<p><strong>Availability and scale</strong><br />
To achieve availability and global scale the designers implemented two key architectural features:</p>
<ul>
<li>For availability, an asynchronous log replicator optimized for long-distance</li>
<li>For scale, data partitioned into small databases each with its own replicated log</li>
</ul>
<p>Rather than implement a master/slave or optimistic replication strategy, the team decided to use Paxos, a consensus algorithm that does not require a master, with a novel extension. A single Paxos log would soon become a bottleneck with millions of users so each partition gets its own replicated Paxos log.</p>
<p>Data is partitioned into entity groups which are synchronously replicated over a wide area while the data itself is stored in NoSQL storage. ACID transaction records within the entities are replicated using Paxos.</p>
<p>For transactions across entities, the synchronous replication requirement is relaxed and an asynchronous message queue is used. Thus it&#8217;s key that entity group boundaries reflect application usage and user expectations.</p>
<p><strong>Entities</strong><br />
An e-mail account is a natural entity. But defining other entities is more complex.</p>
<p>Geographic data lacks natural granularity. For example, the globe is divided into non-overlapping entities. Changes across these geographic entities use (expensive) two-phase commits.</p>
<p>The design problem: entities large enough to make two-phase commits uncommon but small enough to keep transaction rates low.</p>
<p>Each entity has a root table and may have child tables. Each child table has a single root table. Example: a user&#8217;s root table may have each of the user&#8217;s photo collections as a child. Most applications find natural entity group boundaries.</p>
<p><strong>API</strong><br />
The insight driving the API is that the big win is scalable performance rather than a rich query language. Thus a focus on controlling physical locality and hierarchical layouts.</p>
<p>For example, joins are implemented in application code. Queries specify scans or lookups against particular tables and indexes. Therefore, the application needs to understand the data schema to perform well.</p>
<p><strong>Replication</strong><br />
Megastore uses Paxos to manage synchronous replication. But in order to make Paxos practical despite high latencies the team developed some optimizations:</p>
<ul>
<li><strong>Fast reads.</strong> Current reads are usually from local replicas since most writes succeed on all replicas.</li>
<li><strong>Fast writes.</strong> Since most apps repeatedly write from the same region, the initial writer is granted priority for further replica writes. Using local replicas and reducing write contention for distant replicas minimizes latency.</li>
<li><strong>Replica types.</strong> In addition to full replicas Megastore has 2 other replica types:
<ul>
	<i>witness replicas</i>. Witnesses vote in Paxos rounds and store the write-ahead log but do not store entity data or indexes to keep storage costs low. They are also tiebreakers when isn&#8217;t a quorum.<br />
	<i>Read-only replicas</i> are the inverse: nonvoting replicas that contain full snapshots of the data. Their data may be slightly stale but they help disseminate the data over a wide area without slowing writes.</li>
</ul>
</ul>
<p><strong>Architecture</strong><br />
What does Megastore look like in practice? Here&#8217;s an example. </p>
<p><a href="http://storagemojo.com/wp-content/uploads//2011/04/megastore_arch.png"><img src="http://storagemojo.com/wp-content/uploads//2011/04/megastore_arch.png" alt="" title="megastore_arch" width="460" height="310" class="aligncenter size-full wp-image-2350" /></a></p>
<p>A Megastore client library is installed on the app server. It implements Paxos and other algorithms such as read replica selection. The app server has a local replica written to a local <a href="http://storagemojo.com/2006/09/07/googles-bigtable-distributed-storage-system-pt-i/" target="_blank">BigTable</a> instance.</p>
<p>A <i>coordinator server</i> tracks a set of entity groups and observes all Paxos writes. The coordinator is simpler than BigTable and serves local reads.</p>
<p>Concurrent with writing local data to BigTable and the coordinator the Megastore library is also writing to a second full replica: a replication server and a second coordinator. The stateless replication servers handle the writes to the remote big table while the lower latency coordinator handles any reads from the remote replica.</p>
<p>Failures may leave writes abandoned or in an uncertain state. The replication servers scan for incomplete writes and offer no op values via Paxos to complete the.</p>
<p><strong>Availability</strong><br />
As coordinator servers do most local reads their availability is critical to maintaining Megastore&#8217;s performance. The coordinators use an out-of-band protocol to track other coordinators and use Google&#8217;s Chubby distributed lock service to obtain remote locks. If the coordinator loses a majority of its locks it will consider all entities in its purview to be out of date until the locks are regained and the coordinator is current.</p>
<p>There are a variety of network and race conditions that can affect coordinator availability. The team believes the simplicity of the coordinator architecture and their light network traffic makes the availability risks acceptable.</p>
<p><strong>Performance</strong><br />
Because Megastore is geographically distributed, application servers in different locations may initiate writes to the same end entity group simultaneously. Only one of them will succeed and the other writers will have to retry.</p>
<p>Limiting writes to a few per second per entity group makes contention insignificant, e-mail for example. </p>
<p>For multiuser applications with higher write requirements developers can shard entity groups more finely or batch user operations into fewer transactions. Fine-grained advisory locks and sequencing transactions are other techniques to handle higher write loads.</p>
<p><strong>The real world</strong><br />
Megastores been deployed for several years and more than 100 production applications using today. The paper provides these figures on availability and average latencies.</p>
<p><a href="http://storagemojo.com/wp-content/uploads//2011/04/megastore_availability_dist.png"><img src="http://storagemojo.com/wp-content/uploads//2011/04/megastore_availability_dist.png" alt="" title="megastore_availability_dist" width="416" height="327" class="aligncenter size-full wp-image-2351" /></a><br />
<a href="http://storagemojo.com/wp-content/uploads//2011/04/megastore_avg_latencies.png"><img src="http://storagemojo.com/wp-content/uploads//2011/04/megastore_avg_latencies.png" alt="" title="megastore_avg_latencies" width="418" height="343" class="aligncenter size-full wp-image-2352" /></a></p>
<p>The high availability of the system architecture creates a nice-to-have problem: small transient errors on top of persistent uncorrected problems can cause much larger problems. </p>
<p>Fault tolerance makes finding underlying faults more difficult. The price of fault tolerance is eternal vigilance.</p>
<p>As the architecture diagram suggests Megastore doesn&#8217;t manage BigTable. Developers  must optimize the storage for their app.</p>
<p><strong>The StorageMojo take</strong><br />
As Brewer&#8217;s <a href="http://en.wikipedia.org/wiki/CAP_theorem" target="_blank">CAP theorem</a> showed, a distributed system can&#8217;t provide consistency, availability and partition tolerance to all nodes at the same time. But this paper shows that by making smart choices we can get darn close as far as human users are concerned.</p>
<p>If Microsoft Office &#8211; or an open-source analog &#8211; could plug into a productized version of Megastore this could become popular for private cloud implementations: LAN performance in the office and global availability on the road. What&#8217;s not to like?</p>
<p>But whether that happens or not, the paper demonstrates again the value of Internet scale infrastructure thinking. Enterprise vendors would never have developed Megastore, but now that we&#8217;ve seen it work we can begin applying its principles to smaller scale problems.</p>
<p><strong>Courteous comments welcome, of course.</strong>  If this overview intrigues I urge you to read the entire paper as there are some interesting pieces I&#8217;ve left out.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2011/04/20/googles-megastore/&text=Google's Megastore" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2011/04/20/googles-megastore/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Hyder: a flash-based scale-out database</title>
		<link>http://storagemojo.com/2011/01/24/hyder-a-flash-based-scale-out-database/</link>
		<comments>http://storagemojo.com/2011/01/24/hyder-a-flash-based-scale-out-database/#comments</comments>
		<pubDate>Mon, 24 Jan 2011 07:36:35 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Future Tech]]></category>
		<category><![CDATA[Information Management]]></category>
		<category><![CDATA[SSD/Flash Disk]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2239</guid>
		<description><![CDATA[Talked to a company last week whose cloud app handles several billion transactions per month on a cluster. Sounds like SSDs could help them but how? In a paper from the latest 5th Biennial Conference on Innovative Data Systems Research (CIDR &#8217;11) researchers Philip A. Bernstein and Colin W. Reid of Microsoft and Sudipto Das [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Talked to a company last week whose cloud app handles several billion transactions per month on a cluster. Sounds like SSDs could help them but how?</p>
<p>In a paper from the latest <a href="http://www.cidrdb.org/cidr2011/" target="_blank">5th Biennial Conference on Innovative Data Systems Research</a> (CIDR &#8217;11) researchers Philip A. Bernstein and Colin W. Reid of Microsoft and Sudipto Das of UC Santa Barbara have a suggestion: <a href="http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper2.pdf" target="_blank">Hyder – A Transactional Record Manager for Shared Flash</a> (pdf).</p>
<p>As underlying hardware changes &#8211; faster networks, large memories, multi-core CPUs and SSDs &#8211; database software architectures may change too. <i>Hyder</i> architecture supports</p>
<blockquote><p>
. . . reads and writes on indexed records within classical multi-step transactions. It is designed to run on a cluster of servers that have shared access to a large pool of network-addressable raw flash chips. . . . Hyder uses a data-sharing architecture that scales out without partitioning the database or application.
</p></blockquote>
<p><strong>No partition scale-out</strong><br />
Today, most popular database clusters partition the database across multiple servers. Done well this works, but at some cost. The database design is non-trivial &#8211; cross-partition transactions, cache coherence, load balancing, scaling and multi-server debugging &#8211; are knotty issues which translate into higher design and operation costs.</p>
<p>Hyder eliminates partitioning, distributed programming, layers of cache, remote procedure calls and load balancing. All servers can read and write the entire database &#8211; so any server can execute any transaction. Load-balancing is simple: direct new transactions to lightly-loaded servers.</p>
<p>Each update transaction runs on one machine and writes to a shared log &#8211; so there&#8217;s no 2-phase commit. And no 2-phase <strike>commit</strike> locking, which can force performance off a cliff when workloads spike.</p>
<p>The 3 main components of Hyder are the <i>log</i>, the <i>index</i> and the <i>roll-forward algorithm</i>.</p>
<p><strong>Log</strong><br />
The log runs on multiple flash devices &#8211; chips, DIMMs or ??? &#8211; and writes multi-page log records across multiple devices with parity to enable log recovery after device failures.</p>
<p>Hyder uses a <i>multi-versioned</i> database &#8211; old record versions aren&#8217;t updated-in-place, only the most recent version of a record is used &#8211; which has a couple of useful properties:</p>
<ul>
<li>Server caches are inherently coherent since only the most recent versions of records are used.</li>
<li>Data can be read while writes are in progress.</li>
<li>Queries that can be decomposed can be run across multiple servers concurrently for a faster response time.</li>
</ul>
<p>[This may seem like voodoo to ACIDheads. A good technical intro to multi-versioning concurrency control (MVCC) is <a href="http://www.rtcmagazine.com/articles/view/101612" target="_blank">Multi-core software: to gain speed, eliminate resource contention</a>.]</p>
<p>Servers run a cache update process that keeps them current with updated records. Server caches don&#8217;t have to be identical and the cache invalidate messages that most clusters use for cache coherency aren&#8217;t needed.</p>
<p>All log writes are idempotent appends, so if a write fails the server can simply reissue the write. The authors describe several error modes and how Hyder handles them.</p>
<p><strong>Index</strong><br />
The index stores the database as a search tree with each node a [key, payload] pair. The tree can store, for example, a relational database. The index tree is also represented in the log.</p>
<p>Tree nodes are not updated in place. When node <i>n</i> is updated, a new copy &#8211; <i>n&#8217;</i>is created. Then, of course, the parent node must be updated and so on up the tree. </p>
<p>A binary tree minimizes the number of node updates, but can be processor intensive. The optimal tree structure for Hyder is not yet resolved.</p>
<p>Garbage collection is an issue. Each node pointer includes the ID of the oldest reachable data element. An element older than any that is pointed to by a node is garbage.</p>
<p><strong>Roll-forward algorithm</strong><br />
This is the key process of Hyder.</p>
<p>When a record update begins, one server executes the transaction. The server is given a copy of  the latest database root, a static snapshot of the entire database.</p>
<p>The updates are stored in a local cache and after execution the after-images are gathered into an <i>intention</i> record, which is broadcast to all servers and appended to the log. The update&#8217;s readset is included in the intention record, to insure all intentions are properly ordered, none are lost, and the offset is made known to all servers.</p>
<p>Each server can assemble a local copy of the tail of the log, which is used to determine if there are conflicting updates. The <i>meld</i> procedure manages conflicting updates.</p>
<p>Appending the intention to the database log doesn&#8217;t commit the transaction. The intention references the static snapshot of the latest database root. The meld procedure determines if any committed transactions since the snapshot conflict with the intention. </p>
<p>If they don&#8217;t, all is good. If they do, the transaction is aborted.</p>
<p>All servers roll forward using meld and don&#8217;t message each other about committed and failed transactions. Therefore there is no lock manager and no 2-phase commit.</p>
<p><strong>Contention</strong><br />
Losing the lock manager and 2-phase commit should help performance unless other points of contention throttle the system. Hyder&#8217;s points of contention include appending intentions to the log, melding the log at each server, and aborting transactions.</p>
<p>Intention appends are serial. The lower the write latency the more appends can be written. A 10us write latency means a 100k TPS.</p>
<p>Network latency adds to write latency. Faster switches improve append performance.</p>
<p>The abort rate depends on the number of concurrent transactions that conflict. Fast transactions reduce the probability of aborts by reducing the number of concurrent transactions. </p>
<p>The worst case is a record subject to multiple updates from different servers. Detecting high-conflict transactions and serializing them by forcing them onto 1 server would reduce the hot data performance hit.</p>
<p><strong>Performance</strong><br />
The authors model Hyder&#8217;s performance with a focus on the high-contention corner cases. In general, the tests show linear scaling as servers are added. </p>
<p>The problems come when the underlying hardware limits are exceeded. Increasing execution times mean more aborts and performance falls off a cliff. From the paper:</p>
<p><a href="http://storagemojo.com/wp-content/uploads//2011/01/hyder_thrashing.jpg"><img src="http://storagemojo.com/wp-content/uploads//2011/01/hyder_thrashing.jpg" alt="" title="hyder_thrashing" width="475" height="286" class="aligncenter size-full wp-image-2240" /></a></p>
<p><strong>The StorageMojo take</strong><br />
We&#8217;ve been building disk workarounds for for decades. We now tend to assume those workarounds are fundamental architectural requirements rather than hacks. </p>
<p>The <i>Hyder</i> paper asks us to imagine a world where non-volatile mass storage is fast and cheap &#8211; and how we could re-architect basic systems to be faster and cheaper too.</p>
<p>The authors conclusion is a fair assessment:</p>
<blockquote><p>
Many variations of the Hyder architecture and algorithms would be worth exploring. There may also be opportunities to use Hyder’s logging and meld algorithms with some modification in other contexts, such as file systems and middleware. We suggested a number of directions for future work throughout the paper. No doubt there are many other directions as well.
</p></blockquote>
<p><strong>Courteous comments welcome, of course.</strong> I hope to get to some of the other CIDR papers before <a href="" target="_blank">FAST &#8217;11</a> snows me under.  <strong>Update:</strong> Phil Bernstein was kind enough to scan the post and I&#8217;ve updated 1 minor error. He also mentioned that it won the Best Paper award at the conference. Those CIDR folks have great taste in papers, don&#8217;t they?</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2011/01/24/hyder-a-flash-based-scale-out-database/&text=Hyder: a flash-based scale-out database" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2011/01/24/hyder-a-flash-based-scale-out-database/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>quFiles: The right file at the right time</title>
		<link>http://storagemojo.com/2010/02/24/qufiles-the-right-file-at-the-right-time/</link>
		<comments>http://storagemojo.com/2010/02/24/qufiles-the-right-file-at-the-right-time/#comments</comments>
		<pubDate>Wed, 24 Feb 2010 19:43:08 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Future Tech]]></category>
		<category><![CDATA[Information Management]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=1910</guid>
		<description><![CDATA[The official best paper winner at FAST &#8217;10 isn&#8217;t one of the several I excerpted. I&#8217;m listening to the presentation as I write &#8211; trying live blogging &#8211; while following a fast talking presenter. The winning paper is quFiles: The right file at the right time by Kaushik Veeraraghavan, Jason Flinn and Brian Noble of [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>The official best paper winner at FAST &#8217;10 isn&#8217;t one of the several I excerpted. I&#8217;m listening to the presentation as I write &#8211; trying live blogging &#8211; while following a fast talking presenter.</p>
<p>The winning paper is <strong>quFiles: The right file at the right time</strong> by Kaushik Veeraraghavan, Jason Flinn and Brian Noble of the University of Michigan and	Edmund B. Nightingale of Microsoft Research.<br />
From the abstract:</p>
<blockquote><p>
A quFile is a unifying abstraction that simplifies data management by encapsulating different physical representations of the same logical data. Similar to a quBit (quantum bit), the particular representation of the logical data displayed by a quFile is not determined until the moment it is needed. The representation returned by a quFile is specified by a data-specific policy that can take into account context such as the application requesting the data, the device on which data is accessed, screen size, and battery status.
</p></blockquote>
<p>One application is video files that may be played back on a variety of devices with differing resolutions, compute and graphics engines, codecs, editing capability and storage. There is one quFile that encapsulates several versions of the file &#8211; even different versions of the same file &#8211; and which is returned depends on the device requesting the file.</p>
<p>The key is that every device asks for the same file name, simplifying file management on the server and file distribution. quFiles are space efficient, adding little to file size, while their compute overhead is in the single-digit percents. And no application changes are required.</p>
<p><strong>The StorageMojo take</strong><br />
FAST&#8217;s Best Paper isn&#8217;t necessarily StorageMojo&#8217;s Best Paper, yet this is a worthy candidate. Hiding the gory details of file types and network requirements from users is a Good Thing. I particularly like the support for file versioning, a feature I grew to love on the VMS operating system, but not widely appreciated today.</p>
<p>Congratulations to the team on their win. </p>
<p><strong>Courteous comments welcome, of course.</strong> I&#8217;ll provide a link to the conference papers once the USENIX folks make them public.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2010/02/24/qufiles-the-right-file-at-the-right-time/&text=quFiles: The right file at the right time" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2010/02/24/qufiles-the-right-file-at-the-right-time/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A petascale parallel database</title>
		<link>http://storagemojo.com/2010/02/08/a-petascale-parallel-database/</link>
		<comments>http://storagemojo.com/2010/02/08/a-petascale-parallel-database/#comments</comments>
		<pubDate>Tue, 09 Feb 2010 03:01:06 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Cloud computing & storage]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Information Management]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=1891</guid>
		<description><![CDATA[MapReduce and its open source version, Hadoop, are parallel data analysis tools. A few lines of code can drive massive data reductions across thousands of nodes. Cool. Powerful though it is, Hadoop isn&#8217;t a database. Classic structured data analysis of the model/load/process type isn&#8217;t what it was designed for. That&#8217;s where the paper HadoopDB: An [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>MapReduce and its open source version, Hadoop, are parallel data analysis tools. A few lines of code can drive massive data reductions across thousands of nodes. </p>
<p>Cool.</p>
<p>Powerful though it is, Hadoop isn&#8217;t a database. Classic <i>structured</i> data analysis of the model/load/process type isn&#8217;t what it was designed for.</p>
<p>That&#8217;s where the paper <a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.html" target="_blank">HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads</a> (pdf) comes in. Written by Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz and Alexander Rasin (the former 4 @Yale, and the latter @Brown) the paper proposes a method for building an open-source, commodity hardware-based massively scalable, shared-nothing, analytical parallel database.</p>
<p><strong>What it is</strong><br />
HadoopDB coordinates SQL queries across multiple independent database nodes using Hadoop as the task coordinator and network communication layer. It uses the scheduling and job tracking of Hadoop while it intelligently pushes much of the query processing into the individual database nodes.</p>
<p>There are four components to HadoopDB.</p>
<ul>
<li>Database Connector. Each node has its own independent database. The connector is the interface between the database and Hadoop&#8217;s task trackers. A MapReduce jobs supplies the Connector with an SQL query and other parameters. The Connector executes a SQL query on the database and returns results as key value pairs. It can implemented to support a variety of databases.</li>
<li>Catalog. The information needed to access the databases and metadata such as cluster data sets, replica locations and data partitions is kept in the catalog.</li>
<li>Data loader. The data loader is responsible for two jobs. First executing a MapReduce job over Hadoop that reads the raw data files and partitions them into as many parts as the number of nodes in the cluster. Second, the partitions are loaded into the local file system of each node and chunked according the system-wide parameter.</li>
<li>SQL to MapReduce to SQL planner. The planner provides a parallel database front end to enable SQL queries. The planner transforms the queries into map reduce jobs and optimizes the query plans for efficiency. This is where scratch that this is the secret sauce of HodoopDB.</li>
</ul>
<p>HadoopDB complements the Hadoop infrastructure and does not replace it. Analysts have both available as needed.</p>
<p><strong>Heterogeneity</strong><br />
A key issue for Internet-scale systems is the ability to run in a heterogenous environment where multi-year build-outs and rolling node replacement are the norm. That means that some nodes will be faster than others.  HadoopDB breaks the work down into small tasks and moves them from slow to fast nodes automagically.</p>
<p><strong>Results</strong><br />
The authors ran some benchmarks on Amazon&#8217;s EC to to test performance. The HadoopDB load times were about 10x that of Hadoop, but the higher performance of HadoopDB usually justified the longer set up time.</p>
<p>The authors found that HadoopDB was able to approach the performance of parallel database systems on much lower cost hardware and free software. Given the gift of the projects one can expect higher performance as improvements are made.</p>
<p><strong>The killer app for private clouds?</strong><br />
MapReduce and Hadoop are already in wide use among Internet-scale datacenters. As companies begin to understand and correlate social media, web activity and ad response rates, the demand for large-scale parallel database processing will grow. But will they want to ship it out to Amazon?</p>
<p>Depending on the quantity and sensitivity of the data many organizations may prefer to keep the processing in-house. Private scale out Hadoop clusters may become the poor companies data warehouse of choice.</p>
<p><strong>The StorageMojo take</strong><br />
HadoopDB is more science project than commercial tool today. Yet the project demonstrates the feasibility of using scale out compute/storage clusters for work that day typically requires proprietary high-end scale up system architectures.</p>
<p>If capital costs are reduced by two thirds with a commodity/FOSS architecture, companies could afford to hire the expertise required to make it work. The free software/paid support model will prove quite successful in this space.</p>
<p><strong>Courteous comments welcome, of course.</strong>  </p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2010/02/08/a-petascale-parallel-database/&text=A petascale parallel database " target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2010/02/08/a-petascale-parallel-database/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Oracle+Sun storage: wiser &amp; brighter</title>
		<link>http://storagemojo.com/2010/01/27/oraclesun-storage-wiser-brighter/</link>
		<comments>http://storagemojo.com/2010/01/27/oraclesun-storage-wiser-brighter/#comments</comments>
		<pubDate>Thu, 28 Jan 2010 02:11:23 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Enterprise]]></category>
		<category><![CDATA[Information Management]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=1882</guid>
		<description><![CDATA[While everyone else was watching the Apple iPad intro I was watching Oracle&#8217;s John Fowler talk about their systems and storage strategy. I like the iPad, but the O+S strategy could reshape the storage industry. More details will emerge and many decisions still remain but the basic elements are clear: Focus on direct sales. In [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>While everyone else was watching the Apple iPad intro I was watching Oracle&#8217;s John Fowler talk about their systems and storage strategy. I like the iPad, but the O+S strategy could reshape the storage industry.</p>
<p>More details will emerge and many decisions still remain but the basic elements are clear:</p>
<ul>
<li>Focus on direct sales. In the mid-1990s, when I joined Sun, the tenacity and aggressiveness of their direct sales force was a welcome change. Direct sales forces are expensive, but losing touch with your customers is even costlier. The combo&#8217;s unique value propositions can&#8217;t be sold by channels today. In 5 years &#8211; maybe.</li>
<li>A dedicated storage sales force. Generalist salespeople with millimeter deep storage product and application knowledge can&#8217;t compete with EMC and NetApp. Storage specialists aren&#8217;t easy to develop, so they&#8217;ll hire them &#8211; and they promise top commissions.</li>
<li>Deep integration of ZFS into storage systems. A software company <i>should</i> like a software solution to many of the biggest storage problems? Putting real muscle behind ZFS will help thousands of enterprise customers to rethink their high-performance data protection strategies.</li>
<li>Flash everywhere. Sun has done some creative things with flash already, such as Logzilla, and Oracle sees that much more can be done.</li>
</ul>
<p>Not mentioned &#8211; not that it should have been &#8211; is the fate of ZFS on Mac OS X. That would be a boost for all concerned.</p>
<p><strong>The StorageMojo take</strong><br />
Sun&#8217;s primary storage business has been a <a href="http://storagemojo.com/2004/10/27/suns-sorry-storage-story/" target="_blank">black smoking crater of disaster</a> for over a decade. And it didn&#8217;t help StorageTek to have them answer to know-nothings.</p>
<p>Despite that Sun engineers outside the storage group developed innovative and game-changing technologies that the company couldn&#8217;t capitalize on. With Oracle&#8217;s investment now they can.</p>
<p>No database/systems company can be successful without a healthy and very competitive storage team &#8212; and the high gross margins don&#8217;t hurt. With a hard-nosed focus on application performance, marketing competence and continued innovation, the O+S storage group could be a fun place to work. They are hiring!</p>
<p>It will take Oracle 12 to 18 months to develop the kind of customer traction that will make other storage vendors set up and take notice. But Larry Ellison isn&#8217;t planning to lose and there is no reason he should.</p>
<p>Storage competition in the enterprise is about to get cranked up several notches. And that is a good thing for all customers.</p>
<p><strong>Courteous comments welcome, of course.</strong>  </p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2010/01/27/oraclesun-storage-wiser-brighter/&text=Oracle+Sun storage: wiser & brighter" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2010/01/27/oraclesun-storage-wiser-brighter/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Coolness @ Storage Visions/CES 2010</title>
		<link>http://storagemojo.com/2010/01/08/coolness-storage-visionsces-2010/</link>
		<comments>http://storagemojo.com/2010/01/08/coolness-storage-visionsces-2010/#comments</comments>
		<pubDate>Fri, 08 Jan 2010 18:29:52 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Cloud computing & storage]]></category>
		<category><![CDATA[Future Tech]]></category>
		<category><![CDATA[Information Management]]></category>
		<category><![CDATA[SSD/Flash Disk]]></category>

		<guid isPermaLink="false">http://storagemojo.com/2010/01/08/coolness-storage-visionsces-2010/</guid>
		<description><![CDATA[In no particular order, cool stuff at Storage Visions 2010 and CES. Mobo-mounted SSD. Soligen has announced an SSD that mounts on motherboards. The drive mounts firmly, requires no special cooling and takes little board space. Tiny USB drive. Verbatim has announced a tiny USB thumb drive that is a fraction the size of most [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>In no particular order, cool stuff at <a href="http://www.storagevisions.com/" target="_blank">Storage Visions 2010</a> and CES.</p>
<ul>
<li>Mobo-mounted SSD. <a href="http://www.soligencorp.com/" target="_blank">Soligen</a> has announced an SSD that mounts on motherboards. The drive mounts firmly, requires no special cooling and takes little board space.</li>
<li>Tiny USB drive. Verbatim has announced a tiny USB thumb drive that is a fraction the size of most current thumb drives. Call it a thumbnail drive. Perfect for keychains.</li>
<li>Super Talent is showing a 2 TB PCI-e SSD and claiming strong performance. At $6k gamers won&#8217;t buy it, but enterprises might.</li>
<li><a href="http://www.raidon.com.tw/" target="_blank">Raidon</a> is showing a nice collection of 2.5&#8243; drive enclosures, including 8 drive arrays. Not much larger than a 5.25&#8243; drive. Can&#8217;t find them all on the web yet, though.</li>
<li>A 32 GB Class 6 Micro SD is close to announcement. <i>Micro.</i></li>
<li>Supermicro showed a 48 drive JBOD/36 drive server chassis. The server is almost as dense of Sun&#8217;s Thumper &#8211; and drives are front and rear accessible.</li>
<li>Eye-fi&#8217;s Wi-Fi enabled SD cards don&#8217;t handle AVCHD video files, but they&#8217;re working on it. With all the SD card using consumer, prosumer and even pro camcorders using SD, this will be a popular market for them.</li>
<li>How about a double-ended flash drive: one end for personal; the other for work? Developed with the help of the social community at <a href="http://www.quirky.com/" target="_blank">Quirky.com</a>. They pay developers and influencers a percentage of the revenues. Cool!</li>
<li><a href="http://www.poketypoke.com/" target="_blank">PoketyPoke</a> is a con-call management service that reminds you of your concalls and optionally records them and provides transcripts for $9/hr. I like.</li>
</ul>
<p><strong>In other news</strong><br />
I moderated a too-short panel on Cloud storage at Storage Visions. Several technologies are out there that will change the current economics and application profiles of online storage. The field is young.</p>
<p>Got an update on USB 3.0 from <a href="http://www.symwave.com/" target="_blank">Symwave</a>, the fabless IC firm that makes USB 3.0 chips. Bottom line: unlike USB 2.0, whose marketing made promises the protocol could not keep, the new version can achieve over 400 MB/sec.</p>
<p>Here&#8217;s the <strong>30 seconds over USB 3.0</strong> video:<br />
<object width="480" height="295"><param name="movie" value="http://www.youtube.com/v/jn802nnObvI&#038;hl=en_US&#038;fs=1&#038;"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/jn802nnObvI&#038;hl=en_US&#038;fs=1&#038;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="295"></embed></object></p>
<p><strong>The StorageMojo take</strong><br />
No blockbuster, sector-defining new products. But many stepwise enhancements that move us forward.</p>
<p>USB 3.0 is going to push consumer storage as we can move gigabytes in seconds rather than minutes. But it looks like Apple is poised to miss this one &#8211; which could cost them a big chunk of their pro market.</p>
<p><strong>Courteous comments welcome, of course.</strong>  Fixed the pooched hyperlinks and a couple of other minor edits.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2010/01/08/coolness-storage-visionsces-2010/&text=Coolness @ Storage Visions/CES 2010" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2010/01/08/coolness-storage-visionsces-2010/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Tiny server clusters</title>
		<link>http://storagemojo.com/2009/12/06/tiny-server-clusters/</link>
		<comments>http://storagemojo.com/2009/12/06/tiny-server-clusters/#comments</comments>
		<pubDate>Mon, 07 Dec 2009 04:38:55 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Cloud computing & storage]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Future Tech]]></category>
		<category><![CDATA[Information Management]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=1717</guid>
		<description><![CDATA[Virtual machines (VMs) solve the problem of many tiny servers on a big server. VMs are a logical outgrowth of Moore&#8217;s Law: server CPUs got bigger, faster, than the apps required. And Windows Server didn&#8217;t handle multiple apps well. But the growth of 100 megawatt Internet-scale data centers has architects rethinking efficiency-at-scale. As James Hamilton [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Virtual machines (VMs) solve the problem of many tiny servers on a big server. VMs are a logical outgrowth of Moore&#8217;s Law: server CPUs got bigger, faster, than the apps required. And Windows Server didn&#8217;t handle multiple apps well. </p>
<p>But the growth of 100 megawatt Internet-scale data centers has architects rethinking efficiency-at-scale. As James Hamilton put it in his presentation<br />
<a href="http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_ISCA2009.pdf" target="_blank">Internet-Scale Service Infrastructure Efficiency</a> (pdf):</p>
<blockquote><p>
Single dimensional performance measurements are not interesting at scale unless balanced against cost
</p></blockquote>
<p>Therefore: work done per $; per joule; and per rack.</p>
<p><strong>Microslice server</strong><br />
Because CPU performance has grown so much faster than storage &#8211; disk and DRAM &#8211; over the last 30 years, powerful multicore CPUs are spending much of their time idling. The microslice server idea: build servers from slower, cheaper and much more power-efficient CPUs.</p>
<p>Amazon has done just that. A microslice prototype jointly developed with <a href="http://www.sgi.com/" target="_blank">SGI</a> &#8211; formerly Rackable &#8211; using a lower power Athlon 4850e CPU handled over 9x the requests per second (RPS) of a rack of conventional servers. </p>
<p><a href="http://storagemojo.com/wp-content/uploads//2009/12/microslice_test.jpg"><img src="http://storagemojo.com/wp-content/uploads//2009/12/microslice_test.jpg" alt="microslice_test" title="microslice_test" width="475" height="218" class="aligncenter size-full wp-image-1718" /></a><br />
And the server cost just $500, used 1/5th the power and provided about 70% of the performance (RPS) of the much costlier server. Higher density &#8211; something like 6 servers per rack unit &#8211; provided the rack-level performance. </p>
<p><strong>Disk Workload from Hell</strong><br />
At October&#8217;s 22nd ACM Symposium on Operating Systems Principles (SOSP) &#8211; David G. Andersen, Jason Franklin, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan &#8211; all from Carnegie Mellon University &#8211; and Michael Kaminsky (Intel Research Pittsburgh) presented <a href="http://www.sigops.org/sosp/sosp09/papers/andersen-sosp09.pdf" target="_blank">FAWN: A Fast Array of Wimpy Nodes</a>, a Best Paper award winner.</p>
<p>FAWN&#8217;s goal: maximizing queries per Joule in a high performance key-value storage system. Key-value stores are seeing increasing use in Internet-scale systems &#8211; the key is a unique identifier for the associated value.</p>
<p>The paper explains:</p>
<blockquote><p>
The workloads these systems support share several characteristics: they are I/O, not computation, intensive, requiring random access over large datasets; they are massively parallel, with thousands of concurrent, mostly-independent operations; their high load requires large clusters to support them; and the size of objects stored is typically small, e.g., 1 KB values for thumbnail images, 100s of bytes for wall posts, twitter messages, etc.
</p></blockquote>
<p>The paper describes both the hardware &#8211; which uses 500 MHz embedded processors, 256 MB DRAM and 4 GB CF flash &#8211; and the software &#8211; a log-structured per-node datastore that optimizes flash performance. The net/net: FAWN is over 6x more efficient &#8211; on queries per second &#8211; than conventional systems. </p>
<p>At 1/5th the cost. And 1/8th the power.</p>
<p><strong>The StorageMojo take</strong><br />
This is more important than it looks. The Internet guys are optimizing for power, something most businesses ignore. But the low cost and performance of these nodes is attractive to everyone else. </p>
<p>Back in the day, DEC sold a lot of 3 node DSSI VAXclusters. Why? They were cheap(er) and if you lost a node you still had 2/3rds of your system.</p>
<p>In 2010 I expect to see low-end, cluster-based storage systems that offer multi-node resilience at low cost. Not just purchase price either, but service costs as well. A node went down? We&#8217;ll overnight you a new one.</p>
<p>The low-end is about to get a lot more interesting.</p>
<p><strong>Courteous comments welcome, of course.</strong> The other SOSP best paper is fascinating too: <a href="http://www.sigops.org/sosp/sosp09/papers/dobrescu-sosp09.pdf" target="_blank">RouteBricks: Exploiting Parallelism to Scale Software Routers</a>. I hope I have time to post on it. </p>
<p>And BTW, Intel is also showing a microslice proto.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2009/12/06/tiny-server-clusters/&text=Tiny server clusters" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2009/12/06/tiny-server-clusters/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Mac ZFS is dead</title>
		<link>http://storagemojo.com/2009/10/27/mac-zfs-is-dead/</link>
		<comments>http://storagemojo.com/2009/10/27/mac-zfs-is-dead/#comments</comments>
		<pubDate>Tue, 27 Oct 2009 07:58:42 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Information Management]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=1664</guid>
		<description><![CDATA[Ding, dong. PC file system progress took a giant step back this week with the news on MacOSforge that Apple&#8217;s ZFS project has been discontinued. ZFS Project Shutdown 2009-10-23 The ZFS project has been discontinued. The mailing list and repository will also be removed shortly. Apple announced in June &#8217;08 that Snow Leopard server would [...]]]></description>
			<content:encoded><![CDATA[<p></p><p><strong>Ding, dong.</strong><br />
PC file system progress took a giant step back this week with the <a href="http://zfs.macosforge.org/" target="_blank">news</a> on MacOSforge that Apple&#8217;s ZFS project has been discontinued. </p>
<blockquote><p>
ZFS Project Shutdown 2009-10-23<br />
The ZFS project has been discontinued. The mailing list and repository will also be removed shortly.
</p></blockquote>
<p>Apple announced in June &#8217;08 that Snow Leopard server would support ZFS. But things came apart early this year. </p>
<p><strong>What happened?</strong><br />
Jeff Bonwick, ZFS architect, <a href="http://mail.opensolaris.org/pipermail/zfs-discuss/2009-October/033125.html" target="_blank">posted</a> Saturday on an earlier quoted comment:</p>
<blockquote><p>
> Apple can currently just take the ZFS CDDL code and incorporate it<br />
> (like they did with DTrace), but it may be that they wanted a &#8220;private<br />
> license&#8221; from Sun (with appropriate technical support and<br />
> indemnification), and the two entities couldn&#8217;t come to mutually<br />
> agreeable terms.</p>
<p>I cannot disclose details, but that is the essence of it.</p>
<p>Jeff
</p></blockquote>
<p><strong>Indemnification?</strong><br />
Sun is being sued by NetApp claiming that ZFS infringes on NetApp patents. If NetApp won, Apple would find itself in a tough position unless Sun shouldered the financial damage. That&#8217;s indemnification.</p>
<p>IMHO Sun has a good case that NetApp&#8217;s patents will be invalidated by prior art. But with all their other problems and the Oracle purchase it was a headache they, Oracle and Apple didn&#8217;t need.</p>
<p><strong>Where does Apple go from here?</strong><br />
Apple has hired some smart file system engineers and <a href="http://jobs.apple.com/index.ajs?method=mExternal.showJob&#038;RID=42559" target="_blank">wants to hire more</a> to work on &#8220;state-of-the-art file system technologies for Mac OS X.&#8221;</p>
<p>I&#8217;m not convinced: it sounds like standard HR boilerplate and a snare for the unwary. But hey! it could happen.</p>
<p>But writing new file systems isn&#8217;t easy. It takes 5-7 years for a new file system to achieve the maturity needed to support large-scale deployment. Even replacing QuickTime is non-trivial.</p>
<p>So if Apple is starting from scratch we have a long wait for real innovation to appear. Like Mac OS XII.</p>
<p><strong>What about Microsoft?</strong><br />
Meanwhile Redmond&#8217;s file system gurus are well aware of NTFS issues. They&#8217;re making stepwise enhancements. </p>
<p>But as the NTFS and HFS+ architectures age and the pace of storage innovation increases the gap between what is and what could be grows. It&#8217;s like putting a 1001 hp Bugatti engine in a Model T: the power is there but you can&#8217;t use it.</p>
<p><strong>The StorageMojo take</strong><br />
I already hate software patents &#8211; but that&#8217;s another post. As long as law allows companies will try to enforce them.</p>
<p>Why didn&#8217;t Apple cut a deal with NetApp directly? Probably for the same reason Sun didn&#8217;t: money. Apple has a lot more of it than Sun, but Steve is a tightwad, especially when it comes to storage. </p>
<p>NetApp could have raised their visibility in the consumer market by cutting a deal with Apple, but NetApp&#8217;s management isn&#8217;t thinking strategically about the low-end of the market, as the rapidity of StoreVault&#8217;s entrance and exit demonstrated. True, they have bigger issues, but multi-tasking is supposed to be a corporate strength.</p>
<p>Consumers are generating masses of video and photos at an accelerating pace &#8211; and they&#8217;ll need reliable, available and dirt-easy storage. Lots of it. </p>
<p>Let EMC supply it!</p>
<p>Until the Next New Thing in file systems rolls out of Cupertino, Redmond or, maybe, Redwood City, consumers will stuck with too many BSODs, missing or corrupted files and app crashes. Let&#8217;s hope we don&#8217;t have to wait too many more years.</p>
<p><strong>Comments welcome, of course.</strong>  An earlier version of this was posted on <a href="http://blogs.zdnet.com/storage/" target="_blank">Storage Bits</a>. Can you spot the dozen or so differences?</p>
<p>And there is a Google code <a href="http://code.google.com/p/maczfs/" target="_blank">page</a> for MacZFS for you diehards out there.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2009/10/27/mac-zfs-is-dead/&text=Mac ZFS is dead " target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2009/10/27/mac-zfs-is-dead/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>RDBMS: going the way of the mainframe?</title>
		<link>http://storagemojo.com/2009/09/14/rdbms-going-like-mainframes/</link>
		<comments>http://storagemojo.com/2009/09/14/rdbms-going-like-mainframes/#comments</comments>
		<pubDate>Tue, 15 Sep 2009 04:34:11 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Future Tech]]></category>
		<category><![CDATA[Information Management]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=1588</guid>
		<description><![CDATA[High-end big iron storage arrays have long owned the transaction processing market. The big relational database systems need all the I/O and availability you can give them. But what if we didn&#8217;t need big relational databases? What then? RDBMS &#8211; RIP? On his ACM blog, Michael Stonebraker, a database guru, says that relational databases may [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>High-end big iron storage arrays have long owned the transaction processing market. The big relational database systems need all the I/O and availability you can give them.</p>
<p>But what if we didn&#8217;t need big relational databases? What then?</p>
<p><strong>RDBMS &#8211; RIP?</strong><br />
On his <a href="http://cacm.acm.org/blogs/blog-cacm/32212-the-end-of-a-dbms-era-might-be-upon-us/fulltext" target="_blank">ACM blog</a>, Michael Stonebraker, a database guru, says that relational databases may be nearing the end of the line. He says that their one-size-fits-all philosophy and 1980s code base is at the end of its useful life.</p>
<p>Why? Quoting Stonebraker:</p>
<blockquote><p>
If we examine the non-trivial sized DBMS markets, it turns out that the current relational DBMSs can be beaten by approximately a factor of 50 in most any market I can think of.</p>
<p>In the data warehouse market, a column store beats a row store by approximately a factor of 50 on typical business intelligence queries. . . .</p>
<p>In the online transaction processing (OLTP) market, a lightweight main memory DBMS needs a row store by a factor of 50. . . .</p>
<p>In the science DBMS market, users have never liked relational DBMSs and want a non-relational model and query facility. . . .</p>
<p>Text applications have never used relational DBMSs. This was pointed out to me most clearly by Eric Brewer nearly 15 years ago in the early days of Inktomi. He wanted to use a relational DBMS to store the results of web crawling, but found relational DBMSs to be two orders of magnitude slower than a homebrew system. . . .</p>
<p>Even in XML, where the current major vendors have spent a great deal of energy extending their and engines, it is claimed that specialized engines, such as Mark Logic or Tamino, run circles around the major vendors according to a private communication by Dave Kellogg.</p>
<p>In summary, one can leverage at least the following ideas to get superior performance:</p>
<p>A non-relational data model. . . . .</p>
<p>A different implementation of tables. . . . </p>
<p>A different implementation of transactions. . . .
</p></blockquote>
<p>Mr. Stonebraker&#8217;s comments have interesting storage implications. First, big iron storage arrays may not have the relational database management market to rely on much longer. </p>
<p>Second, what happens to storage system engineering when we no longer have one basic data management model to design for? And that is without considering the effect of a 50 times faster database on applications.</p>
<p>In the hardware world a 50 times speed up has 2 major effects: existing problems increase their resolution to absorb the additional compute cycles; and new applications &#8211; both low and high end &#8211; become economically feasible.</p>
<p>Is there 50 times more data we would collect from existing applications if we had a 50 times faster database? Or will we be running enterprise data management applications on hardware with the power of a netbook? Great power savings. Not so great for hardware vendors.</p>
<p>Mr. Stonebraker theorizes that the DBMS replacement will be a collection of vertical market specific engines. Each, no doubt, with its own storage I/O profile.</p>
<p><strong>The StorageMojo take</strong><br />
Just as the ground has shifted under storage vendors in the last decade, it may be that DBMS vendors face the same <strike>problem</strike> opportunity in the coming decade. </p>
<p>If past experience is any guide, the storage industry will face multiple challenges supporting these new data management models, even as their high performance and lower (relative) costs drive new waves of application invention and adoption.</p>
<p>Only one thing is certain: much more data will be collected and, therefore, stored. The opportunities keep on coming, whether we are ready for them or not.</p>
<p><strong>Courteous comments welcome, of course.</strong> For an interesting dissent, check out <a href="http://www.daniel-lemire.com/blog/archives/2009/09/16/relational-databases-are-they-obselete/" target="_blank">Daniel Lemire&#8217;s blog.</a></p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2009/09/14/rdbms-going-like-mainframes/&text=RDBMS: going the way of the mainframe?" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2009/09/14/rdbms-going-like-mainframes/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Why did Apple drop ZFS?</title>
		<link>http://storagemojo.com/2009/08/31/why-did-apple-drop-zfs/</link>
		<comments>http://storagemojo.com/2009/08/31/why-did-apple-drop-zfs/#comments</comments>
		<pubDate>Mon, 31 Aug 2009 07:04:51 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Future Tech]]></category>
		<category><![CDATA[Information Management]]></category>
		<category><![CDATA[Marketing]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=1557</guid>
		<description><![CDATA[With the release of Snow Leopard it is now official: no ZFS &#8211; anywhere &#8211; in Mac OS 10.6. Given that Apple went to the trouble of announcing it last year as part of Snow Leopard Server this is quite a reversal. The question is why? Many theories I wrote this up on ZDnet Friday. [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>With the release of Snow Leopard it is now official: no ZFS &#8211; anywhere &#8211; in Mac OS 10.6. Given that Apple went to the trouble of announcing it last year as part of Snow Leopard Server this is quite a reversal.</p>
<p>The question is why?</p>
<p><strong>Many theories</strong><br />
I wrote this up on <a href="http://blogs.zdnet.com/storage/?p=584" target="_blank">ZDnet</a> Friday. At the time my theory was that the integration schedule or migration issues turned out to be less manageable than once thought. Or maybe NIH reared its parochial head. </p>
<p>ZDnet readers wrote in with ideas as well, the most popular that technical issues with ZFS itself forced the issue. I discounted this because, after all, ZFS is in production in large, I/O intense environments. If it is fundamentally broken we&#8217;d know by now. </p>
<p>I follow the ZFS discussion list and while there are issues, they aren&#8217;t show stopper bugs.</p>
<p><strong>A new narrative</strong><br />
But then a couple of sources came in with a new angle: that Sun&#8217;s licensing demands killed the deal. Sun prefers the CDDL and may have asked for some extra protections, including Apple&#8217;s promise not seek damages should Sun lose the ZFS patent infringement suits initiated by NetApp, that caused Apple to reconsider the business risk of ZFS.</p>
<p>Sun could, of course, GPL ZFS, but it may also be that the ZFS engineering team &#8211; like other Sun engineers &#8211; rejected GPL. I&#8217;d love to get some comment from the ZFS team &#8211; very bright guys all &#8211; because this reminds me of the late &#8217;80s at DEC when senior people begged DEC founder and CEO Ken Olsen to essentially open source some of DEC&#8217;s advanced software, like VMS, VMSclusters and DECnet.  </p>
<p>Ken, a very smart engineer who shepherded DEC from a $70,000 startup to a $14 billion company, couldn&#8217;t see the business sense in giving away what the company had spent millions developing. So that leadership technology withered as DEC cratered.</p>
<p>The NetApp lawsuit may have come into play, making patent risk pertinent and potentially costly. Given that and the other CDDL-related risks, plus engineering opposition to GPL, Apple must have reluctantly stepped away. Apple would like bragging rights over Windows 7 that ZFS would give it, but in this narrative Sun&#8217;s pre-acquisition turmoil and tougher-than-expected licensing terms killed the deal.</p>
<p><strong>Going forward</strong><br />
Now that Oracle is acquiring Sun things look brighter. Oracle is already bankrolling a GPL&#8217;d ZFS clone &#8211; btrfs &#8211; that will take years to reach the level of maturity that ZFS now enjoys. Once they own ZFS why wouldn&#8217;t they GPL it and call it good?</p>
<p><strong>Update:</strong> Also, Oracle is in a stronger position to negotiate a settlement with NetApp over the ZFS/WAFL patent suits. After all, why would a storage company want the world&#8217;s largest database company as an enemy? <strong>End update.</strong></p>
<p><strong>The StorageMojo take</strong><br />
This is speculation of course and no doubt missing many specifics. But what is public &#8211; that Apple announced ZFS in June 2008, included a read-only CLI version in Leopard Server and is not shipping it in August 2009 &#8211; is evidence enough that things went awry. What other than a license issue would cause Apple to step away from even the read-only CLI version in Snow Leopard Server?</p>
<p>The ZFS team has produced a game-changing file system/volume manager. The chance to get it into the hands of 10s of millions of Mac users &#8211; and to influence Redmond&#8217;s file system strategy &#8211; seem to this outsider an opportunity of a lifetime. </p>
<p>If the ZFS engineering team opposed this &#8211; and I&#8217;d love to hear their take &#8211; I encourage them to reconsider. Marketers often ask the question &#8220;would you prefer 100% of nothing or 40% of something huge?&#8221; </p>
<p>Once the acquisition of Sun is complete, I hope Oracle quickly GPLs ZFS and cuts a deal with Apple. It will be good for them, for ZFS and for the entire industry.</p>
<p><strong>Courteous comments welcome, of course.</strong>  I worked for Sun for 3 years in the mid-90s and despite the many problems in the storage group I remain impressed by much of the company&#8217;s culture and accomplishments.</p>
<p><strong>Update:</strong> I got the indemnification issue backwards in the original post and I thank those readers who deciphered my intent. For those who didn&#8217;t, I corrected it. While I was at it I made some other edits for clarity.<strong>End update.</strong></p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2009/08/31/why-did-apple-drop-zfs/&text=Why did Apple drop ZFS?" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2009/08/31/why-did-apple-drop-zfs/feed/</wfw:commentRss>
		<slash:comments>59</slash:comments>
		</item>
		<item>
		<title>Google File System v2, part 2</title>
		<link>http://storagemojo.com/2009/08/18/google-file-system-v2-part-2/</link>
		<comments>http://storagemojo.com/2009/08/18/google-file-system-v2-part-2/#comments</comments>
		<pubDate>Tue, 18 Aug 2009 23:50:41 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Cloud computing & storage]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Information Management]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=1537</guid>
		<description><![CDATA[Bigtable to the rescue (sort of) In Part 1, Sean Quinlan, a Google engineer, related how the original GFS single master architecture became a bottleneck. But since Google controls its entire software stack from OS to apps, it could compensate by tweaking the apps and the infrastructure, like Bigtable. Bigtable is Google&#8217;s structured storage system. [...]]]></description>
			<content:encoded><![CDATA[<p></p><p><strong>Bigtable to the rescue (sort of)</strong><br />
In Part 1, Sean Quinlan, a Google engineer, related how the original GFS single master architecture became a bottleneck. But since Google controls its entire software stack from OS to apps, it could compensate by tweaking the apps and the infrastructure, like Bigtable.</p>
<p>Bigtable is Google&#8217;s structured storage system. If you need a refresher &#8211; I did &#8211; check out <a href="http://storagemojo.com/2006/09/07/googles-bigtable-distributed-storage-system-pt-i/" target="_blank">Google&#8217;s Bigtable Distributed Storage System</a>.</p>
<p>The short version from 3 years ago is:</p>
<blockquote><p>
Google’s Bigtable is essentially a massive, distributed 3-D spreadsheet. It doesn’t do SQL, there is limited support for atomic transactions, nor does it support the full relational database model. In short, in these and other areas, the Google team made design trade-offs to enable the scalability and fault-tolerance Google apps require.  . . .Bigtable today supports almost 400 Google apps with data stores ranging up to several hundred terabytes.
</p></blockquote>
<p>Bigtable has a distributed lock server, Chubby, that coordinates the several thousand nodes in large Bigtable clusters. Presumably that is why Bigtable has been able to scale to handle many of the problems the single-master GFS has created.</p>
<p>But &#8211; and there&#8217;s always a but &#8211; Quinlan says that Bigtable isn&#8217;t an optimal solution to the many files/small files problem:</p>
<blockquote><p>
. . . [U]sing BigTable . . . as a way of fighting the file-count problem where you might have otherwise used a file system to handle that — then you would not end up employing all of BigTable&#8217;s functionality . . . . BigTable isn&#8217;t really ideal . . . in that it requires resources for its own operations that are nontrivial. Also, it has a garbage-collection policy that&#8217;s not super-aggressive, so that might not be the most efficient way to use your space. . . . people who have been using BigTable purely to deal with the file-count problem probably haven&#8217;t been terribly happy, but . . .  it is one way for people to handle that problem.
</p></blockquote>
<p>GFS was designed to maximize bandwidth to disk as the crawlers sluiced data back to Google. Low latency was a non-goal. But as Google offered more user-facing apps, latency became important. Sean notes:</p>
<blockquote><p>
. . . if you&#8217;re writing a file, it will typically be written in triplicate—meaning you&#8217;ll actually be writing to three chunkservers. Should one of those chunkservers die . . . the GFS master will notice the problem and schedule what we call a <i>pullchunk</i>, which means it will basically replicate one of those chunks. That will get you back up to three copies, and then the system will pass control back to the client, which will continue writing.</p>
<p>When we do a pullchunk we limit it to something on the order of 5-10 MB a second. So, for 64 MB, you&#8217;re talking about 10 seconds. . . . If you are working on Gmail, however, and you&#8217;re trying to write a mutation that represents some user action, then getting stuck for a minute is really going to mess you up.
</p></blockquote>
<p><strong>Consistency, <i>consistency</i>, cOnsiStencY </strong><br />
Since GFS and BigTable were designed to run on massive pools of commodity hardware failures and faults were a given. </p>
<p>For example, disk drives would tell Linux they supported some IDE versions when they really didn&#8217;t, leading to silent data corruption when drives and kernels disagreed about the drive&#8217;s state. GFS includes rigorous end-to-end check-summing to protect data from network and storage corruption, but other decisions compromised data consistency.</p>
<p>GFS simply assumes that there will be times that stale data is returned to applications. Data appended to an open file won&#8217;t be seen until the file is reopened.</p>
<p>Given that the Google owned GFS, Bigtable and the apps, it seemed acceptable to ask the apps to handle some problems. But some of the inherent problems are hard ones.</p>
<p>If a client crashes in the middle of a write, data could be left in an indeterminate state. The RecordAppend operation supported multiple writers to a single file, so if a primary writer failed you could end up with multiple inconsistent copies of the data in a single file &#8211; with different versions of the file in different chunks.</p>
<p>These things may not happen all that often, but it&#8217;s Murphy&#8217;s Law: if they can happen, they will. With several million servers and dozens of data centers, it is a continuing headache.</p>
<p><strong>Snapshot?</strong><br />
Sean makes an interesting comment about the GFS snapshot feature, which he calls &#8220;the most general-purpose snapshot capability you can imagine.&#8221;</p>
<blockquote><p>
I also think it&#8217;s interesting that the snapshot feature hasn&#8217;t been used more since it&#8217;s actually a very powerful feature. . . from a file-system point of view, it really offers a pretty nice piece of functionality.
</p></blockquote>
<p>I&#8217;d like to hear from Google app developers why they didn&#8217;t use the snapshot feature. I suspect it is an interesting set of reasons.</p>
<p><strong>The StorageMojo take</strong><br />
Google engineers have been hard at work for the last 2 years building a distributed master system that will work better with Bigtable to fix many of the current problems.</p>
<p>Still, it is amazing that in 1 year 4 or 5 people could put together a file system critical to Google&#8217;s success for almost 10 years. It looks creaky now, but it has also scaled far beyond what it&#8217;s developers expected.</p>
<p>&#8220;Scalability&#8221; is one of the most abused words in the IT marketing lexicon. It is often used where &#8220;expandability&#8221; is more appropriate. </p>
<p>That GFS has scaled 1,000x or more is a benchmark for Internet data center infrastructure. With billions of people still not on the web and the growth of sensor networks, machine translation and other scale intensive apps, 1,000x is the new normal.</p>
<p>Get used to it. Plan for it.</p>
<p><strong>Courteous comments welcome, of course.</strong>  </p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2009/08/18/google-file-system-v2-part-2/&text=Google File System v2, part 2 " target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2009/08/18/google-file-system-v2-part-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Not a filesystem, not a database.</title>
		<link>http://storagemojo.com/2009/06/17/not-a-filesystem-not-a-database/</link>
		<comments>http://storagemojo.com/2009/06/17/not-a-filesystem-not-a-database/#comments</comments>
		<pubDate>Wed, 17 Jun 2009 17:48:35 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Future Tech]]></category>
		<category><![CDATA[Information Management]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=1441</guid>
		<description><![CDATA[Jeff Darcy has a good post on key data stores, like Amazon&#8217;s Dynamo, and how they differ from filesystems and databases. He relates his transition from a filesystem purist to a more flexible perspective. The thing that really changed my mind about this was an observation in the Dynamo paper: strong consistency reduces availability. I&#8217;ve [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Jeff Darcy has a good <a href="http://pl.atyp.us/wordpress/?p=2142" target="_blank">post</a> on key data stores, like Amazon&#8217;s Dynamo, and how they differ from filesystems and databases. He relates his transition from a filesystem purist to a more flexible perspective.</p>
<blockquote><p>
The thing that really changed my mind about this was an observation in the Dynamo paper: strong consistency reduces availability. I&#8217;ve always thought of data availability in terms of data not being lost or stranded on the other side of a failed network connection. The Dynamo insight is that many applications have to do a lot of work within a small acceptable-response-time window, and to make sure that they fit into that window they have to impose deadlines on all sub-operations including data access. If consistency issues make data unavailable within that deadline then they&#8217;ve made it unavailable period, with practically the same effect as if the data were unavailable in any other sense.
</p></blockquote>
<p>In short, while there is a class of applications where traditional consistency is important, there is an emerging class where strong consistency isn&#8217;t affordable or necessary. Good stuff.</p>
<p><strong>Another point</strong><br />
Many of the features that make up these non-FS/non-DB stores seem to have a lot in common with object storage. In a highly mobile world the whole idea of placing a file in cyberspace by a path name is anachronistic at best: it could be, physically, almost anywhere and is most likely in several places at once.</p>
<p><strong>The StorageMojo take</strong><br />
While the name &#8220;object&#8221; is problematic for market acceptance, the concept of managing objects in a flat address space &#8211; like the web itself &#8211; is a better fit for a mobile networked world. There is a major opportunity to move file management infrastructure forward to reflect the world we now live in rather than a 35 year old server environment.</p>
<p><strong>Courteous comments welcome, of course.</strong> Thanks to Wes Felter&#8217;s <a href="http://wmf.editthispage.com/" target="_blank">Hack the Planet</a> blog for the link to Jeff&#8217;s post.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2009/06/17/not-a-filesystem-not-a-database/&text=Not a filesystem, not a database." target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2009/06/17/not-a-filesystem-not-a-database/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Btrfs vs ZFS &#8211; OMG!</title>
		<link>http://storagemojo.com/2009/05/20/btrfs-vs-zfs-omg/</link>
		<comments>http://storagemojo.com/2009/05/20/btrfs-vs-zfs-omg/#comments</comments>
		<pubDate>Wed, 20 May 2009 23:49:32 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Information Management]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=1366</guid>
		<description><![CDATA[Am @ Interop today &#8211; a nice, relaxing 250 mile drive from home &#8211; so this isn&#8217;t a standard StorageMojo post. Think of it as an expanded tweet. Part of what Oracle gets with Sun is ZFS. And part of what Chris Mason of Oracle is working on is Btrfs &#8211; B-Tree or &#8220;butter&#8221; FS [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Am @ Interop today &#8211; a nice, relaxing 250 mile drive from home &#8211; so this isn&#8217;t a standard StorageMojo post. Think of it as an expanded tweet.</p>
<p>Part of what Oracle gets with Sun is ZFS. And part of what Chris Mason of Oracle is working on is Btrfs &#8211; B-Tree or &#8220;butter&#8221; FS &#8211; seen as a Linux answer to ZFS. With a GPL license.</p>
<p>With <a href="http://linuxupdate.blogspot.com/2009/01/btrfs-next-generation-file-system-for.html" target="_blank">many of the same features</a> &#8211; such as parent-stored checksums and snapshots &#8211; Btrfs provides important new functionality to Linux. But if ZFS is an Oracle property, how hard could it be to change the licensing and open it up to the Linux community?</p>
<p><strong>The StorageMojo take</strong><br />
I&#8217;m asking the question, not answering it.  License T&#038;C&#8217;s are important, but if the bottom line is that CDDL is incompatible with GPL, will Oracle be able to fix that? Will they want to? </p>
<p>Or does Linux really need AZFS &#8211; Almost ZFS? </p>
<p><strong>Courteous comments welcome, of course.</strong>  </p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2009/05/20/btrfs-vs-zfs-omg/&text=Btrfs vs ZFS - OMG!" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2009/05/20/btrfs-vs-zfs-omg/feed/</wfw:commentRss>
		<slash:comments>38</slash:comments>
		</item>
	</channel>
</rss>

