<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>StorageMojo &#187; Clusters</title>
	<atom:link href="http://storagemojo.com/category/clusters/feed/" rel="self" type="application/rss+xml" />
	<link>http://storagemojo.com</link>
	<description>Data storage info &#38; analysis</description>
	<lastBuildDate>Fri, 20 Jan 2012 06:10:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Gridstore snags Geoff Barrall</title>
		<link>http://storagemojo.com/2012/01/10/gridstore-snags-geoff-barrall/</link>
		<comments>http://storagemojo.com/2012/01/10/gridstore-snags-geoff-barrall/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 17:09:37 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Clusters]]></category>
		<category><![CDATA[NAS, IP, iSCSI]]></category>
		<category><![CDATA[SOHO/SMB]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2568</guid>
		<description><![CDATA[BlueArc and Drobo founder Geoff Barrall has a new perch: Gridstore, one of the companies I&#8217;ve been following for almost 3 years. Geoff is the new executive chairman. Formal announcement is expected this week. Gridstore&#8217;s concept is a low-cost scale-out NAS appliance designed for office environments. Each box is a small, low-power node with a [...]]]></description>
			<content:encoded><![CDATA[<p></p><p><a href="http://www.bluearc.com/" target="_blank">BlueArc</a> and <a href="http://www.drobo.com/" target="_blank">Drobo</a> founder Geoff Barrall has a new perch: <a href="http://gridstore.com/" target="_blank">Gridstore</a>, one of the companies I&#8217;ve been <a href="http://www.zdnet.com/blog/storage/google-style-storage-comes-to-the-smb/1323" target="_blank">following</a> for almost 3 years. Geoff is the new executive chairman. Formal announcement is expected this week.</p>
<p>Gridstore&#8217;s concept is a low-cost scale-out NAS appliance designed for office environments. Each box is a small, low-power node with a couple of TB. Stack &#8216;em for as much redundancy, capacity and performance you want.</p>
<p>Think of it as the consumerization of hyper-scale technology. <a href="http://www.nutanix.com/" target="_blank">Nutanix</a> writ small.</p>
<p><strong>Gridstore details</strong><br />
Gridstore is offering a low-cost, scale-out network file server for $500 a node. That is too cheap for the enterprise storage companies to sell directly.</p>
<p>Founded 5 years ago, Gridstore got a beta out in 2010, and have been shipping for well over a year. They are a Microsoft CIFS protocol file server, using Microsoft’s storage server software. Running on small, 25 watt Atom-based boxes, a 6 node configuration is the size of a bread box.</p>
<p> Like other scale-out NAS systems, the Gridstore NAS has no single point of failure and can survive multiple node failures without going down or losing data.</p>
<p>They call their redundancy scheme RAIDg. When you set up a volume you dial in how many faults you want to survive and the software handles the rest.</p>
<p>Today the number of faults they can handle is limited to half the number of nodes minus one. If you have a 6 node configuration it can handle the loss of 2 nodes. They expect to relax that requirement in the future.</p>
<p><strong>The StorageMojo take</strong><br />
Haven&#8217;t spoken to Geoff about this, but Gridstore seems like a natural for him. If there&#8217;s a theme to his many endeavors, its making advanced NAS technology more accessible.</p>
<p>Gridstore fits the bill nicely. If there&#8217;s one complaint about Drobo, its the lack of box-level redundancy. Gridstore answers this objection, at a higher price point.</p>
<p>Drobo &#8211; over 200,000 units sold &#8211; has blazed a trail for bringing advanced storage technology to the masses at affordable prices. They may be the first, but as Gridstore and others demonstrate, they won&#8217;t be the last.</p>
<p><strong>Courteous comments welcome, of course.</strong> Hoping to make it to CES later this week. Readers: anyone I should make a point to see?</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2012/01/10/gridstore-snags-geoff-barrall/&text=Gridstore snags Geoff Barrall" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2012/01/10/gridstore-snags-geoff-barrall/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>The network is choking our storage</title>
		<link>http://storagemojo.com/2011/10/20/the-network-is-choking-our-storage/</link>
		<comments>http://storagemojo.com/2011/10/20/the-network-is-choking-our-storage/#comments</comments>
		<pubDate>Thu, 20 Oct 2011 17:03:08 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Cloud computing & storage]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Future Tech]]></category>
		<category><![CDATA[SAN, FC]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2533</guid>
		<description><![CDATA[Amazon Web Services architect James Hamilton has been posting on network issues for over a year and researching them much longer. As Ethernet becomes the de facto SAN technology, his views become more relevant to the larger storage market. Critique Part of Mr. Hamilton&#8217;s concern is the structure of the networking industry: the high margins; [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Amazon Web Services architect James Hamilton has been <a href="http://perspectives.mvdirona.com/2011/10/01/ChangesInNetworkingSystems.aspx" target="_blank">posting</a> on network issues for over a year and researching them much longer. As Ethernet becomes the <i>de facto</i> SAN technology, his views become more relevant to the larger storage market.</p>
<p><strong>Critique</strong><br />
Part of Mr. Hamilton&#8217;s concern is the structure of the networking industry: the high margins; the dominance of a single player, Cisco; the closed technology; and the heavy vertical integration. All antithetical to the dynamics that have driven server costs down so successfully in the last 20 years.</p>
<p>These are issues the storage industry knows too well. But Mr. Hamilton is more concerned about the waste the current high-cost industry structure causes.</p>
<p>Waste?</p>
<p><strong>Workload placement</strong><br />
The cost of network bandwidth leads to network over-subscription. Networks are configured as tree topologies: the further you move from end nodes the worse the over subscription. </p>
<p>As described in the 2009 Microsoft Research paper <a href="http://research.microsoft.com/pubs/80693/vl2-sigcomm09-final.pdf" target="_blank">VL2: A Scalable and Flexible Data Center Network</a>:</p>
<blockquote><p>
. . . the capacity between different branches of the tree is typically over- subscribed by factors of 1:5 or more, with paths through the highest levels of the tree oversubscribed by factors of 1:80 to 1:240. This limits communication between servers to the point that it fragments the server pool — congestion and computation hot-spots are prevalent even when spare capacity is available elsewhere.
</p></blockquote>
<p>This throttles data center performance by limiting server-to-server bandwidth, fragmenting resources and reducing network utilization. The latter reflects the redundant paths needed in case of switch failure: ≈50% or more of costly data center bandwidth goes unused.</p>
<p>As might be expected, big Internet data centers like Amazon&#8217;s have complex and unpredictable workloads. They need lots of bandwidth between all servers all the time.</p>
<p><strong>A solution</strong><br />
The VL2 paper describes an experimental solution to these problems that includes <i>location-specific</i> and <i>application-specific</i> addressing, multi-path traffic load balancing and a novel directory design that efficiently handles lookups and updates to network mappings.</p>
<p>In an 75-node test cluster the design moved 2.75TB of data in 395 seconds &#8211; 94% of maximum network bandwidth &#8211; at a fraction of the cost of current enterprise networks. The paper calculates that a cloud-service scale network with no over-subscription could be built with commodity switches at <strong>1/14th the cost</strong> of a traditional data center Ethernet.</p>
<p>Whoa!</p>
<p><strong>The StorageMojo take</strong><br />
VC and engineering dollars follow high-growth markets. What Google, Amazon and Microsoft want, they get. With the rapid growth of public cloud services the network over-subscription problem will get solved. </p>
<p>Merchant silicon from Broadcom, Intel and Marvell is making a tried-and-true Moore&#8217;s Law attack on hardware cost. The protocol stack is tougher, but several open-source industry initiatives are under way with support from major companies. Progress will be slower than hoped, but within 3 years we&#8217;ll have a viable stack to build on.</p>
<p>Where does this leave the networking industry? That depends on where you sit.</p>
<p>Cisco will be the biggest loser, because they&#8217;ve been the biggest winner with the current model. They may need to pull an IBM and move big into services if they want to stick around. Ironically, Cisco&#8217;s UCS product line &#8211; which bakes in the tree-structured network &#8211; has further motivated broader industry action.</p>
<p>The rest of the industry can go after this emerging market with a lower-GM business model. Not all of them will, but it will be a critical success factor. </p>
<p>The big winner will be storage. Scale-out storage relies on spraying data across multiple racks for maximum availability, utilization and performance. Cheaper, faster, better scale-out networks will only drive storage demand.</p>
<p>For most of us this is an academic problem today. Lightly used systems &#8211; such as for backup and archiving &#8211; don&#8217;t see Amazon&#8217;s problems. But in 5 years this will be common even outside the public cloud providers.</p>
<p>Just as IT users have benefited from Google&#8217;s push on energy efficiency and much more, they will also benefit from much lower cost and more scalable networks.</p>
<p><strong>Courteous comments welcome, of course.</strong> I can&#8217;t help but continue to marvel at how dumb Cisco&#8217;s UCS has turned out to be. It&#8217;s a gift that keeps on giving.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2011/10/20/the-network-is-choking-our-storage/&text=The network is choking our storage " target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2011/10/20/the-network-is-choking-our-storage/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>RAMCloud is the new flash</title>
		<link>http://storagemojo.com/2011/10/05/ramcloud-is-the-new-flash/</link>
		<comments>http://storagemojo.com/2011/10/05/ramcloud-is-the-new-flash/#comments</comments>
		<pubDate>Thu, 06 Oct 2011 01:03:30 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[SSD/Flash Disk]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2529</guid>
		<description><![CDATA[Sometimes in the midst of the endless tweaking needed to maximize storage performance one just wants to say &#8220;screw it! Put everything in RAM!&#8221; And that&#8217;s just what RAMCloud does. Disk is the new tape, flash the new disk, DRAM the new flash. RAMCloud is a research paper (pdf) and an open software project. The [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Sometimes in the midst of the endless tweaking needed to maximize storage performance one just wants to say &#8220;screw it! Put everything in RAM!&#8221; And that&#8217;s just what RAMCloud does.</p>
<p><strong> Disk is the new tape, flash the new disk, DRAM the new flash.</strong><br />
RAMCloud is a <a href="http://www.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf" target="_blank">research paper</a> (pdf) and an <a href="http://fiz.stanford.edu:8081/display/ramcloud/Home" target="_blank">open software project</a>. The goal is enterprise-class availability with every bit of active data stored in DRAM, not disk or flash, for maximum performance. It is a key-value object store today, though as pure software that could change.</p>
<p>It&#8217;s the brainchild of John Ousterhout, a Stanford prof who invented Tcl back in the 80s at Berkeley. </p>
<p><strong>Isn&#8217;t DRAM volatile and costly?</strong><br />
Right on both counts, grasshopper, so RAMCloud isn&#8217;t a 1 for 1 disk-style architecture. No Google FS-style triple replication here, or RAID-style erasure coding.</p>
<p>Instead RAMCloud uses <i>buffered logging</i>:</p>
<blockquote><p>
. . . a single copy of each object is stored in DRAM of a primary server and copies are kept on the disks of two or more backup servers; each server acts as both primary and backup. However, the disk copies are not updated synchronously during write operations. Instead, the primary server updates its DRAM and forwards log entries to the backup servers, where they are stored temporarily in DRAM.
</p></blockquote>
<p>Instead of working around crashes &#8211; using multiple object copies as scale-out storage does &#8211; RAMCloud recovers lost data from the DRAM logs or disk drives to replicate the lost data at high speed. That&#8217;s possible because all the log data is in DRAM or spread across many disks. </p>
<p>In a recent paper (<a href="http://www.stanford.edu/~ouster/cgi-bin/papers/ramcloud-recovery.pdf" target="_blank">Fast Crash Recovery in  RAMCloud</a>) (pdf) Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum (co-founder of VMware) go into more detail on this critical feature. </p>
<p>The key elements are:</p>
<ul>
<li><strong>Scale.</strong> Servers scatter their backup data across all other servers so thousands of disks can serve the recovery.</li>
<li><strong>Log-structure. </strong> Reduces complexity and offers high performance.</li>
<li><strong>Randomization.</strong> Many decisions need to be made in a large cluster. Rather than CPU, time and bandwidth consuming determinism, injecting randomization speeds decisions with less overhead.</li>
<li><strong>Dynamic tablets.</strong> The key-value store tracks resource usage within a single table and ensures that no single partition is too large for fast restores.</li>
</ul>
<p>DRAM is volatile so the log replication data is spread to other servers on other racks for redundancy before being committed to disk. Still, total system write throughput is limited by the disk write speed, whose limits are a key reason people are moving from disks. Flash drives may help, but other techniques, such as log truncation and sharding make it possible to get good performance from several thousand SATA drives.</p>
<p>How good? The team reports that in a 60 node cluster they recover 35GB in 1.6 seconds. With more nodes larger partitions should be restored even faster. Scale is good.</p>
<p><strong>Lights out!</strong><br />
Power failures wipe all the data in DRAM. The obvious defense is to avoid failures: combine battery backup with diesel generator sets. Power ride-through will handle interruptions into the hundreds of milliseconds.</p>
<p>But who is going to trust that? That&#8217;s why future commercial implementations will insist on logging to stable storage, such as the flash SSDs.</p>
<p>They&#8217;re getting cheaper fast &#8211; faster than DRAM &#8211; which will make this a common approach. </p>
<p><strong>Cost</strong><br />
Professor Ousterhout kindly sent a short note about cost, correctly noting that</p>
<blockquote><p>
. . . if you measure cost/operation, DRAM is roughly 100x cheaper than disk, since a disk can only perform about 100-200 operations/second.  This is why RAMCloud makes sense for data-intensive applications. . . .
</p></blockquote>
<p>While you and I might find that persuasive, too many enterprises don&#8217;t. The deep conservatism of the storage culture &#8211; both figuratively and literally &#8211; makes cost a good excuse to stay with the tried and true, and easy to explain to CFOs. </p>
<p>The good news for the company I hope he is starting is that the primacy of $/GB is slowly eroding as customers see the system level savings from fast storage. SSD vendors and companies like TMS and Kaminario are breaking trail for RAMCloud.</p>
<p><strong>The StorageMojo take</strong><br />
Make no mistake: RAMCloud is a research project, not a commercial product, years and million$ away from commercial application. But the concept is promising.</p>
<p>Imagine a world where data layout doesn&#8217;t matter, where apps are optimized for sub-millisecond storage, where 100 byte I/Os are faster and just as efficient as 8KB I/Os. The architectural implications are huge and would take a decade or more to absorb.</p>
<p>RAMCloud raises the thorny issue of tiering: getting hot data on the hot storage and everything else off to disk. There are OK answers for tiering but nothing insanely great. </p>
<p>RAMCloud shows we&#8217;re far from the end of the line in what storage can do. Faster, better, arguably cheaper: 2 out of 3 ain&#8217;t bad.</p>
<p><strong>Courteous comments welcome, of course.</strong> A shorter version of this post appeared on <a href="http://www.zdnet.com/blog/storage/ramcloud-puts-everything-in-dram/1546" target="_blank">ZDNet</a>.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2011/10/05/ramcloud-is-the-new-flash/&text=RAMCloud is the new flash" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2011/10/05/ramcloud-is-the-new-flash/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>NoSQL in the metadata engine room</title>
		<link>http://storagemojo.com/2011/10/03/nosql-in-the-metadata-engine-room/</link>
		<comments>http://storagemojo.com/2011/10/03/nosql-in-the-metadata-engine-room/#comments</comments>
		<pubDate>Mon, 03 Oct 2011 18:59:44 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Future Tech]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2525</guid>
		<description><![CDATA[One more datapoint and we&#8217;ll have a trend: NoSQL databases managing metadata. It&#8217;s obvious in retrospect: use a scalable big data tool to handle scale-out metadata. Maybe not a requirement today, but surely will be with even bigger data tomorrow. Metadata is a fraction of the user data set, but it gets hammered much more. [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>One more datapoint and we&#8217;ll have a trend: NoSQL databases managing metadata. It&#8217;s obvious in retrospect: use a scalable big data tool to handle scale-out metadata. Maybe not a requirement today, but surely will be with even bigger data tomorrow.</p>
<p>Metadata is a fraction of the user data set, but it gets hammered much more. As more metadata is found useful the hammering will get more insistent.</p>
<p><strong>Nutanix</strong><br />
<a href="http://www.nutanix.com/" target="_blank">Nutanix</a>, whose CTO and co-founder, Mohit Aron, was a developer of the Google File System, uses MapReduce. Nutanix achieves it scale due to its distributed metadata, masterless architecture &#8211; powered by MapReduce jobs that run in the background.</p>
<p><strong>Druva</strong><br />
<a href="http://www.druva.com/" target="_blank">Druva</a>, a backup company for mobile devices, also uses a NoSQL database to manage storage metadata. They say they&#8217;ve found that NoSQL scales over an order of magnitude better than relational in similar applications.</p>
<p><strong>Somebody else</strong><br />
A company that shall remain nameless is porting Hadoop to their backend. The customer won&#8217;t be able to access Hadoop for their work &#8211; it is strictly for the system&#8217;s internal use.</p>
<p>It is a proof of concept so it isn&#8217;t a 3rd data point, but they see the potential advantages. Call it data point 2½. </p>
<p><strong>The StorageMojo take</strong><br />
Small advances are the building blocks of disruption. RAID made it possible to build available storage using cheap disks. Consumer adoption of PCs made disks even cheaper. Moore&#8217;s Law made RAID controllers cheaper and faster, or faster and more capable. </p>
<p>A virtuous circle of disruption.</p>
<p>The basic architecture of scale-out storage systems &#8211; purpose-built software on clustered commodity hardware &#8211; has been stable. But this is the beginning of scale-out storage 2.0: taking scale-out technology developed for users and incorporating it into the storage infrastructure itself.</p>
<p>These ideas are bubbling up among the latest startups and among the establishment players. At some point the old RAID architectures will be well and truly broken, able to compete in smaller and smaller niches until the revenue can&#8217;t justify more investment. </p>
<p>Of course vendors have been making RAID controllers out of servers for years now, and those servers can run any software they want. But at some point the explicit and implicit assumptions in the old architecture crash into current realities &#8211; either in cost, development time, feature completeness or management overhead &#8211; and then we move on.</p>
<p><strong>Courteous comments welcome, of course.</strong> I learned about Nutanix at the last <a href="http://techfieldday.com/" target="_blank">Tech Field Day</a> &#8220;The Independent IT Influencer Event&#8221; which paid for my travel expenses to Silicon Valley.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2011/10/03/nosql-in-the-metadata-engine-room/&text=NoSQL in the metadata engine room " target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2011/10/03/nosql-in-the-metadata-engine-room/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Storage @VMworld 2011</title>
		<link>http://storagemojo.com/2011/09/12/storage-vmworld-2011/</link>
		<comments>http://storagemojo.com/2011/09/12/storage-vmworld-2011/#comments</comments>
		<pubDate>Mon, 12 Sep 2011 16:53:32 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Cloud computing & storage]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Enterprise]]></category>
		<category><![CDATA[SSD/Flash Disk]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2519</guid>
		<description><![CDATA[VMworld is the best storage show I&#8217;ve seen in years. VMware&#8217;s severe storage problems leave users hungry for solutions &#8211; and your friendly neighborhood storage industry is happy to oblige. It&#8217;s almost as if VMware were owned by a storage company. Flash everywhere Fusion-io, Nimble Storage, Nimbus Data, Avere, Pure and more were talking about [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>VMworld is the best storage show I&#8217;ve seen in years. VMware&#8217;s severe storage problems leave users hungry for solutions &#8211; and your friendly neighborhood storage industry is happy to oblige.</p>
<p>It&#8217;s almost as if VMware were owned by a storage company.</p>
<p><strong>Flash everywhere</strong><br />
<a href="http://www.fusionio.com/" target="_blank">Fusion-io</a>, <a href="http://www.nimblestorage.com/" target="_blank">Nimble Storage</a>, <a href="http://www.nimbusdata.com/" target="_blank">Nimbus Data</a>, <a href="http://www.averesystems.com/" target="_blank">Avere</a>, <a href="http://www.purestorage.com/" target="_blank">Pure</a> and more were talking about how well flash supports VMware. Fixes VDI boot storms, deduped VMDKs, I/O bound servers and much more.</p>
<p><strong>Pure Storage</strong><br />
Here is <a href="http://www.purestorage.com/" target="_blank">Pure&#8217;s</a> Matt Kixmoeller giving a nifty demo in this 50 second video:</p>
<p><object width="500" height="306"><param name="movie" value="http://www.youtube.com/v/7_7ps2ci8tk?version=3"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/7_7ps2ci8tk?version=3" type="application/x-shockwave-flash" width="500" height="306" allowscriptaccess="always" allowfullscreen="true"></embed></object></p>
<p>Not exactly sure what those thousand VMs were doing. Maybe Pure will comment.</p>
<p><strong>Falconstor</strong><br />
I lost track of <a href="http://www.falconstor.com/" target="_blank">Falconstor</a> due to their OEM focus and sprawling product line. New CEO Jim McNiel has refocused the company &#8211; with the help of former Cheyenne teammates &#8211; on backup, business continuity/DR, dedup and virtualization.</p>
<p>Their clustered Network Storage Server turns all of Fstor&#8217;s products into tin-wrapped software suitable for channel partners. Takeaway: forget what you knew about them; they are a new company.</p>
<p><strong><a href="http://www.virsto.com/" target="_blank">Virsto</a></strong><br />
While the release of their storage hypervisor for VMware makes them seem like a new company, Virsto has been shipping product for over a year, but on Hyper-V, not VMware. Microsoft lost interest in server virtualization and Virsto moved on.</p>
<p>Their product is a virtual appliance that:</p>
<blockquote><p>
. . . runs in each host, creating a transparent virtual storage layer that is thin provisioned, fully cluster-aware, supports very rapid snapshot and clone creation, and scales to support tens of thousands of high performance snapshots and clones.</p>
<p>Virsto . . . decouple[s] application performance from any dependence on the rotational latencies and seek times of underlying disk associated with random writes. All random writes are sequentialized and written directly to a transparent logging device . . . and then asynchronously de-staged to primary storage. . . .
</p></blockquote>
<p>Net/net: high performance virtual storage regardless of underlying physical storage. Virsto offers a free trial &#8211; if you try it, let me know how it works.</p>
<p><strong>But wait! There&#8217;s more!</strong><br />
Cloud-related products from <a href="http://www.storsimple.com/" target="_blank">StorSimple</a>, <a href="http://amax.com/default.asp" target="_blank">AMAX</a> and <a href="http://raidundant.com/v2/" target="_blank">Raidundant</a> continue to pick at the problem of how/when/where cloud integrates with the enterprise.</p>
<p><strong>The StorageMojo take</strong><br />
Many cool products and ideas. The storage problems of many virtual machines are not unlike those of earlier time-shared virtual memory systems, but the scale is much greater. </p>
<p>And when the scale is greater the problem is fundamentally different. As virtualization grows we&#8217;ll need to see more creative answers beyond deduplication and flash.</p>
<p><strong>Courteous comments welcome, of course.</strong> Message to SNIA: storage networking is passé. Time to retool for the world of virtual machines, noSQL databases, scale-out storage and flash-enabled architectures. New name would be a start.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2011/09/12/storage-vmworld-2011/&text=Storage @VMworld 2011 " target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2011/09/12/storage-vmworld-2011/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Open source storage array</title>
		<link>http://storagemojo.com/2011/07/20/open-source-storage-array/</link>
		<comments>http://storagemojo.com/2011/07/20/open-source-storage-array/#comments</comments>
		<pubDate>Thu, 21 Jul 2011 00:37:44 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Cloud computing & storage]]></category>
		<category><![CDATA[Clusters]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2458</guid>
		<description><![CDATA[Most business files are only opened a few times, yet remain valuable enough to keep on line, just in case. That cold data is normally stored on high-performance, high-price NAS boxes at $$/GB. Why? 2 years ago Backblaze, an online backup provider, open-sourced their storage pod design: 45 drives in a box (see Build a [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Most business files are only opened a few times, yet remain valuable enough to keep on line, just in case. That cold data is normally stored on high-performance, high-price NAS boxes at $$/GB.</p>
<p>Why?</p>
<p>2 years ago <a href="http://www.backblaze.com/" target="_blank">Backblaze</a>, an online backup provider, open-sourced their storage pod design: 45 drives in a box (see <a href="http://www.zdnet.com/blog/storage/build-a-raid-6-array-for-100tb/603" target="_blank">Build a RAID 6 array for $100/TB</a>). Now they&#8217;re back with v2: 45 3TB drives in a box with higher performance.</p>
<p>Backblaze now has over 16PB of storage pods in production.<br />
<a href="http://storagemojo.com/wp-content/uploads//2011/07/backblaze_computer_room.jpg"><img src="http://storagemojo.com/wp-content/uploads//2011/07/backblaze_computer_room.jpg" alt="" title="backblaze_computer_room" width="470" height="337" class="aligncenter size-full wp-image-2460" /></a><br />
<strong>Now for the good news</strong><br />
Backblaze isn&#8217;t in the box building business. They designed the storage pod for their backup business and released the plans out of the goodness of their hearts and for the free publicity.</p>
<p>I&#8217;ve thought that this could be a viable business for someone who <i>doesn&#8217;t</i> want to be the next NetApp or Isilon. Someone happy to build and ship boxes on a cost-plus basis to people who understand and can support a fault-tolerant software layer above the box, but who don&#8217;t have time to chase down miscellaneous hardware from vendors who prefer to sell in bulk.</p>
<p>That vendor has emerged: <a href="http://protocase.com/products/index.php?e=Backblaze" target="_blank">Protocase</a>, the quick-turn enclosure shop that builds Backblaze&#8217;s enclosures.</p>
<p>I spoke to Protocase co-founder Doug Milburn &#8211; a PhD in mechanical engineering &#8211; today. Protocase will announce a complete just-add-drives storage pod: assembled, tested and software loaded box. Look for it in 2-4 weeks, priced at ≈$6k. With another $5500 for 3TB drives, it will come in at less than $90 per raw TB. </p>
<p>Why no drives? That&#8217;s the lion&#8217;s share of the cost and also the fastest to decline in price. They don&#8217;t need the inventory exposure and tech savvy shoppers can probably do better anyway. BTW, Backblaze has had good experience with the Hitachi HDS5C3030ALA630 drive.</p>
<p><strong>The StorageMojo take</strong><br />
This will help energize the private cloud market by reducing the entry price. Amazon and Google don&#8217;t use NetApp or EMC. Why should you?</p>
<p>And the savings over renting cloud storage can be substantial as this Backblaze chart suggests:<br />
<a href="http://storagemojo.com/wp-content/uploads//2011/07/backblaze_pb_cost.jpg"><img src="http://storagemojo.com/wp-content/uploads//2011/07/backblaze_pb_cost.jpg" alt="" title="backblaze_pb_cost" width="470" height="368" class="aligncenter size-full wp-image-2461" /></a><br />
True, Amazon provides many more services, but if you need petabytes for mini-bucks, this is hard to beat.</p>
<p><strong>Courteous comments welcome, of course.</strong> Read about the v2 storage pod at the Backblaze <a href="http://blog.backblaze.com/2011/07/20/petabytes-on-a-budget-v2-0revealing-more-secrets" target="_blank">blog post</a>. Or get the shorter version in my ZDnet post <a href="http://www.zdnet.com/blog/storage/build-a-135tb-array-for-7384/1453 target="_blank">&#8220;Build a 135TB array for $7,384</a>. </p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2011/07/20/open-source-storage-array/&text=Open source storage array" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2011/07/20/open-source-storage-array/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
		</item>
		<item>
		<title>Amazon&#8217;s EBS outage</title>
		<link>http://storagemojo.com/2011/04/29/amazons-ebs-outage/</link>
		<comments>http://storagemojo.com/2011/04/29/amazons-ebs-outage/#comments</comments>
		<pubDate>Fri, 29 Apr 2011 17:26:37 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Cloud computing & storage]]></category>
		<category><![CDATA[Clusters]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2360</guid>
		<description><![CDATA[Amazon&#8217;s outage was caused by a failure of the underlying storage &#8211; the Elastic Block Storage. Here&#8217;s what they learned. EBS The Elastic Block Store (EBS) is a distributed and replicated storage optimized for consistent and low latency I/O from EC2 instances. EBS runs on clusters that store data and serve requests and a set [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Amazon&#8217;s outage was caused by a failure of the underlying storage &#8211; the Elastic Block Storage. Here&#8217;s what they learned.</p>
<p><strong>EBS</strong><br />
The Elastic Block Store (EBS) is a distributed and replicated storage optimized for consistent and low latency I/O from EC2 instances. EBS runs on clusters that store data and serve requests and a set of control services that coordinate and propagate I/Os.</p>
<p>Each EBS cluster consists of EBS nodes where data is replicated and I/Os are served. Nodes are connected by 2 networks: a primary high-bandwidth network for traffic between the EBS nodes and EC2 server instances; and a slower replication network intended as a backup and for reliable internode communication.</p>
<p>Newly written data is replicated ASAP. An EBS node searches the cluster for a node with enough capacity, connects to it and replicates the data, usually in milliseconds.</p>
<p>If connectivity to a node it is replicating to is lost the node assumes the other node failed and tries to find another node to replicate the data. In the meantime it holds onto all data until it can confirm the data is replicated.</p>
<p><strong>The outage</strong><br />
During a network change on April 21 to upgrade primary network capacity a mistake occurred: the primary network data traffic was shifted to the slower secondary network.</p>
<p>The secondary network couldn&#8217;t handle the traffic which isolated many nodes in the cluster. Losing contact with nodes they were replicating to the remaining EBS nodes sought new nodes, but the few remaining nodes were quickly overwhelmed in a retry storm.</p>
<p>The now degraded secondary network then slammed the coordinating control services. Configured with a long timeout the retry requests backed up and the control services suffered thread starvation. </p>
<p>Once a large number of I/O requests were backed up the control services had no ability to service I/O requests and began to fail I/O requests from other Amazon availability zones. Within two hours the Amazon team had identified this issue and disabled all new <code>create volume</code> requests in the cluster. </p>
<p>But then another bug kicked in.</p>
<p>A <a href="http://en.wikipedia.org/wiki/Race_condition" target="_blank">race condition</a> in EBS caused them to fail when closing a large number of replication requests. Because there were so many replication requests the race condition caused even more EBS notes to fail, re-creating the need to replicate even more data and again the control services were overwhelmed.</p>
<p><strong>Recovery</strong><br />
The Amazon team get control of the replication storms in about 12 hours. Then the problem was recovering customer data.</p>
<p>Amazon optimizes its systems to protect customer data. When a node fails it is not reused until its data is replicated.</p>
<p>But since so many nodes were failed the only way to ensure no customer data was lost was by adding more physical capacity &#8211; no easy chore &#8211; but that wasn&#8217;t all.</p>
<p>The replication mechanisms had been throttled to control the storm, so adding physical capacity also meant delicate management of the many queued replication requests. It took the team 2 days to implement a process.</p>
<p><strong>Amazon Relational Database Service</strong><br />
The Amazon Relational Database Service (RDS) uses EBS for database and log storage. RDS can be configured to operate within a single Amazon zone or replicated across multiple zones. Customers with a single zone RDS were quite likely to be affected, but a 2.5% of multi-zone RDS customers were affected as well due to another bug.</p>
<p><strong>Lessons learned</strong><br />
The network upgrade process will be further automated to prevent a similar mistake. But the more important issue is to keep a cluster from entering a replication storm. One factor is to increase the amount of free capacity in each EBS cluster.</p>
<p>Retry logic will be changed as well to back off faster to focus on reestablishing connections first before more retries. And of course, the race condition bug will be fixed.</p>
<p>Finally, Amazon has learned it must improve the isolation between zones. They will tune timeout logic to prevent thread exhaustion, increase control services awareness of zone loads and, finally, move more control services into each EBS cluster.</p>
<p><strong>The StorageMojo take</strong><br />
Data center opponents of cloud computing will point with alarm to this incident to make the case that they are still needed. But they forget that today&#8217;s enterprise gear is reliable only because of the many failures that led to better error handling.</p>
<p>While painful for the affected, the Amazon team&#8217;s response shows a level of openness and transparency that few enterprise infrastructure vendors ever display. Of course, that is due to the public nature of these large cloud failures; nevertheless the outcome is commendable.</p>
<p>But the battle is not only between large public clouds and private enterprise infrastructures, but between architectures. Traditionally, enterprise infrastructures have focused on increasing MTBF. Cloud architectures, on the other hand, have focused on fast MTTR &#8211; Mean Time To Repair.</p>
<p>What can be scaled up can also be scaled down. Not every application is suitable for public cloud hosting. But small-scale, commodity-based, self managing infrastructures are very doable. They are the bigger threat to the large proprietary hardware vendors of today.</p>
<p><strong>Courteous comments welcome, of course.</strong> I speculated in <a href="http://www.zdnet.com/blog/storage/amazons-experience-fault-tolerance-and-fault-finding/1354" target="_blank"> Amazon&#8217;s experience: fault tolerance and fault finding</a> about the cause of the failure, but I was wrong. A failure precipitated by a network upgrade? Way-y-y too simple.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2011/04/29/amazons-ebs-outage/&text=Amazon's EBS outage" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2011/04/29/amazons-ebs-outage/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Google&#8217;s Megastore</title>
		<link>http://storagemojo.com/2011/04/20/googles-megastore/</link>
		<comments>http://storagemojo.com/2011/04/20/googles-megastore/#comments</comments>
		<pubDate>Wed, 20 Apr 2011 16:50:29 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Cloud computing & storage]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Information Management]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2349</guid>
		<description><![CDATA[Megastore handles over 3 billion writes and 20 billion reads daily on almost 8 PB of primary data across many global data centers. In a paper by Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Léon, Yawei Li, Alexander Lloyd, Vadim Yushprakh titled Megastore: Providing Scalable, Highly Available Storage [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Megastore handles over 3 billion writes and 20 billion reads daily on almost 8 PB of primary data across many global data centers. </p>
<p>In a paper by Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Léon, Yawei Li, Alexander Lloyd, Vadim Yushprakh titled <a href="http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf" target="_blank">Megastore: Providing Scalable, Highly Available Storage for Interactive Services</a> Google engineers describe how it works. From the abstract:</p>
<blockquote><p>
Megastore is a storage system developed to meet the requirements of today&#8217;s interactive online services. Megastore blends the scalability of a NoSQL data store with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high-availability. We provide fully serializable ACID semantics within fine-grained partitions of data. This partitioning allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between data centers.
</p></blockquote>
<p><strong>The mission</strong><br />
Support Internet apps such as Google&#8217;s AppEngine. </p>
<ul>
<li>Scale to millions of users</li>
<li>Responsive despite Internet latencies to impatient users</li>
<li>Easy for developers</li>
<li>Fault resilience from drive failures to data center loss and everything in between</li>
<li>Low-latency synchronous replication to distant sites</li>
</ul>
<p><strong>The how</strong><br />
Scale by partitioning the data store and replicating each partition separately, providing full ACID semantics within partitions but limited consistency guarantees across them. Offer some traditional database features if they scale with tolerable latency.</p>
<p>The key assumptions are that data for many apps can be partitioned, for example by user, and that a selected set of DB features can make developers productive.</p>
<p><strong>Availability and scale</strong><br />
To achieve availability and global scale the designers implemented two key architectural features:</p>
<ul>
<li>For availability, an asynchronous log replicator optimized for long-distance</li>
<li>For scale, data partitioned into small databases each with its own replicated log</li>
</ul>
<p>Rather than implement a master/slave or optimistic replication strategy, the team decided to use Paxos, a consensus algorithm that does not require a master, with a novel extension. A single Paxos log would soon become a bottleneck with millions of users so each partition gets its own replicated Paxos log.</p>
<p>Data is partitioned into entity groups which are synchronously replicated over a wide area while the data itself is stored in NoSQL storage. ACID transaction records within the entities are replicated using Paxos.</p>
<p>For transactions across entities, the synchronous replication requirement is relaxed and an asynchronous message queue is used. Thus it&#8217;s key that entity group boundaries reflect application usage and user expectations.</p>
<p><strong>Entities</strong><br />
An e-mail account is a natural entity. But defining other entities is more complex.</p>
<p>Geographic data lacks natural granularity. For example, the globe is divided into non-overlapping entities. Changes across these geographic entities use (expensive) two-phase commits.</p>
<p>The design problem: entities large enough to make two-phase commits uncommon but small enough to keep transaction rates low.</p>
<p>Each entity has a root table and may have child tables. Each child table has a single root table. Example: a user&#8217;s root table may have each of the user&#8217;s photo collections as a child. Most applications find natural entity group boundaries.</p>
<p><strong>API</strong><br />
The insight driving the API is that the big win is scalable performance rather than a rich query language. Thus a focus on controlling physical locality and hierarchical layouts.</p>
<p>For example, joins are implemented in application code. Queries specify scans or lookups against particular tables and indexes. Therefore, the application needs to understand the data schema to perform well.</p>
<p><strong>Replication</strong><br />
Megastore uses Paxos to manage synchronous replication. But in order to make Paxos practical despite high latencies the team developed some optimizations:</p>
<ul>
<li><strong>Fast reads.</strong> Current reads are usually from local replicas since most writes succeed on all replicas.</li>
<li><strong>Fast writes.</strong> Since most apps repeatedly write from the same region, the initial writer is granted priority for further replica writes. Using local replicas and reducing write contention for distant replicas minimizes latency.</li>
<li><strong>Replica types.</strong> In addition to full replicas Megastore has 2 other replica types:
<ul>
	<i>witness replicas</i>. Witnesses vote in Paxos rounds and store the write-ahead log but do not store entity data or indexes to keep storage costs low. They are also tiebreakers when isn&#8217;t a quorum.<br />
	<i>Read-only replicas</i> are the inverse: nonvoting replicas that contain full snapshots of the data. Their data may be slightly stale but they help disseminate the data over a wide area without slowing writes.</li>
</ul>
</ul>
<p><strong>Architecture</strong><br />
What does Megastore look like in practice? Here&#8217;s an example. </p>
<p><a href="http://storagemojo.com/wp-content/uploads//2011/04/megastore_arch.png"><img src="http://storagemojo.com/wp-content/uploads//2011/04/megastore_arch.png" alt="" title="megastore_arch" width="460" height="310" class="aligncenter size-full wp-image-2350" /></a></p>
<p>A Megastore client library is installed on the app server. It implements Paxos and other algorithms such as read replica selection. The app server has a local replica written to a local <a href="http://storagemojo.com/2006/09/07/googles-bigtable-distributed-storage-system-pt-i/" target="_blank">BigTable</a> instance.</p>
<p>A <i>coordinator server</i> tracks a set of entity groups and observes all Paxos writes. The coordinator is simpler than BigTable and serves local reads.</p>
<p>Concurrent with writing local data to BigTable and the coordinator the Megastore library is also writing to a second full replica: a replication server and a second coordinator. The stateless replication servers handle the writes to the remote big table while the lower latency coordinator handles any reads from the remote replica.</p>
<p>Failures may leave writes abandoned or in an uncertain state. The replication servers scan for incomplete writes and offer no op values via Paxos to complete the.</p>
<p><strong>Availability</strong><br />
As coordinator servers do most local reads their availability is critical to maintaining Megastore&#8217;s performance. The coordinators use an out-of-band protocol to track other coordinators and use Google&#8217;s Chubby distributed lock service to obtain remote locks. If the coordinator loses a majority of its locks it will consider all entities in its purview to be out of date until the locks are regained and the coordinator is current.</p>
<p>There are a variety of network and race conditions that can affect coordinator availability. The team believes the simplicity of the coordinator architecture and their light network traffic makes the availability risks acceptable.</p>
<p><strong>Performance</strong><br />
Because Megastore is geographically distributed, application servers in different locations may initiate writes to the same end entity group simultaneously. Only one of them will succeed and the other writers will have to retry.</p>
<p>Limiting writes to a few per second per entity group makes contention insignificant, e-mail for example. </p>
<p>For multiuser applications with higher write requirements developers can shard entity groups more finely or batch user operations into fewer transactions. Fine-grained advisory locks and sequencing transactions are other techniques to handle higher write loads.</p>
<p><strong>The real world</strong><br />
Megastores been deployed for several years and more than 100 production applications using today. The paper provides these figures on availability and average latencies.</p>
<p><a href="http://storagemojo.com/wp-content/uploads//2011/04/megastore_availability_dist.png"><img src="http://storagemojo.com/wp-content/uploads//2011/04/megastore_availability_dist.png" alt="" title="megastore_availability_dist" width="416" height="327" class="aligncenter size-full wp-image-2351" /></a><br />
<a href="http://storagemojo.com/wp-content/uploads//2011/04/megastore_avg_latencies.png"><img src="http://storagemojo.com/wp-content/uploads//2011/04/megastore_avg_latencies.png" alt="" title="megastore_avg_latencies" width="418" height="343" class="aligncenter size-full wp-image-2352" /></a></p>
<p>The high availability of the system architecture creates a nice-to-have problem: small transient errors on top of persistent uncorrected problems can cause much larger problems. </p>
<p>Fault tolerance makes finding underlying faults more difficult. The price of fault tolerance is eternal vigilance.</p>
<p>As the architecture diagram suggests Megastore doesn&#8217;t manage BigTable. Developers  must optimize the storage for their app.</p>
<p><strong>The StorageMojo take</strong><br />
As Brewer&#8217;s <a href="http://en.wikipedia.org/wiki/CAP_theorem" target="_blank">CAP theorem</a> showed, a distributed system can&#8217;t provide consistency, availability and partition tolerance to all nodes at the same time. But this paper shows that by making smart choices we can get darn close as far as human users are concerned.</p>
<p>If Microsoft Office &#8211; or an open-source analog &#8211; could plug into a productized version of Megastore this could become popular for private cloud implementations: LAN performance in the office and global availability on the road. What&#8217;s not to like?</p>
<p>But whether that happens or not, the paper demonstrates again the value of Internet scale infrastructure thinking. Enterprise vendors would never have developed Megastore, but now that we&#8217;ve seen it work we can begin applying its principles to smaller scale problems.</p>
<p><strong>Courteous comments welcome, of course.</strong>  If this overview intrigues I urge you to read the entire paper as there are some interesting pieces I&#8217;ve left out.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2011/04/20/googles-megastore/&text=Google's Megastore" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2011/04/20/googles-megastore/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Hyder: a flash-based scale-out database</title>
		<link>http://storagemojo.com/2011/01/24/hyder-a-flash-based-scale-out-database/</link>
		<comments>http://storagemojo.com/2011/01/24/hyder-a-flash-based-scale-out-database/#comments</comments>
		<pubDate>Mon, 24 Jan 2011 07:36:35 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Architecture]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Future Tech]]></category>
		<category><![CDATA[Information Management]]></category>
		<category><![CDATA[SSD/Flash Disk]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2239</guid>
		<description><![CDATA[Talked to a company last week whose cloud app handles several billion transactions per month on a cluster. Sounds like SSDs could help them but how? In a paper from the latest 5th Biennial Conference on Innovative Data Systems Research (CIDR &#8217;11) researchers Philip A. Bernstein and Colin W. Reid of Microsoft and Sudipto Das [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Talked to a company last week whose cloud app handles several billion transactions per month on a cluster. Sounds like SSDs could help them but how?</p>
<p>In a paper from the latest <a href="http://www.cidrdb.org/cidr2011/" target="_blank">5th Biennial Conference on Innovative Data Systems Research</a> (CIDR &#8217;11) researchers Philip A. Bernstein and Colin W. Reid of Microsoft and Sudipto Das of UC Santa Barbara have a suggestion: <a href="http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper2.pdf" target="_blank">Hyder – A Transactional Record Manager for Shared Flash</a> (pdf).</p>
<p>As underlying hardware changes &#8211; faster networks, large memories, multi-core CPUs and SSDs &#8211; database software architectures may change too. <i>Hyder</i> architecture supports</p>
<blockquote><p>
. . . reads and writes on indexed records within classical multi-step transactions. It is designed to run on a cluster of servers that have shared access to a large pool of network-addressable raw flash chips. . . . Hyder uses a data-sharing architecture that scales out without partitioning the database or application.
</p></blockquote>
<p><strong>No partition scale-out</strong><br />
Today, most popular database clusters partition the database across multiple servers. Done well this works, but at some cost. The database design is non-trivial &#8211; cross-partition transactions, cache coherence, load balancing, scaling and multi-server debugging &#8211; are knotty issues which translate into higher design and operation costs.</p>
<p>Hyder eliminates partitioning, distributed programming, layers of cache, remote procedure calls and load balancing. All servers can read and write the entire database &#8211; so any server can execute any transaction. Load-balancing is simple: direct new transactions to lightly-loaded servers.</p>
<p>Each update transaction runs on one machine and writes to a shared log &#8211; so there&#8217;s no 2-phase commit. And no 2-phase <strike>commit</strike> locking, which can force performance off a cliff when workloads spike.</p>
<p>The 3 main components of Hyder are the <i>log</i>, the <i>index</i> and the <i>roll-forward algorithm</i>.</p>
<p><strong>Log</strong><br />
The log runs on multiple flash devices &#8211; chips, DIMMs or ??? &#8211; and writes multi-page log records across multiple devices with parity to enable log recovery after device failures.</p>
<p>Hyder uses a <i>multi-versioned</i> database &#8211; old record versions aren&#8217;t updated-in-place, only the most recent version of a record is used &#8211; which has a couple of useful properties:</p>
<ul>
<li>Server caches are inherently coherent since only the most recent versions of records are used.</li>
<li>Data can be read while writes are in progress.</li>
<li>Queries that can be decomposed can be run across multiple servers concurrently for a faster response time.</li>
</ul>
<p>[This may seem like voodoo to ACIDheads. A good technical intro to multi-versioning concurrency control (MVCC) is <a href="http://www.rtcmagazine.com/articles/view/101612" target="_blank">Multi-core software: to gain speed, eliminate resource contention</a>.]</p>
<p>Servers run a cache update process that keeps them current with updated records. Server caches don&#8217;t have to be identical and the cache invalidate messages that most clusters use for cache coherency aren&#8217;t needed.</p>
<p>All log writes are idempotent appends, so if a write fails the server can simply reissue the write. The authors describe several error modes and how Hyder handles them.</p>
<p><strong>Index</strong><br />
The index stores the database as a search tree with each node a [key, payload] pair. The tree can store, for example, a relational database. The index tree is also represented in the log.</p>
<p>Tree nodes are not updated in place. When node <i>n</i> is updated, a new copy &#8211; <i>n&#8217;</i>is created. Then, of course, the parent node must be updated and so on up the tree. </p>
<p>A binary tree minimizes the number of node updates, but can be processor intensive. The optimal tree structure for Hyder is not yet resolved.</p>
<p>Garbage collection is an issue. Each node pointer includes the ID of the oldest reachable data element. An element older than any that is pointed to by a node is garbage.</p>
<p><strong>Roll-forward algorithm</strong><br />
This is the key process of Hyder.</p>
<p>When a record update begins, one server executes the transaction. The server is given a copy of  the latest database root, a static snapshot of the entire database.</p>
<p>The updates are stored in a local cache and after execution the after-images are gathered into an <i>intention</i> record, which is broadcast to all servers and appended to the log. The update&#8217;s readset is included in the intention record, to insure all intentions are properly ordered, none are lost, and the offset is made known to all servers.</p>
<p>Each server can assemble a local copy of the tail of the log, which is used to determine if there are conflicting updates. The <i>meld</i> procedure manages conflicting updates.</p>
<p>Appending the intention to the database log doesn&#8217;t commit the transaction. The intention references the static snapshot of the latest database root. The meld procedure determines if any committed transactions since the snapshot conflict with the intention. </p>
<p>If they don&#8217;t, all is good. If they do, the transaction is aborted.</p>
<p>All servers roll forward using meld and don&#8217;t message each other about committed and failed transactions. Therefore there is no lock manager and no 2-phase commit.</p>
<p><strong>Contention</strong><br />
Losing the lock manager and 2-phase commit should help performance unless other points of contention throttle the system. Hyder&#8217;s points of contention include appending intentions to the log, melding the log at each server, and aborting transactions.</p>
<p>Intention appends are serial. The lower the write latency the more appends can be written. A 10us write latency means a 100k TPS.</p>
<p>Network latency adds to write latency. Faster switches improve append performance.</p>
<p>The abort rate depends on the number of concurrent transactions that conflict. Fast transactions reduce the probability of aborts by reducing the number of concurrent transactions. </p>
<p>The worst case is a record subject to multiple updates from different servers. Detecting high-conflict transactions and serializing them by forcing them onto 1 server would reduce the hot data performance hit.</p>
<p><strong>Performance</strong><br />
The authors model Hyder&#8217;s performance with a focus on the high-contention corner cases. In general, the tests show linear scaling as servers are added. </p>
<p>The problems come when the underlying hardware limits are exceeded. Increasing execution times mean more aborts and performance falls off a cliff. From the paper:</p>
<p><a href="http://storagemojo.com/wp-content/uploads//2011/01/hyder_thrashing.jpg"><img src="http://storagemojo.com/wp-content/uploads//2011/01/hyder_thrashing.jpg" alt="" title="hyder_thrashing" width="475" height="286" class="aligncenter size-full wp-image-2240" /></a></p>
<p><strong>The StorageMojo take</strong><br />
We&#8217;ve been building disk workarounds for for decades. We now tend to assume those workarounds are fundamental architectural requirements rather than hacks. </p>
<p>The <i>Hyder</i> paper asks us to imagine a world where non-volatile mass storage is fast and cheap &#8211; and how we could re-architect basic systems to be faster and cheaper too.</p>
<p>The authors conclusion is a fair assessment:</p>
<blockquote><p>
Many variations of the Hyder architecture and algorithms would be worth exploring. There may also be opportunities to use Hyder’s logging and meld algorithms with some modification in other contexts, such as file systems and middleware. We suggested a number of directions for future work throughout the paper. No doubt there are many other directions as well.
</p></blockquote>
<p><strong>Courteous comments welcome, of course.</strong> I hope to get to some of the other CIDR papers before <a href="" target="_blank">FAST &#8217;11</a> snows me under.  <strong>Update:</strong> Phil Bernstein was kind enough to scan the post and I&#8217;ve updated 1 minor error. He also mentioned that it won the Best Paper award at the conference. Those CIDR folks have great taste in papers, don&#8217;t they?</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2011/01/24/hyder-a-flash-based-scale-out-database/&text=Hyder: a flash-based scale-out database" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2011/01/24/hyder-a-flash-based-scale-out-database/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Making data Vanish</title>
		<link>http://storagemojo.com/2010/07/09/making-data-vanish/</link>
		<comments>http://storagemojo.com/2010/07/09/making-data-vanish/#comments</comments>
		<pubDate>Fri, 09 Jul 2010 23:55:45 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Cloud computing & storage]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Future Tech]]></category>
		<category><![CDATA[Security & Public Policy]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2079</guid>
		<description><![CDATA[Given how hard it is to save data you want (see The Universe hates your data) to keep, losing data on the web should be easy. It isn&#8217;t, because it gets stored so many places in its travels. Problem But the power of the web means that silliness can now be stored and found with [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Given how hard it is to save data you <i>want</i> (see <a href="http://www.zdnet.com/blog/storage/the-universe-hates-your-data/975" target="_blank">The Universe hates your data</a>) to keep, losing data on the web should be easy. It isn&#8217;t, because it gets stored so many places in its travels.</p>
<p><strong>Problem</strong><br />
But the power of the web means that silliness can now be stored and found with the speed of a Google search. You don&#8217;t want sexy love notes &#8211; or pictures &#8211; to a former flame posted after infatuation ends. </p>
<p>Or maybe you want to discuss relationship, health or work problems with a friend over email &#8211; and don&#8217;t want your musings to be later shared with others. Wouldn&#8217;t it be nice to know that such messages will become unreadable even if your friend is unreliable?</p>
<p>Researchers built a prototype service &#8211; Vanish &#8211; that seeks to:</p>
<blockquote><p>
. . . ensure that all copies of certain data become unreadable after a user-specified time, without any specific action on the part of a user, without needing to trust any single third party to perform the deletion, and even if an attacker obtains both a cached copy of that data and the user&#8217;s cryptographic keys and passwords.
</p></blockquote>
<p>That&#8217;s a tall order. Their 1st proof-of-concept failed. But they are continuing the fight.</p>
<p><strong>Vanish</strong><br />
In <a href="http://vanish.cs.washington.edu/pubs/usenixsec09-geambasu.pdf" target="_blank">Vanish: Increasing Data Privacy with Self-Destructing Data</a> Roxana Geambasu, Tadayoshi Kohno, Amit A. Levy and Henry M. Levy of the University of Washington computer science department present an architecture and a prototype to do just that.</p>
<p>Ironically, the project utilizes the same P2P infrastructures that preserves and distribute data: BitTorrent&#8217;s VUZE distributed hash table (DHT) client. </p>
<p>The basic idea is this: Vanish encrypts your data with a random key, destroys the key, and then sprinkles pieces of the key across random nodes of the DHT. You tell the system when to destroy the key and your data goes <i>poof!</i> </p>
<p>They developed a data structure called a <i>Vanishing Data Object</i> (VDO) that encapsulates user data and prevents the content from persisting. And the data becomes unreadable even if the attacker gets a pristine copy of the VDO from before its expiration and all the associated keys and passwords.</p>
<p>Here&#8217;s a timeline for that attack:</p>
<p><a href="http://storagemojo.com/wp-content/uploads//2010/07/vdo_usage_and_attack.jpg"><img src="http://storagemojo.com/wp-content/uploads//2010/07/vdo_usage_and_attack.jpg" alt="" title="vdo_usage_and_attack" width="475" height="208" class="aligncenter size-full wp-image-2083" /></a><br />
<strong>DHT overview</strong></p>
<blockquote><p>
A DHT is a distributed, peer-to-peer (P2P) storage network. . . . DHTs like Vuze generally exhibit a put/get interface for reading and storing data, which is implemented internally by three operations: <code>lookup, get</code>, and <code>store</code>. The data itself consists of an (<i>index, value</i>) pair. Each node in the DHT manages a part of an astronomically large index name space (e.g., 2<sup>160</sup> values for Vuze).
</p></blockquote>
<p>DHTs are available, scalable, broadly distributed and decentralized with rapid node churn. All these properties are ideal for an infrastructure that has to withstand a wide variety of attacks.</p>
<p><strong>Vanish architecture</strong><br />
<a href="http://storagemojo.com/wp-content/uploads//2010/07/vanish_system_architecture.jpg"><img src="http://storagemojo.com/wp-content/uploads//2010/07/vanish_system_architecture.jpg" alt="" title="vanish_system_architecture" width="462" height="220" class="aligncenter size-full wp-image-2082" /></a><br />
Data (D) is encrypted (E) with key (K) to deliver cyphertext (C). Then K is split into N shares &#8211; K<sub>1</sub>,&#8230;,K<sub>N</sub> &#8211; and distributed across the DHT using a random access key (L) and a secure pseudo-random number generator. The K split uses a redundant erasure code so that a user definable subset of N shares can reconstruct the key.</p>
<p>The erasure codes are needed because DHTs lose data due to node churn. It is a bug that is also a feature for secure destruction of data.</p>
<p><strong>Prototype</strong><br />
They built a Firefox plug-in for Gmail to create self-destructing emails and another &#8211; FireVanish &#8211; for making any text in a web input box self-destructing. They also built a file app, so you can make any file self-destructing. Handy for Word backup files that you don&#8217;t want to keep around.</p>
<p>The major change to the Vuze BitTorrent client was less than 50 lines of code to prevent <code>lookup</code> sniffing attacks. Those changes only affect the client, not the DHT.</p>
<p>The Vanish proto was <a href="http://z.cs.utexas.edu/users/osa/unvanish/" target="_blank">cracked</a> by a group of researchers at UT Austin, Princeton, and U of Michigan. They found that an eavesdropper could collect the key shards from the DHT and reassemble the &#8220;vanished&#8221; content.</p>
<p>Who is going to collect all the shard-like pieces on DHTs? Other than the NSA and other major intelligence services, probably no one. For extra security the data can be encrypted before VDO encapsulation.</p>
<p><strong>The StorageMojo take</strong><br />
The Internet is paid for with our loss of privacy. Young people may think it no great loss, check back in 20 years and we&#8217;ll see what you think then.</p>
<p>It is slowly dawning on the public that their lives are an open book on the Internet. Expect a growing market for private communication and storage if ease-of-use and trust issues can be resolved.</p>
<p>You don&#8217;t have to be Tiger Woods to want to keep your private life private. I hope the Vanish team succeeds.</p>
<p><strong>Courteous comments welcome, of course.</strong>  Figures courtesy of the Vanish team.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2010/07/09/making-data-vanish/&text=Making data Vanish" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2010/07/09/making-data-vanish/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Room at the top</title>
		<link>http://storagemojo.com/2010/06/09/room-at-the-top/</link>
		<comments>http://storagemojo.com/2010/06/09/room-at-the-top/#comments</comments>
		<pubDate>Wed, 09 Jun 2010 07:39:16 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Enterprise]]></category>
		<category><![CDATA[SSD/Flash Disk]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2052</guid>
		<description><![CDATA[Kaminario has introduced the world&#8217;s fastest SAN storage, the K2. If time is money, this is for you. DRAM Kaminario&#8217;s K2 is fast because DRAM, not disk, is the primary storage. DRAM&#8217;s low latency, high bandwidth and durability breaks the tight link between capacity and performance that disks and flash impose. No need for excess [...]]]></description>
			<content:encoded><![CDATA[<p></p><p><a href="http://www.kaminario.com/" target="_blank">Kaminario</a> has introduced the world&#8217;s fastest SAN storage, the K2. If time is money, this is for you.</p>
<p><strong>DRAM</strong><br />
Kaminario&#8217;s K2 is fast because DRAM, not disk, is the primary storage. DRAM&#8217;s low latency, high bandwidth and durability breaks the tight link between capacity and performance that disks and flash impose. No need for excess capacity to ensure enough IOPS, bandwidth or service life.</p>
<p><strong>The product</strong><br />
Kaminario is a software company. However, they configure customer systems and install the software to order. No home-baked integration here. </p>
<p>The basic hardware unit is a Dell blade server. The blade servers are either I/O directors or data nodes. The Dell server chassis is a passive box &#8211; no active components on the backplane &#8211; but some customers opt for dual chassis for redundancy out of caution.</p>
<p><strong>I/O directors</strong><br />
The I/O directors use 8 gig Fibre Channel to servers and 10Gig/Ethernet to data nodes. The company says they can saturate both due to proprietary software optimizations.</p>
<p>Using FC switches, each I/O director can talk to multiple servers. Each I/O director can handle 150,000 random IOPS.<br />
<div id="attachment_2054" class="wp-caption aligncenter" style="width: 475px">
	<a href="http://storagemojo.com/wp-content/uploads//2010/06/kaminario_architecture.jpg"><img src="http://storagemojo.com/wp-content/uploads//2010/06/kaminario_architecture.jpg" alt="" title="kaminario_architecture" width="475" height="364" class="size-full wp-image-2054" /></a>
	<p class="wp-caption-text">K2 architecture - courtesy Kaminario</p>
</div><br />
<strong>Data nodes</strong><br />
Each data node supports up to 288 GB of ECC DRAM. All the data nodes have battery backup and 2 disks for de-staging data to persistent storage. Background de-staging during idle time reduces backup times during power failures.</p>
<p>The minimum config is 2 I/O directors and 4 data nodes with 500 GB of capacity. That&#8217;s 300,000 IOPS. They&#8217;ve been tested to 10 nodes and 1.5 million random read/write IOPS with support for 16 nodes &#8211; and double the IOPS &#8211; reportedly coming soon.</p>
<p><strong>Under the covers</strong><br />
The I/O directors are clustered so when 1 fails the others pick up the load. The switched back end 10Gig Ethernet enables all I/O directors to access all data nodes.</p>
<p>The replication default is 2 copies of all data on different blades. Plus copies on disk. </p>
<p>All this runs on standard Dell blade servers. No specialized, low-volume RAID controllers or power-hungry disk shelves. </p>
<p><strong>Software</strong><br />
The secret sauce is the software. Kaminario doesn&#8217;t say much about how they do what they do. In any high-performance cluster maintaining metadata coherence across nodes is one of the tough problems.</p>
<p>They did say they maintain hash tables that enable very short updates to all I/O directors after writes. I also suspect they also have implemented a low latency backend update protocol. Metadata serving is distributed across the cluster.</p>
<p>They must also have some creative ways to max out FC links. I&#8217;d like to know more.</p>
<p><strong>Management</strong><br />
With storage this fast they say you need little tuning. Lay LUNs across the data nodes and fasten your seatbelt. The software includes optimizations, like pseudo-random block layout to minimize contention, automatic load balancing and demand-based block replication. </p>
<p>If your app calls for it you can tune chunk sizes and set replication policies. Kaminario says K2 is much easier to manage than typical high-performance storage &#8211; you don&#8217;t have to worry about disk-induced issues like stride.</p>
<p>Management is kept out of the data path on a dedicated GigE network.</p>
<p><strong>Support</strong><br />
Kaminario says they have designed the product and their organization to provide mission-critical Enterprise support. The visible elements from configuration control and software installation to phone home and remote diagnostics back that up.</p>
<p><strong>Who needs this?</strong><br />
If you are hammering a few TB of data for stock trading, real-time business intelligence or TLA government work, this could be the ticket.</p>
<p><strong>Pricing</strong><br />
If you have to ask. . . .</p>
<p>Kaminario has a unique approach: pay for performance:</p>
<blockquote><p>
. . . we price the solution based on the customer IOPS and capacity needs, so basically the way we present such a platform price is by $/GB/IOPS.
</p></blockquote>
<p>I *think* small configs start around $200k. For the performance market price is something like #7 on the list. The first 3 are performance/availability &#8211; 2 sides of the same coin, really.</p>
<p>This removes SPEC shadow puppetry between application requirements and storage performance. Of course, you have to know what performance you want. But anyone who&#8217;s performance tuning high-end arrays will know that.</p>
<p><strong>The StorageMojo take</strong><br />
Kaminario is opening a new niche at the performance end of the market.</p>
<p>The current Big Storage vendors claim that they too can do a million IOPS. And they can, for millions. A price that makes a few TB of DRAM look cheap.</p>
<p>Since high-end disk &#8211; ≈$1/GB retail &#8211; makes up 5-10% of the cost of a high-end array, replacing disk with DRAM might be expected to double the cost of an array. But K2 does away with all the low-volume kit &#8211; controllers, shared cache, disk packaging and more &#8211; and replaces it with high-volume blade hardware. That lowers costs a lot.</p>
<p>Kaminario has opened a new niche: hyper-performance data storage. While a few TB doesn&#8217;t sound like much, it is more text than all but the world&#8217;s largest libraries place on miles of shelves.</p>
<p>The data arms race has kicked up another few notches. It is more competition for the big iron arrays where they least expected it: at the high-end of the market. </p>
<p><strong>Courteous comments welcome, of course.</strong>  </p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2010/06/09/room-at-the-top/&text=Room at the top" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2010/06/09/room-at-the-top/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>Another scale-out storage vendor bought</title>
		<link>http://storagemojo.com/2010/05/11/another-scale-out-storage-vendor-bought/</link>
		<comments>http://storagemojo.com/2010/05/11/another-scale-out-storage-vendor-bought/#comments</comments>
		<pubDate>Tue, 11 May 2010 17:15:44 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Video]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=2029</guid>
		<description><![CDATA[Harmonic is acquiring video production infrastructure and storage provider Omneon for $274 million. They&#8217;d raised about $100M since their founding. Omneon Video Networks is a specialized storage company that provides broadcast quality storage for digital media, along with the gear needed to convert video streams to bits. They do clustering, in their MediaGrid product, a [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>Harmonic is <a href="http://www.marketwire.com/press-release/Harmonic-Announces-Definitive-Agreement-to-Acquire-Omneon-NASDAQ-HLIT-1256022.htm" target="_blank">acquiring</a> video production infrastructure and storage provider Omneon for $274 million. They&#8217;d raised about $100M since their founding.</p>
<p>Omneon Video Networks is a specialized storage company that provides broadcast quality storage for digital media, along with the gear needed to convert video streams to bits. They do clustering, in their MediaGrid product, a sophisticated architecture that can handle a 7&#215;24 beating.</p>
<p>Founded in 1998, venture-backed Omneon started offering storage in response to customer demand. They chose a commodity-based cluster and built their own storage software, MediaGrid, whose architecture hews to the post-array Google-style storage model:</p>
<ul>
<li>No RAID – slices are replicated one or more times based on policy or demand</li>
<li>Single global namespace</li>
<li>Out-of-band meta-data servers manage content servers</li>
</ul>
<p>Omneon’s content servers do more than serve content. They put their unused CPU power to work doing jobs like transcoding – translating content from one format like HD to iPhone-suitable QuickTime.</p>
<p><strong>The StorageMojo take</strong><br />
Omneon is more than a storage company, but their storage made them a competitor to Isilon in the broadcast market. Harmonic is big in the rest of the video workflow, especially distribution in multiple formats. It looks like the 2 firms complement each other nicely.</p>
<p>Omneon was not a pure play storage company. But the fact that they were able to build a competitive storage product as an adjunct to their main business points up how low the barriers to entry are in scale-out storage.</p>
<p><strong>Courteous comments welcome, of course.</strong> I&#8217;m still at EMC World. YottaYotta&#8217;s technology is front and center in the VPLEX product. More on that later.</p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2010/05/11/another-scale-out-storage-vendor-bought/&text=Another scale-out storage vendor bought" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2010/05/11/another-scale-out-storage-vendor-bought/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>NetApp buys Bycast</title>
		<link>http://storagemojo.com/2010/04/13/netapp-buys-bycast-2/</link>
		<comments>http://storagemojo.com/2010/04/13/netapp-buys-bycast-2/#comments</comments>
		<pubDate>Tue, 13 Apr 2010 16:07:00 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Cloud computing & storage]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Marketing]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=1997</guid>
		<description><![CDATA[Brilliant NetApp is buying Bycast, the little-known but likely most successful scale-out file storage company. Bycast has several hundred customers, many installed petabytes, leadership in a growing market segment &#8211; medical imaging &#8211; and a compelling value proposition. What they didn&#8217;t have was market presence. Most of their sales came through OEM deals with IBM [...]]]></description>
			<content:encoded><![CDATA[<p></p><p><strong>Brilliant</strong><br />
NetApp is buying Bycast, the little-known but likely most successful scale-out file storage company. Bycast has several hundred customers, many installed petabytes, leadership in a growing market segment &#8211; medical imaging &#8211; and a compelling value proposition.</p>
<p>What they didn&#8217;t have was market presence. Most of their sales came through OEM deals with IBM and HP, who rebranded Bycast&#8217;s software &#8211; the <a href="http://storagemojo.com/2010/03/17/brocades-unraveling/" target="_blank">Brocade</a> problem.</p>
<p>They also had the common Canadian reluctance to promote themselves. No marketing VP. What marketing efforts they made followed &#8220;big company&#8221; models &#8211; something few small companies can afford.</p>
<p>Their most effective spokesman was their CTO, co-founder and <a href="http://intotheinfrastructure.blogspot.com/" target="_blank">blogger</a> David Slik. Why he got detailed to SNIA committees is a mystery.</p>
<p><strong>The StorageMojo take</strong><br />
This is a brilliant move by NetApp &#8211; as long as they execute. They can&#8217;t afford another Spinnaker.</p>
<p>Tom Georgens has been putting his stamp on the executive team. What they do with Bycast will be a good first test.</p>
<p>The most interesting angle: Bycast&#8217;s replication and resiliency means you don&#8217;t need to back up a properly configured cluster. Which means you don&#8217;t need Data Domain. Hmm-m-m?</p>
<p><strong>Courteous comments welcome, of course.</strong> I did some work for Bycast and I&#8217;m a fan of their technology. </p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2010/04/13/netapp-buys-bycast-2/&text=NetApp buys Bycast" target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2010/04/13/netapp-buys-bycast-2/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>A petascale parallel database</title>
		<link>http://storagemojo.com/2010/02/08/a-petascale-parallel-database/</link>
		<comments>http://storagemojo.com/2010/02/08/a-petascale-parallel-database/#comments</comments>
		<pubDate>Tue, 09 Feb 2010 03:01:06 +0000</pubDate>
		<dc:creator>Robin Harris</dc:creator>
				<category><![CDATA[Cloud computing & storage]]></category>
		<category><![CDATA[Clusters]]></category>
		<category><![CDATA[Information Management]]></category>

		<guid isPermaLink="false">http://storagemojo.com/?p=1891</guid>
		<description><![CDATA[MapReduce and its open source version, Hadoop, are parallel data analysis tools. A few lines of code can drive massive data reductions across thousands of nodes. Cool. Powerful though it is, Hadoop isn&#8217;t a database. Classic structured data analysis of the model/load/process type isn&#8217;t what it was designed for. That&#8217;s where the paper HadoopDB: An [...]]]></description>
			<content:encoded><![CDATA[<p></p><p>MapReduce and its open source version, Hadoop, are parallel data analysis tools. A few lines of code can drive massive data reductions across thousands of nodes. </p>
<p>Cool.</p>
<p>Powerful though it is, Hadoop isn&#8217;t a database. Classic <i>structured</i> data analysis of the model/load/process type isn&#8217;t what it was designed for.</p>
<p>That&#8217;s where the paper <a href="http://db.cs.yale.edu/hadoopdb/hadoopdb.html" target="_blank">HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads</a> (pdf) comes in. Written by Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz and Alexander Rasin (the former 4 @Yale, and the latter @Brown) the paper proposes a method for building an open-source, commodity hardware-based massively scalable, shared-nothing, analytical parallel database.</p>
<p><strong>What it is</strong><br />
HadoopDB coordinates SQL queries across multiple independent database nodes using Hadoop as the task coordinator and network communication layer. It uses the scheduling and job tracking of Hadoop while it intelligently pushes much of the query processing into the individual database nodes.</p>
<p>There are four components to HadoopDB.</p>
<ul>
<li>Database Connector. Each node has its own independent database. The connector is the interface between the database and Hadoop&#8217;s task trackers. A MapReduce jobs supplies the Connector with an SQL query and other parameters. The Connector executes a SQL query on the database and returns results as key value pairs. It can implemented to support a variety of databases.</li>
<li>Catalog. The information needed to access the databases and metadata such as cluster data sets, replica locations and data partitions is kept in the catalog.</li>
<li>Data loader. The data loader is responsible for two jobs. First executing a MapReduce job over Hadoop that reads the raw data files and partitions them into as many parts as the number of nodes in the cluster. Second, the partitions are loaded into the local file system of each node and chunked according the system-wide parameter.</li>
<li>SQL to MapReduce to SQL planner. The planner provides a parallel database front end to enable SQL queries. The planner transforms the queries into map reduce jobs and optimizes the query plans for efficiency. This is where scratch that this is the secret sauce of HodoopDB.</li>
</ul>
<p>HadoopDB complements the Hadoop infrastructure and does not replace it. Analysts have both available as needed.</p>
<p><strong>Heterogeneity</strong><br />
A key issue for Internet-scale systems is the ability to run in a heterogenous environment where multi-year build-outs and rolling node replacement are the norm. That means that some nodes will be faster than others.  HadoopDB breaks the work down into small tasks and moves them from slow to fast nodes automagically.</p>
<p><strong>Results</strong><br />
The authors ran some benchmarks on Amazon&#8217;s EC to to test performance. The HadoopDB load times were about 10x that of Hadoop, but the higher performance of HadoopDB usually justified the longer set up time.</p>
<p>The authors found that HadoopDB was able to approach the performance of parallel database systems on much lower cost hardware and free software. Given the gift of the projects one can expect higher performance as improvements are made.</p>
<p><strong>The killer app for private clouds?</strong><br />
MapReduce and Hadoop are already in wide use among Internet-scale datacenters. As companies begin to understand and correlate social media, web activity and ad response rates, the demand for large-scale parallel database processing will grow. But will they want to ship it out to Amazon?</p>
<p>Depending on the quantity and sensitivity of the data many organizations may prefer to keep the processing in-house. Private scale out Hadoop clusters may become the poor companies data warehouse of choice.</p>
<p><strong>The StorageMojo take</strong><br />
HadoopDB is more science project than commercial tool today. Yet the project demonstrates the feasibility of using scale out compute/storage clusters for work that day typically requires proprietary high-end scale up system architectures.</p>
<p>If capital costs are reduced by two thirds with a commodity/FOSS architecture, companies could afford to hire the expertise required to make it work. The free software/paid support model will prove quite successful in this space.</p>
<p><strong>Courteous comments welcome, of course.</strong>  </p>
<div style="clear:both;margin-bottom:5px;">
				<a href="http://twitter.com/share?url=http://storagemojo.com/2010/02/08/a-petascale-parallel-database/&text=A petascale parallel database " target="_blank" title="Click here if you liked this article">
					<img src="http://storagemojo.com/wp-content/plugins/twitter-plugin/images/twitt.gif" alt="Twitt" />
				</a>
			</div>]]></content:encoded>
			<wfw:commentRss>http://storagemojo.com/2010/02/08/a-petascale-parallel-database/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

