<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Isilon&#8217;s Cluster Technology.  Pt. 1.0</title>
	<atom:link href="http://storagemojo.com/2007/01/26/isilons-cluster-technology-pt-05/feed/" rel="self" type="application/rss+xml" />
	<link>http://storagemojo.com/2007/01/26/isilons-cluster-technology-pt-05/</link>
	<description>Data storage info &#38; analysis</description>
	<lastBuildDate>Tue, 07 Feb 2012 16:02:02 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: hirni</title>
		<link>http://storagemojo.com/2007/01/26/isilons-cluster-technology-pt-05/comment-page-1/#comment-18919</link>
		<dc:creator>hirni</dc:creator>
		<pubDate>Tue, 30 Jan 2007 11:34:08 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=359#comment-18919</guid>
		<description>Some while ago I had the chance to get a root-shell for some minutes on an isilon box. (you can get a shell - that&#039;s good)
And I looked a bit around :-)
All the observations are just from a short &#039;look around&#039;:

they definitely run FreeBSD 4.x (haven&#039;t remembered x)
(while Ontap-GX for example runs AFAIK FreeBSD 6.2 !)

The FS SEEMS like a clustered-FS, which all nodes mount under /onefs.
Internally the box calls the fs &#039;efs&#039; - whatever that stands for.
Unlike a traditional array - they stripe/distribute FILES, not blocks, and they can RE-stripe files - which seems to make it possible to change RAID-levels on individual files.
(and to reconstruct of course - which is a regen of parity/data)

Due to the fact, that they do not want to expose any details of their &#039;efs&#039; to the world, each node does act as an NFS-server.
(or CIFS server - but that&#039;s just samba)
This has now several effects IMHO:

a.) clients don&#039;t need any fs-driver to access it 
     (which sounds good)
     OTOH - clustered NFS isn&#039;t really trivial, and causes issues.

b.) each of the node acts as an own NFS-server for /onefs.
So all nodes &#039;see&#039; and &#039;export&#039; the same FS via multiple nodes.
But this causes all kind of questions and issues, which I couldn&#039;t check:
  1.) is the NFS-reply-cache replicated ?

  2.) is there an optional NFS-failover ? (I couldn&#039;t see any vif)
       (no need for reply-cache replication if there&#039;s no NFS-failover)

  3.) you have to take care which clients mount which brick - as it could otherwise create &#039;hot bricks&#039; ...
NFS itself has ZERO mechanisms to &#039;move/migrate&#039; a mount from one IP to another IP without unmount/mount on the client.
(this is something NTAP tries to avoid talking about too)
       
c.) the generic arch isn&#039;t capable to scale single-client or single-stream performance to more than one gigE.
With a client, you could mount /onefs several times, but whenever a client writes to a file - this stream goes only over one mountpoint - and hence over one isln-interface/brick.
So the max sustainable read/write is 1 gigE PER FILE - while reality says it&#039;s even less (rumours say ~75mb/sec write speed)
Maybe they&#039;ll go to 10gigE frontends - then it&#039;ll change - maybe.

Also - if several clients (mounted on different bricks) read/write to the same file - what&#039;s the realistic aggregate speed ?
(depends on the style of caching/locking)

d.) the IB-backend:
implies, that the remote-disks are connected via (TCP) IP !
(otherwise you couldn&#039;t use gigE instead of IB)

Latency: like every distributed FS, which PROXIES requests to the back-end on other nodes - it normally causes latency - esp. for small files...
As this system is from my understanding for large-files - no problem - but for home-dirs/small files - hmm ...
(esp. if many clients mounted on different bricks do many creates/renames/removes.)

e.) metadata coordination:
Like every distributed FS - some metadata operations MUST GET serialized (like mkdir/rmdir/creat()/rename()/unlink()/link() ).
You can&#039;t parallelize &#039;mkdir&#039; or &#039;rm/mv&#039; :-))
Not sure how they handle this - they&#039;re very quiet on this.
But as the system is limited in size - you could do local caching - depends on the number of files etc... (memory consumption)

f.) failure behavior:
Like with every distributed FS - the failure-behavior is definitely tricky - and is definitely different from non-clustered FSs.

 1.) If a node goes down (OS crash or power) - the other nodes have to decide what to do:
     reconstruct all files from the failed node - and reinitialize the node (clean) when it rejoins. (very inefficient)
     or hope and wait that the node comes back. (need to wait)
     (how long do they wait untill they give up the hope ?)

 Can you write during this time NEW files ? (I think - yes)
 Can you write/append/modify during this time to EXISTING FILES ?
 (likely not - but maybe yes - don&#039;t know)

Reconstruction of failed nodes:
Seems like the system is FILE based - so reconstruction of failed nodes only recons REALLY-USED-space, not like block-arrays everything ... 
But still - with 6+ disks per brick - and assuming they&#039;re close to full - this can take time ... - maybe 1 day ?? :-)

So remains the question of what&#039;s this system REALLY good for:

a.) it&#039;s mainly useless for HPC - at least for the workloads I know.
too slow single-stream perf. - NFS has issues with N-to-1 writes.
(multiple clients write through different bricks to the same file)

b.) it&#039;s mainly useless for &#039;commercial&#039; NAS
homedirectory-files (small) won&#039;t go too well, and some good space/quota management doesn&#039;t seem to exist.

c.) not sure whether it&#039;s usefull for databases (like ORCL) - esp. would need more details how the crash/recovery of individual nodes is handled.

d.) To me it looks more like a READ-optimized archive system.
(esp. the degraded write/append/modify has questions)
So somewhere at web-farms or other high-read-load scenarios.
But for such areas - most cheaper systems should do well too.

hirni</description>
		<content:encoded><![CDATA[<p>Some while ago I had the chance to get a root-shell for some minutes on an isilon box. (you can get a shell &#8211; that&#8217;s good)<br />
And I looked a bit around <img src='http://storagemojo.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /><br />
All the observations are just from a short &#8216;look around&#8217;:</p>
<p>they definitely run FreeBSD 4.x (haven&#8217;t remembered x)<br />
(while Ontap-GX for example runs AFAIK FreeBSD 6.2 !)</p>
<p>The FS SEEMS like a clustered-FS, which all nodes mount under /onefs.<br />
Internally the box calls the fs &#8216;efs&#8217; &#8211; whatever that stands for.<br />
Unlike a traditional array &#8211; they stripe/distribute FILES, not blocks, and they can RE-stripe files &#8211; which seems to make it possible to change RAID-levels on individual files.<br />
(and to reconstruct of course &#8211; which is a regen of parity/data)</p>
<p>Due to the fact, that they do not want to expose any details of their &#8216;efs&#8217; to the world, each node does act as an NFS-server.<br />
(or CIFS server &#8211; but that&#8217;s just samba)<br />
This has now several effects IMHO:</p>
<p>a.) clients don&#8217;t need any fs-driver to access it<br />
     (which sounds good)<br />
     OTOH &#8211; clustered NFS isn&#8217;t really trivial, and causes issues.</p>
<p>b.) each of the node acts as an own NFS-server for /onefs.<br />
So all nodes &#8216;see&#8217; and &#8216;export&#8217; the same FS via multiple nodes.<br />
But this causes all kind of questions and issues, which I couldn&#8217;t check:<br />
  1.) is the NFS-reply-cache replicated ?</p>
<p>  2.) is there an optional NFS-failover ? (I couldn&#8217;t see any vif)<br />
       (no need for reply-cache replication if there&#8217;s no NFS-failover)</p>
<p>  3.) you have to take care which clients mount which brick &#8211; as it could otherwise create &#8216;hot bricks&#8217; &#8230;<br />
NFS itself has ZERO mechanisms to &#8216;move/migrate&#8217; a mount from one IP to another IP without unmount/mount on the client.<br />
(this is something NTAP tries to avoid talking about too)</p>
<p>c.) the generic arch isn&#8217;t capable to scale single-client or single-stream performance to more than one gigE.<br />
With a client, you could mount /onefs several times, but whenever a client writes to a file &#8211; this stream goes only over one mountpoint &#8211; and hence over one isln-interface/brick.<br />
So the max sustainable read/write is 1 gigE PER FILE &#8211; while reality says it&#8217;s even less (rumours say ~75mb/sec write speed)<br />
Maybe they&#8217;ll go to 10gigE frontends &#8211; then it&#8217;ll change &#8211; maybe.</p>
<p>Also &#8211; if several clients (mounted on different bricks) read/write to the same file &#8211; what&#8217;s the realistic aggregate speed ?<br />
(depends on the style of caching/locking)</p>
<p>d.) the IB-backend:<br />
implies, that the remote-disks are connected via (TCP) IP !<br />
(otherwise you couldn&#8217;t use gigE instead of IB)</p>
<p>Latency: like every distributed FS, which PROXIES requests to the back-end on other nodes &#8211; it normally causes latency &#8211; esp. for small files&#8230;<br />
As this system is from my understanding for large-files &#8211; no problem &#8211; but for home-dirs/small files &#8211; hmm &#8230;<br />
(esp. if many clients mounted on different bricks do many creates/renames/removes.)</p>
<p>e.) metadata coordination:<br />
Like every distributed FS &#8211; some metadata operations MUST GET serialized (like mkdir/rmdir/creat()/rename()/unlink()/link() ).<br />
You can&#8217;t parallelize &#8216;mkdir&#8217; or &#8216;rm/mv&#8217; <img src='http://storagemojo.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> )<br />
Not sure how they handle this &#8211; they&#8217;re very quiet on this.<br />
But as the system is limited in size &#8211; you could do local caching &#8211; depends on the number of files etc&#8230; (memory consumption)</p>
<p>f.) failure behavior:<br />
Like with every distributed FS &#8211; the failure-behavior is definitely tricky &#8211; and is definitely different from non-clustered FSs.</p>
<p> 1.) If a node goes down (OS crash or power) &#8211; the other nodes have to decide what to do:<br />
     reconstruct all files from the failed node &#8211; and reinitialize the node (clean) when it rejoins. (very inefficient)<br />
     or hope and wait that the node comes back. (need to wait)<br />
     (how long do they wait untill they give up the hope ?)</p>
<p> Can you write during this time NEW files ? (I think &#8211; yes)<br />
 Can you write/append/modify during this time to EXISTING FILES ?<br />
 (likely not &#8211; but maybe yes &#8211; don&#8217;t know)</p>
<p>Reconstruction of failed nodes:<br />
Seems like the system is FILE based &#8211; so reconstruction of failed nodes only recons REALLY-USED-space, not like block-arrays everything &#8230;<br />
But still &#8211; with 6+ disks per brick &#8211; and assuming they&#8217;re close to full &#8211; this can take time &#8230; &#8211; maybe 1 day ?? <img src='http://storagemojo.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>So remains the question of what&#8217;s this system REALLY good for:</p>
<p>a.) it&#8217;s mainly useless for HPC &#8211; at least for the workloads I know.<br />
too slow single-stream perf. &#8211; NFS has issues with N-to-1 writes.<br />
(multiple clients write through different bricks to the same file)</p>
<p>b.) it&#8217;s mainly useless for &#8216;commercial&#8217; NAS<br />
homedirectory-files (small) won&#8217;t go too well, and some good space/quota management doesn&#8217;t seem to exist.</p>
<p>c.) not sure whether it&#8217;s usefull for databases (like ORCL) &#8211; esp. would need more details how the crash/recovery of individual nodes is handled.</p>
<p>d.) To me it looks more like a READ-optimized archive system.<br />
(esp. the degraded write/append/modify has questions)<br />
So somewhere at web-farms or other high-read-load scenarios.<br />
But for such areas &#8211; most cheaper systems should do well too.</p>
<p>hirni</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Robin Harris</title>
		<link>http://storagemojo.com/2007/01/26/isilons-cluster-technology-pt-05/comment-page-1/#comment-18834</link>
		<dc:creator>Robin Harris</dc:creator>
		<pubDate>Tue, 30 Jan 2007 00:10:29 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=359#comment-18834</guid>
		<description>Richard,

They don&#039;t charge more for Infiniband, which is a plus, given its much lower latency and higher bandwidth. As I recall though, Iband switches are actually much simpler than ethernet switches. So not terribly surprising. I&#039;m sure Isilon gets a major performance boost from it.

I haven&#039;t looked that closely at the actual hardware, so thanks for the overview. I&#039;ve asked Isilon to comment, so let&#039;s see if they do.

Wes,

What a great site! Thanks for the link.</description>
		<content:encoded><![CDATA[<p>Richard,</p>
<p>They don&#8217;t charge more for Infiniband, which is a plus, given its much lower latency and higher bandwidth. As I recall though, Iband switches are actually much simpler than ethernet switches. So not terribly surprising. I&#8217;m sure Isilon gets a major performance boost from it.</p>
<p>I haven&#8217;t looked that closely at the actual hardware, so thanks for the overview. I&#8217;ve asked Isilon to comment, so let&#8217;s see if they do.</p>
<p>Wes,</p>
<p>What a great site! Thanks for the link.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Wes Felter</title>
		<link>http://storagemojo.com/2007/01/26/isilons-cluster-technology-pt-05/comment-page-1/#comment-18829</link>
		<dc:creator>Wes Felter</dc:creator>
		<pubDate>Mon, 29 Jan 2007 23:18:11 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=359#comment-18829</guid>
		<description>For those reading along at home who prefer PDF: http://www.pat2pdf.org/pat2pdf/foo.pl?number=7,146,524</description>
		<content:encoded><![CDATA[<p>For those reading along at home who prefer PDF: <a href="http://www.pat2pdf.org/pat2pdf/foo.pl?number=7,146,524" rel="nofollow">http://www.pat2pdf.org/pat2pdf/foo.pl?number=7,146,524</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Richard</title>
		<link>http://storagemojo.com/2007/01/26/isilons-cluster-technology-pt-05/comment-page-1/#comment-18769</link>
		<dc:creator>Richard</dc:creator>
		<pubDate>Mon, 29 Jan 2007 12:43:22 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=359#comment-18769</guid>
		<description>Isilon OneFS  looks very much  like a &#039;replay&#039; of the NetApp GX architecture ... messaging and locking done over a standardized  (but more expensive) cluster inter-connect, provided by a third party InfiniBand  switch. 

At this point in time ... it seems to be specialized  for ‘read mostly ’ sequential multi-stream video delivery...  if the reported 96 KB chunk size is correct. 

Each node contains 12 disks which are locally managed as a RAID group, protected by Reed Solomon  algorithms …  which is RAID 6.   

So far… not too much mojo.

As usual, the issue of performance and scalability is more complex.

It would help if someone at Isilon could come up  with a system diagram &amp; comment on their locking mechanizm &amp; dataflow, much as Netapp did on ... 

http://drunkendata.com/?p=622

and … http://gridguy.net/?p=16#comments</description>
		<content:encoded><![CDATA[<p>Isilon OneFS  looks very much  like a &#8216;replay&#8217; of the NetApp GX architecture &#8230; messaging and locking done over a standardized  (but more expensive) cluster inter-connect, provided by a third party InfiniBand  switch. </p>
<p>At this point in time &#8230; it seems to be specialized  for ‘read mostly ’ sequential multi-stream video delivery&#8230;  if the reported 96 KB chunk size is correct. </p>
<p>Each node contains 12 disks which are locally managed as a RAID group, protected by Reed Solomon  algorithms …  which is RAID 6.   </p>
<p>So far… not too much mojo.</p>
<p>As usual, the issue of performance and scalability is more complex.</p>
<p>It would help if someone at Isilon could come up  with a system diagram &amp; comment on their locking mechanizm &amp; dataflow, much as Netapp did on &#8230; </p>
<p><a href="http://drunkendata.com/?p=622" rel="nofollow">http://drunkendata.com/?p=622</a></p>
<p>and … <a href="http://gridguy.net/?p=16#comments" rel="nofollow">http://gridguy.net/?p=16#comments</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Magda</title>
		<link>http://storagemojo.com/2007/01/26/isilons-cluster-technology-pt-05/comment-page-1/#comment-18390</link>
		<dc:creator>David Magda</dc:creator>
		<pubDate>Sat, 27 Jan 2007 03:30:17 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=359#comment-18390</guid>
		<description>For anyone that cares the patent is number 7,146,524 at the USPTO. It&#039;s entitled &quot;Systems and methods for providing a distributed file system incorporating a virtual hot spare&quot;</description>
		<content:encoded><![CDATA[<p>For anyone that cares the patent is number 7,146,524 at the USPTO. It&#8217;s entitled &#8220;Systems and methods for providing a distributed file system incorporating a virtual hot spare&#8221;</p>
]]></content:encoded>
	</item>
</channel>
</rss>

