Isilon’s Cluster Technology. Pt. 1.0

by Robin Harris | Friday, January 26, 2007 | Clusters | 5 comments

Unexpectedly, this has turned into Isilon Week here at the StorageMojo. I think everyone is excited by Isilon’s successful IPO, the first, I hope, of many for other storage startups.

I’ve already commented on Isilon’s surprisingly uninformative website. They have cool technology, so you’d think they’d want to talk about it. Maybe not talking about technology is how one has a successful IPO these days.

But wait! There’s more!
However, Isilon had a patent granted December 5, 2006 to inventors Sujal M. Patel (company founder), Paul A. Mikesell, Darren P. Schack and Aaron J. Passey. And wonder of wonders: the patent is surprisingly readable! If you’ve read many patents, most of them read like architecture papers rendered into insurance company legalese. The Isilon patent isn’t. It is still 15 pages of fine print, broken up by the USPTO’s really weird online publishing protocol, with the modules that make up the system dissected into an out-of-order presentation. But compared to most patents it is a paragon of clarity, even though a really dumb error crept in – see later in this post.

Some friends showed up; the 3-2-1 Margaritas started flowing: See you tomorrow!

OK, so it’s the day after tomorrow – and the search for storage nirvana continues . . . .

One caveat: I’m using the patent rather than a technical paper on the actual product to explicate Isilon’s architecture. Patents are typically written to embody a lot more functionality than the first gen products whose IP they are protecting. So what I’m describing here may or may not be part of Isilon’s shipping products. That said, my gut tells me that while there may be features that haven’t been implemented, the patent is, in fact, illustrative of the Isilon architecture. Isilon guys are welcome to chime in and correct any misperceptions. I see Isilon folks visiting regularly, so don’t be shy. Sujal?

Further, the patent actually covers what they call a “virtual hot spare”, but it seems to describe most of their system.

The Isilon layer cake recipe:
The core of Isilon’s offering is supposed to be the Intelligent File System (IFS). Using a standard NAS protocol, the user requests a file. That request goes to Isilon’s Linux-based server, where the kernal space Virtual File System receives the request. The VFS maintains a buffer cache that stores metadata generated by the lower layers of the IFS. The VFS layer talks to the Local File System layer, which

. . . maintains the hierarchical naming system of the file system and sends directory and filename requests to the layer below, the Local File Store layer. The Local File System layer handles metadata data structure lookup and management.

The Local File System layer speaks, in turn, to the Local File Store layer – don’t worry, the quiz will be open-book – which translates the logical data request to a specific block request. That request goes to the Storage Device layer, which hosts the disk driver.

That’s the description of the IFS. Notice anything missing?

Right!
Nothing coordinates IFS across the cluster. That piece is handled, according to the patent’s tortured taxonomy, by the Smart Storage Units. Which could be running their functionality in hardware, firmware or software. See what I mean about patent language? And this is a readable one!

Modular Smart Storage
The Smart Storage Unit (SSA) consists of a management module, a processing module, a cache, a stack and a storage device. The management module does about what you’d expect, monitoring and error logging and such.

The real work gets done in the processing module which consists of another set of modules:

Block allocation manager
Block cache module
Local block manager
Remote block manager
Block device module

Here’s a description of each:

Block allocation manager consists of three submodules
- Block Request Translator Module receives incoming READ requests, performs name lookups, locates the appropriate devices, and pulls the data from the device. The module sends a data request to the local or remote block manager module depending on whether the block of data is stored locally or remotely in another smart storage unit. It can also respond to device failures by requesting parity data to rebuild lost data.
- Forward Allocator Module (FAM) allocates device blocks for a writes based upon redundancy, capacity and performance. It receives statistics from other SSUs and uses those statistics to optimize new data distribution. The statistics include measurements of CPU utilization, network utilization and disk utilization. It also receives latency information from remote block managers and may underutilizing slow SSUs, if possible, based on the redundancy settings. Latency is logged and reported and reasons for slow performance might include bad network cards or a device being hammered by demand.
  A variety of strategies are used to allocate the data, such as striping data across multiple SSUs. The file system handles the striping so disks of different sizes and performance can be used. The module looks up the root metadata data structure for disk device information and calculates the number of smart storage units across which the file data should be spread using performance or other rules. The FAM may provide no data redundancy, or parity or mirroring, while also taking into account SSU capacity, performance or network or CPU utilization in allocating incoming data.
- The Failure Recovery Module (FRM) recovers data no longer available due to a device failure. The remote block manager detects failures and notifies the FRM. It locates data blocks that no longer meet redundancy requirements and recreates data from parity information and requests the FAM to allocate space. Sysadmins can limit rebuild resource consumption. The FRM is where the virtual hot spare comes in. It’s a set of idle storage blocks distributed among blocks present on the SSUs. It sounds cool, yet it looks like all it does is reserve some blocks for rebuild purposes.
Block Cache Module manages caching, name looks ups and metadata data structures. It caches data and metadata blocks using the Least Recently Used caching algorithm, though it may vary the caching protocol to respond to the system’s performance levels.
Local Block Manager manages the allocation, storage, and retrieval of data blocks stored – you guessed it! – locally.
Remote Block Manager Module manages inter-device communication, including, block requests, block responses, and the detection of remote device failures. The module resides at the Local File System layer.
Block Device Module hosts the device driver for the particular piece of disk hardware used by the file system.

Continued tomorrow, Tuesday.
Oh, and that error? The patent says

This parity information is used to perform data recovery when a disk failure occurs. The lost data is recalculated from taking the bitwise XOR of the remaining disks’ data blocks and the parity information. In typical RAID systems, the data is unrecoverable until a replacement disk in inserted into the array to rebuild the lost data.

Of course, a typical RAID system is reading and writing the data after a disk failure. Otherwise it wouldn’t be much use. What I suppose they meant was that the redundancy doesn’t get recreated until the a replacement disk is inserted. Even then, a hot spare is often allocated, so the rebuild starts automatically. An odd oversight.

Update: further research has left me in doubt about Isilon’s host OS. I said Linux above, but a couple of references indicate it may be FreeBSD. I’ve invited the Isilon folks to comment, so maybe they’ll straighten this out.

Update II: Got a detailed comment from a reader who’s looked – briefly – under the covers of the Isilon box. Recommended!

That’s all for today. Comments welcome, of course.

5 Comments

David Magda on Friday, 26 January, 2007 at 8:30 pm

For anyone that cares the patent is number 7,146,524 at the USPTO. It’s entitled “Systems and methods for providing a distributed file system incorporating a virtual hot spare”
Richard on Monday, 29 January, 2007 at 5:43 am

Isilon OneFS looks very much like a ‘replay’ of the NetApp GX architecture … messaging and locking done over a standardized (but more expensive) cluster inter-connect, provided by a third party InfiniBand switch.

At this point in time … it seems to be specialized for â€˜read mostly â€™ sequential multi-stream video delivery… if the reported 96 KB chunk size is correct.

Each node contains 12 disks which are locally managed as a RAID group, protected by Reed Solomon algorithms â€¦ which is RAID 6.

So farâ€¦ not too much mojo.

As usual, the issue of performance and scalability is more complex.

It would help if someone at Isilon could come up with a system diagram & comment on their locking mechanizm & dataflow, much as Netapp did on …

http://drunkendata.com/?p=622

and â€¦ http://gridguy.net/?p=16#comments
Wes Felter on Monday, 29 January, 2007 at 4:18 pm

For those reading along at home who prefer PDF: http://www.pat2pdf.org/pat2pdf/foo.pl?number=7,146,524
Robin Harris on Monday, 29 January, 2007 at 5:10 pm

Richard,

They don’t charge more for Infiniband, which is a plus, given its much lower latency and higher bandwidth. As I recall though, Iband switches are actually much simpler than ethernet switches. So not terribly surprising. I’m sure Isilon gets a major performance boost from it.

I haven’t looked that closely at the actual hardware, so thanks for the overview. I’ve asked Isilon to comment, so let’s see if they do.

Wes,

What a great site! Thanks for the link.
hirni on Tuesday, 30 January, 2007 at 4:34 am

Some while ago I had the chance to get a root-shell for some minutes on an isilon box. (you can get a shell – that’s good)
And I looked a bit around 🙂
All the observations are just from a short ‘look around’:

they definitely run FreeBSD 4.x (haven’t remembered x)
(while Ontap-GX for example runs AFAIK FreeBSD 6.2 !)

The FS SEEMS like a clustered-FS, which all nodes mount under /onefs.
Internally the box calls the fs ‘efs’ – whatever that stands for.
Unlike a traditional array – they stripe/distribute FILES, not blocks, and they can RE-stripe files – which seems to make it possible to change RAID-levels on individual files.
(and to reconstruct of course – which is a regen of parity/data)

Due to the fact, that they do not want to expose any details of their ‘efs’ to the world, each node does act as an NFS-server.
(or CIFS server – but that’s just samba)
This has now several effects IMHO:

a.) clients don’t need any fs-driver to access it
(which sounds good)
OTOH – clustered NFS isn’t really trivial, and causes issues.

b.) each of the node acts as an own NFS-server for /onefs.
So all nodes ‘see’ and ‘export’ the same FS via multiple nodes.
But this causes all kind of questions and issues, which I couldn’t check:
1.) is the NFS-reply-cache replicated ?

2.) is there an optional NFS-failover ? (I couldn’t see any vif)
(no need for reply-cache replication if there’s no NFS-failover)

3.) you have to take care which clients mount which brick – as it could otherwise create ‘hot bricks’ …
NFS itself has ZERO mechanisms to ‘move/migrate’ a mount from one IP to another IP without unmount/mount on the client.
(this is something NTAP tries to avoid talking about too)

c.) the generic arch isn’t capable to scale single-client or single-stream performance to more than one gigE.
With a client, you could mount /onefs several times, but whenever a client writes to a file – this stream goes only over one mountpoint – and hence over one isln-interface/brick.
So the max sustainable read/write is 1 gigE PER FILE – while reality says it’s even less (rumours say ~75mb/sec write speed)
Maybe they’ll go to 10gigE frontends – then it’ll change – maybe.

Also – if several clients (mounted on different bricks) read/write to the same file – what’s the realistic aggregate speed ?
(depends on the style of caching/locking)

d.) the IB-backend:
implies, that the remote-disks are connected via (TCP) IP !
(otherwise you couldn’t use gigE instead of IB)

Latency: like every distributed FS, which PROXIES requests to the back-end on other nodes – it normally causes latency – esp. for small files…
As this system is from my understanding for large-files – no problem – but for home-dirs/small files – hmm …
(esp. if many clients mounted on different bricks do many creates/renames/removes.)

e.) metadata coordination:
Like every distributed FS – some metadata operations MUST GET serialized (like mkdir/rmdir/creat()/rename()/unlink()/link() ).
You can’t parallelize ‘mkdir’ or ‘rm/mv’ :-))
Not sure how they handle this – they’re very quiet on this.
But as the system is limited in size – you could do local caching – depends on the number of files etc… (memory consumption)

f.) failure behavior:
Like with every distributed FS – the failure-behavior is definitely tricky – and is definitely different from non-clustered FSs.

1.) If a node goes down (OS crash or power) – the other nodes have to decide what to do:
reconstruct all files from the failed node – and reinitialize the node (clean) when it rejoins. (very inefficient)
or hope and wait that the node comes back. (need to wait)
(how long do they wait untill they give up the hope ?)

Can you write during this time NEW files ? (I think – yes)
Can you write/append/modify during this time to EXISTING FILES ?
(likely not – but maybe yes – don’t know)

Reconstruction of failed nodes:
Seems like the system is FILE based – so reconstruction of failed nodes only recons REALLY-USED-space, not like block-arrays everything …
But still – with 6+ disks per brick – and assuming they’re close to full – this can take time … – maybe 1 day ?? 🙂

So remains the question of what’s this system REALLY good for:

a.) it’s mainly useless for HPC – at least for the workloads I know.
too slow single-stream perf. – NFS has issues with N-to-1 writes.
(multiple clients write through different bricks to the same file)

b.) it’s mainly useless for ‘commercial’ NAS
homedirectory-files (small) won’t go too well, and some good space/quota management doesn’t seem to exist.

c.) not sure whether it’s usefull for databases (like ORCL) – esp. would need more details how the crash/recovery of individual nodes is handled.

d.) To me it looks more like a READ-optimized archive system.
(esp. the degraded write/append/modify has questions)
So somewhere at web-farms or other high-read-load scenarios.
But for such areas – most cheaper systems should do well too.

hirni