Parascale’s CTO on what’s different about Parascale

by Robin Harris | Thursday, October 4, 2007 | Architecture, Clusters, Future Tech | 2 comments

Is Parascale new or old?
There were many good reader questions about Parascale’s announcement. Even though I’ve done some work for them I didn’t know the answers so I invited their CTO, Cameron Bahar, to respond. He sent me a text only email, which I’ve decorated with some HTML to improve readability.

CTO Cameron Bahar:

Hi Robin,

We are delighted by the interest shown in both the file management challenges that Parascale seeks to address…and in our newly-announced solution. Your readers bring up many important issues, especially in regards to how existing solutions compare to Parascale. Permit me to try to group these questions into categories and to highlight how Parascale is different.

HPC solutions High Performance Computing (HPC) solutions are typically implemented with kernel code and employ custom client-side software to achieve high bandwidth. For example, Lustre has been successful at many national labs as mentioned in one post. Parascale is targeting a different market. Parascale is all about industry standards. We support NFS, HTTP, and FTP protocols because we donâ€™t expect our customers to recompile their applications. We want our software to be simple to use, as well as to scale in capacity and bandwidth for our target digital content applications.

Archival solutions. Several companies, including Archivas, have delivered archival systems. These solutions are generally WORM (write once read many) systems and disallow updates to existing files. By comparison, Parascale is POSIX-compliant and designed to support large read/write bandwidthâ€”not always a requirement for archiving. Finally, if a large vendor has acquired these technologies (e.g. HDS-Archivas), theyâ€™re usually shipped as a rack of pre-installed appliances, limiting choice of hardware provider and hardware configuration.

Clustered file systems. Shared-disk clustered file systems such as Red Hat GFS have the characteristics of traditional distributed file systems such as tight cache coherency, distributed lock management, symmetric topology. Scalability of these file systems is generally limited to 16 or 32 nodes due to heavy cache coherency traffic and message passing between nodes.

Members of our engineering team have written several clustered file systems in previous undertakings. From that experience we elected to adopt a very different architecture for Parascale. For starters, we elected to adopt a loosely-coupled architecture for scalability. Further, we chose not to write a new file system. File systems are very delicate (as we know by having written them in the past) and they take 5-7 years to fully stabilize and stop corrupting data. We simply aggregate existing file systems to present a â€œvirtual file systemâ€ layer to clients/applications over standard protocols.

Appliances versus software. NAS appliances are ideal for many markets, like SMBs and enterprise workgroups, that need simplicity of installation and for which scalability in volume and bandwidth are not key requirements. Appliances generally employ hardware highly-customized for serving files, including hardware features like NVRAM to boost write-performance and RAID controllers for data redundancy.

Parascale seeks to solve a different problem, that for management of large digital content repositories. Think of video on demand, photo archives, medical imaging, seismic data, and genomics data. Donâ€™t fault us for being inappropriate as secondary storage for an RDBMS. We didnâ€™t design Parascale for block storage because many excellent products already address this market.

Weâ€™ve constrained our solution to run as an application (with no kernel code) on industry-standard servers, as qualified only by Red Hat. We want our customers to enjoy the very latest advances in server hardware (motherboards, processors, memory, disks) available from Dell, HP and others. And we want our customers to be able to buy servers from their â€œregular hardware vendorâ€

Parascaleâ€™s software-only solution lets our customers to tune the disk capacity, CPU, RAM, I/O and network bandwidth independentlyâ€”as required by the application at hand. Growth can be incrementalâ€”one disk drive or server at a time. You never have to discard hardware or licenses. Another useful benefit of a software-only solution is that other applications can coexist on the Parascale storage nodes, allowing data mining, trans-coding, encryption, or compression on the servers where the data resides. This is not possible with closed appliances.

What qualifies as â€œsoftware-onlyâ€ file storage solution? Our perspective is, first, that the software has to support standard network file access protocols like NFS, HTTP, or CIFS. You can store files in an RDBMS, but that doesnâ€™t make it a software-only file management solution. Second, the disk drives must be direct-attached to the servers. Shared disk distributed or parallel filesystems (over SAN) are software products, but donâ€™t qualify because they require specialized SAN hardware on the back end.

Finally, because all our engineering resources are focused on software, weâ€™ve been able to innovate (with patents to prove it) and to deliver features like transparent, automated file migration (to eliminate server hot spots) and replication (to raise read bandwidth). And our roadmap promises a lot more innovation to follow!

Asked another way, where does Parascale fit in the market? Choose us if:

You want industry-standard hardware (e.g. because you want to run applications on the storage nodes, or because you have corporate hardware standards).
You need more bandwidth than one server/head can provide.
You need the benefits of data mobility across servers (e.g. migration to balance data and eliminate hot spots, replication to increase read bandwidth, smart load balancing to optimize system performance).

Lastly, Parascale aspires to be new and modern in its business model. When our product goes production, we plan to allow you to download our software to try it out at no cost. Weâ€™re confident youâ€™ll like it. Our pricing is per-spindle, so you never have to deploy or pay for storage capacity before you need it. And if a drive fails, replace it with a new drive in the manufacturersâ€™ current sweet-spot; weâ€™re not trying to make money on advances by the disk drive manufacturers.

Hope Iâ€™ve addressed some of the questions posted. I applaud the thoughtful discussion that your post has prompted.

Best,

Cameron

Comments welcome, of course.

2 Comments

Bill Todd on Friday, 5 October, 2007 at 7:21 am

OK – I’ll take a shot:

1. Since Lustre leverages the local Linux file systems on its servers, I’m not sure where it uses the ‘kernel code’ that Cameron suggests it does. Presumably it does use special client-side software to handle its data mapping and direct access – just as pNFS will need to do. If Parascale does *not* use such custom software, that just means that it needs an extra mapping hop through whatever Storage Node is coordinating that specific client’s access to the file (assuming that this Storage Node does not then have to consult the Control Node to get the mapping information, introducing yet more overhead) each time data is read from a distributed file (and that it must actually pass data *through* the coordinating Storage Node when writing to such files): not exactly something to brag about (at least implicitly, by failing to mention the down-side of their design choice), and not exactly insignificant given their stated large-file design center.

It would be trivial to extend a product like Lustre to handle such mapping for clients unwilling to run Lustre software locally to optimize access. It would be considerably less trivial to extend Parascale’s product to optimize access as Lustre does. And while an additional network hop (or even a couple) doesn’t amount to all that much for a read request compared to the cost of fetching data from disk and then streaming MBs of it across the wire directly to the recipient, an additional hop for a multi-MB write is significant indeed (though perhaps they just write off write-intensive environments as being uninteresting, just as they write off small-file/database environments, though Cameron explicitly claims large write bandwidth as a goal above, and that’s a bit difficult to achieve when all the write activity for a distributed file – at least from a single client – is funneled through a single coordinating node).

2. His discussion of ‘Clustered file systems’ conspicuously focuses upon shared-disk environments, and with good reason: his already somewhat over-blown description of their limitations (VMS shared-storage clusters nominally accommodate up to 96 nodes and have been built considerably larger than that) completely evaporates when considering shared-nothing, partitioned cluster FSs (of which Parascale’s is just the newest kid on the block).

3. Leveraging the maturity of an existing file system is good, as long as it doesn’t preclude using something better down the road if it should come along. But stating that a file system takes “take 5-7 years to fully stabilize and stop corrupting data” just demonstrates that he’s never worked in a competent, professional development environment (where releasing such a product would sink an OS – and quite possibly its parent corporation – long before that amount of time had elapsed).

Furthermore, suggesting (as their Architecture Web page does) that using the Storage-Node-local FS eliminates anything like *all* block-management overhead (not that this overhead is anything like as heavy as he claims, unless the workload is insanely write-intensive or you’re using a brain-damaged, non-extent-based file system) is disingenuous when talking about distributed large files (again, their alleged design center).

4. Software-only solutions are a double-edged sword: they allow use of any hardware the user cares to throw at them, and they depend on that hardware (and therefore on the user’s choice of it) to work as they expect it to. This is great for users able to qualify entire systems themselves, and not so great for the rest (who may discover problems after they’ve become wholly dependent on the system). Whether Red Hat certification is sufficient to mitigate such concerns I wouldn’t know.

And even as we speak NVRAM is moving from the ‘custom hardware’ category (battery-backed cards, if you could find and afford them) to commodity (USB drives): it certainly would be nice to see even a ‘software-only’ solution able to capitalize on such commodity hardware if it were present.

5. His discussion of data migration, hot spots, and load-balancing simply indicates that Parascale hasn’t solved tuning problems any more than most of its competitors have.

6. Moving back to their ‘Architecture’ page, the statement there that “The efficiency of maintaining all metadata on one node greatly outweighs any risk of a single point of failure” is absolute rubbish, and should raise major red flags to any prospective purchasers. They do a hand-wave suggesting that the solution is hardening that single node, but fail to go into any specifics about exactly what level of hardening is available (and how well their distributed system accommodates things like Control Node fail-over handled by third-party mechanisms).

7. And then they have the nerve to suggest that a clustered Control Node implementation might be *less* scalable than their single Control Node configuration! Even a shared-disk Control Node cluster would offer far greater scalability, and a shared-nothing partitioned metadata cluster even more (leaving aside the virtues of distributing partitioned, shared-nothing metadata across the entire system to achieve a high degree of automatic load-balancing across all access patterns: as I observed above, additional network hops are small potatoes compared with disk accesses, and in any event can largely be avoided by inexpensive distributed path-caching mechanisms).

8. By contrast, their choice to replicate data across Storage Nodes rather than attempt to harden such nodes individually is by and large a good call (though one might observe a call that Lustre came up with years ago). Even here, though, they attempt to over-sell it by stating that it halves the time required to replace a server or disk (whereas in fact regardless of how many *sources* may exist there’s still only a single *destination* whose bandwidth constrains the rebuild speed).

Unless, of course, they’re distributing replicas of different files on the same server to different replica servers. If so, again this is a double-edged sword: it can indeed yield improved rebuild parallelization, but turns RAID-1-style availability characteristics (where you lose access to data only if the two nodes serving it fail) to RAID-5-style availability (where you may lose access to *some* data if *any* two Storage Nodes in the system fail – or possibly if just any two *disks* in the system fail).

And offering the *option* to harden individual Storage Nodes via mirroring them while sharing a set of back-end disks has merit: in particular, it allows node failure to be decoupled from storage failure and the effective use of RAID-5-style shared storage (rather than the mirrored storage which is required with inter-node replication; whether they actually offer RAID-5/6 striping across Storage Nodes or just throw it out as food for thought is not clear, but what is clear is that its overhead is far greater than local RAID-5/6 implementations). SATA II mechanisms may even allow such back-end disks to be shared inexpensively without resorting to the use of SAS.

9. Suggesting that approaches which run in kernel mode somehow intrinsically cannot accommodate running applications on Storage Nodes is poppycock: in fact, they’d run faster if the file system implementation were in the kernel. Whether Parascale chose a user space implementation to avoid any GPL issues or simply because it’s far easier to create and debug, from the standpoint of performance and features it’s clearly second-best.

That about covers it, I think. Parascale’s product may be well-implemented and quite useful to a certain customer base, but it’s hardly novel in concept, unique in capabilities, or as wonderful as their market-speak would have one believe.

– bill
Cameron Bahar on Saturday, 29 December, 2007 at 10:36 pm

Hi Bill, you pose a number of interesting comments. I agree with some and feel there are areas where you make incorrect assumptions about our design goals and reach conclusions that do not apply to our solution or target markets. I will clarify these below. We are always looking for the best technical talent to help Parascale become even better, so we welcome your insights.

On the specific points that you raised:

Question 1: Lustreâ€™s own web site reports that it requires a new Linux Kernel with Lustre-specific changes . Kernel features include changes to ext-3 and locking functions to the vfs layer. Please refer to this luster.org link and text below for details.

http://wiki.lustre.org/index.php?title=Lustre_Howto#Download_Packages
Lustre requires a number of patches to the core Linux kernel, mostly to export new functions, add features to ext3, and add a new locking path to the VFS. One can either patch their own kernel using patches from the Lustre source tarball or download pre-patch kernel RPM along with matching lustre-lite-utils RPM.

Also, we aim to be much simpler to use than Lustre!

http://www.ists.dartmouth.edu/serenyi.pdf
A quote from peter braam, lustre creator: “It’s not like backing your car out of the driveway. Installing Lustre is more like launching the space shuttle, with pieces of foam falling off.”

Also requiring a custom client software instead of supporting industry standard protocols largely makes solutions like Lustre appeal to a smaller set of of groups and is not usually widely adopted in the enterprise. I think NFS/CIFS/HTTP/FTP are the clear winners in enterprise.

Your point about double-hopping or proxying a request from a remote storage node if the file is not present on the storage node receiving the request is quite valid. We do try to maintain affinity between clients and storage nodes, but do keep in mind that our design goal is NOT to stripe one very LARGE file across a number of nodes as you suggest which is common in HPC applications. Parascale isn’t aiming at the HPC market where multiple clients have to write to a single LARGE file at the same time and where write path parallelism to a single file is important. Parascale aims at serving many requests from many clients to many files stored on many storage nodes for read or write for large(ish) files. It can also serve multiple requests for the same file and be opportunistic about it because it has options: serve it out of the cache on one blade or through a separate interface from another blade. With NFS 2/3 there is no file-based redirection in the NFS client protocol, so this necessitates a double-hop architecture commonly used by most relevant solutions today. With HTTP this is not the case and this is not an issue. And with support for pNFS and NFSv4 in the near future, this double-hop scenario is also eliminated.

Point #2: I was a fan of VMS clusters.

Point #3: As I explained in Point 1, our stated design goal is NOT what you infer (striped LARGE files). As to how long it takes to ship a STABLE filesystem, we have all done our stints on filesystems and your opinion is respected. I believe that it indeed DOES take 5-7 years to get a stable fs if you do it from scratch. People’s definition of “stable” differs. Vertias VxFS was stable 3 years after it was conceived — for some definition of “stable”. Major issues continued to plague it for years, and were addressed and squashed over time. Sun’s UFS was “stable” for years, and yet continued being plagued by pathological bugs. Newer filesystems continue to exhibit data corruption and other failures today. We believe leveraging an existing filesystem is a better idea than writing yet another filesystem from scratch.

Block management overhead is partitioned between storage nodes, without centrally kept freelists or allocation maps. Allocation overhead isn’t terribly heavy in extent-based filesystems such as VxFS, but when you scale the filesystem to some reasonable size, the maps become hard to manage. This is why the drift in the filesystem design has been toward the “object model” — NASD, OSD, Lustre, Panasas, Google FS, Parascale — all of these are going down the path of separating block management from namespace management.

Point #4: We intend to leverage NVRAM, SSD, FLASH and other interesting technologies as they become relevant to our target applications.

Point #5: What we have built is a great foundation for many innovative features that can simply be added to self-tune the system.

Points 6&7: I absolutely agree with your points and we do not suggest that a single control node is always the best option. Clustered HA pair and partitioned metadata services are useful solutionsâ€”but these are not yet features of our software. Currently we support a pair of Control Nodes either shared disk or non-shared disk with block-level mirroring. In the near future we intend to offer more choices for larger scale and higher availability. I think the info on the website is simply suggesting that symmetric traditional architectures find it difficult to scale to a large number of nodes largely because of cache-coherency and cluster control traffic and that the Parascale architecture avoids this problem by using an asymmetric architecture.

Point 8: Hey! Great that we agree on advantages of replication across servers. There are some good recent technical papers that show the clear benefits of file-based replication across servers in terms of availability and reliability versus RAID within a single system. The replication factor is user-configurable, so it can be turned up or down depending on the HA requirements.

Point 9: I definitely agree that kernel mode systems will usually perform better than user-mode code mostly because they avoid context switches; but kernel mode solutions tend to be more difficult to install and much less portable across different platforms. The idea here is to scale-out to increase bandwidth and capacity, not to have a single very fast server.