Configure a 100 TB HD video infrastructure

by Robin Harris on Sunday, 7 June, 2009

The video folks have an interesting set of problems: large needs; major bandwidth; time-critical collaboration; lots of metadata; and more. Like budgets. I do some video production myself and empathize.

They are today where most of us will be in 10 years: lots of large files; local and remote sharing; processor and bandwidth intensive operations; large archives of wanted and rarely accessed files. Today high-end video folks are working at 2k, 4k and, sometimes, 8k video resolutions – and 10 years from now I wouldn’t be surprised if home users weren’t too.

What prompts this is a note I received from, well, I’ll let him introduce himself.

I have a boutique post-production company and I’m a filmmaker. We are small, under a dozen, but swell to a few times that size with freelancers on a project-by-project basis. Because we work with very high resolution media, we need a lot of space, and very high throughput to each user. . . . [W]e’re all working with 2K and 4K media (300 and 1200MBps respectively to EACH user) and 3D animation rendering. . . . We use a mix of Linux, Windows, and OS X clients. In total, we could easily make use of 100TB+ right now, and prefer to stop archiving everything to tape and deleting it, but rather migrate to another tier of storage but keep in one global namespace with the tape just for disaster recovery. We also need security administration.

I can’t find a storage system that does all this. DataDirect Networks seems to be the du jour high-end storage for my industry, and supposing I’m willing to finance that big-ticket brand, they still don’t have a filing system answer. They’re suggesting StorNext or CXFS, and I know the multi-user scalability and expansion limitations well (can anybody say “forklift”?).

The closest I’ve come is Lustre. It seems like it would fit the bill nicely, especially since we’re savvy to integrate in-house, except that it is Linux only, and NFS/CIFS gateways don’t seem like a great idea. I keep hearing they’re working on at least a Windows client, but who knows when it will be ready?

Can you help at all? What have I overlooked? Doesn’t anyone make what I’m looking for?

Short answer to last question:
No.

Longer answer:
No. But there are workarounds.

For those new to video, here’s an abbreviated chart of some video rates in megabytes per second:
video_data_rates1 [Adapted from Integrity Data Systems which offers the whole chart. Aspect ratios and frame rates left out.]
Update: Larry Jordan, a writer and trainer in video editing, graciously wrote to let me know that the above data rates are uncompressed – and that most production houses would use compressed data. The amount of compression varies based on the codec as Larry explains in this informative post. End update.

Issue 1: Interconnects
GigE won’t even handle 32-bit RGB standard def video. And when you get into HD video it gets hairier fast. Trunk multiple GigE’s? 10GbE? 4x Infiniband? FC? eSATA or PCI-e direct attached storage?

Issue 2: Virtualization
A single address space is a wonderful thing. You’ll need a software layer that clusters multiple boxes. You’ll also probably want to build an archive infrastructure that is distinct from your higher performance working set storage, but some vendors will disagree.

Likely software suspects include IBRIX, Parascale, Caringo, MatrixStore, Bycast and Permabit.

On the combined HW/SW side there’s Panasas and Isilon. Something tells me there are some other options, like HP’s Extreme Data Storage 9100, that are also applicable.

Lustre is not a product I would recommend since it was designed for HPC, a market where PhDs work as sysadmins. Sun may have tamed it since they bought it, but it is a non-trivial piece of software.

Come one, come all
StorageMojo readers are invited to offer their 2¢ worth. Architecting is non-trivial, especially if money is an object.

Update:
Our interlocutor wrote in to add some detail:

thanks for the response. Here’s some answers:

– We can manage expensive interfaces like 10GigE and Infiniband QDR. We’ve been paying for dual-channel 4Gb FC for the past few years, after all. I just want to also allow standard Gigabit connections to the cheap seats without a lot of complexity. So I guess the jargon for that would be “multiprotocol” switching?

– The large naming space might be a luxury. The fact is that jobs come in one of three general sizes, and we could have volumes of that size waiting to take on new jobs as they come in, so at least there is one namespace per job. As you said, capacity is cheap…

– Truth is I am pretty savvy, but other than that we have a lot of power desktop users but not sysadmin types. I contract some people with steady part-time work, but it has been our business model to try to keep as many of our full-time people on the creative and producing side as possible, and not in support/administration.

The one thing I don’t understand is what you say about Infiniband not being so great when there’s lots of node churn?

I know what you mean about DAS, but I think I’ve ruled out distributing the data through push/pull from a central repository. The fact is jobs just move to fast through here for that, and we often have about two seconds notice that we need to bring a certain job’s data to System X, Y or Z to do work on it. It’s very dynamic.

I see some brands in your blog post I haven’t checked on yet.

What turned me onto Lustre is that Frantic Films in London has deployed it. They’re the only ones AFAIK.
End update.

The StorageMojo take
Some thoughts on the infrastructure issues.

Capacity is cheap, network bandwidth is expensive. Raw SATA disk is less than $0.10/GB. 10GbE switch ports are over a grand apiece. Infiniband is better from a price/performance perspective, but not as friendly for networks where there is much node churn – unless that’s been fixed in the last few years.

Direct attached storage will give you the best performance – especially with 4k. The new PCI-e attached arrays from JMR and others can offer up to 4,000 MB/sec bandwidth. Stripe across 4 of those and you’ll be able to handle 8k.

Transaction processing is well on its way to niche status, like mainframes and hierarchical databases that once ruled the earth. It is a big file world out there and the files are getting bigger every year.

Courteous comments welcome, of course. I’ve done work for many of these folks – but not all – at one time or another.

{ 29 comments… read them below or add one }

jrad June 7, 2009 at 8:04 pm

How about something like green-bytes.com, x4500 sata, read/write ssd cache with ZFS+ (Build in block level dedup and compression).

Or Ibrix + Coraid is a pretty scalable combination.

-J

Max June 7, 2009 at 8:47 pm

A rather notable absence I see in your list of interconnects to address Issue 1 is SAS (serial-attached SCSI). This has over the past few years become my personal favorite, since it’s remarkably inexpensive compared to FC, Infiniband, and externalized PCIe, and outperforms typical configurations of them.

In fact, I have found that the PCIe slot a SAS HBA is plugged into is a bottleneck, which makes sense, since PCIe lanes are 250MB/s compared to SAS’s 300MB/s. To avoid that, one would have to plug a 4-port HBA into an x8 PCIe slot or an 8-port HBA into an x16 slot (though I couldn’t suggest a currently available HBA with an x16 interface).

2GB/s on two 4x cables on a $300 HBA (or RAID card) in an x8 slot should handle all but the 32-bit 4K. Even a single 4x cable would handle the 1200MB/s Since, presumably, we’re talking about inherently sequential I/O, 72 1.5TB SATA disks would handle that fine at $9400. The onboard-expander enclosures for them from SuperMicro would add another $3600. Add some labor cost for assembly, and call it $14k. Host-based RAID1 can be done for twice the price.

Sharing, of course, is another matter. One could, of course, plug all the hosts into the same SAS bus(es), but this would require purchasing a clustered volume manager and/or filesystem. I think Symantec/Veritas has such a thing now, but I can’t speak from any direct experience with it. It may well be cheaper to go with link-aggregated (aka trunked) 10GE, even at $1k per end.

I agree generally with your response on Issue 2, that virtualization, management, and that abstraction layer should be considered separately. The vendors disagree because they’d charge 5-10 times my solution ($28k hardware plus a similar amount in software).

Patrick Osborne June 7, 2009 at 9:03 pm

IBRIX would love to help you architect a solution around your bandwidth, capacity and price requirements. Pixar, Dreamworks and Disney all use our solution (on vastly different server, storage and networking platfroms) for rendering and post-production as well as other boutique firms in the US and APAC.

These types of customers like the flexibility of the software approach so they can choose best-of-breed hardware deployments that meet their needs. Another popular feature is IBRIX’s built-in Data Tiering feature that allows the user to migrate files between tiers of storage based on user-defined policies, i.e. SSD -> SAS -> SATA based on rules written around file metadata and access patterns. This provides sysadmins to store data in the right place at the right time, for the right price.

Protocols to access the filesystem include industry-standards like NFS/CIFS/HTTP in addition to providing a proprietary client for Linux and Windows platforms, all of which can be accessed concurrently through all protocols.

Send us an email at sales@ibrix.com if we can be of help.

Robin Harris June 7, 2009 at 9:07 pm

E-SATA. Knew I was missing something. Not as fast as PCI-e but good enough.

Robin

Wes Felter June 7, 2009 at 9:23 pm

This sounds like a call for pNFS-RDMA over IB. Too bad that’s not out of the lab yet.

Nick June 8, 2009 at 12:13 am

Hi Robin,

The trend we have noticed in the UK is that small post houses and creative agencies do not want to allocate budget to hiring tech personnel to manage their data, instead it is a task assigned to one of the team whose time is better spent working on content for a client. The requirements as we have seen it are:

– Plug it in and it works.
– Fits in the workflow (plays nicely with DAM/MAM apps, Final Cut Server and CatDV becoming increasingly popular with boutique post shops)
– Minimal manual intervention for managing and retrieving content in a growng archive.

The main problem is that for most of the industry the idea of spending any sort of bucks at the ‘boring’ end of the process. Red cameras are sexy, VFX apps and tin are cool. Archive is not. This is changing to a degree as the likes of Panasonic and Red are changing the game. No tape in the camera means no tape to put on the shelf as archive. Also the ability to re-use, re-purpose content is slowly driving adoption of disk-based solutions for nearline of deep archiving (see BBC DMI etc).

On the point of disk based solutions …whilst you can use the Mac OS version of MatrixStore as a software only option we have recently launched a software/hardware combo in the form of low cost Supermicro chassis with cpu and storage, Linux OS, Linux FS and MatrixStore software layered on top. (www.thematrixstore.com)

Christoph June 8, 2009 at 5:17 am

CXFS sounds right, it scales way beyond 100 TB and easily delivers thousands of MB/s if you have the right hardware (20 resp. 25 GB/s at our site).

a01:~> df -h /ptmp1 /ptmp2
Filesystem Size Used Avail Use% Mounted on
/dev/cxvm/TP17-32 279T 177T 103T 64% /ptmp1
/dev/cxvm/TP1-16 279T 113T 167T 41% /ptmp2

It can be complex and there is a maximum of ~64 nodes in a cluster, but of course you are free to re-export using CIFS of NFS.

Michael Kilian June 8, 2009 at 6:28 am

Blackwave.tv has an interesting solution. It can provide many streams of video of any bitrate of any length through a 10 GigE interface and uses SATA drives to do it (reducing cost). While our platform is designed for large scale distribution of content (e.g., 10,000 users each at 2Mbps), it can be configured for smaller, higher bandwidth applications as well. Bandwidth starts at 10Gbps and can scale; storage starts at 48 TB and can scale.

We are an HTTP device. No file systems, no volumes to mess with, self-healing of drives. Uploads are via FTP or HTTP put, downloads are HTTP get. That may make it trickier for a post-production environment.

Take a look at out web site and if you have any questions, please don’t hesitate to shoot me an e-mail and I’ll do my best to answer them.

— Mike Kilian
Blackwave CTO

Jeff Denworth June 8, 2009 at 7:57 am

Hi All – so, I’d like to settle a few questions here:

a) “DDN still doesn’t have a file system answer.”… Not true.
DDN has been delivering file-level solutions now for over 3 years, combining them with our award-winning bandwidth and capacity optimized real-time storage systems. In fact, we now have 3 file-level solutions we provide to the marketplace. Of interest to this thread:
– our xStreamScaler platform (http://www.ddn.com/xstream-scaler) is designed for post & broadcast workflows and delivers very cost effective infrastructure for SD/HD/2K and 4K environments. The system has native support for all of the OSs you’re considering and can drive up to 4K streams to any of your clients – as well, we work with all of the major MAM players to integrate this directly into your workflow. This is likely the best option for you, and will deliver well in excess of what any 10GbE or NAS or IB (no OS X support) option can deliver.
– our ExaScaler product: DDN has combined our storage portfolio with the Lustre file system… while this has seen success in many markets (particularly HPC and grid computing) we typically do not offer it in the post & broadcast industry as it only today has native Linux client support. While it is used occasionally in the media market, we find that it is generally relegated to scalable animation. DDN has delivered technology to power the world’s largest Lustre system at the US Department of Energy (clocking in at 240GB/s). We have Lustre experts around the world that can help deploy very complex architectures.

b) DDN solutions require “forklift” upgrades to scale… Also not true.
One of the primary reasons that people use our open storage products with scale-out, tiered storage systems is because we can simply add our systems into an existing infrastructure non-disruptively. Similarly, our single-system scale (up to 1200 HDDs managed by a single system) will allow you to grow your system capacity at “bare metal” prices once the infrastructure is purchased. Need additional performance, or a new tier of storage?… simply purchase another system and add it to your existing environment. Our xStreamScaler product features this exact storage pooling concept and customers around the world are using hybrid infrastructure – leveraging our newest gear for real-time, online performance pools while repurposing older storage within the same namespace as secondary or archive tiers.

We just announced a single customer in Spain (CATA) that has achieved the ability to deliver 4 x 4K concurrent streams (~ 4800MB/s) with a single DDN storage array by using our xStreamScaler platform. NO one gets 4K more than DDN – we architect environments like the one you are looking for everyday… you’re well within our comfort zone. Scale-up, down or out…we can get you there. http://www.ddn.com/index.php?id=206

That said – please consider giving us another call… I think there’s still much to learn about DDN and the scaling and cost benefits of our technology.

Thanks,
Jeff Denworth
VP, Marketing

Kent Langley June 8, 2009 at 8:02 am

Here are some links to info that wansn’t mentioned for research.

I don’t know if it’s helpful, but this thread reminded me of Don Macaskill’s talk about the Sun Unified Storage platform devices..

Here’s the talk..
http://www.youtube.com/watch?v=2WEx_XTjPvE

Here’s the storage..
http://www.sun.com/storage/disk_systems/unified_storage/

They seem pretty impressive as potential building blocks. They are using Arista networks 10G switching which is also worth a look. It drives the costs of 10G down quite a bit it seems.
http://www.aristanetworks.com/en/Index

You might want to look at GlusterFS as well. They just had a new release that looks very nice. Version 2 is the new hotness there.
http://www.gluster.org/

I’d certainly be interested to build that on top of Arista + 7xxx unified storage and run some tests.

Nice post. Though provoking.

Kent

David Magda June 8, 2009 at 2:24 pm

Some benchmarks for the Sun 7000-series stuff:

http://blogs.sun.com/brendan/entry/my_sun_storage_7410_perf
http://blogs.sun.com/brendan/

NFS gets ~1.9 GB/s, CIFS about 1 GB/s:

http://blogs.sun.com/brendan/entry/cifs_at_1_gbyte_sec

This is for files that are in DRAM (the 7410 can have up to 128 GB); stream from disk is slower for both. Microsoft is actually financing an open-source NFSv4 client for Windows:

http://blogs.zdnet.com/microsoft/?p=2582

Anand Babu June 8, 2009 at 2:25 pm

Here is a little more background on GlusterFS, I would encourage you to try it out as well 😉
– GlusterFS is a clustered file system that runs on off the shelf x86 hardware
– It supports multiple protocols: GlusterFS native, NFS, CIFS, HTTP, and FTP
– Supported interconnects: 1GbE, 10GbE, and InfiniBand
– Single global namespace and volume manager that scales to PB’s

A likely configuration for you would be 6-8 servers (say an HP DL320 or Dell 1950) with 12-24 disks per server with IB or 10GbE. GlusterFS runs in user space making it relatively easy to install and configure.

You can download it here: http://gluster.org/download.php
If you have any questions please contact us at: sales@gluster.com. We are also offering a 30 day free trial subscription if you are interested.

Best,
AB
Gluster CTO

David Magda June 8, 2009 at 3:22 pm

Looking at some of the other videos from MySQLConf ’09, there’s this little tidbit about the forthcoming Sun Flash Array 5100:

http://www.youtube.com/watch?v=QPagpPQTaQY#t=6m55s

Specs listed as:
. 1U
. 4 TB
. 1M iops per 1U (that’s “M” for mega, i.e., million)
. 10 GB/s
. etc.

Brad Winett June 8, 2009 at 9:04 pm

I don’t believe there is any easy, fully-shared solution today for 4K (and I’ve been doing DI storage implementations for about 7 or 8 years so I’ve seen most of them). As usual, the devil is in the details. You can “easily” assemble a SAN-based system (some more “easy” than others) that can support the 1.2GB/s per stream needed for 4K. You can use StorNext or CXFS to share the SAN storage – but while facilities often try just one big filesystem to start, the reality has been that they fall back to multiple filesystems, don’t allow much sharing in real-time, and spend tons of sysadmin time “data-wrangling” to keep the workstations attached to the data they need at a given point in time to try and maximize facility efficiency.

Why? Because after just a little time the file systems get fragmented and performance degrades to the point that you can’t sustain that all-important 24 frames/second. Oh – and you’d better not let anybody use an unbounded application like file moves or copies because those will gobble as much resources – both disk and metadata – as they can. Also – you don’t really want people doing animation of VFX rendering to hit the same disk storage and particularly metadata server because rendering is usually a (take a look at the average renderman file size) tiny-file, high-I/O application and that creams the metadata server; and the mix of small random I/O is poison to simultaneously streaming large sequential (2K or 4K) files. And don’t even think of deleting a 28,000 frame reel while you are trying to keep a colorist happy (well, never happy but maybe “content”…)– nothing will make a metadata server go bananas for a while like that!

So, shared SANs have the bandwidth, but generally not the random (particularly multiple streams in a fragmented file system) performance and any out-of-band metadata server will have trouble handling a truly “all-shared-all-the-time” single filesystem.

How about parallel (NAS) file systems like Lustre, Gluster or IBRIX? Well, first forget about using 10GigEthernet to do real-time 4K (maybe somebody has done some magic work with bonding or link aggregation, but I haven’t seen the ability to sustain 1.2GB/s minimum to a client workstation with well-behaved latencies). Maybe a parallel mounting (pNFS?) solution with a couple of 10GigE NICs in the workstations will get there eventually (frankly I think 40 or 100GigE in a fast PCIe slot next year will be far simpler). So, if you absolutely, positively have to sustain real-time 4K 24 fps today, Infiniband is probably the only realistic option for a parallel FS (by the way Jeff, there are OS-X drivers available for IB cards).

To do this right on a parallel file system, I’d argue that you have to stripe the files across the storage nodes. I’d guess that eliminates IBRIX (sorry, but no amount of caching in each node will do much in a multi-stream, 1.2GB/s stream environment) even with their IBRIX client.

Open-source parallel systems have a shot; but most of them are tuned for extremely large block sizes (not good in mixed environments), have centralized metadata services not good for random mixed-I/O environments (even Lustre metadata services can get hit hard depending on the I/O), and are the hardest scalable storage systems to keep running, bar none. Most, like Lustre, will require a client agent – and then you have the issue of supporting different flavors of Linux, Windows & Mac clients (and maybe stuff like legacy SGI XFS, etc). Worst of all, – none of them can be considered remotely “user-friendly”. Most do not do well in environments with Windows at all, either (cross-platform locking, access control, etc.).

By the way – a note about Hierarchical Storage Management (HSM) systems that can migrate data back & forth among storage pools. While great in some applications, in DI they rarely work. Most rely on placing “stub” files on your main storage (this doesn’t apply to IBRIX’ implementation that Patrick mentioned) – and the reality in DI is that sysadmins wipe file systems all the time to get a new fresh, un-fragmented file system (versus running a defrag that can take forever). Wipe a file system = lose all the stubs!

Full disclosure time. I worked for DDN for over 10 years, worked for IBRIX, and I’m now at Isilon. Isilon has been in M&E market since its inception, however up until this year Isilon has frankly kind of ignored the DI market – until many of our media customers that used our products in other areas begged us to look at it. Isilon’s unmatched (really unmatched!) ease of use, ease-of-scalability and balanced systems architecture isn’t something that has been found in DI before. Frankly, DI storage systems have been a beast to care and feed for.

We’ve done extensive testing and characterization of 2K DI (and have had our customers implement systems unbeknownst to us!). Our inherent cross-node file-striping, parallel metadata & locking services, and resistance to fragmentation really fit very well into what DI needs. Unfortunately – we are limited at 10GigE so real-time 2K is all we can do today (no real-time 4K). People can use us for a truly shared, truly scalable real-time 2K DI workflow for as many streams as one could ever want – but for 4K we are limited to a transfer in/out process (at 10GigE speeds). Is that a limitation? Today it is, but the world’s not perfect (sigh).

I think you’ll find that the world’s not perfect for ANY 4K DI implementation at this point unfortunately. So pick your poison – live with a complicated SAN and have multiple file systems that a data-wrangler moves around daily (or hourly depending on how you rent out your suites) and maintains an iron-fisted grip on who does what at any point in time, or for your 4K jobs go with a transfer in/out approach (or get maybe 8 frames/second “real-time”…) and use something like Isilon for an easy, streamlined 2K workflow.

Blake Golliher June 9, 2009 at 9:40 am

I’ll toss GPFS from IBM into the ring. DDN is an OEM of it. I foresee this company having to support some element of partitioning out filesystems, but probably by workload. I’ve seen companies try the ‘single filesystem, it would be soooo easy!’ thing and it’s not pretty when it fails. Less eggs per basket is better. The random small file IO stuff can hit flash on a jbod (equallogic maybe?) or 15k disk from DDN, while the highly sequential throughput stuff can pound away on a 9900 or something. All of this IO can happen under the GPFS umbrella in a small number of filesystems. It scales to many petabytes and large numbers of nodes, like 32k in the latest version. (I don’t work for DDN, but I’m a fan.)

Steve June 9, 2009 at 4:31 pm

You should check out the Omneon MediaGrid. It can easily handle over 100 TB of replicated storage. There are several large broadcasters using it for video editing with similar bandwidth requirements.

Joe Landman June 9, 2009 at 6:05 pm

Reading over some of the checklist and the comments was interesting. First on the checklist

First, they need up to 1200 MB/s per user, using Linux, Windows, and OS-X. Up to a dozen people now (potentially 14.4GB/s, not 1.2GB/s), possibly larger. So smaller bandwidth/non-scalable bandwidth really need not apply. This (strongly) suggests a cluster storage system, not a single box.

Second, they need 100TB+ now.

Third, they need global namespace … and direct access for Windows … which is the odd one out in terms of client support. Lustre is mentioned.

Ok, first, the only way you are going to hit these bandwidths is with a distributed model. I am talking the 14GB/s with 12 users simultaneously slamming on the 1200 MB/s.

Second, don’t disparage Infiniband, just because your preferred vendor/solution doesn’t include it. QDR IB can handle these rates, and you can put it on desktop units. We do. You can do longer runs with DDR IB and fibre … we quoted this out for a customer. A bit more expensive than CX4, but if you need it, you can have it. Moreover, you can channel bond the IB and do multi-rail stuff. IB is the indicated solution, likely with the fibre links. ConnectX or Qlogic cards in the desktops. Nice Mellanox/Qlogic based switch doing QDR on the back end. Add in a 10GbE blade into the frame for multi-protocol switching if you want to tie this into your gigabit ethernet network (don’t go there, 110 MB/s won’t cut it).

Third, global namespace (e.g. single file system) suggests a distributed or centralized metadata service. One may be better than others for this use case. I’d argue for distributed in this scenario, and that suggests solutions like GlusterFS. No current windows client, ask Anand if one is planned. Lustre is possible, but again, that windows client gets in the way. And I should point out that Lustre does not require a team of Ph.D.s to administer it.

Ok, hardware side. FWIW, we have designed a number of storage clusters as of late, including smaller sustained 4 and 8 GB/s performance using Lustre and GlusterFS respectively. There is nothing magical about this, other than using good hardware as the basis. We have units that will provide 96TB raw in 5U at ~2GB/s per unit, that you can use as the basis for this (see this link for details), without breaking budgets, or causing CFOs to go catatonic.

All told, if this person wants to solve this, their best solution would likely look a great deal like a GlusterFS cluster, followed closely by a Lustre cluster. Its just that windows issue. Thats going to be the hard aspect to this. There is no simple solution … possibly apart from running windows in a VM on a muscular Linux desktop with an IB card in it. Not perfect, but much better than “No”. And for the highly reluctant pure windows users, you can have the bootup log in to a VM user, and fire up VMWare workstation right away, so it looks simply like a slow booting, but otherwise, very fast windows box.

Just my thoughts, and of course I am biased given who I work for …

Steve Jones June 10, 2009 at 12:01 am

I too would look at the SUN unified storage offerings as the basic building block using 10Gbe. There is always the option of aggregating 100BaseT foir a cheaper option. The considerable advantage of it is that it should be very easy to manage and long term support should be guaranteed with Oracle’s interests in SUN and open storage. With Open Solaris & ZFS there is a good software base. There are other vendors who will be using the same software stack. It scales fairly large in a single box (and should be higher yet when 2TB disks are available).

It might also be that some of the built-in functionality like snapshoting and will be useful too and the HA clustering might be important too.

Archival software is a different issue – I’m old fashioned enough to like my layering approaches clean and tend to want my basic storage devices do that well rather than try and perform too many functions. When it comes down to it, a storage device offering a standardised interface can be swapped out – something which has lots of added functionality can lock you in.

Uday June 10, 2009 at 3:19 pm

As Robin mentioned, you should build a ParaScale cloud to solve your storage problems. ParaScale cloud storage software aggregates storage from heterogeneous commodity servers to form a massively scalable storage pool that can be viewed, accessed and managed from a single point.

– Trying it out is a simple software download and install on any commodity hardware running Linux. You can try us out by downloading a free 4 TB cloud at http://www.parascale.com, and grow it to PB scale on demand. You choose your favorite hardware vendor and can select any hw configuration and SATA/SAS/FC based on your performance requirements/cost. You purchase any server that can run RHEL or CentOS (e.g. SuperMicro xxx, white boxes). You can also mix and match servers, repurposing servers into the cloud if needed.

– The cloud offers a single namespace and every node within the cloud can serve client requests independently. This is not a gateway implementation and there is no chokepoint.
– The cloud software performs load balancing automatically, eliminating the need for an external load balancer. Each user who uses NFS will get redirected to a different part of the cloud and the load will be spread across the cloud.
– We work over regular bonded interfaces, and so the network will also not be a bottleneck with our solution. GE, 10gE, IB etc – whatever Linux supports we support
– We simplify managing PBs of data by supporting policy based management operations like automatic data migration, capacity balancing and replication (if required). All this is included in the free download, and you can try it without having to install any add-ons. If you are familiar with Redhat Linux or CentOS admin, that’s all that is required.
– We are optimized for large file reads/writes
– We do not support CIFS today – but windows clients can access the cloud via WebDav quite effectively. We have also found Windows NFS clients to be quite effective and with good performance.
– You can eliminate “forklift” from your vocabulary. The ParaScale software provides a layer of persistence, permanence. Underneath this software layer you can add new servers as old ones die. The cloud continues with the latest and greatest hardware out there.

Wes Felter June 10, 2009 at 4:26 pm

Joe, I think a better Windows solution would be for a few nodes to re-export the Gluster/Lustre filesystem using Samba w/ CTDB. No VMs needed.

Steve, I don’t see how you’d get multiple 1.2GB/s streams out of Sun Unified Storage (short of time-traveling into the future to get pNFS).

Steve Jones June 11, 2009 at 5:43 am

Having looked again I’d agree -0 that 4K stuff is immense. maybe you can aggregate 10GBe ports but I doubt any software stack would handle the data volumes in a fully shared system. The performacne figures available for NFS streaming appear to be for the combined output for multiple clients and not single streaming speed.

The data volumes here are seriously crazy – like my 12MP DSLR taking 30 frames a second. At 1.2GBps per second then that’s over 1TB just for 15 minutes.

But this also shows the problem with spinning disks – 1.2GBps would take perhaps 20 of them in parallel to have a chance of sustaining just one stream. Pulling 60MBps reliably off of 20 devices all at the same time is difficult enough. Add into that contention through multiple streams, small random I/Os for file system, meta-data and so on and there would have to be many dozens of disks. I suspect that the strain of multiple streams and all the buffering going on could also saturate the memory bandwidth available.

Ahead of pNFS if somebody can invent some form of layer that allows for the interleaving of files over multiple, independent links and storage devices then I suppose the problem might be soluble using several commodity NAS units whilst allowing for full sharing. Using a file-based protocol rather than block-based at least makes the shared access easier. I suspect it is fairly easy to kludge together, but whether there is enough CPU horsepower in a client machine to do all the buffer shuffling at that speed, I don’t know. Also there are probably single threaded parts of client IP stacks that could throttle the whole thing, even if there were multiple targets.

Thierry June 11, 2009 at 5:49 am

There are 3 well known companies that provide really good solutions tailored for broadcast/film industries.
1) You’ve got Quantel with its Genetic Engineering solution.
2) You’ve got Omneon with its MediaGrid solution already mentionned by Steve.
3) you’ve got Avid with Unity ISIS solution.

Joe Kraska June 12, 2009 at 3:30 pm

Well,

Panasas is remarkably affordable these days, offering most of the benefits of Lustre in a turnkey appliance form factor, without the baggage. At the prices Panasas is able to offer, I’d suggest you consider only tiering off old data, and do it in a separate silo: i.e, don’t try to get fancy. Caveat, they have no Mac driver, and their Windows driver is not QUITE out yet.

DDN is entering this market with a clustered storage appliance also. It, too, will be an easy to use turnkey clustered storage appliance…

Others have pointed out Isilon, which is the easiest to use of the bunch, but cannot approach the performance requirements that the first two can. Note that if you don’t need the highest performance that Panasas and Lustre offer you, you should do some thinkin’: when Isilon says they are the easiest enterprise storage to deploy and use, they ain’t lyin’.

As for ILM, I might suggest you just go for some software that moves older files over to something cheap when the time is right. I think IBM will happily sell you TSM for that… and it’s cheap.

So, make your own:

Put Panasas, Isilon, DDN in Tier 1
Put a Tape Robot (Sun, IBM, Quantum/ADIC) in Tier 3

Notice I do not list a Tier 2. I do NOT think this is needed. I believe you can negotiate Panasas, Isilon, DDN, and the like, to be close enough to Tier 2 prices so as to not need a classical Tier 2 at all…

Joe.

Dustin June 16, 2009 at 7:40 am

Part of what is so daunting about this is understanding the jargon and what all of these things do. I was interested in this article because I’m in a situation similar to your sample subject, but I got lost when I saw “Issue 1: Interconnects” because I have no idea what that is. Ugh…

Chuck June 25, 2009 at 2:00 pm

For years (literally) I’ve whined to folks that its all about the Interconnect. Not that I’m a prophet, on the contrary I’m a systems guy (very un-prophet like). However when you start talking about tera-byte disk drives behind a channel the manufacturer claims will run flat out at 105MB/sec (see Disk Drive data sheets) but you only get to ask for maybe 100 or so different pieces of data from that huge pool every second. The problem isn’t storage any more, the problem is making sure what you need is already in RAM (or at least a whole lot closer to the interconnect than sitting dead on a platter 180 degrees away from being visible.)

Jeff Brue July 7, 2009 at 7:34 pm

I technically manage a 40 person post facility in socal. I’ve given up on storage vendors actually doing what makes sense for this space. Thte bigger problem though as usual is knowledge about the medium. IE those data rates are great…. but what about when they’re DPX files and you have to worry not only about throughput but IOPS…

So with that said I’m about to embark on a dyi with SSD. I’m projecting 6 workstations at 10 bit 4K with intel SSD’s. One storage vendor put a price tag on that throughput at north of half a million… the DYI at under 40.

Jean July 9, 2009 at 11:56 am

I would go with Sun QFS servers over standard 10Gbits links switches. This is similar to what Sun have done at few places in this market segment.

Sun has won few TV stations with their Unified Storage into editing sections with Appel FCP or Avid. A single 7410 with 128GB ram with few SSD head can do 1.2 to 5.5GB/sec depending file size and number of disks behind. No complex setup like QFS or any similar disk sharing here. Huge saving.

Rafael Morbeck January 22, 2010 at 7:48 pm

I’m a cinema student from Brazil, 24yo. I think u should see this:
http://www.ramsan.com/products/ramsan-6200.htm

I/Os Per Second
5,000,000+

Capacity
100 TB of Flash

Bandwidth
60 GB per second

Latency
Writes: 80 microseconds
Reads: 250 microseconds

Rob January 31, 2015 at 3:19 pm

Nearline storage, 10gigE at good prices up to a Petabyte of usable storage. Small Welsh company in the UK, we go up against the big boys space, Avid, Isilon. check us out…. May be what you’re looking for.

Leave a Comment

Previous post:

Next post: