Cleversafe, Again
The New York Times has a readable article about Cleversafe. StorageMojo.com commented on Cleversafe in June and July (see Cleversafe: Yet Another Online Storage Startup and Coolest Remote Data Services).
The money quote:
The Cleversafe design could lead to a communal Internet storage system that Mr. Patterson called “hippie storage.†The idea is similar to SETI@Home, the shared computing system that allows PC users to contribute idle time on their machines to create a distributed supercomputer.
It sounds a lot more like BitTorrent when described like that. The difference, that the stored bits aren’t readable by themselves so the data is secure, is the key to creating a private resource from a public network.
The Network Is The Storage
The article also gets into the issue of the impact that broadband internet is having on storage: internet-enabled distributed file systems; web storage services; efficient secure storage. The problem with all these schemes is that network bandwidth is so costly compared to storage capacity. Amazon’s S3 for example, charges you almost 3x to upload and retrieve a gigabyte as it does to store it (S3 charges $0.15/GB/per month for storage, and $0.20/GB for bandwidth.)
Gilder’s Fever Dream Remains Just That
George Gilder, the late ’90’s prophet of the Telecosm, foresaw a world where network bandwidth would be both plentiful and cheap. A vision not so different from the atomic energy visionaries of the 1950’s who spoke confidently of energy too cheap to meter. Alas, both were wrong. Networks are expensive compared to local access and always will be. While increases in network bandwidth and speed allow networks to do more every year, their growth rate is far exceeded by the growth of stored data. The network tail does not wag the storage dog – no matter how long the tail is.
Actually, Storage Is The Network
Storage and networks have long been recognized as partial substitutes for each other: caching substitutes storage for bandwidth and access time, whether it is an L2 cache on a CPU or Akamai’s content delivery network storing multiple copies across the web. We use networks to connect pools of storage and skim off the most valuable content. Using broadband networks for massive storage is one of those intriguing theoretical what-ifs that will remain forever just beyond our grasp.
Cleversafe May Have Accidentally Designed Something Great
And not a safe backup infrastructure, either. They may have designed the next generation of storage array. Not RAID anything, nothing encrypted, yet safer and more reliable than any existing array. Data parceled out across hundreds of disks, so no hotspots; lots of spindles for I/O, no single disk drive, or even several, containing reconstructible data; perhaps riding on a cheap, fast network storage protocol like AoE.
Sure, hitching up Cleversafe with a backup data compression appliance on the front end would answer some of those network bandwidth issues. But the real win could be in the data center, where a secure, high-performance infrastructure could be built out of standard components.
Cleversafe has an open-source component. Why not?
As always, comments welcome.
I think Google’s file system is already much more advanced as a model for datacenter implementations…. for the average consumer though this is an interesting new idea. One item where this causes some issues is in privacy, if I store my MS Money backup file on my “storage” it may just be on twenty different computers…. and five of those computers may be established online criminals. There needs to be a way to protect this, maybe integration into a contact list, or some form of a trust ranking (e.g. ebay) etc…
Josh, you raise a good point about security – perhaps the Cleversafe algorithm won’t turn p2p networks into secure storage. At least, not without some more work. Even though I don’t care for encryption as a personal data security tool – maybe adding that to the mix would make it safe to use public resources.
The privacy implications of massive storage is another area of interest – and I’ve been working on another article about that very thing. I think it is a bad business decision on Google’s part – and any other search engine that does it – to save everyone’s search history forever. Yet the benefits of massive storage are powerful. Stay tuned.
I also agree that Cleversafe’s technology is no substitute for a re-architected data center along the lines of GooFS or ZFS. Yet it seems to me that it does provide a layer of security – all those disk drives containing sensitive info for sale on eBay – that could co-exist nicely with a ZFS. It also seems like it could provide another model of a fast, RAID-less, secure storage array. I’m hardly the person to try to engineer such a beast. I only welcome new thinking about the old problem of creating fast, low-cost massive storage. Cleversafe may have inadvertently created just such a model.
Re: Storage Is The Network
Actually, I recall reading about a primitive audio computer that used a waveguide as a data storage mechanism; it played a sound in one end, and when the sound exited the waveguide it was read by a microphone. I imagine the same thing could be done with the Internet in a variety of ways; for example, by sending data to nonexistent accounts that mimics spam, and the bounce comes back for you to read again. Alternately, you could use the contents of an ICMP echo request (ping) packet; the echo response comes back with the first N octets of the payload the same as the request.
If I was told correctly, the Cray supercomputers didn’t use buffers; they used copper traces on the PCBs for storage (electricity travels through copper at a precise speed).
Also, if the peers don’t know which slice belongs to which data set, then it could be difficult to reconstruct the data even without crypto. Note that the secret splitting system involves lots of randomly-generated bits to create the secret shares.
Adding crypto on top of this should be an easy matter.
This reminds me of this program:
http://duplicity.nongnu.org/
My plan was to set up a peering relationship with a friend, where we each use half our disks and leave half for the other person, who would backup nightly or something using duplicity. It’s basically rsync plus crypto, which beats SSH or SSL/TLS because even the peer can’t decrypt it.
One more thought.
This reminds me of a DEFCON talk where the doxpara,com guy showed how you could do many things with DNS servers, since you can query them and force them to cache data for a time determined by the owner of the DNS zone file (i.e. you).
You can:
1) Use them as a huge network for storing files with tremendous bandwidth (DNS servers tend to be well-provisioned and massive network parallelism makes it perform like bittorrent).
2) Use them for streaming audio or video efficiently; the demonstration involved serving DNS records that encoded digital audio that was sampled from a SF radio station, and then “broadcast” through a public DNS server.
3) Use them for covert channels (e.g. he demonstrated SSH tunnelled over DNS).
Food for thought.
The main downside of Cleversafe seems to be that IDA is much less space-efficient than FEC. And the benefits of FEC for storage were already established by OceanStore and other projects.