Files, Objects and Blocks, Oh My!

by Robin Harris | Friday, March 16, 2007 | Future Tech, NAS, IP, iSCSI | 6 comments

Virtualization is the answer. Now, what was the question?
The drumbeat for virtualization as the answer for the storage world’s ills continues unabated. Yet I wonder if we are virtualizing the right things and, if we are, doing it in the right way.

I got into the computer business in 1981, when virtual memory superminicomputers were still the coming thing. Folks had figured out that due to the cost of memory *and* the cost of dealing with fixed memory capacities, that memory was the right thing to virtualize.

Yet the early implementations were clunky and prone to non-productive behaviors like thrashing. It took a while to engineer a virtual memory system that provided a good illusion of physical memory. How many people even know they are using virtual memory today?

The Turing test for virtualization
You can’t tell whether it is virtual or real.

As scary as lions, tigers and bears?
Maybe blocks should be.

Grand virtualization architecture visions
The dotcom boom saw at least a couple of dozen storage virtualization startups funded, at least for a while. A surprising number still survive. Major storage companies launched storage virtualization programs.

Virtualization in HBAs
Virtualization in switches
Virtualization in appliances

And more.

Blocks are the problem. What is the answer?
My thought: maybe OSD has the right idea. Maybe by virtualizing a really basic and largely irrelevant resource – blocks – we can advance virtualization without a costly rejiggering of everything else in storage.

I pinged a very smart and highly experienced engineer I know to ask him what he thought about OSD. A member of the T10 committee on OSD, he asked that I not give his name. He considers his comments a SWAG, and he is not professionally given to making unresearched statements.

His response had a couple of threads
The first is what OSD could do.

One of the most interesting things about OSD was that you could provide a certain amount of information about an object in the SCSI command set defined for it. This could have been the basis for some security, access management, and information life cycle management solutions.

My take away is that OSD could offer new infrastructure for managing data, were we to adopt it.

His second point is where OSD fits.

I have always believed that people cared about files, and that blocks and objects were just interesting ways to construct files. Data base and transaction processing applications may be a significant exception to that because they attempt to optimize their behavior at a much lower level, though that may be a temporary expedient due to present performance limitations.

Just as people once programmed in ones and zeros before moving to assembler, perhaps blocks are the ones and zeros of the age of massive storage. We have to stop thinking about them to achieve useful virtualization, and let the machines handle blocks so we don’t have to.

Comments welcome – we got some good ones on the first OSD post – thank you all. Moderation is a virtue. Have a good weekend.

6 Comments

Marc Farley on Saturday, 17 March, 2007 at 11:34 am

The real issue is how data is mapped to permanent storage. Blocks provide a way to map data onto the storage address spaces of disk media. Available storage space is mapped as unused addresses within finite boundaries by the filing system (including databases). The way free storage space is allocated for storing data follows some old assumptions of how disk drives work, including the inherent mechanical latencies of disk drives.

The question is, is there a way to change the mapping method of storage addresses from blocks to something else, such as objects, and assumes that storage is part of a virtualized subsystem, instead of a bare disk drive? Furthermore, what would the reporting mechanism be that would allow a filing system to know where the free space is along with any intelligence about how to use it optimally and how to achieve balanced, predictable performance as the free space is consumed.

Finally, can such a mapping system be reliable and recoverable so that mico-errors on the storage media can be transparent to the rest of the system?

My hunch is that we won’t see a new mapping method until we see the underlying devices change – such as solid state storage. As the bottom of the food chain in PC systems, disk drive vendors find it nearly impossible to innovate due to the economic challenges of recouping their R&D costs when the rest of the computing industry only allows them the thinnest of margins. Large scale OSD development is a huge business risk for a disk drive vendor and for that reason, I don’t see OSD as happening inside disk drives.

When you take mechanical latencies completely out of the equation, then it will be obvious that different access and allocation modes will be needed to have competitive price/performance ratios. Then the processor, system and operating system vendors will be forced to look at a whole new way of moving data and working with storage.

So when will that be – ever? It seems inevitable that at some point solid state storage will finally catch up with magnetic. Of all the technologies being researched today, my intuition favors phase change storage due to its performance and capacity potential. But that’s all it is right now – potential.

People have been predicting it for decades, why should things suddenly change now? They might not. The stakes are high and breakthrough technologies will only emerge if R&D companies are willing to invest sufficiently in solving snaky engineering problems. I think companies like Intel, Toshiba and Samsung really want to find new fundamental sources of revenue and margin in the industry. I just don’t know if any of them are willing to take on the enormous economic risks.

Flash proved that solid state could compete with magnetic on a small scale. It did not change the storage mapping method, but that’s because flash’s performance is not that world-changing. If a new solid state storage technology emerges for enterprise storage applications, it will have to have sizable economic and performance advantages. The performance advantages would likely result from improvements to the access and mapping methods. Blocks would be replaced by something else, maybe OSD, as part of a major infrastructure upgrade.

Oh yeah, in our lifetime?
Phillip Reed on Monday, 19 March, 2007 at 5:57 am

Somebody will no doubt correct me if I’m wrong, but I believe the IBM AS/400 (now iSeries) architecture is designed around a single address space that uses 64-bit addressing. All files live in that address space, including programs. Running a program means addressing the instructions in that space, and the operating system virtual moves the program into RAM. When you access a file, you actually just access another portion of that space and the operating system virtualizes that file as necessary.
PJ on Monday, 19 March, 2007 at 7:07 am

IIRC, OSD or something OSD-like is already in use in the Lustre clustered filesystem. I think what they do is first build a block-to-OSD abstraction layer and then build their clusterfs on top of the OSD abstraction.
Robert Pearson on Monday, 19 March, 2007 at 12:27 pm

Gosh! This is almost heretical but RIGHT ON! Great insight…
FWIW, when and if Storage is ready, open and willing for Solutions, Marc Farley will be a part of them.
AFAIK, the vendor feeding frenzy is still going full speed.

I once proposed that Storage cache could determine the applications type (Performance or bandwidth) from the request. This would eliminate having to predetermine hardware configuration for Performance (OLTP) or bandwidth. For example most Storage is used in Performance (OLTP) but most are also backed up. Backups are a bandwidth application.
With cache having the application “needs” information it could then determine if the application request was OLTP or OLCP or OLAP or bandwidth and route the request to the appropriate backend.

The NetApp guys almost croak every time I bring this up. Every other vendor else looks very wise and says nothing.

OLCP seems to have dropped out of the acronym soup. I put it in just to see if anyone remembers it.

The “Best of All Possible Worlds” would be for Storage to be able to reconfigure on the fly between Performance and bandwidth needs. Failing that the Storage would need to be pre-configured for Performance and bandwidth. This speaks to two copies of the Information. One for Performance and one for Backups that must be in sync. This makes for a very busy Storage box.

“Roll Your Own” hybrid NAS/SAN solves this problem nicely. The NAS Cluster head serves the Information from the SAN for all Performance needs. The Backups run against the FC or IP SAN for bandwidth needs (depending on whether you have block or file needs). The Storage controllers are still a PITA but there are work-a-rounds. Multiple channels is one. Two types of controllers are really needed; or a switch-hitting multi-mode controller? Multi-ported? Beyond the scope of the Hitchhiker’s Guide to the Storage Galaxy? I thought the HDS TagmaStore did this with ease? Virtualized controllers?

How about Gene Amdahl’s Storage design that used PC motherboards and Objects? It seems to have died a quick death. Was it deserved?
Gene’s design can be done today with component parts for the hardware. You will have to write your own Manageware due to the sad state of Storage Manageware.

Stone Age Storage is breaking my heart.
Look at the sad state of SRM (Storage Resource Manager). Instead of HighGround buying Sun and morphing a “badly in need of change” Product line to enhance and support the SRM, Sun bought HighGround and morphed the SRM into an unrecognizable mess.

With decent, COTS Manageware a process like ILM is a no-brainer…
But where’s the profit in that?
Robin Harris on Monday, 19 March, 2007 at 4:19 pm

Great comments! Thanks.

IIRC, Phillip is right about the AS400. Flat address space, everything there. No difference between disk and RAM. In the late 1980s the AS/400 business was as big as all of DEC’s VAX business. A business computing appliance.

Marc, when ZFS gets wrung out I think the micro-errors – by which I assume you mean ghost writes, wrong-block reads and the like – will be pretty well taken care of. While I posited turning disk drives into OSDs it would probably make more sense to do it on storage servers like the x4500. I’m working up to something here so stay tuned.

PJ, I hadn’t heard that about Lustre. I’ll check into it some more.

Robert, I share your high opinion of Marc. Likewise, stay tuned for some stuff I am working on that relates to the OSD posts.

Robin
Bill Todd on Saturday, 31 March, 2007 at 12:10 am

I wrote a response to your previous object post, but then read this one and decided it made more sense to post it here. I don’t have time to visit StorageMojo all that frequently, so tend to join discussions late. Ever considered using a modified forum format where you still get to pick the topics but discussions move to the top of the list when they’re added to?

Anyway: The reason OSDs haven’t gotten anywhere in the decade-plus that they’ve been touted as ‘the coming thing’ is that individual disks just aren’t the right level at which to handle objects. Objects often (even usually) must span multiple disks – to achieve robustness via redundancy, to achieve performance via striping, even just to achieve size (because there’s no guarantee that the disk will be large enough – and worse yet, will *remain* large enough to avoid the need to move a humongous object somewhere else if it should grow, the alternative being to fragment it and manage multiple address-range-based allocations, just as you do with block extents).

That’s why the ‘OSDs’ used in Lustre (to pick one of the few examples where they’re claimed to be used at all) aren’t disks but entire storage server nodes (even an entire server node may offer insufficient availability or capacity for some objects, but at least it’s convenient, since the ‘object’ layer is simply a Linux file system). That’s why OSD’s proponents have spent so much time struggling with doomed efforts to do things like move concurrency control down to the disk level (something I tried to explain to Garth Gibson nearly a decade ago, and which he may finally have given up on at some point). That’s why AS/400 uses a conventional storage system underneath its single-address-space veneer (i.e., while programming is object-oriented, storage is not).

Disk manufacturers would love to find a way to add value to their products to gain a leg up on their competition. A bit over a decade ago Seagate was aggressively pursuing a way to build concurrency control into their drives, and the Global File System project at the University of Minnesota bought into this enough to base GFS on it (later on, they had to separate out the lock module to allow other approaches after disk-resident locking proved the bust that they should have realized it would, but they were apparently as mesmerized by the potential of a new and lucrative product as Seagate was). Gibson began somewhat more modestly in the early ’90s exploring ‘Network Attached Secure Disks’ (NASD), which actually had some modest potential utility (placing enough intelligence in the disk to let it handle secure links over an unsecured SAN), but apparently, like the UMinn crowd, got an itch for glory and moved on to far more intelligent (but far less useful, at least if implemented at the individual disk level) ‘OSDs’ (another Seagate-supported effort).

Disk-level OSDs have never looked good from anything below about the 30,000 foot level. The above only gets down to 2,000 feet or so – and as you get even closer to actual design and implementation, the less sense disk-level objects make. One significant impediment is about to disappear: having to maintain metadata on the disk describing what space is in use and where the objects reside (we’re about to have flash on disks where that could be kept persistently and without adding another disk-access overhead to every write operation, the alternative until now having been a log-structured approach that would require a full disk scan on restart). But given what you still have to do *above* the disk level even if disks implement some form of opaquely-addressable ‘objects’ with no added overhead at all, the value-add of disk-level objects is just about nil.