Reprint “Jonathan Livingston Seagull” Every 5 Years?

by Robin Harris on Monday, 19 June, 2006

Imagine we had to reprint all the world’s books every 5 years to preserve them. Would we? Could we? That is the flip side of Quick Disk Erase (the previous post): data preservation. Storage pros focus on device availability and redundancy, but these are the short term problems. Long-term, the problem is that even if we have the data stored, will we be able to read it?

As a long-time computer user I’ve run into that problem many times. My first computer project used IBM punch cards – useless after school ended. My first computer was the original Apple // with the ROM-based integer basic and the 2k ROM assembler/disassembler that I used to program Life. All the data storage was on audio cassette tapes since I couldn’t afford $800 for a single 144k floppy. All that data was lost when I sold the machine to help pay for grad school and to buy the TI-59 programmable calculator got me through statistics, calculus and financial modeling – all pre-VisiCalc. All that was lost, despite persistent on the TI’s little mag cards, when the TI keyboard gave out. Finally I got a Mac Plus, and while my data has transferred easily since then, even to Windows machines, I’ve had losses. Like when I encrypted a 90k text document I’d written using an application that got corrupted in a machine transition. No application, no document. Ouch!

Several critical threads here about data rot:

  • Hardware dependencies, like obsolete media – cassette tapes, floppies, Zip disks, will bite, as well as interface dependencies – good luck on that printer port disk driver working on Vista
  • Application dependencies, like app-specific encryption, proprietary file formats – will bite
  • Document dependencies, such as formatting, may not lead to total data loss, but can keep document from maintaining 100% faithfulness to the original
  • And of course, there are catastrophic losses: disk drive crash, fire, flood, tornado, earthquake or whatever

In the Internet Age we have no choice but to work to keep data accessible. Applcations, formats and devices are changing too quickly not too. Innovation will slow down some day, but that day is still far away.

On-line publishing, a growing trend, reflects the issue nicely. In the old world, publishers published and libraries bought and stored. In the on-line world all kinds of people publish, and who stores? And how? The LOCKSS organization is a group of librairies and publishers whose name is an acronym for “Lots of Copies Keep Stuff Safe”. LOCKSS is

open source, peer-to-peer software that functions as a persistent access preservation system. Information is delivered via the web, and stored using a sophisticated but easy to use caching system.

I’ll be writing more this week about digital asset preservation, both personal and institutional.

Before I close here, a couple of more sites for the serious digital preservationist. First is the British Digital Preservation Coalition whose site has some good, free content. Digitization 101 covers a wide range of digital preservation news.

Robert Pearson June 19, 2006 at 9:26 pm

Even though I love your Blog, I was beginning to feel all alone on the planet again. This post “Reprint “Jonathan Livingston Seagull” Every 5 Years?” strikes at the heart of the matter.

For the 85000 foot view of the problem let’s start with this post I made on Jeff Tash’s ITscout Blog. http://itscout.blogspot.com/
Jeff is very good at Enterprise Architecture. I prefer Visual Information Architecture but both are, IMHO, simply stirring the “Technology” pot, not solving the problem.

“Wonderful presentations. Once again you display that exceptional grasp of very intricate, tightly woven entities, concepts and processes and produce a ‘picture worth a thousand words’.

Your topics bring to mind some of W. Edwards Deming’s basic statements:

In the 1970s, Dr. Deming’s philosophy was summarized by some of his Japanese proponents with the following ‘a’-versus-‘b’ comparison:
(a) When people and organizations focus primarily on quality, quality defined by the following ratio:
Quality = {results of work efforts}/{all costs}
then quality tends to increase and costs fall over time.
(b) However, when people and organizations focus primarily on COST, then costs tend to rise and quality declines over time.
(http://storagemojo.com/ and I are in exact agreement on this. Vendors have fostered and maintained the “cash cow” approach using this very technique)

A Lesser Category of Obstacles:
1. …
2. Relying on technology to solve problems.

One of my favorites is from Walter A. Shewhart

His more conventional work led him to formulate the statistical idea of tolerance intervals and to propose his data presentation rules, which are listed below:

1. Data has no meaning apart from its context.
2. Data contains both signal and noise. To be able to extract Information, one must separate the signal from the noise within the data.

[I changed “information” to “Information”]”

Two conclusions by me:
1) ROI (Return on Investment) should always be emphasized over TCO (Total Cost of Ownership) – Deming’s equation

2) The only real value is in the Information. Technology is a “profit-enabler” to allow the Information to be separated from the Data, and in particular, the Noise in the Data. Other than that, Technology has no value, it can never produce ROI by itself and is always 100% TCO.

In “1)” above I equate ROI with “Quality”. This is not exactly correct.
What has happened is the subtle and insidious shift in the Triangle of Unobtainium. The Triangle consists of the three points:
1) Quality (Good)
2) Cost (Cheap)
3) Speed (Fast)

Only two can be achieved at any one time. Never all three.
“Quality and Speed” were once the standard. Now it is “Speed and Cost”.
There are some “softer” issues here like—
“We have to understand it to sell it!”
“We have to see what it does for us!”
“We can’t manage it if we don’t understand it!”

I leave you with this thought, “If Information ever existed in your Information Domain, does it still exist? This means that once created, Information in your Domain has infinite Life. Whether Online, Nearline or Offline.”
What is your Expiry policy?
Do you want infinite Persistence?
Can you afford it?

This is my interpretation of what you are saying.

Comments on this entry are closed.

Previous post:

Next post: