Imagine we had to reprint all the world’s books every 5 years to preserve them. Would we? Could we? That is the flip side of Quick Disk Erase (the previous post): data preservation. Storage pros focus on device availability and redundancy, but these are the short term problems. Long-term, the problem is that even if we have the data stored, will we be able to read it?

As a long-time computer user I’ve run into that problem many times. My first computer project used IBM punch cards – useless after school ended. My first computer was the original Apple // with the ROM-based integer basic and the 2k ROM assembler/disassembler that I used to program Life. All the data storage was on audio cassette tapes since I couldn’t afford $800 for a single 144k floppy. All that data was lost when I sold the machine to help pay for grad school and to buy the TI-59 programmable calculator got me through statistics, calculus and financial modeling – all pre-VisiCalc. All that was lost, despite persistent on the TI’s little mag cards, when the TI keyboard gave out. Finally I got a Mac Plus, and while my data has transferred easily since then, even to Windows machines, I’ve had losses. Like when I encrypted a 90k text document I’d written using an application that got corrupted in a machine transition. No application, no document. Ouch!

Several critical threads here about data rot:

  • Hardware dependencies, like obsolete media – cassette tapes, floppies, Zip disks, will bite, as well as interface dependencies – good luck on that printer port disk driver working on Vista
  • Application dependencies, like app-specific encryption, proprietary file formats – will bite
  • Document dependencies, such as formatting, may not lead to total data loss, but can keep document from maintaining 100% faithfulness to the original
  • And of course, there are catastrophic losses: disk drive crash, fire, flood, tornado, earthquake or whatever

In the Internet Age we have no choice but to work to keep data accessible. Applcations, formats and devices are changing too quickly not too. Innovation will slow down some day, but that day is still far away.

On-line publishing, a growing trend, reflects the issue nicely. In the old world, publishers published and libraries bought and stored. In the on-line world all kinds of people publish, and who stores? And how? The LOCKSS organization is a group of librairies and publishers whose name is an acronym for “Lots of Copies Keep Stuff Safe”. LOCKSS is

open source, peer-to-peer software that functions as a persistent access preservation system. Information is delivered via the web, and stored using a sophisticated but easy to use caching system.

I’ll be writing more this week about digital asset preservation, both personal and institutional.

Before I close here, a couple of more sites for the serious digital preservationist. First is the British Digital Preservation Coalition whose site has some good, free content. Digitization 101 covers a wide range of digital preservation news.