Creating an Historical Archive

by Robin Harris on Monday, 19 March, 2007

Culinary history: old wedding cake in the freezer?
An long time friend of mine is working with the Culinary Historians of New York on a project to gather and preserve the records of a Depression-era WPA project. According to the CHNY:

The mission of “America Eats”– part of the New Deal and abandoned at the outset of WWII– was to send writers and photographers nationwide to document community eating in America from church suppers and clambakes to barbecues and holiday meals. The diverse flavors chronicled in these documents have lain forgotten in scattered archives and are only now being brought to light.

As you’d imagine, this a volunteer organization made up of foodies, not IT gurus. I’m no IT guru either, but not knowing that my friend asked me for help.

Easier to find than preserve
She wrote:

. . . we are trying to organize a search for these scattered and lost WPA documents inc. photographs that are buried in attics, historical societies, and a some collections in Library of Congress. We hope to “digitize” them to preserve in a central location for present and future food scholars to access.

So I asked myself, “Self, how would you build a historical archive?”
In response, I wrote:

CHNY has two problems: getting the materials digitized and then preserving the digitized copies for posterity.

Scanning, the easier problem, IMHO
Scanners can digitize textual and photographic materials quite handily. For text 300 dpi (dots per inch) is fine. Photographs should be scanned at a minimum of 600 dpi. Higher dpi is better; most scanners will do at least 1200 dpi and many will go up to 2400 dpi and beyond. Higher dpi results in larger files which may be harder to store, edit or share, yet if you don’t have the resolution to start with you can’t create it later.

Perfectly adequate text scanners start at $50, while very good photo scanners are available for $400. Photographs of particular interest can be commercially digitized in drum scanners for the very highest resolution and quality. Negatives and slides can be scanned by film scanners that range from $400 to $1200 depending on speed and quality.

Creating an archive of scanned documents
Preserving the digitized data is the more difficult problem. Over the decades file formats may change, data storage devices become obsolete – think 8 track tape – and media decays. There is only one strategy that I would trust and it goes by the acronym LOCKSS: Lots Of Copies Keeps Stuff Safe.

For CHNY I would save every file in at least three formats and distribute the copies on at least three media. For photos use JPEG, PDF and TIFF file formats. For text use ASCII text, PDF and PNG formats. For media store complete collections on DVD, server-attached hard drives and backed up to tape using ZMANDA, a commercial variety of the open-source AMANDA, which can be read without the application.

Ship the DVDs to people who will store the content on their web-servers and make new DVDs for people – DVDs you can burn yourself only have a life of 5-10 years. Also, print out complete copies of the data on archival quality equipment and media and donate them to a couple of archives at research libraries.

This may sound like overkill – it does to me, a little bit – and others may have different opinions as to the best file formats, but the basic LOCKSS strategy is your best bet. Once you’ve gone to the trouble of gathering the source material you never want to have to do that again. So preserve it with LOCKSS.

The StorageMojo question
That was all off the top of my head. I know some of you are smarter about this stuff than I am, so please, what would you do?

I suspect that many small and non-profit organizations have the same problem. If we put our heads together maybe we can put something together that will help a lot of people.

Comments welcome, especially in this case. Moderation turned on to keep spam out of the comments.

Update: I meant to put in a reference to the actual LOCKSS site and didn’t. I thank the commenter for reminding me of that. So I put in a reference above.

{ 5 comments… read them below or add one }

joseph martins March 19, 2007 at 3:20 pm

Robin,

When I have an opportunity I’ll put together a list of resources for you. And I’d be happy to discuss some of the strategies with you.

While you’re thinking about the preservation of digital and non-digital assets, consider the following:

1. The original document. For example the actual US Constitution, or an MS Word file.
2. A likeness. For example a digital image of the Constituion, or a copy of a digital file in another format.
3. The content of the original document. For example the text of the Constitution decoupled from the presentation.

Clark Hodge March 20, 2007 at 9:37 am

Robin,

I like your ideas of LOCKS – Lots Of Copies Keeps Stuff Safe. And that it’s not just copies of the same thing, on the same media, but multiple formats, and multiple medias. Better to protect the information from a single point of failure.

I’d like to add ‘long term commitment’ to your list. Right now we don’t have much in the technology world that is meant to last truly long periods of time. Not protocols, not media, not hardware, not standards and especially (unfortunately) not the interesting stuff – the information! A long term commitment to the maintenance, and preservation of the information.

All those copies won’t mean anything if it goes ‘stagnant’ and no one reads them for 50 years and none of the media is readable. That long term commitment means regular checks (audits) of all of the media, regular migrations of both format and media, audit logs for data transformations … (for the digital stuff at least!).

Even the traditional paper archivists commit to environmental controls in their archives – with spot checks looking for mold, fading or other degradation.

I write a blog – “Fixed Content Fixations” at http://www.storageswitch.com/blog that targets issues around of long term data (records, information) storage.

..clark

joseph martins March 20, 2007 at 12:30 pm

Interesting blog Clark – I just finished writing a few comments to it. My only suggestion would be to expand your perspective beyond storage if you’re going to target issues around long-term storage. I noticed your links read like a who’s who of the storage industry. 95% of the long-term archival research is going on outside storage.

Andrew March 21, 2007 at 6:20 pm

G’day,

Scanned documents should be stored a lossless format such as tiff. A product I have been watching for some time is the open-source alfresco.

Alfresco is being run by the lads who orginally setup the very costly documentum 🙂

I am using in a small scale at the moment but are in the evaluation for a very large scale deployment…

Cheers

Robin Harris March 22, 2007 at 12:12 pm

Andrew,

I’ll have to check out Alfresco. Also noticed that the former CEO of Documentum left EMC a couple of weeks ago. Wonder what that means?

Robin

Leave a Comment

Previous post:

Next post: