Digital Archiving: The Impossible Dream?
Information storage is easy; anyone can throw things into a box. Retrieval is where the real power lies; the ability to efficiently sort and retrieve records in a timely, useful manner. This is, of course, where computers really shine -- they make it possible to store and sort vast quantities of data easily and quickly. The drawback is the transient nature of both the technologies and the media.
All of us ace sysadmins have excellent short-term backup and storage schemes. (That's my story at least, and I'm sticking to it.) But what about planning for the future? Shakespeare's bar bills and court records have survived the last 500 years. The Dead Sea Scrolls go back about 2000 years. Cave paintings date back over 30,000 years. Yet in this here fabulously advanced 21st century, neither analog nor digital creations seem to have anywhere near this longevity.
David Mandel, of Mandel and Associates, says "Modern digital media are very bad in terms of long term storage. As libraries, historical societies, and other institutions become more and more digital, we are storing more and more of our culture on transient media that require constant maintenance by highly skilled people. If something happens in our society (war, economic depression, total social chaos, etc.) and we have to mothball everything for a generation or two, we risk losing a lot of information of cultural importance."
There are two aspects of modern data storage to consider: storage medium and data format. The technology changes quickly, and the options seem to grow even faster: punch cards, 9-track tapes, 1/2" reel-to-reel tapes, paper tape, floppy disks, all the various incarnations of hard drives, optical disks, and so on. Even if you have the right hardware to read obsolete media, and the media are still readable, file formats become obsolete. If you can't decode the data, it's has no value, no matter how well it's been preserved.
Any magnetic medium is a bit scary for me; it feels too vulnerable. Ed Sawicki of Alcpress has this to say on the subject, "I studied this issue a while back for a customer. I did some informal testing as well. I had an extensive library of diskettes that was indexed with a database so I knew when the diskette was placed in the library. I noticed that errors started occurring in diskettes that were over 6 years old, and by 10 years the failure rate was nearly 50 percent. At 15 years, the majority of the diskettes were unreadable."
CD-ROMs and DVDs
CDROMs and DVDs theoretically should be more stable and last longer, but manufacturers are reluctant to make any specific longevity claims. Some people insist that "CDs will spontaneously de-laminate, you mark my words, and then you'll be sorry." Well, I don't know. I have six-year old commercial data CDs that are fine, and older music CDs that still play. Home-burned are not as dependable, though, due to being burned with lower-power and less-precise lasers than the commercial variety.
Still, if Shakespeare's bar bills can survive nearly 500 years, why can't we do better in this day and age? There are two old-fashioned but dependable means of archiving that still do the job well.
Microfilm, properly stored and handled, should have a lifespan of 100-200 years, with some experts claiming up to 500 years. If microfilm readers vanish, the film is still readable with a magnifying glass, which is a technology dating from before Galileo's time, so I imagine it won't be going away anytime soon.
Paper is still king. It's inexpensive, needs no special technology to read, and lasts decades, if not centuries. And with computers, there's no need to retype all those pages; simply use a high-quality, high-volume OCR (optical character recognition) scanner to convert the pages back into digital files. I think it is safe to assume that technologies to convert printed text into digital files will not go away.
One digital archiving option is to move old archives on to new media. Copy those old tapes, diskettes, and hard drives onto modern media while they are still usable. This is how I manage my personal files; every year I spend a day sorting and recording onto new media. As a business strategy, however, the difficulties are obvious: staff switches, management changes, companies merge, etc. It's time-consuming and expensive, and someone has to keep track and make it happen. Errors can also be introduced in the copying. Making checksums of everything can help catch errors, but that requires even more time and effort. Another potential issue is that the more the data are handled, the greater the risk of damaging something.
But short of inventing a miraculous new gadget that's guaranteed to work forever, a rotation strategy may be the most practical method. Ed Sawicki reports, "I now archive to optical media, and I've put two new CD-ROM drives in storage for the future. My archives tend to be cumulative -- I'm backing up the same old data along with the new data simultaneously, so the ages of the media are not as critical."
The difficulty in this scheme is upgrading old file formats into newer programs. For example, what's to be done with Quicken files from ten years ago? Or Word documents or Excel spreadsheets? Importing them into newer versions is yet more work, and the imports don't always result in identical copies of the files, as formatting is often lost and errors can be introduced. What about keeping copies of the original programs and the necessary hardware to run them? A good idea, perhaps, but finding parts for older machines can be difficult.
Another idea floating about is to record, on good ole paper, all the technical specs of the stored archives: file encoding, hardware specs, and whatever else is needed to re-create the means to access the data. While persuading the owners of closed, proprietary file formats to go along with such a scheme is probably more difficult than actually implementing it, there are a number of open formats that can be considered when planning for the long-term.
What about preserving movies, music, or software? For small programs it may be practical to print out the source code, but a one-million line program would fill about 20,000 printed pages. The Linux 2.4 kernel, for example, is about 300,000 lines of code, or about 6,000 pages. Large databases are not practical for hard copies, either. The bottom line is that we are well past the point of being able to reduce everything to paper.
Huzzah for Clerks and Librarians
It is said that being a sysadmin is the most varied job a person can take on -- technician, strategist, diplomat, and now librarian. In other words, it's a great job for versatile and creative thinkers like ourselves.
Please email me with your ideas for long-term archiving.