Preserving Digital History: What We Can Do Today to Help Tomorrow’s Historians
nce you’ve taken care of these initial steps to extend the life expectancy of your website, you can begin to think about what to do with your well-documented and structured files over time. Absolutely the first thing you should think about is safe storage. Questions about storage are similar to the questions we raise about a library or archive: Where should you keep your website’s files? Who can access (and possibly alter) them? What happens if there’s a fire or a flood? No matter how much attention you pay to documenting and standardizing your code, poorly thought out storage of your website can erase—quite literally—your entire effort. Having a robust storage plan provides further peace of mind that the digital materials you create about the past will have a future.
The fundamental rule of storage is that you should have copies, or “backups,” of the files and data that make up your website. This may seem too obvious to mention. But the history of the interaction between historians and their computers is littered with tragic stories of papers, archival research, notes, data, dissertations, and manuscripts lost in countless ways. Some readers of this book will recognize themselves in these catastrophes, and we hope they have been chastened into a backup regime. Those who have not experienced the loss of computer-based work are merely victims waiting to meet their inevitable fate. For some reason, though lots of people talk about backing up, fewer people than you might imagine are actually doing it. We may curse the complexity of computer technology and feel annoyed that computer scientists haven’t solved many of the problems that plague us, but with respect to backing up, we have seen the enemy, and it is us.
Backing up need not be convoluted or complex. First, store copies in more than one place, for example, at home and in your office, or in a safe-deposit box at your bank and your desk drawer. In addition to the Famous Trials files on his web server, Douglas Linder keeps a copy of his site on his personal computer and another copy on a tape backup stored at a different location.20 Second, refresh, or re-create, these copies on a regular schedule, such as every Friday morning or on the first of every month (depending on your level of paranoia). Third, check to make sure, every so often, that these copies are readable. After all, your web server is not all that could fail; backups can fail as well. Fourth, institute social processes that support your backup plan. Tell others or record on paper (or in the README file) the location of backups so they are available if you (or your site’s steward) are not.
One of these backups should sit on a hard drive separate from the server’s hard drive. Hard drives are generally the fastest way to shuttle information, and so in case of a problem, it is nice to have a hard drive backup around. Second (or third or fourth) copies may be on some kind of removable media, that is, a format that can be easily transferred from place to place. The candidates include tapes, which are very high-capacity digital versions of cassettes that look not unlike what we used to slide into a car stereo or VHS; removable magnetic media; other formats that store information on thin, revolving platters in a way that is essentially identical to hard drives but that can be separated from their readers and are thus more portable than hard drives; optical formats, which use lasers to scan and imprint single shiny platters—what most of us have known first as CDs (and CD-ROMs) and now DVDs (and DVD recordable formats for PCs); and most recently solid-state formats, which are essentially memory chips like the ones in your computer, encased in plastic, and which function as tiny drives when inserted in specially configured slots.
According to the Cornell University Library’s somewhat chilling “Obsolete and Endangered Media Chamber of Horrors,” no fewer than thirty-two distinct media formats for backing up digital information have emerged since the advent of modern computing. This includes (cue the funeral dirge) once revered technologies such as the 5 1/4 inch floppy, Sony’s line of WORM disks, Syquest cartridges, and IBM’s half-inch tapes (superseded, of course, by quarter-inch tapes). Even formats that seem to live forever, such as the 3 1/2 inch “floppy” (strangely, with a hard plastic shell) introduced with the first Macintosh computer twenty years ago, have declined in popularity and will soon join less celebrated formats in the dustbin of history; Dell Computer, the world’s largest computer company, recently dropped the 3 1/2 inch floppy from its line of desktop and laptop computers.21
To be sure, all of today’s formats work as advertised, providing copies of your files that can be moved easily and stored in secure, small places like a safe-deposit box given their slender form-factor. The costs for both the media and the drive to record and play that media vary widely. CD-ROMs and their optical successors generally have both cheap media (less than a dollar for each disc) and cheap drives (under $100). Not surprisingly, they dominate the low end of the storage market. The drives for magnetic media can also be inexpensive (often under $200) though the media themselves cost more (a Zip disk that holds the same amount of data as a CD-ROM costs roughly thirty times the price). Tape drives dominate the upper end of the market, with expensive drives (from several hundred to thousands of dollars) and media ($2–$50 per cartridge). With solid-state media like the popular “thumb” or “keychain” USB drives, the device is both the drive and the storage, so it is not expandable by plugging in additional discs or cartridges. When you need more storage, you have to buy another device. Many guides would advise you to run some calculations to figure out the cost per megabyte or gigabyte of storage (including the cost of the drive and the number of tapes, discs, or devices you will likely need), but most websites will fit easily on any of these media, with room to spare.
Those interested in long-term preservation still face two more significant considerations than cost when choosing a format. The first is the obvious question of the inherent longevity of the medium itself—that is, how long the bits on the volume can remain uncorrupted. As we noted at the beginning of this chapter, this question is quite difficult to answer, even with the best research tools at the disposal of the National Institute of Standards and Technology. Two hundred years, NIST’s high-end estimate for optical media such as CDs and DVDs, sounds terrific, whereas their low-end estimate of twenty years seems barely adequate for “long-term” preservation. Given the additional problems of improper storage and the extreme variability of manufacturing quality that NIST outlines in their report on the archiving of digital information on these formats, some of these backups may become worthless in merely a few years.22
A second, perhaps equally important consideration can be called lightheartedly the “lemmings” question. What is everyone else using? As with web technologies, we recommend staying with formats that are popular because popularity alone is highly correlated with readability and usability over the long term. In the same way that Microsoft’s effective monopoly on computer operating systems means that there is a far larger base of software developers for the Windows platform than for niche platforms such as Apple’s Macintosh, strong market forces will increase the likelihood that information stored on the most popular formats will be readable in the distant future, though it may involve some cost in the outlying years. Currently the most popular storage format is the CD-R, and likely soon, one of the DVD recordable formats. Probably the best advice we can provide at this point is to choose one of these optical formats, and then follow NIST’s one-page preservation guide for your chosen format. The Virginia Center for Digital History (VCDH) backs up their scans of nineteenth-century newspapers onto CDs.23
For sites that change frequently, such as a historical community site with a message board or a site that collects history online, as well as sites whose loss would affect many people, you need to consider a more rigorous backup scheme, including incremental backups (copies of what has changed since the previous backup) and full backups of the entire set of files. The number and timing of your incremental and full backups will vary depending on how frequently your site changes and how important it is to be able to “roll back” your site to prior versions in case of problems. You also need to keep full backups from prior weeks and months. Because a file can become corrupted without your knowledge (in some deep cul-de-sac of your site), if you merely have last week’s full backup and the file was corrupted a month ago, that more recent backup will not help you retrieve the unblemished version of the file because the backed up copy itself will be corrupted. VCDH, which runs a number of popular history sites including Valley of the Shadow, backs up their entire server daily, on an incremental basis, onto magnetic tape that is stored at an off-site location.24
Mirroring, or the constant paralleling of data between two or more computers, is the most robust form of storage, but is probably only necessary for those sites that change so frequently that a failure of the main web server would result in the loss of significant amounts of information, or for sites that serve information that must be available constantly, even in the face of a computer crash. Under normal circumstances, the mirror provides a perfect, ongoing backup. In case of disaster, it can transfer data back to the original computer that has failed, or even be hooked up to the Internet, allowing your site to live on without visitors even noticing the swap. (Meanwhile, you will have to repair the main server and set it up to mirror your mirror.) Mirrors are the computer technology that allowed many Wall Street businesses to continue functioning with little interruption—or loss of critical financial data—in the aftermath of September 11.25
Commercial web hosts offer all of these backup options, though with increasing costs as the frequency of the backups increases. Most commercial hosts provide regular backups as part of a basic package, though “regular” can vary widely from once a month to once a day—a big difference for sites that change a lot. Be sure to check the fine print for your host’s backup schedule. Universities tend to have once-a-week or once-a-day backups for their shared servers. (In all of these cases, we should note that backups tend to occur during the wee hours of the night when few people are accessing the server.) ISPs generally have the least reliable backup and restoration schemes. Regardless of who hosts your website, you should still make your own backups and keep them in safe places.
20 Douglas O. Linder, email to Joan Fragaszy, 30 June 2004.
21 Cornell University Library, “Chamber of Horrors: Obsolete and Endangered Media,” Digital Preservation Tutorial, ↪link 8.21a; Tom Mainelli, “Dell Drops Floppy Drives in New PCs,” PCWorld.com, 5 February 2003, ↪link 8.21b.
23 National Institute of Standards and Technology, “Digital Data Preservation Program CD and DVD Archiving: Quick Reference Guide for Care and Handling,” Digital Data Preservation Program, ↪link 8.23; Kimberly A. Tryka, email to Joan Fragaszy, 20 July 2004.
25 Helen Meredith, “Learning All About Disaster Recovery,” Australian Financial Review, Supplement, 1 May 2002, 16. The principle of mirroring is at the center of the digital preservation strategy advanced by the LOCKSS program (Lots of Copies Keep Stuff Safe), which hopes to sustain digital documents and artifacts through a federation of library servers with redundant copies of accessioned files. For an overview of LOCKSS, see ↪link 8.25.