Preserving Digital History: What We Can Do Today to Help Tomorrow’s Historians

The Long-Term Fate of Your Site

lthough you may enter a web project with great enthusiasm, no one can guarantee keeping their site up forever. Books have a natural preservation path: multiple copies get stored in libraries and personal collections. Websites, on the other hand, just disappear without proper and ongoing shepherding. In 2002, the History Department of the University of California, Riverside, took down its H-GIG website, a broad historical community site with more than 1,700 links to other history websites, after deciding that the department could no longer expend the energy to maintain it. They left a rueful and telling note behind: “It has been impossible for the History Department to keep H-GIG up-to-date. Most of the directories and pages have been deleted.”26 Unfortunately such bad news travels slowly on the web; two years after H-GIG’s demise, more than fifty other sites continue to link to it.

Clearly you will need to consider how to maintain and update your site over time, or hand off those tasks to someone else. Virtually all sites, even relatively static ones, require such stewardship. There have already been several versions of HTML, enough to “break,” or render partially unreadable, many of the pages created in the early 1990s. Who is going to “translate” your pages to the web languages of the future? Many of the scripting languages (like PHP, discussed in the appendix) also go through changes that alter the way that web pages must be written to function properly, and thus must be attended to over time. Even simpler elements of a web page, such as links to other pages on the web, must be checked periodically or a site may maintain internal integrity but appear out of touch with the wider world. For sites that have links to other sites, you need an ongoing process for checking those external links regularly so they do not break. Several low-cost programs will check links for you and highlight broken ones, and the World Wide Web Consortium provides a free link-checking service, but someone still has to correct these broken links or delete them.27 Unlike books on library shelves, websites therefore require regular attention, and you need to plan accordingly.

Perhaps the best insurance for your website’s existence is finding a permanent home for it in an institution that is in the business of long-term preservation. Libraries, archives, and historical and professional societies are in that business, and have the social systems—and funding—to maintain historical materials. Although some libraries do not have well-developed digital preservation plans or systems at the present time, and although plenty of web projects from such organizations and archives have disappeared over time, the fundamental nature of these institutions—focused on preserving resources and working for the public trust—should lead to such systems in the future. At the very least, these institutions have stable structures—they don’t move, change email addresses, and lose things (like laptops) as much as you do. If you have built a site worth preserving, talk to someone at one of these institutions that might be interested in your topic. Douglas Linder has an arrangement with the University of Missouri-Kansas City Law School that they will maintain his site indefinitely should he leave the institution and decide not to take his site with him.28 Some sites created through ad hoc institutional arrangements have found new homes with other organizations. For example, after DoHistory lost (through the death of the principal investigator, Richard P. Rogers) its active connection to the Harvard Film Study Center, they transferred custody of the site to CHNM because we have a long-term commitment to work in digital history.

In addition to making such arrangements if possible, you should check with any institutions you are affiliated with or that might have an interest in your site to see if they have set up special repository software on their web server. Probably the best known and most widely implemented of these programs is MIT Libraries’ and Hewlett-Packard’s DSpace, which is currently in an early release (1.2 at this writing) that is being tested by dozens of universities and libraries. Although it involves complex technology behind the scenes, for end users like historians, DSpace looks more or less like a simple web interface through which you can search for documents stored in the system and into which you can upload your own documents or data. With its focus on single objects, such as articles, datasets, or images, DSpace currently seems less prepared to accept whole websites, however, because websites are interconnected sets of heterogeneous objects—some text, data, and images, pulled together with HTML and linked both internally and externally.29

Although it uses somewhat different technologies, Cornell University’s and the University of Virginia’s Fedora project has a similar repository mission to DSpace. Because it has a greater focus than DSpace on linking objects together and relies heavily on XML, the more sophisticated cousin of HTML, Fedora shows more promise as a system to store and preserve whole websites (in addition to articles, datasets, and images). But it, too, is in an early stage with some usability issues still to be resolved. Historians looking for a long-term home for their digital materials should keep an eye out for these two programs and other digital repositories as they become available. The best of such repositories are OAI compliant, meaning that they follow a highly regarded framework for digital archives that outlines standardized, rigorous methods for acquiring, maintaining, and disseminating digital materials.30

Although these advanced software solutions sound promising, we would be remiss without discussing—perhaps in hushed tones if technophiles are around—the idea of saving your digital materials in formats other than the ones they are currently in. For example, why not just print everything out on acid-free paper—the poor historian’s version of the New York Times millennium capsule? For some websites, this method seems fine. For instance, text-heavy, noninteractive, relatively small sites are easy to print out and save in case of disaster—take your web essay and turn it into a traditional paper one. It would take a great deal of work, however, to retype even a modest site back into a computer format.

For websites that are large or have multimedia elements, the print-out solution seems less worthy. One of the major problems with printing out is that such analog backups lose much of the uniqueness of the digital realm that we highlighted in the introduction—for example, the ability of machines to scan digital materials rapidly for matches to searches, the ability to jump around digital text, the specific structure that can only be reproduced online. For instance, key features of DoHistory—the ability to search by keywords through Martha Ballard’s Diary or the “Magic Lens” that helps you read eighteenth-century handwriting—have no print analog. These same issues hold true for preservation techniques that do the equivalent of printing out, but in a digital format—for instance, converting a website to static PDF files or turning interactive web pages into TIFF graphics files. Though such conversions maintain the fonts and overall look of a site (unlike many print-outs), they still lose much in the translation—too much, in our view.

You can, of course, do these things while simultaneously trying the other preservation methods we have discussed in this chapter. Indeed a basic principle of preservation is to keep the originals intact while exploring archival techniques with parallel copies. Because digital copies are so cheap, it does not hurt to have copies of digital documents and images in a variety of formats; if you are lucky, one or more will be readable in the distant future. For example, while keeping many files in Microsoft Word format, you could also use Word’s “Save AsÉ” menu function to convert these files (while keeping the originals) to basic text (ASCII or Unicode), as well as more complex, non-Microsoft formats, such as the Rich Text Format (RTF). If you have spent a lot of money digitizing photographs for a website in a high-quality TIFF format, why not buy another hard drive and store them as JPEGs and GIFs as well? Only the largest collections will render this problematic from a cost standpoint, and it may increase the odds that one of the three graphics formats will be viewable decades from now. With digital video, of course, storage remains a serious and costly problem, making such parallelism less attractive and some kind of software solution seem more worthy of pursuit.31

26 “Error 404: Page Not Found,” H-GIG, ↪link 8.26.

27 Shareware link checkers can be downloaded from ↪link 8.27a. Alert LinkRunner is a good program for PCs, and can be found at ↪link 8.27b. W3C’s link checker is at ↪link 8.27c.

28 Douglas O. Linder, email.

29 DSpace Federation, DSpace, link 8.29.

30 On Fedora, see “Fedora: The Flexible Extensible Digital Object Repository Architecture,” The Fedora Project: An Open-Source Digital Repository Management System, ↪link 8.30a. Because of a trademark dispute with Red Hat Inc., the Fedora project may have to change its name, even though it has used it since 1998, five years before Red Hat adopted it for one of its software releases. See ↪link 8.30b and also Thornton Staples, Ross Wayland, and Sandra Payette, "The Fedora Project: An Open-source Digital Object Repository System," D-Lib Magazine, April 2003, ↪link 8.30c. Outside of the United States, Greenstone software, which grew out of the New Zealand Digital Library Project at the University of Waikato, offers another turnkey solution for setting up a digital library. With support from UNESCO and with documentation in five languages and an interface that can be modified to more than thirty other languages, Greenstone has gained a modest following worldwide, including in the United States. It seems more popular as a display and search technology for finite digitization projects (e.g., New York Botanical Garden’s rare book collection at ↪link 8.30d) than a continuously updated archival system like DSpace. On OAI compliance, see Don Sawyer, “ISO ÔReference Model for an Open Archival Information System (OAIS)’” (paper presented to USDA Digital Publications Preservation Steering Committee, 19 February 1999), ↪link 8.30e. Originally postulated by the Consultative Committee for Space Data Systems, a coalition of international space agencies (including NASA) that was trying to figure out how to store digital data from space missions for the long term, OAIS provides a framework of both systems and people and an associated set of common definitions (such as what a “data object” is) that should be applicable (CCSDS claims) to any archive, from small, personal ones to international, nearly boundless ones. OAIS provides a model that should enable individual digital archives to store materials effectively and sustain themselves over the long run. Note that this is, by CCSDS’ own admission, a high-level conceptual framework, not a ground-level working model. See ↪link 8.30f.

31 See Mary Ide, Dave MacCarn, Thom Shepard, and Leah Weisse, “Understanding the Preservation Challenge of Digital Television,” and Howard D. Wactlar and Michael G. Christel, “Digital Video Archives: Managing Through Metadata,” in Building a National Strategy for Preservation: Issues in Digital Media Archiving (Washington, D.C.: Council on Library and Information Resources and the Library of Congress, 2002), 67–79 and 80–95, ↪link 8.31a and ↪link 8.31b. For an example of the challenges of digital video, see the EVIA Digital Archive, a joint effort of Indiana University and the University of Michigan to archive approximately 150 hours of ethnomusicological video, ↪link 8.31c.