Preserving Digital History: What We Can Do Today to Help Tomorrow’s Historians

What to Preserve?

hink for a moment about the preservation of another precious commodity: wine. Connoisseurs of wine might have cheap bottles standing upright in the heat of their kitchen—a poor way to store wine you want to keep around for a long time, but fine if you plan to drink it soon or don’t care if it gets knocked over by a dog’s wagging tail. But at the same time they almost certainly hold their bottles of Chteau Lafite Rothschild on their side at close to 55 degrees and 70 percent humidity in a dark cellar (ideal conditions for long-term storage of first-growth Bordeaux). Fine wine, of course, merits far more attention and care than everyday wine. Expense of replacement, rarity, quality, and related elements factor into how much cost and effort we expend on storing such objects for the future. Librarians and archivists have always considered such questions, storing some documents in the equivalent of the kitchen and others in the equivalent of a wine cellar, and historians interested in preserving digital materials should likely begin their analysis of their long-term preservation needs by asking similar questions about their web creations. What’s worth preserving?

The U.S. National Archives and Records Administration (NARA), entrusted to preserve federal government materials such as the papers of presidents and U.S. Army correspondence, has a helpful set of appraisal guidelines they use in deciding what to classify as “permanent”—that is, documents and records that they will expend serious effort and money to preserve. (Although your archival mission will likely differ in nature and scope from NARA’s ambitious mission to capture “essential evidence” related to the “rights of American citizens,” the “actions of federal officials,” and the “national experience,” their basic principles still hold true regardless of an archive’s scale.) Many of these straightforward guidelines will sound familiar to historians. For example, you should try to determine the long-term value of a document or set of documents by asking such questions as:

    Is the information (they hold) unique? How significant is the source and context of the records? How significant are the records for research (current and projected in the future)?

Other questions place the materials being considered into a wider perspective. For example, “Do these records serve as a finding aid to other permanent records?” and “Are the records related to other permanent records?” In other words, by themselves some records have little value, but they may provide insight into other collections, without which those other collections may suffer. It therefore may be worth preserving materials that taken by themselves have little perceived value. Finally, for documents not clearly worth saving but also not obvious candidates for the trash bin, NARA’s guidance is to ask questions related to the ease of preservation and access in the future: “How usable are the records?” (i.e., are they deteriorating to such an extent as to make them unreadable in the near future?); “What are the cost considerations for long-term maintenance of the records?” (e.g., are they on paper that may decay and thus require expensive preservation work?); “What is the volume of records?” (i.e., the more records there are, the more it will cost to store them).7 This list of appraisal questions comes out of a well-established archival tradition in which objects such as the parchment of the Declaration of Independence and United States Constitution stand at the top of a preservation hierarchy and receiveing the greatest attention and resources (including expensive containers and argon gas), and less valuable records such the casual letters of the lowest ranking bureaucrat receive the least amount of attention and resources. In NARA’s physical world of preservation, this hierarchy is surely prudent and justified.

NARA’s sensible archiving questions take on a wholly different character in the digital, nonphysical online world, however. Questions relating to deterioration—at least in the sense of light, water, fire, and insect damage—are irrelevant. The tenth copy of an email is as fresh and readable as the “original.” “Volume” is an even odder question to ask about digital materials. What is “a lot” or perhaps even “too much,” and when do we start worrying about that frightening amount? On one of its email servers, the White House generated roughly 40 million email messages in Clinton’s eight years in office. In 2000, at the end of his second term, the average email was 18.5 kilobytes. Assume for the sake of argument (and ease of calculation) that the policy wonks in Clinton’s staff were as verbose as they were prolific, writing a higher-than-average 25 kilobytes per email throughout the 1990s. That would equal roughly a thousand million kilobytes, or a million million bytes—that is, 1 terabyte, or the equivalent of a thousand Encyclopedia Britannicas—of text that needs to be stored to preserve one key piece of the electronic record of the forty-second president of the United States. That certainly sounds like a preposterous amount of storage—or we should say that sounded like a preposterous amount of storage, because by the time this book is in print, there will almost certainly be computers for the home market shipping with that amount of space on the hard drive. At Clinton’s 1993 inauguration, one terabyte of storage cost roughly $5 million; today you can purchase the same amount of digital space for $500.8

In the predigital age, it would have been impossible to think that a researcher could store copies of every letter to or from, say, Truman’s White House, in a shoebox under his or her desk, but that is precisely where we are headed. The low cost of storage (getting radically less expensive every year, unlike paper) means that it very well may be possible or even desirable to save everything ever written in our digital age.9 The selection criteria that form the core of almost all traditional archiving theories may fall away in the face of being able to save it all. This possibility is deeply troubling to many archivists and librarians because it destroys one of the pillars of archiving—that some things are worth saving due to a perceived importance, whereas other things can be lost to time with few repercussions. It also raises the specter that we may not be able to locate documents of value in a sea of undifferentiated digital files.

Surely this selection or appraisal process remains relevant to any discussion of preservation, including digital preservation, but the possibility of saving it all, or even saving most of it, presents opportunities for historians and archivists that should not be neglected. Archives can be far more democratic and inclusive in this new age. They may also satisfy many audiences at once, unlike traditional archives, by providing a less hierarchical way of approaching stored materials. Blaise Pascal famously said that “the heart has reasons that reason cannot understand”; we have found that in the world of digital preservation, researchers have reasons for using archives that their creators cannot understand.

Or predict. In 2003, most of the visitors to our September 11 Digital Archive came to the site via a search engine, having typed in (unsurprisingly) “September 11” or “9/11.” But because of the breadth of our online archive (over 150,000 digital objects), 228 visitors found our site useful for exploring “teen slang,” 421 were searching for information on the “USS Comfort” (one of the Navy’s hospital ships), and 157 were simply looking for a “map of lower Manhattan.” In other words, thousands of visitors came to our site for reasons that had absolutely nothing to do with September 11, 2001. Historians should take note of this very real possibility when considering what they may want to preserve. Brewster Kahle, the founder of the Internet Archive, likes to say that his archive may hold the early writings of a future president or other figure that historians will likely cherish information about in decades to come. Assessing which websites to save in order to capture that information in the present, however, is incredibly difficult—indeed, perhaps impossible.10

The NARA questions about the relationship between materials under consideration for archiving and those already preserved also take on different meanings in the digital era. One of the great strengths of the web is its interconnectedness—almost every site links to others. Such linkages make it difficult to answer the question of whether a set of digital documents under consideration for preservation is relevant to other preserved materials. Because of the interconnectedness of the web, the best archive of a specific historical site is probably one that stores the site embedded in a far larger archive of the web itself. But archiving a significant portion of the web to accompany your own site is practical only for the very few with tremendous resources, and the best course of action for most historians is to focus on the simpler preservation tactics we now explore.11

7 U.S. National Archives and Records Administration, “Records Management: Strategic Directions: Appraisal Policy,” ↪link 8.7a. The Pitt Project, an influential, early effort at developing an approach to archiving electronic records, takes a very different approach, focusing on “records as evidence” rather than “information.” “Records,” David Bearman and Jennifer Trant explain, “are that which was created in the conduct of business” and provide “evidence of transactions.” Data or information, by contrast, Bearman “dismisses as non-archival and unworthy of the archivist’s attention.” See David Bearman and Jennifer Trant, “Electronic Records Research Working Meeting, 28–30 May 1997: A Report from the Archives Community,” D-Lib Magazine 3, nos. 7–8 (July–August 1997), ↪link 8.7b. Linda Henry offers a sweeping attack on Bearman and other advocates of a “new paradigm” in electronic records management and a defense of the approach of Theodore Schellenberg, who shaped practice at NARA during his long career there, in “Schellenberg in Cyberspace,” American Archivist 61 (Fall 1998): 309–27.

8 Adrienne M. Woods, “Building the Archives of the Future,” Quarterly 2, no. 6 (December 2001), ↪link 8.8a; “Internet,” How Much Information? 2000, ↪link 8.8b; Steve Gilheany, “Projecting the Cost of Magnetic Disk Storage Over the Next 10 Years,” Burghell Associates – Content Management Integrators, ↪link 8.8c.

9 Roy Rosenzweig, “Scarcity or Abundance? Preserving the Past in a Digital Era,” American Historical Review 108, (June 2003): 735–62, ↪link 8.9.

10 Center for History and New Media and American Social History Project, The September 11 Digital Archive, ↪link 8.10; Joseph Menn, “Net Archive Turns Back 10 Billion Pages of Time,” Los Angeles Times, 25 October 2001.

11 Since 1996, the Internet Archive (IA) has been saving copies of millions of websites approximately once every month or two. Visiting the IA’s Wayback Machine interface (↪link 8.11), you can explore what a site looked like a month, a year, or several years ago. Coverage is quite spotty (for less popular sites just the home page is saved, some images and other pieces may be missing from early snapshots, and the IA’s computers have trouble getting inside some database-driven websites), but it is already useful if you need to retrieve information that you may have posted on the web but that is long deleted from your own hard drive. On a similar scale, the search engine giant Google has cached versions of most websites it indexes. Because the spidering of sites by commercial entities such as Google raises copyright concerns, many people argue that the Library of Congress should declare that it has the right to archive all websites in the name of copyright deposit, as has been done in some other countries. See Rosenzweig, “Scarcity or Abundance?”