Becoming Digital: Preparing Historical Materials for the Web
Why Digitize the Past? Costs and Benefits
he aura of excitement that surrounds the new has created an implicit bias that favors digitization over a more conservative maintenance of analog historical artifacts. This was particularly true in the dot-com boom years of the mid- to late 1990s. An advertisement on the web in those heady days captured the prevalent mixture of opportunity and anxiety. Three little red schoolhouses stand together in a field. A pulsing green line or wire lights up one of the schools with a pulse of energy and excitement, casting the others into shadow. “Intraschool is Coming to a District Near You,” a sign flashes. “Don’t Be Left Behind!” That same fear of being “left behind” pushed many historians, librarians, and archivists into the digitizing game. But some leading figures in library circles like Abby Smith warned about getting swept away in the enthusiasm. “We should be cautious about letting the radiance of the bright future blind us to [the] limitations” of “this new technology,” she admonished other stewards of archival resources in 1999.3
For Smith and others, one of the most important of those limitations is intrinsic to the technology. Whereas analog data is a varying and continuous stream, digital data is only a sampling of the original data that is then encoded into the 1s and 0s that a computer understands. The continuous sweep of the second hand on a wristwatch compared to a digital alarm clock that changes its display in discrete units aptly illustrates the difference. As Smith nicely observes, “analog information can range from the subtle tones and gradations of the chiaroscuro in a Berenice Abbott photograph of Manhattan in the early morning light, to the changes in the volume, tone, and pitch recorded on a tape that might, when played back on equipment, turn out to be the basement tapes of Bob Dylan.” But digitization turns the “gradations that carry meaning in analog forms” into precise numerical values that lose at least a little bit of that meaning.4
But how much of that meaning is lost depends, in large part, on how much information you gather when you digitize. One dimension of this, as the excellent NINCH Guide to Good Practice in the Digital Representation and Management of Cultural Heritage Resources explains, is the “density of data” capturedhow much or how frequently the original is being sampled, a calculation that is reflected in the sampling rate for audio or the resolution for images. A second dimension is the breadth or depth of information gathered in each sample.5 For example, if you gathered just one “bit” of informationthe smallest unit of computer memory or storageabout a tiny section of a painting, you would be able to represent that detail only as black or white, which would be highly distorting for a work by Monet. But with 24 bits , you would have millions of colors at your disposal and could thus better approximate, though never fully match, the rich rainbow hues of Monet’s Water Lilies.
Capturing more information and sampling more frequently makes digitizing more expensive. It takes longer to gather and transmit more complete information, and it costs more to store it. Fortunately the stunning rise in computer power, the equally stunning drop in the cost of digital storage, and the significant (but less stunning) increase in the speed of computer networks have made these costs much less daunting than before. But even in the best of circumstances, the move from analog to digital generally entails a loss of information, although the significance of that loss is the subject of continuing and sometimes acrimonious debate. For example, partisans of (digital) music CDs played with solid-state (digital) amplifiers tout their quality and reliability, with the thousandth playback as crisp and unblemished as the first, whereas devotees of (analog) vinyl records amplified by (analog) tube amplifiers enthuse about their “authentic,” “warmer” sound, despite the occasional scratchiness of an old platter. “Taking an analog recording of a live concert,” writes one analogista, “and reducing it to 0s and 1s is not unlike a root canal: by extracting the nerves, the tooth is killed in order to save it.”6
At first glance, the analog versus digital debate would seem to apply to sound and images and not to text, which has the same precise quality in print as digital data. A letter is an “s” or a “t”; it can’t fall on a spectrum between them. But some comparable issues in the digitization of text point us to the largest concerns about the digitization process. Text digitizers also need to worry about the “density of data” collected. For example, should a digitized text capture just the letters and words or also information about paragraphs, headings, centering, spacing, indentations, and pagination? What about handwritten notes? Novelist Nicholson Baker blasted libraries for digitizing (and then disposing of) their card catalogs, thereby losing valuable information in the knowing marginalia scribbled over the years.7
Faithful digital representation is even more difficult with manuscripts. Take a look at the Library of Congress’s online versions of the Works Progress Administration (WPA) Life Histories, which requires some complex notation just to represent a small handwritten correction: {Begin deleted text}Nosy{End deleted text} {Begin inserted text}{Begin handwritten}Noisy{End handwritten}{End inserted text} And this is a mostly typed, twentieth-century text; medieval manuscripts present much thornier difficulties, including different forms of the same letters and a plethora of superscripting, subscripting, and other hard-to-reproduce written formats.8
It may be impossible (or at least very difficult) to move from analog to digital with no loss of information; what you really need to ask is the cost of representing the original as closely as possible. In other words, not only does digitization mean a loss (albeit in some cases a very modest one), it also incurs a cost. Technological advances have gone much further in improving our ability to make faithful digital surrogates than they have in reducing the costs of doing so. If you are contemplating a digitization project, you need to consider those costs soberly and what they might mean for you or your organization.9
The need for such an assessment is great because the costs are not always obvious. We naturally tend to focus on the literal costs of moving documents into digital formpaying someone to type a handwritten document or employing a student to operate a scanner. But this neglects other crucial and expensive parts of the process, especially preparing and selecting the materials to be digitized and assembling information about the materialswhat the librarians call “metadata.” Steve Puglia of the U.S. National Archives and Records Administration calculates that only one-third of the costs in digitization projects stem from actual digital conversion; an equal third goes for cataloging and descriptive metadata and the final third is spent on administrative costs, overhead, and quality control.10
First-time digitizers typically overestimate the production costs and underestimate the intellectual costs such as those associated with making the right selections and providing the most helpful metadata. Even a sophisticated library team at the University of Virginia reports that they “dramatically underestimated the labor and time” in preparing the documents for a digitizing project on Walter Reed and yellow fever.11 An equally important, but even less often considered, cost is maintaining the digital data, as Chapter 8 covers in greater depth.
This recitation of the costs and difficulties of digitization might sound prohibitively gloomy. If digitization is imperfect, difficult, and expensive, why bother? Because digitization offers stunning advantages. We don’t want to talk you out of digitizing the past, but rather encourage you to carefully weigh the problems and expenses against the benefits.
Among the many benefits of digital history we outlined in this book’s introduction, digitization particularly highlights the advantages of access, in a number of senses. It can mean new access, for example, to historical sources that are otherwise unavailable because of their fragility. Pierre-Charles L'Enfant's original 1791 plan for the city of Washington is so brittle and deteriorated that the Library of Congress no longer allows researchers to examine it. But now millions can view the digital reproduction on the library’s website. Most Library of Congress documents are not quite this delicate, but like many other primary source collections, they cannot be browsed easily in analog form. Traditionally, researchers faced the painstaking process of ordering up boxes of items in order to find what they were seeking. Sometimes you could not study the archival documents (e.g., glass plate and film negatives) without prior conversions into readable or viewable media (e.g., prints). Digitization, by contrast, permits quick and easy browsing of large volumes of material.12
Even more remarkable is how remote access to documents and archives that digitization (and global computer networks) makes possible has dramatically broadened the number of people who can research the past. Just two decades ago, research in the Library of Congress’s collection of early Edison motion pictures required an expensive trip to Washington, D.C. Now, high school students from Bangor, Maine, to Baja California, have instant access. The library reports that in 2003 approximately 15 million people visited American Memory, more people than have worked in the library’s reading rooms in its 200-year history and 1,500 times the number who annually use the manuscript reading room.13
This exciting prospect of universal, democratic access to our cultural heritage should always be tempered by a clear-headed analysis of whether the audience for the historical materials is real rather than hypothetical. Local historians would ecstatically greet a fully digitized and searchable version of their small-town newspaper, but it would not justify hundreds of thousands of dollars in digitizing costs. Nor would it make much sense to digitize a collection of personal papers that attracts one or two researchers per year. The archive that holds these papers could spend the money more effectively on travel grants to prospective researchers. “The mere potential for increased access,” the Society of American Archivists warns, “does not add value to an underutilized collection.” Of course, digitization can dramatically increase the use of previously neglected collections by making inaccessible materials easily discoverable. The Making of America collection largely draws from books from the University of Michigan’s remote storage facility that had rarely been borrowed in more than thirty years. Yet researchers now access the same “obscure” books 40,000 times a month.14
Digital searching most dramatically transforms access to collections. This finer grained access will revolutionize the way historians do research. Most obviously, digital word searching is orders of magnitude faster and more accurate than skimming through printed or handwritten texts. Even Thomas Jefferson scholars who have devoted their lives to studying the third president appreciate the ability to quickly find a quotation they remember reading in a letter years earlier. But the emergence of vast digital corporafor example, the full texts of major newspapersopens up possible investigations that could not have been considered before because of the human limits on scanning reams of paper or rolls of microfilm. Such quantitative digital additions may lead to qualitative changes in the way historical research is done.
As yet, the benefits of digital searching have not been brought to images or audio, although computer scientists are struggling to make that possible. If they succeed, they will transform research in these sources, too. But, even now, these other media also benefit from a new level of accessibility. Consider, for example, images that the naked eye cannot readily decipher. The digitization of the L’Enfant plan has made it possible to discern Thomas Jefferson’s handwritten editorial annotations, which had become illegible on the original. Similarly, users of the Anglo-Saxon Beowulf manuscript in the British Library could not see letters on the edges of each page because of a protective paper frame added to the manuscript in the mid-nineteenth century. Digitization with a high-end digital camera and fiber-optic lighting revealed the missing letters. Some of those missing letters offer the only extant record of certain Old English words. Art historians may eventually use computer programs like the University of Maastricht’s Authentic software, which can find patterns in digitized paintings to help with dating and attribution.15
3 Abby Smith, Why Digitize? (Washington, D.C.: Council on Library and Information Resources, 1999),> 1. See similarly Paul Conway, “Overview: Rationale for Digitization and Preservation,” in Sitts, ed., Handbook for Digital Projects, 16.
4 Smith, Why Digitize? 2. See also NINCH Guide, 227; “Analog Versus Digital: The Difference Between Signals and Data,” Vermont Telecom Advancement Center, ↪link 3.4; Steven Puglia, “Technical Primer,” in Sitts, ed., Handbook for Digital Projects, 93–95. Digital imaging cannot reproduce the chemical, biological, or textual makeup of the analog form, which allows, for example, carbon dating or fingerprint identification.
5 NINCH Guide, 228–30.
6 University of Georgia Language Laboratories, “The Great Analog Versus Digital Debate,” VoicePrint Online, viewed online April 2004, but not available as of September 2004.
7 Nicholson Baker, “Discards: Annals of Scholarship,” New Yorker (4 April 1994), 64–>86.
8 “Conversation in a Park,” American Life Histories: Manuscripts from the Federal Writers’ Projects, 1936–1940, ↪link 3.8; NINCH Guide, 230.
9 Smith, Why Digitize? provides an excellent overview.
10 Steve Puglia, “Revisiting Costs” (paper presented at The Price of Digitization: New Cost Models for Cultural and Educational Institutions, New York City, 8 April 2003), ↪link 3.10a. See also Steven Puglia, “The Costs of Digital Imaging Projects,” RLG DigiNews 3.5 (15 October 1999), ↪link 3.10b.
11 Joan Echtenkamp Klein and Linda M. Lisanti, Digitizing History: The Final Report of the IMLS Philip S. Hench Walter Reed and Yellow Fever Collection Digitization Project (Charlottesville: Claude Moore Health Sciences Library, University of Virginia Health System, 2001), ↪link 3.11a. For a detailed discussion of selection criteria and procedures, see Diane Vogt-O’Connor, “Selection of Material for Scanning,” in Sitts, ed., Handbook for Digital Projects, 45–73; Assessing the Costs of Conversion: Making of America IV: The American Voice 1850–1876 (Ann Arbor: University of Michigan Digital Library Services, 2001), 6, ↪link 3.11b.
12 Ricky L. Erway, “Options for Digitizing Visual Materials,” in Going Digital: Strategies for Access, Preservation, and Conversion of Collections to a Digital Format, ed. Donald L. DeWitt (New York: Haworth Press, 1998), 124; Smith, Why Digitize? 8; “Original Plan of Washington, D.C.,” American Treasures of the Library of Congress, ↪link 3.12. According to Franziska Frey, “Millions of negatives are never used only because their image content is not readily available to the user.” “Working with Photographs,” Sitts, ed., Handbook for Digital Projects, 122.
13 The library reported 8,890,221 “hits” from June to December 2003, but it appears that they really mean “visits.” Marilyn K. Parr, email to Roy Rosenzweig, 7 May 2004; Annual Report of the Librarian of Congress 2001 (Washington, D.C.: Library of Congress, 2001), 102, 121 ↪link 3.13a.
14 Smith, Why Digitize? 12; Society of American Archivists Council, The Preservation of Digitized Reproductions (Chicago: Society of American Archivists, 1997), ↪link 3.14; Christina Powell, email to Roy Rosenzweig, 4 August 2004.
15 NINCH Guide, 40–41; Kevin Kiernan, “Electronic Beowulf,” University of Kentucky, ↪link 3.15; Douglas Heingartner, “A Computer That Has an Eye for Van Gogh,” New York Times, 13 June 2004, Arts & Leisure section, 1.