Becoming Digital: Preparing Historical Materials for the Web

Digitizing Text: What Do You Want to Provide?

et’s say your cost-benefit analysis has convinced you that digitization makes sense. The audience, you conclude, is relatively large, scattered around the nation or globe, and eager to get their hands (or rather mice) on the materials you have. What does it actually mean to place a digital text online? Digitized text is any kind of text that can be read on a computer, but that text can take many forms. Decisions about which form you should choose depend on the state of the original, the audience for the digitized version, and your budget.

The simplest format is a “page image” produced by scanning a printed page or a roll of microfilm. These digital photocopies have three major advantages and an equal number of serious drawbacks. First, you can create them quickly and easily. Second, good page images closely represent the original. The page image of the WPA life interview mentioned earlier not only shows you the handwritten insert of the editor but also indicates precisely where he inserted it. Third, page images give a “feel” for the original. Students can read the printed text of Harry Truman’s diary entry for 25 July 1945, in which he contemplates the dropping of the atomic bomb. But the impact grows when they see the words “the Japs are savages, ruthless, merciless and fanatic” written in his own hand.

So, why not just stick to page images? As mere visual representations of text, page images cannot be searched or manipulated in the same ways as machine-readable text. A student looking for Truman’s views on the Japanese and the atomic bomb in a series of page images from his diary would need to read every page, just as with the analog originals. In addition, image files are much larger than text, which makes them much slower to download to your audience’s browsers and slower to print as well. The files that the Universal Library at Carnegie Mellon University uses to present individual page images of the New York Times are about 1 megabyte in size. By contrast, that amount of plain (machine-readable) text would take up about 30 kilobytes, a mere 3 percent of the image. As a result, even with a fast computer and a high-speed connection, it takes about twenty seconds to turn a page on the Carnegie Mellon site. Page images of detailed folios of text can also be difficult to examine on most computer monitors with their limited size and resolution (although some software programs might allow you to segment the image into smaller chunks). With unaided eyes, you can browse an entire page of a printed newspaper easily; it is possible but more difficult to do that with microfilm; it is impossible on a standard computer monitor.

Finally, providing more detailed metadata, which some digitizing projects do to help users find content within page images, erases some of this format’s inherent savings. Even without metadata, machine-readable texts come ready to be searched. You can find your way through them by using a simple word search, as with the “Find” command in Microsoft Word. Large collections of page images, however, usually need additional information and tools so that readers can locate their desired topic or folio.

This discussion of page images points to the obvious advantages and disadvantages of machine-readable texts—they are searchable and easy to work with (you can readily copy and paste passages of text, for example) but more expensive to produce and less likely to faithfully represent the original. Not surprisingly, a variety of hybrid approaches have developed to mitigate these two disadvantages. Some websites link page images with uncorrected transcripts automatically produced by software (more on this in a moment)—the approach taken by JSTOR, the massive online database of 460 scholarly journals. A related approach, which does not, however, offer any cost savings, links page images with machine-readable text proofread by humans. For example, the Franklin D. Roosevelt Presidential Library and Museum website combines images of important presidential records with fully corrected and formatted text.16

16 “The Safe Files,” Franklin D. Roosevelt Presidential Library and Museum, ↪link 3.16. PDFs offer the ability to easily combine page images with either “dirty” or corrected optically scanned text created by optical character recognition.