Becoming Digital: Preparing Historical Materials for the Web
s with text, turning analog visual materials into digital objects can be as easy as operating a photocopier. You can digitize many images with the same inexpensive scanners or cameras as you might use with text. But some visual materialsslides, transparencies, large format documents like maps, or fine linear illustrations (such as engravings in the nineteenth-century press)require more specialized equipment, photographic expertise, or scanning techniques. In addition, you will often face the question of whether to work from an original or a surrogate like a slide. Although the original yields a higher quality digital reproduction, it also often increases costs because large or fragile originals require complex, special handling.
Also, as with text, the goals of your website, your intended audience, and your budget will shape your digitizing plan. You might, for example, be creating a website for your course and want a few images for purely illustrative purposes. In that case, you can do some quick scans from books on an inexpensive scanner; the standard should be whether it is large and detailed enough to convey the information you want your students to see. At the other end of the spectrum, you might be creating an online archive of color lithographs from the early twentieth century. In that case, you need to focus on a wide range of issues, ranging from fidelity to the original to long-term preservation. As with many of the more technical subjects we discuss in this book, image capture and management is a much larger and more complex topic than we can discuss in detail here, and those contemplating a major project will want to consult more specialized guides.39
In digitizing images, as with all digitizing, the quality of the digital image rests on the quality of the original, the digitizing method employed, the skill of the person doing the digitizing, and the degree to which the digital copy has adequately “sampled” the analog original. For images, as with scanned texts, the two key measures of digital sampling are bit depth and resolution. Close up, a digital image looks as if the pointillist painter Georges Seurat decided to use graph paper instead of a blank canvas, carefully dabbing a single dot of color into each box of the grid. The breadth of the color palette is called the bit depth, ranging, as already noted, from 1 for simple black and white to 24 (or higher) for a virtually limitless palette of colors that Seurat and his Impressionist friends would have envied. Digital image size is expressed by two numbers, the width in dots times the height in dots (otherwise known as pixels on a computer screen). For instance, 900 x 1,500 would be the image size of a 3_ x 5_ photograph scanned at 300 dots per inch (3 times 300 by 5 times 300).
Note that digitization disassociates an image from its real-world size. With different settings on our scanner, we could digitize that same 3_ x 5_ photograph at 100 dpi and arrive at a much “smaller” image size of 300 x 500. Of course, the second digital image would still represent the original in its totality; it would just be less fine-grained than the first, as if Seurat used a thicker brush to make his dots. But when displayed side by side on the same computer monitor—“painted” onto the screen using identically sized pixelsthe first image would look much larger than the second. (In contrast, if you blow up the second to be as large as the first and compare a detail, you will see that it begins to break up, or “pixelate.”) To make matters slightly more confusing, monitors have not one but two characteristics that are sometimes called “resolution”: the number of pixels per inch of screen (often referred to as ppi and roughly equivalent to dpi) and the total number of pixels on the screen horizontally and vertically. Generally the first “resolution” is a single number between 72 and 96 ppi, and it denotes the density (and thus clarity and detail) of a monitor; the second “resolution” ranges from 640 x 480 to 1,920 x 1,440 or more, and it is a measure of how much fits on the screen at one time. Our 100-dpi scanned photograph would fit perfectly well on a monitor with an 800 x 600 setting (because 300 x 500 is less in each direction), whereas our 300-dpi version would overflow the sides, thus forcing us to scroll to see it all, as we might do with a long essay in Microsoft Word. Thus the physical size of your image on the screen depends on how your monitor is set. If you change your screen resolution from 640 x 480 to 1,024 x 768, your image will appear smaller because its constituent pixels are smaller (more of them now having to cram onto the same size display).
The three most common formats for digital images are TIFF, JPEG, and GIF. (Almost everyone uses their acronyms rather than full names, and pronounces them tiff, JAY-peg, and jiff or giff, but in case you are curious, they stand for Tagged Image File Format, Joint Photographic Experts Group, and Graphics Interchange Format.) Uncompressed TIFFs are the highest quality of the three and have become, as one guide suggests, “a de facto industry standard.” Unfortunately you pay for that quality with very large file sizes, which means increased storage costs and very slow downloads for your visitors. The Library of Congress’s uncompressed TIFF version of Dorothea Lange’s famous 1936 “Migrant Mother” photo is 55 megabytes, which could leave even a visitor with a high-speed connection waiting impatiently. Moreover, because this archival TIFF is 6,849 x 8,539 pixels in size, it cannot be viewed easily on a standard computer monitor without a great deal of scrolling back and forth.
Computer scientists have developed clever file compression formulas to deal with this problem of huge file sizes. For example, you can compress TIFF files by as much as one-third without throwing away any needed informationwhat is called “lossless” compression. Even so, they are still much too big for a typical website. JPEGs, which use a “lossy” compression algorithm (meaning that they throw away informationalbeit much of it information that the human eye doesn’t notice), are significantly smaller. How much you save in file size between TIFF and JPEG depends on the quality settings you use in the JPEG compression and the overall resolution of the image. But the savings can be quite dramatic. For example, the Library of Congress’s high-resolution JPEG of Lange’s “Migrant Mother” is only 163 kilobyteshundreds of times smaller than the archival TIFF. Of course, much of the savings comes from the smaller pixel dimensions, which are 821 x 1,024 for the JPEG, and additional savings come from an almost imperceptible reduction in the color palette and some fancy mathematics used to condense the information that remains. An even lower resolution (513 x 640) version, less than half that file size, appears lickety-split even on the screen of a modem-bound web surfer and still provides a very nice rendering of the famous photograph. Not surprisingly, JPEGs have emerged as the most common delivery format for photographs on the web.
The main web-based alternatives to JPEGs are GIFs. Ostensibly, GIFs do not work as well for large photos or other images with subtle changes in shade and color, but they render more effectively simple graphics, cartoons, or line drawings with large areas of solid color and high contrast. On the web, this distinction is partly apparent; for example, most logos you see on websites are GIFs, whereas most photos are JPEGs. But a quick check in an image processing tool like Adobe Photoshop or Macromedia Fireworks shows you that the difference in how most photographs look in JPEG and GIF is not as dramatic as conventional wisdom suggests, and many photographs and complex images on the web are indeed GIFs, not JPEGs.41
If you can get very good looking images with low or medium resolution GIFs or JPEGs, why bother with bulky, high-resolution TIFFs? Perhaps the most important reason is to create a high-quality archival master copy that can be used for a variety of purposes. For example, if tomorrow’s computer scientists develop a nifty new compression scheme to replace JPEG, you will be able to go back to the TIFF and recompress it. Trying to do so from the JPEG will only result in further degradation of the imageanother round of “lossy” sampling on top of the first round used to create the JPEG. Or, if inexpensive 21-inch monitors start to become standard and thus the audience for larger sized images expands significantly, you will want to resize your JPEGs working from the original TIFFs. Similarly, if you wanted to “clean up” the image (a controversial move when dealing with famous works like the Lange photo), you would want the best possible “original” to start with, and you could allow people to go back to the master to compare your changes. A high-resolution image also lets you to single out a particular detail Migrant Mother’s eyes, for exampleand study it closely. The higher quality (and especially higher resolution) images are also important to people who are going to use them in print, rather than on the web. If you provide a digital image for a print publication, the publisher will likely request something at 300 dpi or higher because print can render much finer detail than a monitor. A 100-dpi JPEG or GIF printed in a newspaper, magazine, or book generally looks horrible. But if you are only displaying the image on a computer monitor, you will have to shrink down the 600-dpi image and at the same dimensions it will look no better than the 100-dpi version.
These advantages of high-quality TIFFs may be disadvantages from the perspective of some institutions. Many art museums (or artists for that matter) don’t want people “fooling around” with their images, and altering images is much harder to do if you are starting with a low-resolution JPEG. Many museums also earn significant revenues by charging fees for the use of their images in publications. Thus many sites will present images online at only 72 ppi and then provide higher quality images to those who pay a fee, which allows them to control usage. (Those sites that are particularly vigilant about protecting rightslike the commercial Corbis image librarydisplay their images with visible watermarks and sell them with invisible digital signatures to dissuade unauthorized reproduction.) But sites like the Library of Congress that generally provide copyright-free images and are not concerned about earning licensing fees from their collections generously offer high-quality, uncompressed TIFFs.42
As this discussion suggests, the methods by which you digitize images and the formats in which you present them depend on your budget and the goals of your site, but guides uniformly recommend that you create the highest quality digital masters you can afford at the time of the initial scan, and then make other, lesser, derivatives (e.g., JPEGs) appropriate to particular circumstances. Typically your digital masters should be uncompressed TIFFs with a spatial resolution of a minimum of 3000 lines across the longest dimension of the image, a dynamic range of at least 8 bits for grayscale images and at least 24 bits for color images, and scanned at 300 to 600 dpi. Scan your images at the highest possible quality, but keep in mind that an image scanned at 3,000 x 3,000 lines and 24 bits (in other words a 10_ x 10_ image scanned at 300 dpi) produces a 27 MB file. The William Blake Archive at the University of Virginia reports that some of its digital masters are as large as 100 MB.43 As such, the quality of your digital masters necessarily depends on the amount of storage and staff time available.
These digital masters, then, provide a resource from which to produce files for other uses. For example, for the web you will want to compress your large TIFFs into JPEGs or GIFs to allow for quicker load times and to fit standard computer displays. You will likely also want to create a small “thumbnail” version of each image in either GIF or JPEG format, which will load very quickly and make it possible to browse multiple images on a single page, as we suggest as a design principle in Chapter 4.
Although text digitization can be laborious, it is generally straightforwardyou have either correctly rendered a word or not. But image digitization can require much more human judgment. After your initial scan, you need to make sure your images are as close to the original as possible. The Blake Archive, which lavishes great care on getting “true” images, reports that the “correction” process can take as much as several hours for a single image and even thirty minutes for a typical image. But for them, it is “a key step in establishing the scholarly integrity” of the project. Among the many things the Blake Archive and most other digital imaging projects correct for are image sharpness, completeness, orientation, artifacts, tone, color palette, detail, and noise. Photoshop or its less expensive and complex sibling, Photoshop Elements, is generally an essential tool in the correction process. Most guides, moreover, recommend in the interest of archival integrity that these corrections not be done to digital masters, but rather to derivative files, even if they are high-quality exact replicas of the masters. Even this small extra step, intended to ensure that the original remains uncorrupted, can add considerable storage costs to your project, as you are essentially doubling your storage needs right at the start if you follow this tenet. Some projects also “enhance” digitized images, but others worry about maintaining “authenticity” and whether bright white tones should be restored to yellowed photographs.45
The other laborious manual process in image digitizing is the provision of metadata, identifying data about the image. Metadata becomes very important when the files themselves are not text-searchable. If the images are just going to be used on your website, it might be enough simply to provide descriptive “alt” tags in your HTML code (see Chapter 4), both so you can remember what they are and so they can be read by the audio browsers used by the blind. On the other hand, if you are creating a larger reference collection, you will probably want to use a more elaborate metadata standard such as Dublin Core or METS (see Chapter 8) that will provide descriptive information about the content and creation of the original image; technical information about the digitized file such as resolution, compression, and scanning process; structural information about the file’s relationship to derivative or associated files; and administrative information about rights management and preservation.46
39 A good, but somewhat dated, introduction is Howard Besser and Jennifer Trant, Introduction to Imaging: Issues in Constructing an Image Database (Santa Monica, Calif.: Getty Art History Information Program, 1995). See also NINCH Guide, 102–19; Puglia and Roginski, NARA Guidelines for Digitizing Archival Materials for Electronic Access; Erway, “Options for Digitizing Visual Materials”; Council on Library and Information Resources, “File Formats for Digital Masters,” in Guides to Quality in Visual Resource Imaging, ↪link 3.39a; CDL Technical Architecture and Standards Workgroup, Best Practices for Image Capture (Berkeley: California Digital Library, 2001), ↪link 3.39b; Steven Puglia, “Technical Primer,” in Sitts, ed., Handbook for Digital Projects, 93–111; Colorado Digitization Project Scanning Working Group (hereafter CDP), General Guidelines for Scanning (Denver: Colorado Digitization Project, 1999), ↪link 3.39c.
40 Erway, “Options for Digitizing Visual Materials,” 127.
41 JPEG works better for photos with a large range of hues and compresses photos better than GIF, but it loses its compression advantages over GIF at small sizes (e.g., 200 x 200 pixels). A check of Google’s image search tool confirms the large number of photos rendered as GIFs on the web. In 1995, Unisys, which owns the compression algorithm used in GIFs, announced that it would charge royalties for software programs that output into GIFs. As a result, PNG (Portable Network Graphics) was developed as a nonproprietary alternative to GIF and one with some attractive additional features. But so far it has failed to attract a wide following, and the recent expiration of the patents on the compression used in GIF suggests that PNG will never become a major graphics format for the web. Paul Festa, “GIF Patent to Expire, Will PNG Survive?” CNET News.com (9 June 2003), ↪link 3.41a. Another image format that you might encounter is the proprietary “MrSID” (.sid) format used with very large digital images such as maps. The compression used allows for the “zoom in” capability that you can see in the Library of Congress map projects in American Memory. See David Yehling Allen, “Creating and Distributing High Resolution Cartographic Images,” RLG DigiNews 4.1 (15 April 1998), ↪link 3.41b.
42 See ↪link 3.42. Corbis charges $1,000 per year for you to use a relatively low resolution JPEG of the Migrant Mother photo on your website, but you can download a better quality version from the Library of Congress for free. According to the Western States Digital Imaging guide (p. 23), electronic watermarks “can be easily overcome through manipulation.”
43 NINCH Guide, 108; CDP, General Guidelines for Scanning, 21–24; Puglia and Roginski, NARA Guidelines for Digitizing Archival Materials for Electronic Access, 2–3; CDL Technical Architecture and Standards Workgroup, Best Practices for Image Capture; William Blake Archive, “The Persistence of Vision: Images and Imaging at the William Blake Archive,” RLG DigiNews 4.1 (15 February 2000), ↪link 3.43.
44 Conversion from one file format to another can be done easily with commonly available software like Adobe Photoshop. If you have many images to convert, you can use batch-processing tools such as Tif2gif, DeBabelizer, or ImageMagick.
45 Blake Archive, “The Persistence of Vision.” The inclusion of “targets”—grayscales and color bars—in every scanned image or at least in every scanned batch can make this process more efficient. CDP, General Guidelines for Scanning, 25, 27. See also Anne R. Kenney, “Digital Benchmarking for Conversion and Access,” in Moving Theory into Practice: Digital Imaging for Libraries and Archives, ed. Anne R. Kenney and Oya Y. Rieger (Mountain View, Calif.: Research Libraries Group, 2000), 24–60. In some cases, however, it may be more efficient to do color correction in the scanning process. CDP, General Guidelines for Scanning, 3, ↪link 3.45.
46 This information can either sit in stand-alone text files or databases or, as the Blake Archive has done, can be incorporated into the “header information” of the image files themselves. In all cases, image files should be logically and consistently named, for instance by incorporating existing catalog numbers directly into the filenames of digital counterparts. Blake Archive, “The Persistence of Vision.”