Becoming Digital: Preparing Historical Materials for the Web

Who Does the Digitizing? Should You Do It Yourself?

hether you set out to digitize video, sound, images, or text, you still need to ask whether you should do the conversion in-house or “outsource” it to a commercial vendor. As with much in the digital world, experts give the ambiguous answer that “it depends.” In the fall of 2000 and the spring of 2001, researchers from Glasgow University’s Humanities Advanced Technology and Information Institute interviewed people involved in thirty-six major cultural heritage digitization projects. More than half reported that they both digitized in-house and used commercial vendors. Most of the rest did all the work on their own premises; only two relied exclusively on outside companies.52

The project directors cited a range of reasons for their choices. The most common explanation they offered for doing projects onsite was that the original materials were rare or fragile. For example, the Oriental Institute of the University of Chicago cited “the cost and risk of transporting the original materials.” Often, fragile materials are visual (e.g., the maps in the Library of Congress’s vast cartographic collection) and local digitizing offers the further advantage of precise control over image quality—a serious concern, as we have seen. “The primary argument for digitizing in-house,” Janet Gertz of Columbia University Libraries writes in an overview of the subject, “is that it gives the institution close control over all procedures, handling of materials, and quality of products.” Sometimes, projects have chosen to do the work themselves in order to develop digitizing expertise. Gertz notes that “working in-house is a good way to learn the technical side of digitization,” which may prove useful “even when most work in the future will be sent to vendors.”53

Although a few projects cited savings from digitizing in-house, project directors much more often pointed to expenses as the reason to hire outside vendors. Many major projects such as those at Cornell University also favor outsourcing because the preset prices insulate them from any unexpected increases in costs. And though some projects want to develop local digitizing capabilities, others seek to avoid the considerable investment in staff training and equipment that in-house digitizing requires. That software, hardware, and technical standards continue to change rapidly means that such investments are not a one-time matter. Thus the National Monuments Record’s Images of England project, which is creating a snapshot of English heritage through photos of 370,000 buildings of architectural or historic interest, has outsourced all its digitizing because “there were no resources in-house and they did not want to invest in the type of equipment and specialized staff that . . . this project required.”54 As the commercial digitization business grows, vendors are more likely than even well-funded large libraries to have specialized staffs and the latest technology.

For these reasons, outsourcing has become increasingly common for large digitizing projects. Even as early as 2000, Paula De Stefano, head of the Preservation Department at NYU, observed “the trend . . . to use outside contractors” after the initial wave of “demonstration projects.” As David Seaman, the head of the Digital Library Federation, notes, outsourcing has become a “less mysterious” and more appealing option for those who work in cultural heritage projects, as large commercial vendors have begun to court their business, as the prices have dropped, and as vendors have become more willing to take on smaller jobs. At the same time, the prospect of setting up an in-house digitization lab has become more daunting and increasingly requires a multiyear commitment to expensive hardware, software, and especially staff. Dan Pence, the head of the Systems Integration Group, estimates, for example, that the equipment simply to capture page images from eleven volumes of fragile nineteenth-century scientific volumes would total about $60,000. Often commercial vendors have an advantage in dealing with both generic materials (e.g., thousands of ordinary book pages) because of the scale of their operations and nonstandard materials (e.g., large-format maps) because they own expensive specialized equipment.55

Of course, outsourcing also entails costs that go beyond the bill that arrives at the end of the project. These include the staff time to solicit bids, select a vendor, and monitor a contract. Gertz lists more than thirty items you should include in a request for proposal (RFP) from a vendor and more than twenty criteria for assessing bids from contractors. A close study of the digital conversion costs in University of Michigan’s Making of America project notes that preparing the RFP for scanning vendors “consumed several days of the project manager’s time,” and that was for a relatively simple contract that involved only page images and not OCR or typing. Yet the investment in a carefully prepared RFP can reap substantial savings; Michigan received fourteen bids on their contract, with prices ranging from ten cents to four dollars per page.56

Those considering outsourcing will also want to talk to colleagues about their experiences with particular vendors before they sign on the dotted line. Some organizations like the Research Libraries Group provide online lists of data conversion service bureaus about which it has received “positive reports.”57 Even with a signed contract in hand, you still need to select and prepare the materials for scanning and check on the quality of the work produced by the vendor, preferably before a major portion of the digital conversion has been done.

One topic not emphasized in official guides on outsourcing digitization, but known and discussed by everyone involved, is that outsourcing generally means sending the work to the Philippines, India, or China, and that the most important cost savings come from the considerably lower wages that prevail in those countries—sometimes as little as one-eighth or one-tenth what the same jobs pay in the United States. A medical transcriptionist (a position probably comparable to many jobs in data conversion) earns between $1.50 and $2.00 per hour in India and $13 per hour in the United States. Not surprisingly, in the current political climate of concern over the outsourcing of American jobs, most cultural institutions would rather avoid talking about where their digitizing is being done. (Offshore vendors point out that the same skittishness does not extend to the purchase of the computers and scanners for in-house digitization; little, if any, of this equipment is manufactured in the United States.) Interviewed by Computerworld for an article describing how Innodata Isogen (probably the largest digitizer of cultural heritage materials) had digitized in the Philippines the records of the U.S. Exploring Expedition of 1838-1842, Martin Kalfatovic, head of the Smithsonian Institution Libraries’ New Media Office, explained perhaps a tad defensively that “in terms of the marketplace, there aren’t onshore options.”58

Some commercial digitizing work is still done in the United States, but it is much more likely to involve preliminary scanning of rare and fragile objects—for example, setting up a local image scanning operation and then sending the images overseas for further processing. In addition, work that requires relatively little labor—running automatic OCR on standard texts—can be done at competitive prices here. Quite commonly, projects employ a hybrid approach—doing parts locally that are either more economical or that are dictated by the condition of the materials. Sometimes that means creating page images of rare or fragile materials that can’t leave the premises, although even in this case, some vendors—Luna Imaging, for example—will set up shop at your location and do the work for you.

If you are not worried about the materials, it may be cheaper to send them off for scanning, as the Million Book project at Carnegie Mellon University is doing by crating up books and shipping them to India. The University of Michigan sent the volumes in their Making of America project to Mexico for the scanning of page images (an expensive proposition because someone had to place fragile, but not necessarily rare, pages individually on a scanner) but did the automatic OCR in their own facility. In fact, the University of Michigan offers data conversion services to external clients. But for large-scale and labor-intensive historical projects (e.g., those requiring careful setup, proofing, coding, or rekeying), the offshore vendors dominate. In-country “capability has essentially disappeared,” says a leading museum consultant.59

The advice to seriously consider commercial vendors applies most clearly to large-scale projects, especially those subject to bulk processing, those that require substantial manual work (keying, correcting, or mark-up), and those that don’t involve rare or fragile materials. By contrast, you should handle small batches of documents with your own scanner and OCR software or by sending them to a local typist. But it is hard to know where to draw the line between “large” and “small.” Most large commercial vendors would probably disdain projects involving fewer than 5,000 pages of text, fewer than ten books, or a price tag of less than $10,000 unless they saw it as a pilot project that might lead to a larger contract later. Moreover, because commercial vendors charge set-up costs running into the thousands of dollars, you will wind up paying much more per page for a small job than a large one. Not surprisingly, vendors give their lowest prices to giant corporations like Thomson, ProQuest, and EBSCO, whose digitizing projects can run into the millions of dollars. If your project seems too small for a commercial vendor but not something you want to do yourself, you might investigate whether other groups within your institution or even at other institutions with whom you have alliances may be interested in bundling together a project with yours. Or, you might find a local vendor who digitizes smaller batches of materials, although they will generally lack the ability to deal with specialized materials or mark-up.

Considerations about whether outsourcing or in-house work is less expensive are essentially irrelevant for the very large number of history digitization projects carried out with very small budgets or no budget at all. If you have more time (or staff) than money, you are likely to do the work yourself. We happily note that the “do-it-yourself” spirit has driven some of the most exciting and pioneering history web efforts. Those contemplating that path should take inspiration from those who have digitized large corpuses of historical materials without the aid of grants or even research assistants. Jim Zwick, for example, has personally digitized—mostly with an ordinary flatbed scanner and OmniPage OCR software—the tens of thousands of pages of documents that make up the 1,250 web pages in his widely used Anti-Imperialism website. Although he reports that he has devoted “thousands of hours” to the effort, he also notes that he is “satisfied with the time it took” because “the site has received far more use and been far more influential than I originally expected.”60 Building your own website and digitizing your own documents may not be the quickest or the most cost-effective route to getting on the History Web, but like building your own house, it can be the most satisfying. The same goes for designing your site, to which we now turn.

52 NINCH Guide—Interview Reports.

53 Ibid., 2.2, 6.2, 9.2; 14.2, 17.2, 19.2, 21.2, 22.2; Janet Gertz, “Vendor Relations,” in Sitts, ed., Handbook for Digital Projects, 151–52. See similarly Stephen Chapman and William Comstock, “Digital Imaging Production Services at the Harvard College Library,” RLG DigiNews 4.6 (15 December 2000), ↪link 3.53.

54 NINCH Guide--Interview Reports, 4.2, 13.2, 15.2. See English Heritage National Monuments Record, Images of England, ↪link 3.54.

55 Paula De Stefano, “Digitization for Preservation and Access,” in Preservation: Issues and Planning, eds. Paul N. Banks and Roberta Pilette (Chicago: American Library Association, 2000), 318–19; David Seaman, interview, 10 May 2004; Peter Kaufman, interview, 30 April 2004; Rhind-Tutt, interview; Pence, “Ten Ways to Spend $100,000 on Digitization.”

56 Gertz, “Vendor Relations,” 155–57; Assessing the Costs of Conversion, 13, 20.

57 Erway, “Options for Digitizing Visual Materials,” 129–30. See ↪link 3.57 for RLG listings.

58 Ashok Deo Bardhan and Cynthia Kroll, “The New Wave of Outsourcing,” Research Report: Fisher Center for Real Estate and Urban Economics (Fall 2003), 5, ↪link 3.58a; Patrick Thibodeau, “U.S. History Moves Online, with Offshore Help,” Computerworld (16 January 2004), ↪link 3.58b. Such jobs, although poorly paid by American standards, are generally viewed as desirable in India and other locations outside the United States. See John Lancaster, “Outsourcing Delivers Hope to India: Young College Graduates See More Options for Better Life,” Washington Post (8 May 2004), ↪link 3.58c; Katherine Boo, “Best Job in Town,” New Yorker (5 July 2004), 54ff. In addition to Innodata, the largest companies working in data conversion for cultural heritage organizations are probably Apex, Data Conversion Laboratory, Inc. (DCL), and TechBooks.

59 Seaman, interview; Assessing the Costs of Conversion; “Frequently Asked Questions About the Million Book Project,” Carnegie Mellon University Libraries, ↪link 3.59; Thibodeau, “U.S. History Moves Online.”

60 Jim Zwick, interview, 30 March 2004.