Becoming Digital: Preparing Historical Materials for the Web

To Mark Up, Or Not To Mark Up

till another approach to maintaining fidelity to the original—but often a costly one—is to create marked-up text, which has the advantages of complete machine readability but without much of the loss of information that often accompanies the move from analog to digital. Text mark-up can take many forms, but all of them involve classifying the components of the document according to format, logical structure, or context. As we discussed in Chapter 2, HTML uses mark-up for presentational purposes. For example, the tag <i> indicates that the text that follows should be displayed in italics—perhaps indicating the title of a book, a foreign name, the name of a ship, or a point to be emphasized. But more sophisticated encoding schemes also capture structural and descriptive aspects of the text. They might, for example, identify all dates and names; indicate whether something is a footnote, a chapter title, or a caption; precisely specify indentations, margins, and poetry line breaks, or even designate the title (e.g., Senator, Governor) of a speaker.

Lou Burnard, Assistant Director of the Oxford University Computing Services, explains that mark-up makes “explicit (to a machine) what is implicit (to a person),” adds “value by supplying multiple annotations,” and facilitates “re-use of the same material in different formats, in different contexts and for different users.”17 Not only can you reproduce the text with greater visual fidelity, you can also examine it in much more complex ways. You could, for example, search only the footnotes or captions for a particular word. Even more powerful are the ways that multiple texts could be manipulated. You could automatically generate a list of all books cited in the Voltaire papers, or you could create a timeline of all events mentioned in the papers of Dwight Eisenhower.

Of course, more expansive and exciting possibilities emerge only when a large number of people or small projects follow a single mark-up scheme, or a large online collection thoroughly and consistently implements such a scheme. HTML works because everyone agrees that <b>, not <bold>, means bold face. Achieving agreement is a social, cultural, and political process and is thus a much harder problem than surmounting the technical difficulty of, say, getting a computer monitor to display the slanting characters that indicate italics. Because the theory and practice of text mark-up is so complicated (and because mark-up nirvana has not yet been achieved), we urge those who plan to go down this road to consult the many technical works available on the subject.18 Instead, we offer you a brief overview of the main approaches that have emerged so far.

Document mark-up predates the Internet or even computers. Traditionally, copy editors manually marked up manuscripts for typesetters, telling them, for example, to set the chapter title in “24 Times Roman.” Computerized typesetting meant that mark-up had to be computer readable, but the specific codes differed depending on the software program. In 1967, an engineer named William Tunnicliffe proposed that the solution to this Babel of codes lay in separating the information content of documents from their format. Two years later, Charles Goldfarb, Edward Mosher, and Raymond Lorie created the Generalized Markup Language for IBM by drawing on the generic coding ideas of Tunnicliffe and New York book designer Stanley Rice. IBM, the dominant computer company of the 1960s and 1970s, made extensive use of GML (an acronym for both Generalized Markup Language and Goldfarb, Mosher, and Lorie), which emerged in 1986 after a long process as the international standard SGML (Standardized Generalized Markup Language).19

SGML—unlike HTML, which is a specific language derived from SGML—does not provide predefined classifications, or mark-up tags. Instead, it is a “meta-language” with a grammar and vocabulary that makes it possible to define any set of tags. This great flexibility means that different groups, from the Department of Defense to the automobile industry to humanists, can define their own specialized mark-up languages. SGML requires first creating a Document Type Definition (DTD). You need, in effect, to develop a specialized language based on the meta-language of SGML or adopt one that has already been created. This great openness and flexibility is also the Achilles heel of SGML because it makes it difficult, intimidating, time-consuming, and expensive—a particular problem for nontechnical and poorly funded historians. Insiders joke that SGML stands for “Sounds Good, Maybe Later.”20

But at least some humanists decided to plunge immediately into the deep end of the SGML pool. In 1987, more than thirty electronic text experts began developing a common text-encoding scheme based on SGML for humanities documents. Three years later, this “Text Encoding Initiative” (TEI) published a first draft of their mark-up guidelines, a DTD for humanists working with electronic texts. Not until 1994, however, did the first “official” guidelines emerge. The goal of TEI was not to visually represent authentic texts, a task for which it is not that well adapted, but rather to offer them in a machine-readable form that allows automated tools to process and analyze them far more deftly than plain, unmarked-up text. The advantages of automatic processing increase exponentially with the size of the corpus of material—say, all Shakespeare plays or even all early modern drama—and when this corpus has been marked up in a common scheme.21 Historians with properly marked-up texts could ask when the term “McCarthyism” became widespread in the speeches and writings of Senators (as compared to governors and Congressmen) or when Southern (versus Northern) women diarists started talking about “love” and “passion.”

Thus the benefits of encoding for historians reside not simply in the adherence to a particular encoding standard, which may enable the scanning of diverse texts at once by a computer—all diaries on the web, for example. They rest even more, perhaps, in how the structuring of digital texts and the careful human indexing of the contents of those texts allow for new historical questions to be asked and answered. Stephen Rhind-Tutt, the president of Alexander Street Press, a company that focuses on creating digital texts and databases that are highly structured and indexed to enable powerful searching, argues strongly that the considerable investment required to mark up texts pays off by enabling “new ways of exploring, analyzing, and discovering information” and permitting researchers “to examine hypotheses much more quickly than before.” He notes, for example, that his company’s structured databases on early American encounters allow users to answer such questions as “Were the encounters between the Jesuits and the Huron more violent than those between the Franciscans and the Huron?”22

TEI made SGML a viable standard for humanities texts but not an easy one to follow. “There is no doubt,” acknowledges one guide to creating electronic texts, “that the TEI’s DTD and Guidelines can appear rather daunting at first, especially if one is unfamiliar with the descriptive mark-up, text encoding issues, or SGML/XML applications.” In other words, historians without a strong technical background or a considerable amount of time and (monk-like) patience should be cautious before diving into this more robust form of digitization, despite its apparent advantages. They should also consider alliances with partners—perhaps a university library or press—who have already developed the necessary technical expertise. Learning to collaborate is an essential skill for any digital historian, and this is one area where collaboration may be unavoidable. The greatest benefits of mark-up come with its most careful and detailed implementations, but the more careful and detailed the mark-up, the greater the expense. The first exercise in a course taught by Lou Burnard is to decide what you are going to mark up in several thousand pages of text, and then to halve your budget and repeat the exercise.23

Some humanists and technologists question whether the benefits of complex mark-up justify the time and effort. They argue that more automated methods can achieve “good enough” results. Exhibit A for them is Google, which manages in a fraction of a second to come up with surprisingly good search results on the heterogeneous, often poorly formatted text of the World Wide Web. “Doing things really well makes them too expensive for many institutions,” argues computer scientist Michael Lesk, who favors providing more material at lower costs even if it means lower quality.24

Fortunately two related developments have eased the pain of complex mark-up of electronic texts. The first was the development in the mid-1990s of a much simpler set of TEI tags—known as “TEI Lite”—to support “90 percent of the needs of 90 percent of the TEI user community.” It has quickly become the most widely implemented version of TEI. The learning curve was further eased by the emergence of XML, a significantly easier to use subset of SGML—sometimes called “SGML Lite.”25

TEI and especially TEI Lite have increasingly become standards for projects that mount scholarly editions of online texts, especially literary texts. Those who are considering applying for resources for text digitization projects from major funders like the National Endowment for the Humanities or the Institute of Museum and Library Services will likely need to follow these standards. Although many historians use TEI-marked-up resources, surprisingly few have organized such projects. The TEI website lists 115 projects using TEI but puts only 24 in the category of “historical materials.” And more than half of these—for example, the “Miguel de Cervantes Digital Library”—are largely literary. Most of the other projects fall into categories like “language corpora,” “literary texts,” and “classical and medieval literature and language,” in part perhaps a reflection of the greater applicability and utility of mark-up for highly structured literary forms like poetry rather than the more heterogeneous texts studied by the majority of historians.26

Most historians and historical projects continue to present their texts in plain old HTML—a reflection of both their lack of technical sophistication and their greater interest in the “meaning” of a text than its structural and linguistic features. This is not to say that they wouldn’t benefit from closer attention to the structure of texts, but they have not made this a priority. Text marked up with a standard scheme like TEI or indexed precisely in a database is superior to unformatted and unstructured words on a screen (especially for large projects and ones that expect to grow over time), but the journey to achieving that format can be long, treacherous, and expensive.27

17 Lou Burnard, “Digital Texts with XML and the TEI,” Text Encoding Initiative, ↪link 3.17.

18 See, for example, NINCH Guide, chapter 5 and appendix B; Digital Library Forum, “Metadata,” in A Framework of Guidance for Building Good Digital Collections, ↪link 3.18; Creating and Documenting Electronic Texts, chapter 4.

19 Dennis G. Watson, “Brief History of Document Markup,” University of Florida, Electronic Data Information Source, ↪link 3.19a; Harvey Bingham, “SGML: In Memory of William W. Tunnicliffe,” Cover Pages, ↪link 3.19b; Watson, “Brief History of Document Markup”; SGML Users' Group, “A Brief History of the Development of SGML,” Charles F. Goldfarb's SGML Source Home Page, ↪link 3.19c.

20 Creating and Documenting Electronic Texts, 5.1.1; Shermin Voshmgir, “XML Tutorial,” JavaCommerce, ,↪link 3.20.

21 Text Encoding Initiative, ↪link 3.21a. See also David Mertz, “An XML Dialect for Archival and Complex Documents,” IBM, ↪link 3.21b.

22 Stephen Rhind-Tutt, “A Different Direction for Electronic Publishers—How Indexing Can Increase Functionality,” Technicalities (April 2001), ↪link 3.22.

23 Creating and Documenting Electronic Texts, 5.2.2; Burnard, “Digital Texts with XML and the TEI..” Not only are the TEI guidelines complex, but there have never been sufficiently easy tools for working in TEI. Moreover, scholars began creating SGML/TEI documents at about the same time as the web burst on the scene, and web browsers cannot read SGML. For detailed scholarly discussions of TEI, see the special issue of Computers and the Humanities, 33 (1999).

24 Michael Lesk, “The Future Is a Foreign Country” (paper presented at The Price of Digitization: New Cost Models for Cultural and Educational Institutions, New York City, 8 April 2003), ↪link 3.24a. Even projects with the intellectual, financial, and technical resources to implement SGML/TEI have bumped up against the fundamental limits of the coding schemes for representing complex electronic texts. Jerome McGann, the leading figure at the leading national center for digital scholarship in the humanities, the Institute for Advanced Technology in the Humanities (IATH) at the University of Virginia, has written about the frustrations in trying to integrate a visual and presentational approach (exemplified by hypertext) with one rooted in the logical and conceptual approach exemplified by SGML. Jerome McGann, “Imagining What You Don’t Know: The Theoretical Goals of the Rossetti Archive,” ↪link 3.24b.

25 “Text Encoding Initiative (TEI),” Cover Pages, ↪link 3.25a; Lou Burnard, “Prefatory Note,” Text Encoding Initiative, ,↪link 3.25b; “The TEI FAQ,” Text Encoding Initiative, ↪link 3.25c. For a good brief overview of XML versus SGML, see Data Conversion Laboratory, “DCL’s FAQ,” Data Conversion Laboratory, ↪link 3.25d.

26 “Projects Using the TEI,” Text Encoding Initiative, ↪link 3.26a. One of the few TEI-compliant projects organized by historians is the “Model Editions Partnership: Historical Editions in the Digital Age,” which is a consortium of several major historical papers projects. But even after nine years, the only visible result is the production of about a dozen “mini-editions”—not all of which work on all browsers and platforms. The Model Editions Partnership, ↪link 3.26b.

27 Stephen Chapman estimates that text mark-up doubles the cost of digitization, but mark-up can range from very light to full SGML. “Considerations for Project Management,” in Sitts, ed., Handbook for Digital Projects, 42. For large and growing projects, descriptive mark-up in HTML may wind up costing more in the long run than structural mark-up in XML (e.g., TEI) because it is very expensive to repurpose such texts and to take advantage of more complex means of data mining.