Preserving Digital History: What We Can Do Today to Help Tomorrow’s Historians

Documentation

he Commission on Preservation and Access (now merged into the Council on Library Information Resources, the group that has done the most to promote the need for digital preservation in the United States) and the Research Libraries Group issued a report entitled Preserving Digital Information in 1996, and its primary conclusion still holds true: “The first line of defense against loss of valuable digital information rests with the creators, providers and owners of digital information.”12 Historians can take some concrete steps in the process of creating content for the web that should increase the likelihood that this content will last into the future, and more important that it will still be readable, usable, or playable in the long term. In particular, they should make sure that their website’s underlying code is easy to maintain through the use of documentation and technical conventions, and that it does not rely too heavily on specific technologies to run properly.

Historians are great documenters and in turn are great lovers of documentation. We adore seeing footnotes at the bottom of the page, even if we follow them only rarely. Indexes delight us with the prospect of a topic easily found. The only thing better than a cocksure preface to a new historical interpretation is a self-deprecating foreword to a second edition of that book decrying the author’s earlier, rash hubris. This preference for documentation is not merely a pedantic predilection; good documentation allows current and future readers to understand where a text or object came from, who created it, what its constituent pieces are, and how it relates to other texts or objects. Like history itself, documentation contextualizes, relates, and records for posterity, and it helps to make a source more reliable.

The same values hold true for digital materials. Documentation, or little notes within the underlying code, provides guidance to current and future developers of your site so that they have the means to understand how your creations function, what the pieces are, and how to alter them with as little impact as possible on other parts of your website. The ease with which you can put web pages on the Internet will tempt you to spend all your time on that which everyone will see (the stuff visible on the screen), while ignoring the less flashy parts—the digital equivalents of the footnotes and preface. On many occasions we, too, have succumbed to this temptation. Documentation takes time and thus involves a trade-off: after all, you could spend that same time improving the site’s design, digitizing more documents, or adding features that your audience might appreciate. Unsurprisingly, creators of websites often end up abandoning overly stringent or complex documentation regimes. We suspect that many scholars would go into print with incomplete footnotes if not for the stern hand of an editor. But websites go to press without those external prods to jar us out of our laziness.

Programmers, however, know that poorly documented code causes problems in the long term, and sometimes even in the short term, as others take over a project and scratch their heads at what the prior programmer has done, or an insufficiently caffeinated mind fails to comprehend what it did just a few days ago. Rigorous documentation provides a digital document like a web page with something akin to transparency, so that you and others can understand its pieces, structure, and functionality, now and in the future. So try to annotate your website as you create it, just as you would with a book or essay. Doing so in a thoughtful manner will allow you and others to modify and migrate the site in the future with less trouble, which in turn means that the site is more likely to live on when technological changes inevitably come.

The first step toward such website documentation is to use comment tags in HTML. Comment tags are fairly simple, involving greater-than and less-than signs, some dashes, and an exclamation point, like so:

<!-- put your comments here -->

These comments are not visible on the web page, but those who are interested can see them by using an option in their browser to call up the page’s code (generally by clicking on “source” under the “view” menu).

You might place an overall comment near the top of your web pages noting when it was created and who created and last modified the page, so they can be tracked down if there’s a problem with the page in the future. (Alternatively, you could put the creation or last change date and the initials of the person responsible at the bottom of the visible page, i.e., without using the invisible comment tags.) Beyond this initial comment field at the top of the page, you may want to add comments at the top of any section of the file for which it is not utterly obvious what the ensuing code does. The official website of Thomas Jefferson’s home, Monticello, has helpful comments for each section of their web pages, detailing in capital letters the purpose of that section. For instance, the comments note where the site navigation begins and ends, and where important images are loaded into the page.13

Comments on any “scripts,” or small programs within web pages that handle the more complex or interactive features of advanced sites, are perhaps more importantbecause these bits of programming in languages other than HTML tend to be more complicated and thus less transparent than the basic code of a page. The Monticello website provides an admirably brief explanatory comment above the JavaScript code that permits visitors to move easily between areas of the site. If you hand off the creation of this programmingto others, be sure to tell them (or write into a contract if you are using a professional) that you expect clear and consistent documentation. Files are much easier to handle in the long run with these internal notes, and the notes make you less dependent on particular individuals (or your own long-term memory if you do it yourself).

In addition to these occasional intra-file comments, you should create an overall text file describing the contents of your website or its sections, often called a “README” file and frequently named README.txt. But feel free to use a filename that you deem appropriate, such as notes.txt or perhaps if_you_are_reading_this_file_I_am_long_gone.txt. Write in this brief document as a historian of your own work. Why did you create it and who else has worked on it? Has it changed significantlyover time? Have new parts of the site arisen, and if so, when? Have other parts of the site withered away, and if so, why? For instance, a section of the README file for large changes to the overall look of the site might read like a terse diary, with the most recent changes at the top:

3 November 2004 – Added search engine

23 October 2003 – Two new features added

17 September 2002 – Initial release of the site

You can also note in this file where other important elements of the site reside, if not in the same directory as the README file. For instance, if your site uses a database, this critical piece of your site’s content probably sits in a distant part of the web server, and it helps to point to that location in case someone needs to move your site to a new server. When we took on the hosting duties for the Film Study Center at Harvard University’s DoHistorywebsite, their README file, which included extremely useful information about the architecture of the site and the technologies used, simplified the transfer.14

In short, good documentation situates and chronicles your website. When Dollar Consulting helped the Smithsonian plan how to preserve its many web resources, they noted that the large institution had a “too many chefs in the kitchen” problem, with many different people altering and updating the same web pages. Because digital files show no signs of erasure or revision, Dollar suggested recording such changes over time—an “audit trail” similar to what we have suggested.15 Experiment with this idea and our other commenting protocols; they are merely suggestions. The real key to good documentation is not a specific format or style but clarity and consistency.

We should acknowledge that our somewhat breezy approach to documentation would not sit well with some librarians who are deeply committed not just to internal consistency but to shared methods of documentation across all digital projects, to “interoperability” and “standards,” in the widely used buzzwords. Indeed, a great deal of the effort by librarians and archivists focuses on developing standard ways of describing digital objects (whether individual files or entire websites) so that they may be cogently scanned and simply retrieved in the near or distant future. The technologies these parties propose revolve around metadata, or classifications of digital objects, either placed within those objects (e.g., as XML tags within text) or in association with those objects (i.e., as the digital equivalent of a card catalog). Created by a working group of experts in preservation and having found a home and advocate in the Library of Congress, the Metadata Encoding and Transmission Standard (METS) provides its users with a specific XML descriptive standard for structuring documents in a digital archive (unlike, though complementary with, the more abstract framework of the Reference Model for an Open Archival Information System, or OAIS). This includes information about the ownership and rights of these documents (author, intellectual property rights, provenance), as well as standard ways to structure, group, and move these documents across and within archival systems. Although METS can be used to encode digital content itself in addition to providing information about the content, the Metadata Object Description Schema (MODS), also promoted by the Library of Congress and developed in concert with archival experts, more narrowly provides a standardized method for referencing digital materials, as a card catalog entry references a book using its title, author, and subject. The Dublin Core Metadata Initiative promotes an even more restricted set of document references (fifteen main elements like title and author) that may prove to be helpful, like MODS, for electronic bibliographic references and digital archives. A broader schema from the Society of American Archivists and the Library of Congress, the Encoded Archival Description (EAD), uses XML to create finding aids for digital archives.16

Despite the considerable weight behind schemas such as METS, MODS, Dublin Core, and EAD, the quest to classify digital works with standardized, descriptive metadata and thus ostensibly to make them more accessible in the future has encountered substantial criticism from both technologists and humanists. Writer and Electronic Frontier Foundation evangelist Cory Doctorow, among others, has dismissed this pursuit of the standardization of digital object descriptions as an unachievable “meta-utopia.” He skeptically catalogs some elements of human nature that mitigate against standardization and the implementation of reliable metadata schemas: people are “lazy” and “stupid,” and they “lie,” which makes it troubling to depend on their metadata for future retrieval. Other aspects of schemas may cause further trouble, such as the fact that “schemas aren’t neutral,” i.e., they are full of the biases of their creators and, more simply, that “there’s more than one way to describe something.” In other words, metadata schemas require such a high level of attention, compliance, and descriptive acuity as to be practically unattainable in the real world. What we are inevitably left with, Doctorow argues, are heterogeneous, somewhat confusing, partially described objects, and so in the end we have to muddle through anyway. (Or hope that a Google comes along to make decent sense of it all, despite the enormity of the corpus and the flaws within it.)17

To be sure, it would be difficult to get all or even most historians to agree on a set of rules for digital storage, even if uniformly following those rules would allow for a level of interoperability of our sites and retrievability of our work that we might all desire. But we should note that other groups of diverse people have been able to settle on common metadata standards, which have been mutually beneficial for their field or industry. For instance, the many scientists involved in the Human Genome Project decided on a standard to describe their voluminous data, thus enabling the combination of disparate gene studies. Similarly, competing manufacturers of musical equipment were able to agree on a standard for describing electronic music (Musical Instrument Digital Interface, or MIDI) so that various components would work together seamlessly. And we should not forget that whatever our views of librarians and archivists, historians have been able to do much of their research because of the helpful metadata found in card catalogs and classification schemas such as the MARC (MAchine-Readable Cataloging) format that came out of a Library of Congress<–led initiative starting three decades ago. (MARC is now moving online, with Google and Yahoo having imported the library records for two million books so that links to them appear next to links for web content.) Without metadata that enables us to find documents easily and that allows various historical collections to be at least partially interoperable, we might find ourselves in a tomorrow with an obscured view of yesterday.18

12 Preserving Digital Information: Report of the Task Force on Archiving of Digital Information, (Washington, D.C.: Commission on Preservation and Access, and Mountain View, Calif.: Research Libraries Group, 1996), ↪link 8.12.

13 Thomas Jefferson Foundation, Monticello: The Home of Thomas Jefferson, ↪link 8.13.

14 Film Study Center, Harvard University, DoHistory, ↪link 8.14. Many computer operating systems ignore standard alphabetizing rules and put the README file at the top of an alphabetized file list of a website so that it is the first thing you encounter in a sea of website-related files (the capitalization also helps to make it stand out).

15 Dollar Consulting, Archival Preservation of Smithsonian Web Resources: Strategies, Principles, and Best Practices (Washington, D.C.: Smithsonian Institution, 2001), ↪link 8.15.

16 Library of Congress, Metadata Encoding and Transmission Standard, ↪link 8.16a; Library of Congress, Metadata Object Description Schema, ↪link 8.16b; Dublin Core Metadata Initiative, ↪link 8.16c; Library of Congress and the Society of American Archivists, Encoded Archival Description (EAD), ↪link 8.16d.

17 Cory Doctorow, “Meta-crap: Putting the Torch to the Seven Straw-Men of the Meta-Utopia,” The WELL, ↪link 8.17.

18 Daniel J. Cohen, “Digital History: The Raw and the Cooked,” Rethinking History 8 (June 2004): 337–40; Online Computer Library Center, Open WorldCat Pilot, ↪link 8.18a; Barbara Quint, “OCLC Project Opens WorldCat Records to Google,” Information Today, ↪link 8.18b. Our thanks to Abby Smith for pointing out the successful metadata cases of the Human Genome Project and MIDI.