November 08, 2004

Archives/Research Idea

I chose the Wright American Fiction, 1851-1875 website a product of Indiana University’s Digital Library Program. The goal of the program and website is to digitize every novel published in the U.S. between 1851 and 1875. The program uses Lyle Wright's bibliography American Fiction, 1851-1875 as its guide. At present, the website has digitized 2,887 volumes by 1,450 authors.

My idea for a historical research and writing project would be to track the portrayal of Irish-Americans in American novels between 1851 and 1875. Based on a review of the Wright American Fiction website I believe this project could be carried out much more easily using this digital archive versus a print-based archive.

The most obvious advantage of the digital archive is that it provides access to almost 3,000 novels in one convenient location. It is doubtful that many libraries or archives, except for the Library of Congress, would have physical possession of that many 19th Century novels. Even if a researcher could visit such an archive it would be very impractical and time-consuming to request 3,000 separate books.

But much more importantly than simple access the website allows researchers to search the entire collection in multiple ways. The easiest is the Simple Search function, which allows a single word or phrase to be found in the entire collection. Researchers can also browse the entire collection by author or perform a word search by browsing through lists of all unique words in the text of all novels. This latter function, called the Word Index, is convenient for researchers who have developed an initial list of key words or concepts as part of the early research process. Searching these word lists can give a researcher a quick feel for the saliency of their initial theses. As in print indexes, the website provides words alphabetically so a researcher can browse through lists for variations among words or concepts. For instance, I search on “Erin” in the word index and got 119 “hits” to browse on the results page. But the search results page also provided 14 other words sorted alphabetically with 7 words above and below the searched word. Thus, I was also able to quickly find results for “erie,” “erin’s,” and “erin-go-bragh.”

Of even greater interest for a historical researcher are the Advanced Search functions, which fall into three categories. First, Citation searches allow searches by author, title, publisher, place of publication or year of publication. (One small gripe: on the search homepage this search is labeled “Citation Searches” but on all other links it is referred to as “Bibliography search”). Next, are Boolean searches where a researcher can find combinations of two or three words in a given paragraph. Related to this is the third advanced search type, the Proximity Search. This allows the searcher to find occurrences when two or three words or phrases are found near each other (within 40, 80, or 120 characters). Both the Boolean and Proximity Searches can then be filtered by multiple categories, including author, title, publication date, and publisher.

By using these search functions in concert, advanced concepts can be researched much more easily, comprehensively, and quickly than in the print medium. For instance, for my research project, I could search on multiple combinations of the words “Irish” “drink” “alcohol” “amusing” “story” “dirty” etc. to determine if stereotypes that the Irish immigrants drank a lot, were good story tellers and were slovenly were discussed in novels in the 19th Century and if they changed over time.

The structure and interface of the website are very utilitarian. Little thought has been given to design. On the site’s homepage only text is provided (besides a small banner with illustrations and title) in a simple layout. No frames or tables are used to separate the different search functions and information into easily readable. Instead, the text is separated by simple hard returns and bolded titles as the only formatting. The homepage (which is saddled with a web address that is not easily remembered: (http://www.letrs.indiana.edu/web/w/wright2/) also provides a link to Wright American Fiction Project but does not provide a link back to the main search page. Instead, a browser must you the back button.

While short on splashy design, the archive is very user friendly and intuitive. Each page provides clear and quick navigation to all major areas on the website. However, the results page is not easy to read. Results are provided 25 at a time on a page and each “hit” allows the researcher to go directly to the “hit” in the document, the table of contents (if provided), and or the first page of the image. While, each of these options is convenient, they are presented as a mass of text on the results page that is not easily readable.

A strength of the website is its presentation of the source documents. Each page of the almost 3,000 novels has been digitally scanned. Thus, each page can be viewed as an image and a browse button allows sequential viewing of each page in the digitized book. However, the project’s collection has not been uniformly processed. Of the 2887 digitized volumes, only 973 volumes have been fully edited and encoded. 1,914 volumes remain to be edited. Originally, all of the volumes were created by using Optical Character Recognition (OCR) software. The 1,914 novels can be searched and browsed using the digital page images, but they have not been proofread or corrected, and may contain errors – a standard problem with OCR files that have not been reviewed. The other 973 novels have had errors correct and SGML encoding added. These books can be viewed not only as page images but also as text. The SGML encoding has also allowed this smaller group to be given Tables of Contents, allowing for quicker and more focused viewing. The project plans to eventually correct and encode the remaining 1,914 novels.

In summary, while the website has a basic layout it is generally user-friendly and allows for a massive amount of information to be easily searched in complex ways.

Posted by Matt Mc at November 8, 2004 01:09 PM