RLG
 FAQ  

Handwriting Recognition for Historical Documents

Author: Richard Entlich

OCR (Optical Character Recognition) seems to be widely used for providing searchable indexes of printed texts that have been scanned. Is it possible to do a similar thing with handwritten manuscripts and correspondence?

OCR Background

OCR is used to generate machine-readable text from printed documents. These are generally legacy documents from before the electronic publishing era, but may also be printed documents for which the original machine-readable text was discarded or lost.

OCR of printed text is a well-developed technology that has steadily improved in accuracy and flexibility. Initially limited to interpretation of numerals printed with special fonts, current day OCR software can deal with a multitude of fonts, character sets, languages, and page attributes. For extremely clean and well-scanned documents, the resulting text may be good enough to use for direct display purposes. More commonly, the OCR is somewhat "dirty" (i.e., contains errors) but is still accurate enough to form the basis of a quite usable machine-searchable index. Accuracy rates of 99.5% and higher (at the character level) are achievable for good quality source documents.

Handwriting Recognition Background

The conversion of handwriting to machine-readable text is usually referred to as handwriting recognition (HR). Computer scientists recognize two distinct classes of handwriting recognition. The better known of these is on-line HR, a real-time process usually employing a special stylus and pressure sensitive tablet that allows the direction and order of the writer's strokes to be monitored while writing. First popularized by the Apple Newton MessagePad, on-line HR is now available on most PDAs (Personal Digital Assistants).

The process of converting an existing handwritten document into machine-readable text is called off-line HR and is more closely analogous to OCR. Off-line HR is a far more daunting computing task and, as a result, is not as mature a technology as either OCR or on-line HR. The reasons are not hard to fathom.

Unlike printed text (that is, machine produced type), handwriting is subject to almost infinite variation. Cursive writing, in particular, can easily defeat human attempts to interpret it, as anyone who has attempted to decipher a doctor's handwriting can attest. Machine interpretation relies on reducing the scanned image to some kind of recognizable pattern. Patterns may be missed because of vague word boundaries, overlapping letters, and great variations in the slant, spacing and shape of letters. Such variations may be modest within the writings of a single author, but are tremendously magnified across multiple authors. Further hampering recognition, handwritten documents tend to be "noisier" than printed ones due to smudging, staining, stray marks, underlining, and cross-outs.

Thus, early work in off-line HR, like that in OCR, focused on small, simple character sets such as numerals. Even today, much research and development is focused on highly constrained tasks such as reading cities, states and zip codes on hand-addressed mail, interpreting the dollar amount line on bank checks, or deciphering business forms, such as tax returns.

Methods for Off-line Handwriting Recognition of Historical Documents

A portion of a scanned page from the Library of Congress's George Washington manuscripts.

Figure 1. A portion of a scanned page from the Library of Congress's George Washington manuscripts. Rectangles have been drawn around where the words would be segmented. Also, dark lines which result from the scanning process have been removed from the sides. Note that the segmentation process is not perfect. "Winchester" in the fifth line and "Nicholas" in the next to last line have been divided into two parts.[1]

A small but steady stream of computer scientists has been trying to tackle the difficult task of deciphering cursive handwriting. The desire to improve access to large collections of important historical manuscripts has motivated most of this work. Scanned versions of the papers of Isaac Newton, U.S. presidents (especially George Washington), and the Archives of the Indies in Seville, Spain, to name a few, have served as recent experimental fodder.

In most cases, the objective of these experiments is less ambitious than full machine translation of handwriting. Instead, the goal is usually to recognize a subset of the most commonly used vocabulary (anywhere from a few hundred to one or two thousand words), usually within the writings of a single author. That vocabulary then serves as an index to support text queries. Limitations on vocabulary and authorship are intended to simplify the computational task so it can be done in a reasonable period of time, at an acceptable cost, and with a usable degree of accuracy.

Here are descriptions of a few of the different techniques being investigated:

Character segmentation attempts to identify individual characters and build them into words. This is exceedingly difficult to do with any degree of accuracy.

Word segmentation attempts to detect word boundaries, often supplemented by other document cleaning and filtering operations such as artifact removal, normalization of slant, smoothing, and binarization (converting grayscale images to bitonal). An effort can then be made to recognize the pattern made by an entire word and convert it to machine readable form without trying to identify individual characters.

Original grayscale image
Original grayscale image

Binarized with artifacts removed
Binarized with artifacts removed

With slant correction
With slant correction

Figure 2. Processing and normalization steps on a segmented word image prior to image matching.[2]

Word spotting is a form of off-line HR using word segmentation. In word spotting, the segmented words are first normalized to minimize variation and then similar images, which hopefully represent the same word, are clustered together. These groupings are called equivalence classes. No machine interpretation is done, only image matching. The groups of matched words are then displayed to a human operator who provides the text equivalent. Figure 3 shows a simplified diagram of the word spotting process, though stop words like "the" and "that" would normally be discarded. A subset of the most frequently occurring remaining words is used to create an index of the document.

Word spotting has also been applied in multiple author environments where word segmentation is not feasible, using different image matching techniques. (View enlarged image)

A conceptual diagram of the word spotting technique

Figure 3. A conceptual diagram of the word spotting technique for indexing of matched word images.[3]
(View enlarged image)

Statistical methods built on word segmentation are also being explored. Within a set of documents by a single author, a training subset is word segmented and manually transcribed. The images of the words are described using a highly formalized language based on the features (size, sequence of hills and valleys, etc.) of the particular image. The statistical correlation of the transcribed words with their feature-based descriptions is recorded for the entire set of training documents. Subsequently, a textual query can be made against a set of documents from the same collection that have been word segmented and feature described, but not transcribed. The query returns a set of word images (within a single line of the original document) most likely to match the query terms.

Transcript mapping is a technique used when a transcript of a handwritten document has been created, but it is unknown how the transcript corresponds to the location of words (pages, lines, and line position) in the original document. The existence of a transcript defines the vocabulary of the document, leaving the still non-trivial task of determining precisely where those words occurred.

Commentary

The amount of research activity and the variety of clever techniques being utilized in off-line HR should be gratifying for the archivists who maintain, and the scholars who utilize, handwritten historical documents. However, it should be noted that none of the work described here appears ready to emerge from the laboratory anytime soon.

Unconstrained machine translation of handwriting appears particularly far off, and may be unachievable. Even a less ambitious goal, such as software to reliably create partial indexes from good quality single author material, is unlikely to be met within the next several years.

However, enough progress seems to have been made for librarians, archivists, and scholars to become more involved in the ongoing research. Until now, there appears to have been little participation by those parties other than to provide sample documents and, on occasion, to serve on advisory boards.

For librarians and archivists, the future potential for machine translation should at least be considered when handwritten historical documents are digitized, particularly large collections by authors with legible handwriting. Since documents deemed worthy of digitization are likely to be of greater than usual significance, they are also good candidates for transcription and/or indexing. Accurate off-line HR depends on scans with minimal noise and artifacts, so some additional effort to create very clean scans may be merited.

For those documents that are deemed so significant that it is worth fully transcribing them manually, the transcripts should record page, line, and word position to facilitate the potential to create indexes that can pinpoint the search term's location in the scanned document. (This presumes the document can be word-segmented, so the nature of the author's handwriting is again a consideration.)

Archivists could also advise computer scientists about how best to produce indexes that would interoperate smoothly with existing machine-readable finding aid standards such as EAD (Encoded Archival Description).

From the computer science side, more consultation with archivists and librarians familiar with the scanning of historical documents could avoid certain costly mistakes. For example, some of the researchers spent time cleaning up highly compressed JPEG files that suffered from severe artifacting around the text, instead of starting out with uncompressed or losslessly-compressed TIFFs.

Others have worked with low-resolution grayscale images that they have binarized using static thresholding techniques (that is, a single threshold value was used to binarize an entire page or collection of pages). Historical documents are usually scanned at 8-bit grayscale because they tend to be too tonally rich for satisfactory bitonal capture. However, some of the computer scientists seemed unaware of the availability of scanning software capable of dynamic thresholding and automatic background detection and suppression. Such software can produce bitonal scans with uniform contrast and text legibility even from originals with stains, fading, and uneven ink density.

In the meantime, if it is discovered that certain scanning practices would substantially improve the prospect for usable HR of historical documents, those standards should be promulgated to libraries and archives for consideration.

Finally, it is unclear what role scholars are playing in the development of systems for HR on their behalf. Though most of the details of crafting a successful off-line HR system fall within computer science and closely related realms, there are certain questions that only the end users of historical documents are in a good position to answer.

What corpora of historical documents would benefit most from being made searchable? If a search vocabulary has to be whittled down in size in order to reduce the computational load, which terms should be given priority for retention? Should the most common terms be kept, or should personal names, place names, or dates be preferred? What degree of inaccuracy can be tolerated before an index loses its value?

Conclusion

There is as yet no commercial or open source software for automatic transcription of, or the creation of searchable indexes from, handwritten historical documents. However, it is an active area of research and progress is being made. Continued advancement depends on the availability of funding. Librarians, archivists, and scholars may be able to push the agenda more effectively by partnering with computer scientists who share an interest in solving this challenging problem and improving access to significant historical archives.

Further Reading

Note: Much of the literature of off-line HR is highly technical. Some of the following papers provide a general overview of the subject, while others are best read for their abstracts, introductions, and conclusions (unless, of course, hidden Markov models and affine transforms are your cup of tea). All documents are PDFs.

Kane, Shaun, Andrew Lehman, Elizabeth Partridge, "Indexing George Washington's Handwritten Manuscripts: A Study of Word Matching Techniques." Technical Report of the Center for Intelligent Information Retrieval, University of Massachusetts, 2001.

Keaton, Patricia, Hayit Greenspan and Rodney Goodman, "Keyword Spotting for Cursive Document Retrieval," Proceedings of the IEEE Workshop on Document Image Analysis (DIA '97), in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '97), June 1997, San Juan, Puerto Rico, pp. 74-81.

Koerich, A. L., R. Sabourin, C. Y. Suen, "Large Vocabulary Off-Line Handwriting Recognition: A Survey," Pattern Analysis and Applications, v. 6, no. 2, pp. 97-121, July 2003.

Manmatha, R., "Word Spotting: Indexing Handwritten Manuscripts," DLI2/IMLS/NSDL Principal Investigators Meeting, Portland, Oregon, July 17-18, 2002.

Plamondon, Rejea and Sargur N. Srihari, "On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, v. 22, no. 1, January 2000.

Rath, Toni M., Victor Lavrenko and R. Manmatha, "A Statistical Approach to Retrieving Historical Manusript Images without Recognition." Technical Report of the Center for Intelligent Information Retrieval, University of Massachusetts, 2003.

Tomai, Catalin I., Bin Zhang and Venu Govindaraju, "Transcript mapping for Historic Handwritten Document Images," Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR'02), pp. 413-418, September, 2002.

Verma, B., M. Blumenstein & S. Kulkarni "Recent Achievements In Off-Line Handwriting Recognition Systems," International Conference on Computational Intelligence and Multimedia Applications (ICCIMA '98), Melbourne, Australia, pp. 27-33, 1998.

Notes

[1] Image courtesy of R. Manmatha, Center for Intelligent Information Retrieval, University of Massachusetts.(back)

[2] Originally published in Rath, T.M., S. Kane, A. Lehman, E. Partridge and R. Manmatha, "Indexing for a Digital Library of George Washington's Manuscripts: A Study of Word Matching Techniques," Technical Report of the Center for Intelligent Information Retrieval, University of Massachusetts, 2002. Used with permission.(back)

[3] Adapted from Manmatha, R., "Word Spotting: Indexing Handwritten Manuscripts," DLI2/IMLS/NSDL Principal Investigators Meeting, Portland, Oregon, July 17-18, 2002. Used with permission.(back)


Copyright 2004 RLG.