Place: Large Lecture Room
Affiliation: Computer Science Department, University of Fribourg
This talk addresses recent advances in pattern recognition methods for handwriting recognition in historical documents. The aim of these methods is to automatically extract textual content from digitized manuscript images. Based on their textual content, millions of historical manuscript images could be integrated in digital libraries, which would help to preserve our cultural heritage by making it readily accessible to researchers and the public.
Two state-of-the-art strategies are discussed to model and recognize characters, words, and sentences. First, a generative strategy using hidden Markov models (HMM) and secondly, a discriminative strategy using a special form of recurrent neural networks (NN). The learning-based systems are generic in the sense that they can learn character appearance models for arbitrary alphabetical languages as long as a number of training samples are provided. They operate at the level of text lines avoiding prior word and character segmentation which is prone to errors for touching characters, broken characters, variable word spacing, and difficult image conditions stemming, e.g., from paper texture, damaged parchment, faded ink, and ink bleed-through.
Four subproblems of handwriting recognition in historical documents are addressed in this talk, namely ground truth creation, automatic transcription, keyword spotting, and transcription alignment. Experimental results are presented for several historical scripts and languages. The IAM historical document database (IAM-HistDB) includes Latin texts from the 9th century written in Carolingian minuscules (Saint Gall database), medieval German texts from the 13th century written in Gothic minuscules (Parzival database), and longhand English texts from the 18th century (George Washington database). The experimental results are promising in terms of accuracy, speed, and costs for indexing historical documents in digital libraries.