Learning to Represent Handwritten Shapes and Words for Matching and Recognition

CVC has a new PhD on its record!

Jon Almazán successfully defended his dissertation on Computer Science on October 21, 2014, and he is now Doctor of Philosophy by the Universitat Autònoma de Barcelona.

Download thesis

What is the thesis about?

Writing is one of the most important forms of communication and for centuries, handwriting had been the most reliable way to preserve knowledge. However, despite the recent development of printing houses and electronic devices, handwriting is still broadly used for taking notes, doing annotations, or sketching ideas. In order to be easily accessed, there is a huge amount of handwritten documents, some of them with uncountable cultural value, that have been recently digitized. This has made necessary the development of methods able to extract information from these document images.

Transferring the ability of understanding handwritten text or recognizing handwritten shapes to computers has been the goal of many researches due to its huge importance for many dierent elds. However, designing good representations to deal with handwritten shapes, e.g. symbols or words, is a very challenging problem due to the large variability of these kinds of shapes. One of the consequences of working with handwritten shapes is that we need representations to be robust, i.e., able to adapt to large intra-class variability. We need representations to be discriminative, i.e., able to learn what are the dierences between classes. And, we need representations to be ecient, i.e., able to be rapidly computed and compared. Unfortunately, current techniques of handwritten shape representation for matching and recognition do not fulll some or all of these requirements.

Through this thesis we focus on the problem of learning to represent handwritten shapes aimed at retrieval and recognition tasks. Concretely, on the rst part of the thesis, we focus on the general problem of representing any kind of handwritten shape. We rst present a novel shape descriptor based on a deformable grid that deals with large deformations by adapting to the shape and where the cells of the grid can be used to extract dierent features. Then, we propose to use this descriptor to learn statistical models, based on the Active Appearance Model, that jointly learns the variability in structure and texture of a given class. Then, on the second part, we focus on a concrete application, the problem of representing handwritten words, for the tasks of word spotting, where the goal is to nd all instances of a query word in a dataset of images, and recognition. First, we address the segmentation-free problem and propose an unsupervised, sliding-window-based approach that achieves state-ofthe-art results in two public datasets. Second, we address the more challenging multi-writer problem, where the variability in words exponentially increases. We describe an approach in which both word images and text strings are embedded in a common vectorial subspace, and where those that represent the same word are close together. This is achieved by a combination of label embedding and attributes learning, and a common subspace regression. This leads to a low-dimensional, unied representation of word images and strings, resulting in a method that allows one to perform either image and text searches, as well as image transcription, in a unied framework. We evaluate our methods on dierent public datasets of both handwritten documents and natural images showing results comparable or better than the state-of-the-art on spotting and recognition tasks.