Document Image Classification and Retrieval

Document Image Classification and Retrieval

Traditional approaches to document retrieval focus on conversion to electronic text followed by indexing of the text content.  Recently some work in the community has focused on indexing document image content directly.  In this talk, we will overview work at Maryland on Classification and Indexing that scales to millions of documents.  First we present a learning based approach for computing structural similarities among document images for unsupervised exploration in large document collections. The approach is based on multiple levels of content and structure. At a local level, a bag-of-visual words based on SURF features provides an effective way of computing content similarity. The document is then recursively partitioned and a histogram of codewords is computed for each partition. Structural similarity is computed using a random forest classifier trained with these histogram features. We experiment with three diverse datasets of document images varying in size, degree of structural similarity, and types of document images. Second, we present a scalable algorithm for segmentation free content retrieval in document images. The contributions of this paper include the use of the SURF feature for image passage retrieval, a novel indexing algorithm for efficient retrieval of SURF features and a method to filter results using the orientation of local features and geometric constraints. Results demonstrate that logo, signature block and stamp retrieval can be performed with high accurately and efficiently scaled to a large datasets.

Bio: Dr. David Doermann is a senior research scientist in UMIACS.  He received a B.Sc. degree in Computer Science and Mathematics from Bloomsburg University in 1987, and a M.Sc. degree in 1989 in the Department of Computer Science at the University of Maryland, College Park. He continued his studies in the Computer Vision Laboratory, where he earned a Ph.D. 1993. Since 1993, he has served as co-director of the Laboratory for Language and Media Processing in the University of Maryland's Institute for Advanced Computer Studies and as an adjunct member of the graduate faculty.

His team of researchers focuses on topics related to document image analysis and multimedia information processing. Recent intelligent document image analysis projects include page decomposition, structural analysis and classification, page segmentation, logo recognition, document image compression, duplicate document image detection, image based retrieval, character recognition, generation of synthetic OCR data, and signature verification. In video processing, projects have centered on the segmentation of compressed domain video sequences, structural representation and classification of video, detection of reformatted video sequences and the performance evaluation of automated video analysis algorithms.

In 2002 he received an Honorary Doctorate of Technology Sciences from the University of Oulu for his contributions to digital media processing and document analysis research. He is a founding co-editor of the International Journal on Document Analysis and Recognition, has the General Chair or Co-Chair of over a half dozen international conferences and workshops and was the General Chair of the International Conference on Document Analysis and Recognition (ICDAR)  held in Washington DC in 2013.  He has over 30 journal publications and over 160 refereed conference papers.

He is a fellow of the IEEE and IAPR and is currently a program manager at DARPA in the Information Innovation Office (I2O)