Leveraging Scene Text Information for Image Interpretation

CVC has a new PhD on its record!

Andrés Mafla successfully defended his dissertation on Computer Science on November 21, 2022, and he is now Doctor of Philosophy by the Universitat Autònoma de Barcelona.

Download thesis

What is the thesis about?

Until recently, most computer vision models remained illiterate, largely ignoring the semantically rich and explicit information contained as scene text. Recent progress in scene text detection and recognition has recently allowed exploring its role in a diverse set of open computer vision problems, e.g. image classification, image-text retrieval, image captioning, and visual question answering to name a few. The explicit semantics of scene text closely requires specific modeling similar to language. However, scene text is a particular signal that has to be interpreted according to a comprehensive perspective that encapsulates all the visual cues in an image. Incorporating this information is a straightforward task for humans, but if we are unfamiliar with a language or scripture, achieving a complete world understanding is impossible (e.a. visiting a foreign country with a different alphabet). Despite the importance of scene text, modeling it requires considering the several ways in which scene text interacts with an image, processing and fusing an additional modality.

In this thesis, we mainly focus on two tasks, scene text-based fine-grained image classification, and cross-modal retrieval. In both studied tasks we identify existing limitations in current approaches and propose plausible solutions. Concretely, in each chapter: i) We define a compact way to embed scene text that generalizes to unseen words at training time while performing in real-time. ii) We incorporate the previously learned scene text embedding to create an image-level descriptor that overcomes optical character recognition (OCR) errors which is well-suited to the fine-grained image classification task. iii) We design a region-level reasoning network that learns the interaction through semantics among salient visual regions and scene text instances. iv) We employ scene text information in image-text matching and introduce the Scene Text Aware Cross-Modal retrieval (StacMR) task. We gather a dataset that incorporates scene text and design a model suited for the newly studied modality. v) We identify the drawbacks of current retrieval metrics in cross-modal retrieval. An image captioning metric is proposed as a way of better evaluating semantics in retrieved results. Ample experimentation shows that incorporating such semantics into a model yields better semantic results while requiring significantly less data to converge.

Keywords: Computer Vision, Pattern Recognition, Deep Learning, Scene Text Image Retrieval, Fine-grained image retrieval, Cross-modal retrieval, Image-Text matching, Vision and Language, Scene Text Aware Cross-modal retrieval, Semantic Adaptive Margin, COCO-Text Captioned (CTC) Dataset.