
Following The Pattern: Scene Text Spotting Guided by Regular Expressions
Sergi García Bordils successfully defended his dissertation on Computer Science on October 21, 2024, and he is now Doctor of Philosophy by the Universitat Autònoma de Barcelona.
What is the thesis about?
Scene-Text Recognition (STR) is a sub-field of computer vision that tackles the problem of text localization and recognition in natural images. Since scene-text provides crucial semantic information for high-level tasks, continued research interest has resulted in great leaps in performance. Much of this success is thanks to the surge of deep learning, which has significantly pushed the capabilities of STR models. However, these models adopt a purely generic approach toward text extraction, where all text is treated indistinctively and the possible semantics of the textual content are ignored. We identify and study two main disadvantages which are the consequence of this generic nature.
The first one is the reliance on vocabulary priors by the recognition step, which can degrade recognition performance on unseen words and morphological constructions. The second one is related to the \textit{detection granularity}, which we define as the boundary at which the network separates text into individual instances. Most networks establish this localization boundary at word level. If our downstream application requires textual expressions that feature spaces or line breaks, generic STR detectors will split it into different instances.
First, we study the phenomenon of vocabulary reliance with the creation of the Out-of-Vocabulary challenge, a novel STR benchmark that distinguishes between performance on seen and unseen vocabulary.
Using this benchmark, we organized a competition where participants had to train their models for both in and out-of-vocabulary performance. The evaluation of the participant's models and our baselines allowed us to assess the impact of language reliance on STR.
Then, we introduce the task of Structured Scene-Text Spotting, a novel task where STR models have to spot the text instances that match a given regular expression (regex). We also introduce the Structured Scene-Text Spotting Test dataset, which contains many classes of text that follow regular expressions. These instances are not found in any vocabulary and contain spaces and multi-line text, which allows us to probe our two main concerns of generic STR. As opposed to these generic models, we propose leveraging the given regex directly in the spotting pipeline, guiding the detection and recognition to directly spot the target text, while ignoring the rest.
We show how our two proposed approaches STEP and STEPup, obtain better end-to-end results than the generic STR beaselines.
Keynotes
Scene-Text, STR, Text Detection, Text Recognition, Text Detection and Recognition, Regex Spotting, Structured Text Spotting.
