Place: CVC Sala d’actes
Dr. Gregory Rogez – Naver Labs Europe
Dr. Carles Fernandez Tena – Herta Security
Dr. David Masip Rodó – Universitat Oberta de Barcelona
Dr. Jordi Gonzàlez Sabaté – Computer Vision Center
Dr. Josep M. Gonfaus – Computer Vision Center
Dr. F. Xavier Roca Marvà – Computer Vision Center
Fine-grained recognition, i.e. identifying similar subcategories of the same superclass, is central to human activity. Recognizing a friend, finding bacteria in microscopic imagery, or discovering a new kind of galaxy, are just but few examples. However, fine-grained image recognition is still a challenging computer vision task since the differences between two images of the same category can overwhelm the differences between two images of different fine-grained categories. In this regime, where the difference between two categories resides on subtle input changes, excessively invariant CNNs discard those details that help to discriminate between categories and focus on more obvious changes, yielding poor classification performance. On the other hand, CNNs with too much capacity tend to memorize instance-specific details, thus causing overfitting.
In this thesis, motivated by the potential impact of automatic fine-grained image recognition, we tackle the previous challenges and demonstrate that proper alignment of the inputs, multiple levels of attention, regularization, and explicit modeling of the output space, results in more accurate fine-grained recognition models, that generalize better, and are more robust to intra-class variation. Concretely, we study the different stages of the neural network pipeline: input pre-processing, attention to regions, feature activations, and the label space. In each stage, we address different issues that hinder the recognition performance on various fine-grained tasks, and devise solutions in each chapter: i) We deal with the sensitivity to input alignment on fine-grained human facial motion such as pain. ii) We introduce an attention mechanism to allow CNNs to choose and process in detail the most discriminate regions of the image. iii) We further extend attention mechanisms to act on the network activations, thus allowing them to correct their predictions by looking back at certain regions, at different levels of abstraction. iv) We propose a regularization loss to prevent high-capacity neural networks to memorize instance details by means of almost-identical feature detectors. v) We finally study the advantages of explicitly modeling the output space within the error-correcting framework. As a result, in this thesis we demonstrate that attention and regularization are promising directions to overcome the problems of fine-grained image recognition, as well as proper treatment of the input and the output space.