Place: Large Lecture Room
Affiliation: Computer VIsion Centre and Dep. of Computer Science, UAB.
The release of challenging datasets with a vast number of images, requires the development of efficient image representations and algorithms which are able to manipulate these largescale datasets efficiently. Nowadays the Bag-of-Words (BoW) based image representation is the most successful approach in the context of object and scene classification tasks. However, its main drawback is the absence of the important spatial information. Spatial pyramids (SP) have been successfully applied to incorporate spatial information into BoW-based image representation. The main SP approach, works by repeatedly sub-dividing the image into increasingly finer sub-regions by doubling the number of divisions on each axis direction, and further computing histograms of features over the resulting sub-regions. Observing the remarkable performance of spatial pyramids, their growing number of applications to a broad range of vision problems, and finally its geometry inclusion, a question can be asked what are the limits of spatial pyramids. Within the SP framework, the optimal way for obtaining an image spatial representation which is able to cope with it’s most foremost shortcomings, concretely, it’s high dimensionality and the rigidity of the resulting image representation still remains an active research domain. In summary, the main concern of this thesis is to search for the limits of spatial pyramids and try to figure out solutions for them. This thesis explores the problem of obtaining compact, adaptive, yet informative spatial image representations in the context of object and scene classification tasks. In the first part of this thesis, we first analyze the implications of directly applying the state-of-the-art compression techniques for obtaining compact BoW-based image representation within the context of spatial pyramids. We then introduce a novel SP compression technique that works on two levels; (i) compressing the least informative spatial pyramid features, followed by, (ii) compressing the least informative SP regions for the purpose of obtaining compact, and adaptable SP. We then introduce a new texture descriptor that represents local image texture and its spatial layout. Texture is represented as a compact vector descriptor suitable for use in standard learning algorithms with kernels. Experimental results show that texture information has similar classification performances and sometimes outperforms those methods using only shape or appearance information. The resulting spatial pyramid representation demonstrates significantly improved performance on challenging scene classification tasks. In the second part of this thesis, we present a novel technique for building adaptive spatial pyramids. In particular, we investigate various approaches for learning adaptive spatial pyramids, which are specially tailored for the task at hand. To this end, we analyze the use of (i) standard generic 3D scene geometries; the geometry of a scene is measured based on image statistics taken from a single image. (ii) discriminative spatial partitionings, which are generated based on an information-theoretic approach. The proposed method is tested on several challenging benchmark object classification datasets. The results clearly demonstrated the effectiveness of using adaptive spatial representations, which are steered by the 3D scene geometry present in images. In the third part of this thesis, we investigate the problem of obtaining compact spatial pyramid image representations for object and scene classification tasks. We present a novel framework for obtaining compact spatial pyramid image representation up to an order of magnitude without any significant reduction in accuracy. Moreover, we also investigate the optimal combination of multiple features such as color and shape within the context of our novel compact pyramid representation. Finally, we investigate the importance of using the spatial knowledge within the context of color constancy as an application. To this end, we present a novel framework for estimating the image illuminant based on spatial 3D geometry for learning the most appropriate color constancy algorithm to use for every image region. The final image illuminant is obtained based on a weighted combination of each individual illuminant-estimate obtained per region. We test and compare our performance to that of previous state-of-art methods. We will show that the set of innovations introduced here lead to a significant increase on performance on challenging color constancy datasets.