Coloring bag-of-words based image representations

Coloring bag-of-words based image representations

Place: Large Lecture Room - CVC Affiliation: Computer Vision Centre and Universitat de Barcelona

Put succinctly, the bag-of-words based image representation is the most successful approach for object and scene recognition. Within the bag-of-words framework the optimal fusion of multiple cues, such as shape, texture and color, still remains an active research domain. There exist two main approaches to incorporate color information within the bag-of-words framework. The first approach called, early fusion, fuses color and shape at the feature level as a result of which a joint color-shape vocabulary is produced. The second approach, called late fusion, concatenates histogram representation of both color and shape, obtained independently. In this work, we analyze the theoretical implications of both early and late feature fusion. We demonstrate that both these approaches are sub-optimal for a subset of object categories and propose compact and efficient image representations to combine color and shape cues.

We propose a novel method for recognizing object categories when using multiple cues by separately processing the shape and color cues and combining them by modulating the shape features by category specific color attention. Color is used to compute bottom-up and top-down attention maps. Subsequently, these color attention maps are used to modulate the weights of the shape features. In regions with higher attention shape features are given more weight than in regions with low attention.

Spatial pyramids have been successfully applied to incorporate spatial information into bag-of-words based image representation. However, a major drawback is that it leads to high dimensional image representations. We present a novel framework for obtaining compact pyramid representation. The approach reduces the size of a high dimensional pyramid representation upto an order of magnitude without loss of accuracy. Moreover, we investigate the optimal combination of multiple features in the context of our compact pyramid representation.

Finally, we propose a novel technique that builds discriminative compound words from primitive cues learned independently from training images. Our main observation is that modeling joint-cue distributions independently is more statistically robust for typical classification problems than attempting to empirically estimate the dependent, joint-cue distribution directly. We use Information theoretic vocabulary compression to find discriminative combinations of cues and the resulting vocabulary of portmanteau words is compact, has the cue binding property, and supports individual weighting of cues in the final image representation.