Place: Large Lecture Room
Affiliation: Researcher in Computer Vision and Machine Learning LEAR Team, INRIA, Grenoble, France
Bag of visual words treat images as an orderless sets of local regions and represent them by visual word frequency histograms. Implicitly, regions are assumed to be identically and independently distributed (iid), which is a very poor assumption from a modelling perspective.
In this talk I’ll introduce non-iid models by treating the parameters of bag-of-word models as latent variables which are integrated out, rendering all local regions dependent. Using the Fisher kernel we encode an image by the gradient of the data log-likelihood w.r.t. hyper-parameters that control priors on the model parameters. In fact,
our models naturally generate transformations similar to taking square-roots, providing an explanation of why such non-linear transformations have proven successful in practice. Using variational inference we extend the basic model to include Gaussian mixtures over local descriptors, and latent topic models to capture the co-occurrence structure of visual words, both improving performance. Our models yields state-of-the-art image categorization performance using linear classifiers, without using non-linear kernels, or (approximate) explicit
embeddings thereof (such as by taking the square-root of the features).
This talk is based on our upcoming cvpr’12 paper, which can be found here: http://hal.inria.fr/hal-00685943/PDF/paper_final.pdf