Bag of visual words treat images as an orderless sets of local regions
and represent them by visual word frequency histograms. Implicitly,
regions are assumed to be identically and independently distributed
(iid), which is a very poor assumption from a modelling perspective.
In this talk I'll introduce non-iid models by treating the parameters of
bag-of-word models as latent variables which are integrated out,
rendering all local regions dependent. Using the Fisher kernel we encode
an image by the gradient of the data log-likelihood w.r.t.
hyper-parameters that control priors on the model parameters. In fact,
our models naturally generate transformations similar to taking
square-roots, providing an explanation of why such non-linear
transformations have proven successful in practice. Using variational
inference we extend the basic model to include Gaussian mixtures over
local descriptors, and latent topic models to capture the co-occurrence
structure of visual words, both improving performance. Our models
yields state-of-the-art image categorization performance using linear
classifiers, without using non-linear kernels, or (approximate) explicit
embeddings thereof (such as by taking the square-root of the features).
This talk is based on our upcoming cvpr'12 paper, which can be found