Walkthrough : Variational autoencoder (VAE)

What's a VAE ?

• if this is too long to read, these slides explain the key points

'Classic' autoencoders¶

• loss function $||\mathbf{x}-\mathbf{y}|{|}^{2}$$||\mathbf{x} - \mathbf{y}||^2$ : want to reconstruct the original input
• $\mathbf{z}$$\mathbf{z}$ is a compact, low dimensional ($p\ll n$$p\ll n$) representation of input $\mathbf{x}$$\mathbf{x}$
• bottleneck forces the network to learn how to represent the training set $X=\left\{{\mathbf{x}}_{1},\dots {\mathbf{x}}_{N}\right\}$$X = \{\mathbf{x}_1, \ldots \mathbf{x}_N\}$

Applications:¶

• denoising, completion
• discriminant feature learning to feed some classifier
• unsupervised training of individual layers of large convnets
• manifold learning, dimensionality reduction

Comparison of separability of 2-dimensional codes generated by an autoencoder (right) and PCA (left) on the MNIST dataset

Variational autoencoder¶

• It's a generative model : given a dataset $X$$X$, generate new samples like those in $X$$X$ but not equal to anyone
• Learns the parameters of an approximation of the underlying probability distribution $p\left(X\right)$$p(X)$ so as to

• draw new samples
• compute the probability of a new sample
• Only that ${\mathbf{x}}_{i}\in X$$\mathbf{x}_i \in X$ are very high dimensional! e.g. 728 dimensions for MNIST
• Resemblance with AE is just the network architecture (though it's not exactly the same, see below)

Loss function¶

It can be seen (but thats the difficult part, see slides for an explanation) that in order to maximize the likelihood of the training set $P\left(X\phantom{\rule{thickmathspace}{0ex}}|\phantom{\rule{thickmathspace}{0ex}}\phi ,\theta \right)$$P(X \; | \; \varphi, \theta)$ the loss is the sum of

• Kullback Leibler divergence between the distribution in latent space induced by the encoder on the data, ${q}_{\phi }\left(z\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}x\right)$$q_\varphi(z \,|\, x)$ , and some prior selected for $z$$z$, $p\left(z\right)$$p(z)$ like $\mathcal{N}\left(0,1\right)$${\cal N}(0,1)$. At the end, this simplifies to $\sum _{i=1}^{p}1+\mathrm{log}\left({\sigma }_{{z}_{i}}^{2}\right)-{\mu }_{{z}_{i}}-{\sigma }_{{z}_{i}}^{2}$$\displaystyle\sum_{i=1}^p 1 + \log(\sigma^2_{z_i}) - \mu_{z_i} - \sigma^2_{z_i}$
• a reconstruction error: $\sum _{i=1}^{n}\mathrm{log}p\left({x}_{i}\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}z\right)$$\displaystyle\sum_{i=1}^n \log p(x_i \,|\, z)$, if reconstruction was perfect, i.e. $z$$z$ produces ${x}_{i}$$x_i$ always, $\mathrm{log}\left(1\right)$$\log(1)$

In the code, they assume $p\left({x}_{i}\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}z\right)$$p(x_i \,|\, z)$ is a Bernoulli $⇒\sum _{i=1}^{n}\mathrm{log}\left({\mu }_{{y}_{i}}\right)+\left(1-{x}_{i}\right)\mathrm{log}\left(1-{\mu }_{{y}_{i}}\right)$$\Rightarrow \displaystyle\sum_{i=1}^n \log(\mu_{y_i}) + (1-x_i) \log(1-\mu_{y_i})$.

For $p\left({x}_{i}\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}z\right)$$p(x_i \,|\, z)$ Gaussian $N\left({\mu }_{{x}_{i}},{\sigma }_{{x}_{i}}\right)$$N(\mu_{x_i}, \sigma_{x_i})$, this error would be $\sum _{i=1}^{n}\mathrm{log}\left({\sigma }_{{x}_{i}}^{2}\right)+\left({x}_{i}-{\mu }_{{x}_{i}}{\right)}^{2}/{\sigma }_{{x}_{i}}^{2}$$\displaystyle\sum_{i=1}^n \log(\sigma_{x_i}^2) + (x_i - \mu_{x_i})^2/\sigma_{x_i}^2$

Implementations¶

A simple implementation, see #11. Need to download everything for this to run.

A better but longer implementation, that we have adapted