## Walkthrough : Variational autoencoder (VAE)

What's a VAE ?

• if this is too long to read, these slides explain the key points

### 'Classic' autoencoders¶

• loss function $||\mathbf{x}-\mathbf{y}|{|}^{2}$$||\mathbf{x} - \mathbf{y}||^2$ : want to reconstruct the original input
• $\mathbf{z}$$\mathbf{z}$ is a compact, low dimensional ($p\ll n$$p\ll n$) representation of input $\mathbf{x}$$\mathbf{x}$
• bottleneck forces the network to learn how to represent the training set $X=\left\{{\mathbf{x}}_{1},\dots {\mathbf{x}}_{N}\right\}$$X = \{\mathbf{x}_1, \ldots \mathbf{x}_N\}$

### Applications:¶

• denoising, completion
• discriminant feature learning to feed some classifier
• unsupervised training of individual layers of large convnets
• manifold learning, dimensionality reduction

Comparison of separability of 2-dimensional codes generated by an autoencoder (right) and PCA (left) on the MNIST dataset

### Variational autoencoder¶

• It's a generative model : given a dataset $X$$X$, generate new samples like those in $X$$X$ but not equal to anyone
• Learns the parameters of an approximation of the underlying probability distribution $p\left(X\right)$$p(X)$ so as to

• draw new samples
• compute the probability of a new sample
• Only that ${\mathbf{x}}_{i}\in X$$\mathbf{x}_i \in X$ are very high dimensional! e.g. 728 dimensions for MNIST
• Resemblance with AE is just the network architecture (though it's not exactly the same, see below)

### Loss function¶

It can be seen (but thats the difficult part, see slides for an explanation) that in order to maximize the likelihood of the training set $P\left(X\phantom{\rule{thickmathspace}{0ex}}|\phantom{\rule{thickmathspace}{0ex}}\phi ,\theta \right)$$P(X \; | \; \varphi, \theta)$ the loss is the sum of

• Kullback Leibler divergence between the distribution in latent space induced by the encoder on the data, ${q}_{\phi }\left(z\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}x\right)$$q_\varphi(z \,|\, x)$ , and some prior selected for $z$$z$, $p\left(z\right)$$p(z)$ like $\mathcal{N}\left(0,1\right)$${\cal N}(0,1)$. At the end, this simplifies to $\sum _{i=1}^{p}1+\mathrm{log}\left({\sigma }_{{z}_{i}}^{2}\right)-{\mu }_{{z}_{i}}-{\sigma }_{{z}_{i}}^{2}$$\displaystyle\sum_{i=1}^p 1 + \log(\sigma^2_{z_i}) - \mu_{z_i} - \sigma^2_{z_i}$
• a reconstruction error: $\sum _{i=1}^{n}\mathrm{log}p\left({x}_{i}\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}z\right)$$\displaystyle\sum_{i=1}^n \log p(x_i \,|\, z)$, if reconstruction was perfect, i.e. $z$$z$ produces ${x}_{i}$$x_i$ always, $\mathrm{log}\left(1\right)$$\log(1)$

In the code, they assume $p\left({x}_{i}\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}z\right)$$p(x_i \,|\, z)$ is a Bernoulli $⇒\sum _{i=1}^{n}\mathrm{log}\left({\mu }_{{y}_{i}}\right)+\left(1-{x}_{i}\right)\mathrm{log}\left(1-{\mu }_{{y}_{i}}\right)$$\Rightarrow \displaystyle\sum_{i=1}^n \log(\mu_{y_i}) + (1-x_i) \log(1-\mu_{y_i})$.

For $p\left({x}_{i}\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}z\right)$$p(x_i \,|\, z)$ Gaussian $N\left({\mu }_{{x}_{i}},{\sigma }_{{x}_{i}}\right)$$N(\mu_{x_i}, \sigma_{x_i})$, this error would be $\sum _{i=1}^{n}\mathrm{log}\left({\sigma }_{{x}_{i}}^{2}\right)+\left({x}_{i}-{\mu }_{{x}_{i}}{\right)}^{2}/{\sigma }_{{x}_{i}}^{2}$$\displaystyle\sum_{i=1}^n \log(\sigma_{x_i}^2) + (x_i - \mu_{x_i})^2/\sigma_{x_i}^2$

### Implementations¶

A simple implementation, see #11. Need to download everything for this to run.

A better but longer implementation, that we have adapted