X-Git-Url: https://fleuret.org/cgi-bin/gitweb/gitweb.cgi?a=blobdiff_plain;f=elbo.tex;h=563ec3c3bf98e3e250d2fc5fa4deffb26dc74ac8;hb=d3ec58e881d629993d490e7b6b3a6a5f7492fc8b;hp=fe91565f8fa6d21c72fee5ef2b87de8006efc3b3;hpb=05c0721d2f8b578a8a27ed2085dc9812d2249f88;p=tex.git diff --git a/elbo.tex b/elbo.tex index fe91565..563ec3c 100644 --- a/elbo.tex +++ b/elbo.tex @@ -76,24 +76,25 @@ \setlength{\abovedisplayshortskip}{2ex} \setlength{\belowdisplayshortskip}{2ex} -\vspace*{-4ex} +\vspace*{-3ex} \begin{center} {\Large The Evidence Lower Bound} -\vspace*{1ex} +\vspace*{2ex} Fran\c cois Fleuret +%% \vspace*{2ex} + \today -\vspace*{-1ex} +%% \vspace*{-1ex} \end{center} -Given i.i.d training samples $x_1, \dots, x_N$ that follows an unknown -distribution $\mu_X$, we want to fit a model $p_\theta(x,z)$ to it, -maximizing +Given i.i.d training samples $x_1, \dots, x_N$ we want to fit a model +$p_\theta(x,z)$ to it, maximizing % \[ \sum_n \log \, p_\theta(x_n). @@ -134,6 +135,8 @@ since this maximization pushes that KL term down, it also aligns $p_\theta(z \mid x_n)$ and $q(z)$, and we may get a worse $p_\theta(x_n)$ to bring $p_\theta(z \mid x_n)$ closer to $q(z)$. +\medskip + However, all this analysis is still valid if $q$ is a parameterized function $q_\alpha(z \mid x_n)$ of $x_n$. In that case, if we optimize $\theta$ and $\alpha$ to maximize @@ -145,5 +148,20 @@ $\theta$ and $\alpha$ to maximize it maximizes $\log \, p_\theta(x_n)$ and brings $q_\alpha(z \mid x_n)$ close to $p_\theta(z \mid x_n)$. +\medskip + +A point that may be important in practice is +% +\begin{align*} + & \expect_{Z \sim q_\alpha(z \mid x_n)} \left[ \log \frac{p_\theta(x_n,Z)}{q_\alpha(Z \mid x_n)} \right] \\ + & = \expect_{Z \sim q_\alpha(z \mid x_n)} \left[ \log \frac{p_\theta(x_n \mid Z) p_\theta(Z)}{q_\alpha(Z \mid x_n)} \right] \\ + & = \expect_{Z \sim q_\alpha(z \mid x_n)} \left[ \log \, p_\theta(x_n \mid Z) \right] \\ + & \hspace*{7em} - \dkl(q_\alpha(z \mid x_n) \, \| \, p_\theta(z)). +\end{align*} +% +This form is useful because for certain $p_\theta$ and $q_\alpha$, for +instance if they are Gaussian, the KL term can be computed exactly +instead of through sampling, which removes one source of noise in the +optimization process. \end{document}