\begin{document}
-\vspace*{0ex}
+\setlength{\abovedisplayskip}{2ex}
+\setlength{\belowdisplayskip}{2ex}
+\setlength{\abovedisplayshortskip}{2ex}
+\setlength{\belowdisplayshortskip}{2ex}
+
+\vspace*{-3ex}
\begin{center}
{\Large The Evidence Lower Bound}
+\vspace*{2ex}
+
Fran\c cois Fleuret
+%% \vspace*{2ex}
+
\today
-\vspace*{1ex}
+%% \vspace*{-1ex}
\end{center}
-Given a training set $x_1, \dots, x_N$ that follows an unknown
-distribution $\mu_X$, we want to fit a model $p_\theta(x,z)$ to it,
-maximizing
+Given i.i.d training samples $x_1, \dots, x_N$ we want to fit a model
+$p_\theta(x,z)$ to it, maximizing
%
\[
\sum_n \log \, p_\theta(x_n).
\]
%
-If we do not have a analytical form of the marginal $p_\theta(x_n)$
+If we do not have an analytical form of the marginal $p_\theta(x_n)$
but only the expression of $p_\theta(x_n,z)$, we can get an estimate
of the marginal by sampling $z$ with any distribution $q$
%
& = \expect_{Z \sim q(z)} \left[\frac{p_\theta(x_n,Z)}{q(Z)}\right].
\end{align*}
%
-So if we wanted to maximize $p_\theta(x_n)$ alone, we could sample a
+So if we sample a
$Z$ with $q$ and maximize
%
\begin{equation*}
-\frac{p_\theta(x_n,Z)}{q(Z)}.\label{eq:estimator}
+\frac{p_\theta(x_n,Z)}{q(Z)},
\end{equation*}
+%
+we do maximize $p_\theta(x_n)$ on average.
But we want to maximize $\sum_n \log \, p_\theta(x_n)$. If we use the
$\log$ of the previous expression, we can decompose its average value
$p_\theta(z \mid x_n)$ and $q(z)$, and we may get a worse
$p_\theta(x_n)$ to bring $p_\theta(z \mid x_n)$ closer to $q(z)$.
+\medskip
+
However, all this analysis is still valid if $q$ is a parameterized
function $q_\alpha(z \mid x_n)$ of $x_n$. In that case, if we optimize
$\theta$ and $\alpha$ to maximize
it maximizes $\log \, p_\theta(x_n)$ and brings $q_\alpha(z \mid
x_n)$ close to $p_\theta(z \mid x_n)$.
+\medskip
+
+A point that may be important in practice is
+%
+\begin{align*}
+ & \expect_{Z \sim q_\alpha(z \mid x_n)} \left[ \log \frac{p_\theta(x_n,Z)}{q_\alpha(Z \mid x_n)} \right] \\
+ & = \expect_{Z \sim q_\alpha(z \mid x_n)} \left[ \log \frac{p_\theta(x_n \mid Z) p_\theta(Z)}{q_\alpha(Z \mid x_n)} \right] \\
+ & = \expect_{Z \sim q_\alpha(z \mid x_n)} \left[ \log \, p_\theta(x_n \mid Z) \right] \\
+ & \hspace*{7em} - \dkl(q_\alpha(z \mid x_n) \, \| \, p_\theta(z)).
+\end{align*}
+%
+This form is useful because for certain $p_\theta$ and $q_\alpha$, for
+instance if they are Gaussian, the KL term can be computed exactly
+instead of through sampling, which removes one source of noise in the
+optimization process.
\end{document}