-Given a training i.i.d train samples $x_1, \dots, x_N$ that follows an
-unknown distribution $\mu_X$, we want to fit a model $p_\theta(x,z)$
-to it, maximizing
+Given i.i.d training samples $x_1, \dots, x_N$ we want to fit a model
+$p_\theta(x,z)$ to it, maximizing
$p_\theta(z \mid x_n)$ and $q(z)$, and we may get a worse
$p_\theta(x_n)$ to bring $p_\theta(z \mid x_n)$ closer to $q(z)$.
$p_\theta(z \mid x_n)$ and $q(z)$, and we may get a worse
$p_\theta(x_n)$ to bring $p_\theta(z \mid x_n)$ closer to $q(z)$.
However, all this analysis is still valid if $q$ is a parameterized
function $q_\alpha(z \mid x_n)$ of $x_n$. In that case, if we optimize
$\theta$ and $\alpha$ to maximize
However, all this analysis is still valid if $q$ is a parameterized
function $q_\alpha(z \mid x_n)$ of $x_n$. In that case, if we optimize
$\theta$ and $\alpha$ to maximize
it maximizes $\log \, p_\theta(x_n)$ and brings $q_\alpha(z \mid
x_n)$ close to $p_\theta(z \mid x_n)$.
it maximizes $\log \, p_\theta(x_n)$ and brings $q_\alpha(z \mid
x_n)$ close to $p_\theta(z \mid x_n)$.