Update.

[tex.git] / inftheory.tex
diff --git a/inftheory.tex b/inftheory.tex

index eb9e988..954fd06 100644 (file)
--- a/inftheory.tex
+++ b/inftheory.tex
@@ -1,7 +1,11 @@
  %% -*- mode: latex; mode: reftex; mode: flyspell; coding: utf-8; tex-command: "pdflatex.sh" -*-
  
-\documentclass[10pt,a4paper,twoside]{article}
-\usepackage[paperheight=18cm,paperwidth=10cm,top=5mm,bottom=20mm,right=5mm,left=5mm]{geometry}
+%% Any copyright is dedicated to the Public Domain.
+%% https://creativecommons.org/publicdomain/zero/1.0/
+%% Written by Francois Fleuret <francois@fleuret.org>
+
+\documentclass[11pt,a4paper,oneside]{article}
+\usepackage[paperheight=15cm,paperwidth=8cm,top=2mm,bottom=15mm,right=2mm,left=2mm]{geometry}
  %\usepackage[a4paper,top=2.5cm,bottom=2cm,left=2.5cm,right=2.5cm]{geometry}
  \usepackage[utf8]{inputenc}
  \usepackage{amsmath,amssymb,dsfont}
@@ -16,8 +20,8 @@
  \usetikzlibrary{tikzmark}
  \usetikzlibrary{decorations.pathmorphing}
  \usepackage[round]{natbib}
-%\usepackage{cmbright}
-%\usepackage{showframe}
+\usepackage[osf]{libertine}
+\usepackage{microtype}
  
  \usepackage{mleftright}
  
@@ -27,89 +31,127 @@
  \setmuskip{\thickmuskip}{3.5mu} % by default it is equal to 5 mu
  
  \setlength{\parindent}{0cm}
-\setlength{\parskip}{12pt}
+\setlength{\parskip}{1ex}
  %\renewcommand{\baselinestretch}{1.3}
  %\setlength{\tabcolsep}{0pt}
  %\renewcommand{\arraystretch}{1.0}
  
  \def\argmax{\operatornamewithlimits{argmax}}
  \def\argmin{\operatornamewithlimits{argmin}}
-\def\expect{\mathds{E}}
  
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%% The \todo command
-\newcounter{nbdrafts}
-\setcounter{nbdrafts}{0}
-\makeatletter
-\newcommand{\checknbdrafts}{
-\ifnum \thenbdrafts > 0
-\@latex@warning@no@line{*WARNING* The document contains \thenbdrafts \space draft note(s)}
-\fi}
-\newcommand{\todo}[1]{\addtocounter{nbdrafts}{1}{\color{red} #1}}
-\makeatother
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\def\given{\,\middle\vert\,}
+\def\proba{\operatorname{P}}
+\newcommand{\seq}{{S}}
+\newcommand{\expect}{\mathds{E}}
+\newcommand{\variance}{\mathds{V}}
+\newcommand{\empexpect}{\hat{\mathds{E}}}
+\newcommand{\mutinf}{\mathds{I}}
+\newcommand{\empmutinf}{\hat{\mathds{I}}}
+\newcommand{\entropy}{\mathds{H}}
+\newcommand{\empentropy}{\hat{\mathds{H}}}
+\newcommand{\ganG}{\mathbf{G}}
+\newcommand{\ganD}{\mathbf{D}}
+\newcommand{\ganF}{\mathbf{F}}
+
+\newcommand{\dkl}{\mathds{D}_{\mathsf{KL}}}
+\newcommand{\djs}{\mathds{D}_{\mathsf{JS}}}
+
+\newcommand*{\vertbar}{\rule[-1ex]{0.5pt}{2.5ex}}
+\newcommand*{\horzbar}{\rule[.5ex]{2.5ex}{0.5pt}}
+
+\def\positionalencoding{\operatorname{pos-enc}}
+\def\concat{\operatorname{concat}}
+\def\crossentropy{\LL_{\operatorname{ce}}}
  
  \begin{document}
  
+\vspace*{-3ex}
+
+\begin{center}
+{\Large Some bits of Information Theory}
+
+Fran\c cois Fleuret
+
+January 19, 2024
+
+\vspace*{1ex}
+
+\end{center}
+
  Information Theory is awesome so here is a TL;DR about Shannon's entropy.
  
-This field is about quantifying the amount ``of information'' contained
-in a signal and how much can be transmitted under certain conditions.
+The field is originally about quantifying the amount of
+``information'' contained in a signal and how much can be transmitted
+under certain conditions.
  
-What makes it awesome IMO is that it is very intuitive, and like thermodynamics in Physics it give exact bounds about what is possible or not.
+What makes it awesome is that it is very intuitive, and like
+thermodynamics in Physics, it gives exact bounds about what is
+possible or not.
  
  \section{Shannon's Entropy}
  
-This is the key concept from which everything is defined.
+Shannon's entropy is the key concept from which everything is defined.
  
-Imagine that you have a distribution of probabilities p on a finite set of symbols and that you generate a stream of symbols by sampling them one after another independently with that distribution.
+Imagine that you have a distribution of probabilities $p$ on a finite
+set of symbols, and that you generate a stream of symbols by sampling
+them one after another independently with that distribution.
  
-To transmit that stream, for instance with bits over a communication line, you can design a coding that takes into account that the symbols are not all as probable, and decode on the other side.
+To transmit that stream, for instance with bits over a communication
+line, you can design a coding that takes into account that the symbols
+are not all as probable, and decode on the other side.
  
-For instance if $P('\!\!A')=1/2$, $P('\!\!B')=1/4$, and
-$P('\!\!C')=1/4$ you would transmit ``0'' for a ``A'' and ``10'' for a
-``B'' and ``11'' for a ``C'', 1.5 bits on average.
+For instance if $\proba('\!\!A')=1/2$, $\proba('\!\!B')=1/4$, and
+$\proba('\!\!C')=1/4$ you would transmit ``$0$'' for a ``A'' and ``$10$'' for a
+``B'' and ``$11$'' for a ``C'', 1.5 bits on average.
  
-If the symbol is always the same, you transmit nothing, if they are equiprobable you need $\log_2$(nb symbols) etc.
+If the symbol is always the same, you transmit nothing, if they are
+equiprobable you need $\log_2$(nb symbols) etc.
  
-Shannon's Entropy (in base 2) is the minimum number of bits you have to emit on average to transmit that stream.
+Shannon's Entropy (in base 2) is the minimum number of bits you have
+to emit on average per symbol to transmit that stream.
  
-It has a simple formula:
+It has a simple analytical form:
  %
  \[
- H(p) = - \sum_k p(k) \log_2 p(k)
+ \entropy(p) = - \sum_k p(k) \log_2 p(k)
  \]
  %
-where by convention $o \log_2 0 = 0$.
+where by convention $0 \log_2 0 = 0$.
  
-It is often seen as a measure of randomness since the more deterministic the distribution is, the less you have to emit.
+It is often seen as a measure of randomness since the more
+deterministic the distribution is, the less you have to emit.
  
-The codings above are "Huffman coding", which reaches the Entropy
-bound only for some distributions. The "Arithmetic coding" does it
-always.
+The examples above correspond to "Huffman coding", which reaches the
+Entropy bound only for some distributions. A more sophisticated scheme
+called "Arithmetic coding" does it always.
  
  From this perspective, many quantities have an intuitive
-value. Consider for instance sending pairs of symbols (X, Y).
+value. Consider for instance sending pairs of symbols $(X, Y)$.
  
-If these two symbols are independent, you cannot do better than sending one and the other separately, hence
+If these two symbols are independent, you cannot do better than
+sending one and the other separately, hence
  %
  \[
-H(X, H) = H(X) + H(Y).
+\entropy(X, Y) = \entropy(X) + \entropy(Y).
  \]
  
-However, imagine that the second symbol is a function of the first Y=f(X). You just have to send X since Y can be computed from it on the other side.
+However, imagine that the second symbol is a function of the first
+Y=f(X). You just have to send $X$ since $Y$ can be computed from it on the
+other side.
  
  Hence in that case
  %
  \[
-H(X, Y) = H(X).
+\entropy(X, Y) = \entropy(X).
  \]
  
  An associated quantity is the mutual information between two random
  variables, defined with
  %
  \[
-I(X;Y) = H(X) + H(Y) - H(X,Y),
+\mutinf(X;Y) = \entropy(X) + \entropy(Y) - \entropy(X,Y),
  \]
  %
  that quantifies the amount of information shared by the two variables.
@@ -118,77 +160,86 @@ that quantifies the amount of information shared by the two variables.
  
  Conditional entropy is the average of the entropy of the conditional distribution:
  %
-\begin{align*}
-&H(X \mid Y)\\
- &= \sum_y p(Y=y) H(X \mid Y=y)\\
-       &= \sum_y P(Y=y) \sum_x P(X=x \mid Y=y) \log P(X=x \mid Y=y)
-\end{align*}
+\begin{equation*}
+\entropy(X \mid Y) = \sum_y \proba(Y=y) \entropy(X \mid Y=y)
+\end{equation*}
+%
+with
+%
+\begin{eqnarray*}
+\entropy(X \mid Y=y) \hspace*{13.5em} \\
+ = \sum_x \proba(X=x \mid Y=y) \log \proba(X=x \mid Y=y)
+\end{eqnarray*}
  
-Intuitively it is the [minimum average] number of bits required to describe X given that Y is known.
+Intuitively it is the [minimum average] number of bits required to describe $X$ given that $Y$ is known.
  
-So in particular, if X and Y are independent, getting the value of $Y$
+So in particular, if $X$ and $Y$ are independent, getting the value of $Y$
  does not help at all, so you still have to send all the bits for $X$,
  hence
  %
  \[
-  H(X \mid Y)=H(X)
+  \entropy(X \mid Y)=\entropy(X),
  \]
-
-if X is a deterministic function of Y then
+%
+and if $X$ is a deterministic function of $Y$ then
  %
  \[
-  H(X \mid Y)=0
+  \entropy(X \mid Y)=0.
  \]
  
-And since if you send the bits for Y and then the bits to describe X given that Y, you have sent (X, Y), we have the chain rule:
+And if you send the bits for $Y$ and then the bits to describe $X$
+given that $Y$, you have sent $(X, Y)$, hence the chain rule:
  %
  \[
-H(X, Y) = H(Y) + H(X \mid Y).
+\entropy(X, Y) = \entropy(Y) + \entropy(X \mid Y).
  \]
-
+%
  And then we get
  %
  \begin{align*}
-I(X;Y) &= H(X) + H(Y) - H(X,Y)\\
-       &= H(X) + H(Y) - (H(Y) + H(X \mid Y))\\
-       &= H(X) - H(X \mid Y).
+I(X;Y) &= \entropy(X) + \entropy(Y) - \entropy(X,Y)\\
+       &= \entropy(X) + \entropy(Y) - (\entropy(Y) + \entropy(X \mid Y))\\
+       &= \entropy(X) - \entropy(X \mid Y).
  \end{align*}
  
  \section{Kullback-Leibler divergence}
  
  Imagine that you encode your stream thinking it comes from
-distribution $q$ while it comes from $p$. You would emit more bits than
-the optimal $H(p)$, and that supplement is $D_{KL}(p||q)$ the
-Kullback-Leibler divergence between $p$ and $q$.
+distribution $q$ while it comes from $p$. You would emit more bits
+than the optimal $\entropy(p)$, and that excess of bits is
+$\dkl(p||q)$ the Kullback-Leibler divergence between $p$ and $q$.
  
  In particular if $p=q$
  %
  \[
- D_{KL}(p\|q)=0,
+ \dkl(p\|q)=0,
  \]
  %
  and if there is a symbol $x$ with $q(x)=0$ and $p(x)>0$, you cannot encode it and
  %
  \[
- D_{KL}(p\|q)=+\infty.
+ \dkl(p\|q)=+\infty.
  \]
  
  Its formal expression is
  %
  \[
-D_{KL}(p\|q) = \sum_x p(x) \log\left(\frac{p(x)}{q(x)}\right)
+\dkl(p\|q) = \sum_x p(x) \log\left(\frac{p(x)}{q(x)}\right)
  \]
  %
  that can be understood as a value called the cross-entropy between $p$ and $q$
  %
  \[
-H(p,q) = -\sum_x p(x) \log q(x)
+\entropy(p,q) = -\sum_x p(x) \log q(x)
  \]
  %
  minus the entropy of p
  \[
-H(p) = -\sum_x p(x) \log p(x).
+\entropy(p) = -\sum_x p(x) \log p(x).
  \]
  
-Notation horror: if $X$ and $Y$ are random variables $H(X, Y)$ is the entropy of their joint law, and if $p$ and $q$ are distributions, $H(p,q)$ is the cross-entropy between them.
+Notation horror: if $X$ and $Y$ are random variables $\entropy(X, Y)$ is the
+entropy of their joint law, and if $p$ and $q$ are distributions,
+$\entropy(p,q)$ is the cross-entropy between them.
+
  \end{document}