inftheory.tex

   1 %% -*- mode: latex; mode: reftex; mode: flyspell; coding: utf-8; tex-command: "pdflatex.sh" -*-
   2
   3 %% Any copyright is dedicated to the Public Domain.
   4 %% https://creativecommons.org/publicdomain/zero/1.0/
   5 %% Written by Francois Fleuret <francois@fleuret.org>
   6
   7 \documentclass[11pt,a4paper,oneside]{article}
   8 \usepackage[paperheight=15cm,paperwidth=8cm,top=2mm,bottom=15mm,right=2mm,left=2mm]{geometry}
   9 %\usepackage[a4paper,top=2.5cm,bottom=2cm,left=2.5cm,right=2.5cm]{geometry}
  10 \usepackage[utf8]{inputenc}
  11 \usepackage{amsmath,amssymb,dsfont}
  12 \usepackage[pdftex]{graphicx}
  13 \usepackage[colorlinks=true,linkcolor=blue,urlcolor=blue,citecolor=blue]{hyperref}
  14 \usepackage{tikz}
  15 \usetikzlibrary{arrows,arrows.meta,calc}
  16 \usetikzlibrary{patterns,backgrounds}
  17 \usetikzlibrary{positioning,fit}
  18 \usetikzlibrary{shapes.geometric,shapes.multipart}
  19 \usetikzlibrary{patterns.meta,decorations.pathreplacing,calligraphy}
  20 \usetikzlibrary{tikzmark}
  21 \usetikzlibrary{decorations.pathmorphing}
  22 \usepackage[round]{natbib}
  23 \usepackage[osf]{libertine}
  24 \usepackage{microtype}
  25
  26 \usepackage{mleftright}
  27
  28 \newcommand{\setmuskip}[2]{#1=#2\relax}
  29 \setmuskip{\thinmuskip}{1.5mu} % by default it is equal to 3 mu
  30 \setmuskip{\medmuskip}{2mu} % by default it is equal to 4 mu
  31 \setmuskip{\thickmuskip}{3.5mu} % by default it is equal to 5 mu
  32
  33 \setlength{\parindent}{0cm}
  34 \setlength{\parskip}{1ex}
  35 %\renewcommand{\baselinestretch}{1.3}
  36 %\setlength{\tabcolsep}{0pt}
  37 %\renewcommand{\arraystretch}{1.0}
  38
  39 \def\argmax{\operatornamewithlimits{argmax}}
  40 \def\argmin{\operatornamewithlimits{argmin}}
  41
  42 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  43
  44 \def\given{\,\middle\vert\,}
  45 \def\proba{\operatorname{P}}
  46 \newcommand{\seq}{{S}}
  47 \newcommand{\expect}{\mathds{E}}
  48 \newcommand{\variance}{\mathds{V}}
  49 \newcommand{\empexpect}{\hat{\mathds{E}}}
  50 \newcommand{\mutinf}{\mathds{I}}
  51 \newcommand{\empmutinf}{\hat{\mathds{I}}}
  52 \newcommand{\entropy}{\mathds{H}}
  53 \newcommand{\empentropy}{\hat{\mathds{H}}}
  54 \newcommand{\ganG}{\mathbf{G}}
  55 \newcommand{\ganD}{\mathbf{D}}
  56 \newcommand{\ganF}{\mathbf{F}}
  57
  58 \newcommand{\dkl}{\mathds{D}_{\mathsf{KL}}}
  59 \newcommand{\djs}{\mathds{D}_{\mathsf{JS}}}
  60
  61 \newcommand*{\vertbar}{\rule[-1ex]{0.5pt}{2.5ex}}
  62 \newcommand*{\horzbar}{\rule[.5ex]{2.5ex}{0.5pt}}
  63
  64 \def\positionalencoding{\operatorname{pos-enc}}
  65 \def\concat{\operatorname{concat}}
  66 \def\crossentropy{\LL_{\operatorname{ce}}}
  67
  68 \begin{document}
  69
  70 \vspace*{-3ex}
  71
  72 \begin{center}
  73 {\Large Some bits of Information Theory}
  74
  75 Fran\c cois Fleuret
  76
  77 January 19, 2024
  78
  79 \vspace*{1ex}
  80
  81 \end{center}
  82
  83 Information Theory is awesome so here is a TL;DR about Shannon's entropy.
  84
  85 The field is originally about quantifying the amount of
  86 ``information'' contained in a signal and how much can be transmitted
  87 under certain conditions.
  88
  89 What makes it awesome is that it is very intuitive, and like
  90 thermodynamics in Physics, it gives exact bounds about what is
  91 possible or not.
  92
  93 \section{Shannon's Entropy}
  94
  95 Shannon's entropy is the key concept from which everything is defined.
  96
  97 Imagine that you have a distribution of probabilities $p$ on a finite
  98 set of symbols, and that you generate a stream of symbols by sampling
  99 them one after another independently with that distribution.
 100
 101 To transmit that stream, for instance with bits over a communication
 102 line, you can design a coding that takes into account that the symbols
 103 are not all as probable, and decode on the other side.
 104
 105 For instance if $\proba('\!\!A')=1/2$, $\proba('\!\!B')=1/4$, and
 106 $\proba('\!\!C')=1/4$ you would transmit ``$0$'' for a ``A'' and ``$10$'' for a
 107 ``B'' and ``$11$'' for a ``C'', 1.5 bits on average.
 108
 109 If the symbol is always the same, you transmit nothing, if they are
 110 equiprobable you need $\log_2$(nb symbols) etc.
 111
 112 Shannon's Entropy (in base 2) is the minimum number of bits you have
 113 to emit on average per symbol to transmit that stream.
 114
 115 It has a simple analytical form:
 116 %
 117 \[
 118  \entropy(p) = - \sum_k p(k) \log_2 p(k)
 119 \]
 120 %
 121 where by convention $0 \log_2 0 = 0$.
 122
 123 It is often seen as a measure of randomness since the more
 124 deterministic the distribution is, the less you have to emit.
 125
 126 The examples above correspond to "Huffman coding", which reaches the
 127 Entropy bound only for some distributions. A more sophisticated scheme
 128 called "Arithmetic coding" does it always.
 129
 130 From this perspective, many quantities have an intuitive
 131 value. Consider for instance sending pairs of symbols $(X, Y)$.
 132
 133 If these two symbols are independent, you cannot do better than
 134 sending one and the other separately, hence
 135 %
 136 \[
 137 \entropy(X, Y) = \entropy(X) + \entropy(Y).
 138 \]
 139
 140 However, imagine that the second symbol is a function of the first
 141 Y=f(X). You just have to send $X$ since $Y$ can be computed from it on the
 142 other side.
 143
 144 Hence in that case
 145 %
 146 \[
 147 \entropy(X, Y) = \entropy(X).
 148 \]
 149
 150 An associated quantity is the mutual information between two random
 151 variables, defined with
 152 %
 153 \[
 154 \mutinf(X;Y) = \entropy(X) + \entropy(Y) - \entropy(X,Y),
 155 \]
 156 %
 157 that quantifies the amount of information shared by the two variables.
 158
 159 \section{Conditional Entropy}
 160
 161 Conditional entropy is the average of the entropy of the conditional distribution:
 162 %
 163 \begin{equation*}
 164 \entropy(X \mid Y) = \sum_y \proba(Y=y) \entropy(X \mid Y=y)
 165 \end{equation*}
 166 %
 167 with
 168 %
 169 \begin{eqnarray*}
 170 \entropy(X \mid Y=y) \hspace*{13.5em} \\
 171  = \sum_x \proba(X=x \mid Y=y) \log \proba(X=x \mid Y=y)
 172 \end{eqnarray*}
 173
 174 Intuitively it is the [minimum average] number of bits required to describe $X$ given that $Y$ is known.
 175
 176 So in particular, if $X$ and $Y$ are independent, getting the value of $Y$
 177 does not help at all, so you still have to send all the bits for $X$,
 178 hence
 179 %
 180 \[
 181   \entropy(X \mid Y)=\entropy(X),
 182 \]
 183 %
 184 and if $X$ is a deterministic function of $Y$ then
 185 %
 186 \[
 187   \entropy(X \mid Y)=0.
 188 \]
 189
 190 And if you send the bits for $Y$ and then the bits to describe $X$
 191 given that $Y$, you have sent $(X, Y)$, hence the chain rule:
 192 %
 193 \[
 194 \entropy(X, Y) = \entropy(Y) + \entropy(X \mid Y).
 195 \]
 196 %
 197 And then we get
 198 %
 199 \begin{align*}
 200 I(X;Y) &= \entropy(X) + \entropy(Y) - \entropy(X,Y)\\
 201        &= \entropy(X) + \entropy(Y) - (\entropy(Y) + \entropy(X \mid Y))\\
 202        &= \entropy(X) - \entropy(X \mid Y).
 203 \end{align*}
 204
 205 \section{Kullback-Leibler divergence}
 206
 207 Imagine that you encode your stream thinking it comes from
 208 distribution $q$ while it comes from $p$. You would emit more bits
 209 than the optimal $\entropy(p)$, and that excess of bits is
 210 $\dkl(p||q)$ the Kullback-Leibler divergence between $p$ and $q$.
 211
 212 In particular if $p=q$
 213 %
 214 \[
 215  \dkl(p\|q)=0,
 216 \]
 217 %
 218 and if there is a symbol $x$ with $q(x)=0$ and $p(x)>0$, you cannot encode it and
 219 %
 220 \[
 221  \dkl(p\|q)=+\infty.
 222 \]
 223
 224 Its formal expression is
 225 %
 226 \[
 227 \dkl(p\|q) = \sum_x p(x) \log\left(\frac{p(x)}{q(x)}\right)
 228 \]
 229 %
 230 that can be understood as a value called the cross-entropy between $p$ and $q$
 231 %
 232 \[
 233 \entropy(p,q) = -\sum_x p(x) \log q(x)
 234 \]
 235 %
 236 minus the entropy of p
 237 \[
 238 \entropy(p) = -\sum_x p(x) \log p(x).
 239 \]
 240
 241 Notation horror: if $X$ and $Y$ are random variables $\entropy(X, Y)$ is the
 242 entropy of their joint law, and if $p$ and $q$ are distributions,
 243 $\entropy(p,q)$ is the cross-entropy between them.
 244
 245 \end{document}