inftheory.tex

   1 %% -*- mode: latex; mode: reftex; mode: flyspell; coding: utf-8; tex-command: "pdflatex.sh" -*-
   2
   3 %% Any copyright is dedicated to the Public Domain.
   4 %% https://creativecommons.org/publicdomain/zero/1.0/
   5 %% Written by Francois Fleuret <francois@fleuret.org>
   6
   7 \documentclass[10pt,a4paper,twoside]{article}
   8 \usepackage[paperheight=18cm,paperwidth=10cm,top=5mm,bottom=20mm,right=5mm,left=5mm]{geometry}
   9 %\usepackage[a4paper,top=2.5cm,bottom=2cm,left=2.5cm,right=2.5cm]{geometry}
  10 \usepackage[utf8]{inputenc}
  11 \usepackage{amsmath,amssymb,dsfont}
  12 \usepackage[pdftex]{graphicx}
  13 \usepackage[colorlinks=true,linkcolor=blue,urlcolor=blue,citecolor=blue]{hyperref}
  14 \usepackage{tikz}
  15 \usetikzlibrary{arrows,arrows.meta,calc}
  16 \usetikzlibrary{patterns,backgrounds}
  17 \usetikzlibrary{positioning,fit}
  18 \usetikzlibrary{shapes.geometric,shapes.multipart}
  19 \usetikzlibrary{patterns.meta,decorations.pathreplacing,calligraphy}
  20 \usetikzlibrary{tikzmark}
  21 \usetikzlibrary{decorations.pathmorphing}
  22 \usepackage[round]{natbib}
  23
  24 \usepackage{mleftright}
  25
  26 \newcommand{\setmuskip}[2]{#1=#2\relax}
  27 \setmuskip{\thinmuskip}{1.5mu} % by default it is equal to 3 mu
  28 \setmuskip{\medmuskip}{2mu} % by default it is equal to 4 mu
  29 \setmuskip{\thickmuskip}{3.5mu} % by default it is equal to 5 mu
  30
  31 \setlength{\parindent}{0cm}
  32 \setlength{\parskip}{12pt}
  33 %\renewcommand{\baselinestretch}{1.3}
  34 %\setlength{\tabcolsep}{0pt}
  35 %\renewcommand{\arraystretch}{1.0}
  36
  37 \def\argmax{\operatornamewithlimits{argmax}}
  38 \def\argmin{\operatornamewithlimits{argmin}}
  39 \def\expect{\mathds{E}}
  40
  41 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  42
  43 \begin{document}
  44
  45 \today
  46
  47 Information Theory is awesome so here is a TL;DR about Shannon's entropy.
  48
  49 The field is originally about quantifying the amount of
  50 ``information'' contained in a signal and how much can be transmitted
  51 under certain conditions.
  52
  53 What makes it awesome IMO is that it is very intuitive, and like
  54 thermodynamics in Physics, it gives exact bounds about what is possible
  55 or not.
  56
  57 \section{Shannon's Entropy}
  58
  59 Shannon's entropy is the key concept from which everything is defined.
  60
  61 Imagine that you have a distribution of probabilities $p$ on a finite
  62 set of symbols, and that you generate a stream of symbols by sampling
  63 them one after another independently with that distribution.
  64
  65 To transmit that stream, for instance with bits over a communication
  66 line, you can design a coding that takes into account that the symbols
  67 are not all as probable, and decode on the other side.
  68
  69 For instance if $P('\!\!A')=1/2$, $P('\!\!B')=1/4$, and
  70 $P('\!\!C')=1/4$ you would transmit ``0'' for a ``A'' and ``10'' for a
  71 ``B'' and ``11'' for a ``C'', 1.5 bits on average.
  72
  73 If the symbol is always the same, you transmit nothing, if they are
  74 equiprobable you need $\log_2$(nb symbols) etc.
  75
  76 Shannon's Entropy (in base 2) is the minimum number of bits you have
  77 to emit on average per symbol to transmit that stream.
  78
  79 It has a simple analytical form:
  80 %
  81 \[
  82  H(p) = - \sum_k p(k) \log_2 p(k)
  83 \]
  84 %
  85 where by convention $0 \log_2 0 = 0$.
  86
  87 It is often seen as a measure of randomness since the more
  88 deterministic the distribution is, the less you have to emit.
  89
  90 The examples above correspond to "Huffman coding", which reaches the
  91 Entropy bound only for some distributions. A more sophisticated scheme
  92 called "Arithmetic coding" does it always.
  93
  94 From this perspective, many quantities have an intuitive
  95 value. Consider for instance sending pairs of symbols (X, Y).
  96
  97 If these two symbols are independent, you cannot do better than
  98 sending one and the other separately, hence
  99 %
 100 \[
 101 H(X, H) = H(X) + H(Y).
 102 \]
 103
 104 However, imagine that the second symbol is a function of the first
 105 Y=f(X). You just have to send X since Y can be computed from it on the
 106 other side.
 107
 108 Hence in that case
 109 %
 110 \[
 111 H(X, Y) = H(X).
 112 \]
 113
 114 An associated quantity is the mutual information between two random
 115 variables, defined with
 116 %
 117 \[
 118 I(X;Y) = H(X) + H(Y) - H(X,Y),
 119 \]
 120 %
 121 that quantifies the amount of information shared by the two variables.
 122
 123 \section{Conditional Entropy}
 124
 125 Conditional entropy is the average of the entropy of the conditional distribution:
 126 %
 127 \begin{align*}
 128 &H(X \mid Y)\\
 129  &= \sum_y p(Y=y) H(X \mid Y=y)\\
 130        &= \sum_y P(Y=y) \sum_x P(X=x \mid Y=y) \log P(X=x \mid Y=y)
 131 \end{align*}
 132
 133 Intuitively it is the [minimum average] number of bits required to describe X given that Y is known.
 134
 135 So in particular, if X and Y are independent, getting the value of $Y$
 136 does not help at all, so you still have to send all the bits for $X$,
 137 hence
 138 %
 139 \[
 140   H(X \mid Y)=H(X)
 141 \]
 142
 143 if X is a deterministic function of Y then
 144 %
 145 \[
 146   H(X \mid Y)=0.
 147 \]
 148
 149 And if you send the bits for Y and then the bits to describe X given
 150 that Y, you have sent (X, Y). Hence we have the chain rule:
 151 %
 152 \[
 153 H(X, Y) = H(Y) + H(X \mid Y).
 154 \]
 155
 156 And then we get
 157 %
 158 \begin{align*}
 159 I(X;Y) &= H(X) + H(Y) - H(X,Y)\\
 160        &= H(X) + H(Y) - (H(Y) + H(X \mid Y))\\
 161        &= H(X) - H(X \mid Y).
 162 \end{align*}
 163
 164 \section{Kullback-Leibler divergence}
 165
 166 Imagine that you encode your stream thinking it comes from
 167 distribution $q$ while it comes from $p$. You would emit more bits than
 168 the optimal $H(p)$, and that supplement is $D_{KL}(p||q)$ the
 169 Kullback-Leibler divergence between $p$ and $q$.
 170
 171 In particular if $p=q$
 172 %
 173 \[
 174  D_{KL}(p\|q)=0,
 175 \]
 176 %
 177 and if there is a symbol $x$ with $q(x)=0$ and $p(x)>0$, you cannot encode it and
 178 %
 179 \[
 180  D_{KL}(p\|q)=+\infty.
 181 \]
 182
 183 Its formal expression is
 184 %
 185 \[
 186 D_{KL}(p\|q) = \sum_x p(x) \log\left(\frac{p(x)}{q(x)}\right)
 187 \]
 188 %
 189 that can be understood as a value called the cross-entropy between $p$ and $q$
 190 %
 191 \[
 192 H(p,q) = -\sum_x p(x) \log q(x)
 193 \]
 194 %
 195 minus the entropy of p
 196 \[
 197 H(p) = -\sum_x p(x) \log p(x).
 198 \]
 199
 200 Notation horror: if $X$ and $Y$ are random variables $H(X, Y)$ is the
 201 entropy of their joint law, and if $p$ and $q$ are distributions,
 202 $H(p,q)$ is the cross-entropy between them.
 203
 204 \end{document}