X-Git-Url: https://fleuret.org/cgi-bin/gitweb/gitweb.cgi?a=blobdiff_plain;f=inftheory.tex;h=954fd0630c48e3636e6923a34afeaf44344f87b4;hb=HEAD;hp=29b0557428c864eb2875d1c949a6f9be7e5b0fbb;hpb=edce43181d484408f8901abc0c21d3ca7595f20d;p=tex.git diff --git a/inftheory.tex b/inftheory.tex index 29b0557..954fd06 100644 --- a/inftheory.tex +++ b/inftheory.tex @@ -1,7 +1,11 @@ %% -*- mode: latex; mode: reftex; mode: flyspell; coding: utf-8; tex-command: "pdflatex.sh" -*- -\documentclass[10pt,a4paper,twoside]{article} -\usepackage[paperheight=18cm,paperwidth=10cm,top=5mm,bottom=20mm,right=5mm,left=5mm]{geometry} +%% Any copyright is dedicated to the Public Domain. +%% https://creativecommons.org/publicdomain/zero/1.0/ +%% Written by Francois Fleuret + +\documentclass[11pt,a4paper,oneside]{article} +\usepackage[paperheight=15cm,paperwidth=8cm,top=2mm,bottom=15mm,right=2mm,left=2mm]{geometry} %\usepackage[a4paper,top=2.5cm,bottom=2cm,left=2.5cm,right=2.5cm]{geometry} \usepackage[utf8]{inputenc} \usepackage{amsmath,amssymb,dsfont} @@ -16,8 +20,8 @@ \usetikzlibrary{tikzmark} \usetikzlibrary{decorations.pathmorphing} \usepackage[round]{natbib} -%\usepackage{cmbright} -%\usepackage{showframe} +\usepackage[osf]{libertine} +\usepackage{microtype} \usepackage{mleftright} @@ -27,96 +31,127 @@ \setmuskip{\thickmuskip}{3.5mu} % by default it is equal to 5 mu \setlength{\parindent}{0cm} -\setlength{\parskip}{12pt} +\setlength{\parskip}{1ex} %\renewcommand{\baselinestretch}{1.3} %\setlength{\tabcolsep}{0pt} %\renewcommand{\arraystretch}{1.0} \def\argmax{\operatornamewithlimits{argmax}} \def\argmin{\operatornamewithlimits{argmin}} -\def\expect{\mathds{E}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%% The \todo command -\newcounter{nbdrafts} -\setcounter{nbdrafts}{0} -\makeatletter -\newcommand{\checknbdrafts}{ -\ifnum \thenbdrafts > 0 -\@latex@warning@no@line{*WARNING* The document contains \thenbdrafts \space draft note(s)} -\fi} -\newcommand{\todo}[1]{\addtocounter{nbdrafts}{1}{\color{red} #1}} -\makeatother -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\def\given{\,\middle\vert\,} +\def\proba{\operatorname{P}} +\newcommand{\seq}{{S}} +\newcommand{\expect}{\mathds{E}} +\newcommand{\variance}{\mathds{V}} +\newcommand{\empexpect}{\hat{\mathds{E}}} +\newcommand{\mutinf}{\mathds{I}} +\newcommand{\empmutinf}{\hat{\mathds{I}}} +\newcommand{\entropy}{\mathds{H}} +\newcommand{\empentropy}{\hat{\mathds{H}}} +\newcommand{\ganG}{\mathbf{G}} +\newcommand{\ganD}{\mathbf{D}} +\newcommand{\ganF}{\mathbf{F}} + +\newcommand{\dkl}{\mathds{D}_{\mathsf{KL}}} +\newcommand{\djs}{\mathds{D}_{\mathsf{JS}}} + +\newcommand*{\vertbar}{\rule[-1ex]{0.5pt}{2.5ex}} +\newcommand*{\horzbar}{\rule[.5ex]{2.5ex}{0.5pt}} + +\def\positionalencoding{\operatorname{pos-enc}} +\def\concat{\operatorname{concat}} +\def\crossentropy{\LL_{\operatorname{ce}}} \begin{document} +\vspace*{-3ex} + +\begin{center} +{\Large Some bits of Information Theory} + +Fran\c cois Fleuret + +January 19, 2024 + +\vspace*{1ex} + +\end{center} + Information Theory is awesome so here is a TL;DR about Shannon's entropy. -This field is about quantifying the amount of ``information'' contained -in a signal and how much can be transmitted under certain conditions. +The field is originally about quantifying the amount of +``information'' contained in a signal and how much can be transmitted +under certain conditions. -What makes it awesome IMO is that it is very intuitive, and like thermodynamics in Physics it give exact bounds about what is possible or not. +What makes it awesome is that it is very intuitive, and like +thermodynamics in Physics, it gives exact bounds about what is +possible or not. \section{Shannon's Entropy} -This is the key concept from which everything is defined. +Shannon's entropy is the key concept from which everything is defined. -Imagine that you have a distribution of probabilities p on a finite -set of symbols and that you generate a stream of symbols by sampling +Imagine that you have a distribution of probabilities $p$ on a finite +set of symbols, and that you generate a stream of symbols by sampling them one after another independently with that distribution. To transmit that stream, for instance with bits over a communication line, you can design a coding that takes into account that the symbols are not all as probable, and decode on the other side. -For instance if $P('\!\!A')=1/2$, $P('\!\!B')=1/4$, and -$P('\!\!C')=1/4$ you would transmit ``0'' for a ``A'' and ``10'' for a -``B'' and ``11'' for a ``C'', 1.5 bits on average. +For instance if $\proba('\!\!A')=1/2$, $\proba('\!\!B')=1/4$, and +$\proba('\!\!C')=1/4$ you would transmit ``$0$'' for a ``A'' and ``$10$'' for a +``B'' and ``$11$'' for a ``C'', 1.5 bits on average. If the symbol is always the same, you transmit nothing, if they are equiprobable you need $\log_2$(nb symbols) etc. Shannon's Entropy (in base 2) is the minimum number of bits you have -to emit on average to transmit that stream. +to emit on average per symbol to transmit that stream. -It has a simple formula: +It has a simple analytical form: % \[ - H(p) = - \sum_k p(k) \log_2 p(k) + \entropy(p) = - \sum_k p(k) \log_2 p(k) \] % -where by convention $o \log_2 0 = 0$. +where by convention $0 \log_2 0 = 0$. It is often seen as a measure of randomness since the more deterministic the distribution is, the less you have to emit. -The codings above are "Huffman coding", which reaches the Entropy -bound only for some distributions. The "Arithmetic coding" does it -always. +The examples above correspond to "Huffman coding", which reaches the +Entropy bound only for some distributions. A more sophisticated scheme +called "Arithmetic coding" does it always. From this perspective, many quantities have an intuitive -value. Consider for instance sending pairs of symbols (X, Y). +value. Consider for instance sending pairs of symbols $(X, Y)$. -If these two symbols are independent, you cannot do better than sending one and the other separately, hence +If these two symbols are independent, you cannot do better than +sending one and the other separately, hence % \[ -H(X, H) = H(X) + H(Y). +\entropy(X, Y) = \entropy(X) + \entropy(Y). \] -However, imagine that the second symbol is a function of the first Y=f(X). You just have to send X since Y can be computed from it on the other side. +However, imagine that the second symbol is a function of the first +Y=f(X). You just have to send $X$ since $Y$ can be computed from it on the +other side. Hence in that case % \[ -H(X, Y) = H(X). +\entropy(X, Y) = \entropy(X). \] An associated quantity is the mutual information between two random variables, defined with % \[ -I(X;Y) = H(X) + H(Y) - H(X,Y), +\mutinf(X;Y) = \entropy(X) + \entropy(Y) - \entropy(X,Y), \] % that quantifies the amount of information shared by the two variables. @@ -125,78 +160,86 @@ that quantifies the amount of information shared by the two variables. Conditional entropy is the average of the entropy of the conditional distribution: % -\begin{align*} -&H(X \mid Y)\\ - &= \sum_y p(Y=y) H(X \mid Y=y)\\ - &= \sum_y P(Y=y) \sum_x P(X=x \mid Y=y) \log P(X=x \mid Y=y) -\end{align*} +\begin{equation*} +\entropy(X \mid Y) = \sum_y \proba(Y=y) \entropy(X \mid Y=y) +\end{equation*} +% +with +% +\begin{eqnarray*} +\entropy(X \mid Y=y) \hspace*{13.5em} \\ + = \sum_x \proba(X=x \mid Y=y) \log \proba(X=x \mid Y=y) +\end{eqnarray*} -Intuitively it is the [minimum average] number of bits required to describe X given that Y is known. +Intuitively it is the [minimum average] number of bits required to describe $X$ given that $Y$ is known. -So in particular, if X and Y are independent, getting the value of $Y$ +So in particular, if $X$ and $Y$ are independent, getting the value of $Y$ does not help at all, so you still have to send all the bits for $X$, hence % \[ - H(X \mid Y)=H(X) + \entropy(X \mid Y)=\entropy(X), \] - -if X is a deterministic function of Y then +% +and if $X$ is a deterministic function of $Y$ then % \[ - H(X \mid Y)=0. + \entropy(X \mid Y)=0. \] -And if you send the bits for Y and then the bits to describe X given -that Y, you have sent (X, Y). Hence we have the chain rule: +And if you send the bits for $Y$ and then the bits to describe $X$ +given that $Y$, you have sent $(X, Y)$, hence the chain rule: % \[ -H(X, Y) = H(Y) + H(X \mid Y). +\entropy(X, Y) = \entropy(Y) + \entropy(X \mid Y). \] - +% And then we get % \begin{align*} -I(X;Y) &= H(X) + H(Y) - H(X,Y)\\ - &= H(X) + H(Y) - (H(Y) + H(X \mid Y))\\ - &= H(X) - H(X \mid Y). +I(X;Y) &= \entropy(X) + \entropy(Y) - \entropy(X,Y)\\ + &= \entropy(X) + \entropy(Y) - (\entropy(Y) + \entropy(X \mid Y))\\ + &= \entropy(X) - \entropy(X \mid Y). \end{align*} \section{Kullback-Leibler divergence} Imagine that you encode your stream thinking it comes from -distribution $q$ while it comes from $p$. You would emit more bits than -the optimal $H(p)$, and that supplement is $D_{KL}(p||q)$ the -Kullback-Leibler divergence between $p$ and $q$. +distribution $q$ while it comes from $p$. You would emit more bits +than the optimal $\entropy(p)$, and that excess of bits is +$\dkl(p||q)$ the Kullback-Leibler divergence between $p$ and $q$. In particular if $p=q$ % \[ - D_{KL}(p\|q)=0, + \dkl(p\|q)=0, \] % and if there is a symbol $x$ with $q(x)=0$ and $p(x)>0$, you cannot encode it and % \[ - D_{KL}(p\|q)=+\infty. + \dkl(p\|q)=+\infty. \] Its formal expression is % \[ -D_{KL}(p\|q) = \sum_x p(x) \log\left(\frac{p(x)}{q(x)}\right) +\dkl(p\|q) = \sum_x p(x) \log\left(\frac{p(x)}{q(x)}\right) \] % that can be understood as a value called the cross-entropy between $p$ and $q$ % \[ -H(p,q) = -\sum_x p(x) \log q(x) +\entropy(p,q) = -\sum_x p(x) \log q(x) \] % minus the entropy of p \[ -H(p) = -\sum_x p(x) \log p(x). +\entropy(p) = -\sum_x p(x) \log p(x). \] -Notation horror: if $X$ and $Y$ are random variables $H(X, Y)$ is the entropy of their joint law, and if $p$ and $q$ are distributions, $H(p,q)$ is the cross-entropy between them. +Notation horror: if $X$ and $Y$ are random variables $\entropy(X, Y)$ is the +entropy of their joint law, and if $p$ and $q$ are distributions, +$\entropy(p,q)$ is the cross-entropy between them. + \end{document}