X-Git-Url: https://fleuret.org/cgi-bin/gitweb/gitweb.cgi?a=blobdiff_plain;f=inftheory.tex;h=898d6f68d5c77b4c717dbb2a0152f8a5feaec307;hb=ccc37e30411aafd68891ce40933207d238f525d8;hp=33ccfe5cfb1e214b2e0e51edda186f0c719ae404;hpb=1e87b0a30c1b32eb50429af1340ea9706e3ccab6;p=tex.git diff --git a/inftheory.tex b/inftheory.tex index 33ccfe5..898d6f6 100644 --- a/inftheory.tex +++ b/inftheory.tex @@ -1,5 +1,9 @@ %% -*- mode: latex; mode: reftex; mode: flyspell; coding: utf-8; tex-command: "pdflatex.sh" -*- +%% Any copyright is dedicated to the Public Domain. +%% https://creativecommons.org/publicdomain/zero/1.0/ +%% Written by Francois Fleuret + \documentclass[10pt,a4paper,twoside]{article} \usepackage[paperheight=18cm,paperwidth=10cm,top=5mm,bottom=20mm,right=5mm,left=5mm]{geometry} %\usepackage[a4paper,top=2.5cm,bottom=2cm,left=2.5cm,right=2.5cm]{geometry} @@ -16,8 +20,6 @@ \usetikzlibrary{tikzmark} \usetikzlibrary{decorations.pathmorphing} \usepackage[round]{natbib} -%\usepackage{cmbright} -%\usepackage{showframe} \usepackage{mleftright} @@ -36,68 +38,70 @@ \def\argmin{\operatornamewithlimits{argmin}} \def\expect{\mathds{E}} -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%% The \todo command -\newcounter{nbdrafts} -\setcounter{nbdrafts}{0} -\makeatletter -\newcommand{\checknbdrafts}{ -\ifnum \thenbdrafts > 0 -\@latex@warning@no@line{*WARNING* The document contains \thenbdrafts \space draft note(s)} -\fi} -\newcommand{\todo}[1]{\addtocounter{nbdrafts}{1}{\color{red} #1}} -\makeatother %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} Information Theory is awesome so here is a TL;DR about Shannon's entropy. -This field is about quantifying the amount ``of information'' contained -in a signal and how much can be transmitted under certain conditions. +The field is originally about quantifying the amount of +``information'' contained in a signal and how much can be transmitted +under certain conditions. -What makes it awesome IMO is that it is very intuitive, and like thermodynamics in Physics it give exact bounds about what is possible or not. +What makes it awesome IMO is that it is very intuitive, and like +thermodynamics in Physics, it gives exact bounds about what is possible +or not. \section{Shannon's Entropy} -This is the key concept from which everything is defined. +Shannon's entropy is the key concept from which everything is defined. -Imagine that you have a distribution of probabilities p on a finite set of symbols and that you generate a stream of symbols by sampling them one after another independently with that distribution. +Imagine that you have a distribution of probabilities $p$ on a finite +set of symbols, and that you generate a stream of symbols by sampling +them one after another independently with that distribution. -To transmit that stream, for instance with bits over a communication line, you can design a coding that takes into account that the symbols are not all as probable, and decode on the other side. +To transmit that stream, for instance with bits over a communication +line, you can design a coding that takes into account that the symbols +are not all as probable, and decode on the other side. For instance if $P('\!\!A')=1/2$, $P('\!\!B')=1/4$, and $P('\!\!C')=1/4$ you would transmit ``0'' for a ``A'' and ``10'' for a ``B'' and ``11'' for a ``C'', 1.5 bits on average. -If the symbol is always the same, you transmit nothing, if they are equiprobable you need $\log_2$(nb symbols) etc. +If the symbol is always the same, you transmit nothing, if they are +equiprobable you need $\log_2$(nb symbols) etc. -Shannon's Entropy (in base 2) is the minimum number of bits you have to emit on average to transmit that stream. +Shannon's Entropy (in base 2) is the minimum number of bits you have +to emit on average per symbol to transmit that stream. -It has a simple formula: +It has a simple analytical form: % \[ H(p) = - \sum_k p(k) \log_2 p(k) \] % -where by convention $o \log_2 0 = 0$. +where by convention $0 \log_2 0 = 0$. -It is often seen as a measure of randomness since the more deterministic the distribution is, the less you have to emit. +It is often seen as a measure of randomness since the more +deterministic the distribution is, the less you have to emit. -The codings above are "Huffman coding", which reaches the Entropy -bound only for some distributions. The "Arithmetic coding" does it -always. +The examples above correspond to "Huffman coding", which reaches the +Entropy bound only for some distributions. A more sophisticated scheme +called "Arithmetic coding" does it always. From this perspective, many quantities have an intuitive value. Consider for instance sending pairs of symbols (X, Y). -If these two symbols are independent, you cannot do better than sending one and the other separately, hence +If these two symbols are independent, you cannot do better than +sending one and the other separately, hence % \[ H(X, H) = H(X) + H(Y). \] -However, imagine that the second symbol is a function of the first Y=f(X). You just have to send X since Y can be computed from it on the other side. +However, imagine that the second symbol is a function of the first +Y=f(X). You just have to send X since Y can be computed from it on the +other side. Hence in that case % @@ -116,7 +120,7 @@ that quantifies the amount of information shared by the two variables. \section{Conditional Entropy} -Okay given the visible interest for the topic, an addendum: Conditional entropy is the average of the entropy of the conditional distribution: +Conditional entropy is the average of the entropy of the conditional distribution: % \begin{align*} &H(X \mid Y)\\ @@ -126,7 +130,9 @@ Okay given the visible interest for the topic, an addendum: Conditional entropy Intuitively it is the [minimum average] number of bits required to describe X given that Y is known. -So in particular, if X and Y are independent +So in particular, if X and Y are independent, getting the value of $Y$ +does not help at all, so you still have to send all the bits for $X$, +hence % \[ H(X \mid Y)=H(X) @@ -135,10 +141,11 @@ So in particular, if X and Y are independent if X is a deterministic function of Y then % \[ - H(X \mid Y)=0 + H(X \mid Y)=0. \] -And since if you send the bits for Y and then the bits to describe X given that X is known you have sent (X, Y), we have the chain rule: +And if you send the bits for Y and then the bits to describe X given +that Y, you have sent (X, Y). Hence we have the chain rule: % \[ H(X, Y) = H(Y) + H(X \mid Y). @@ -188,5 +195,8 @@ minus the entropy of p H(p) = -\sum_x p(x) \log p(x). \] -Notation horror: if $X$ and $Y$ are random variables $H(X, Y)$ is the entropy of their joint law, and if $p$ and $q$ are distributions, $H(p,q)$ is the cross-entropy between them. +Notation horror: if $X$ and $Y$ are random variables $H(X, Y)$ is the +entropy of their joint law, and if $p$ and $q$ are distributions, +$H(p,q)$ is the cross-entropy between them. + \end{document}