Photos of (left to right) Solomon Kullback, Richard A. Leibler and Lucien Le Cam. Sources: NSA Cryptologic Hall of Honor (1, 2) and MacTutor.

We saw the last time how to find lower bounds using the total variation divergence.  Even so, conditions with the Kullback-Leiber divergence are easier to verify than the total divergence and ratio likehood ones. However, there are some examples where it is preferred other distances like the Hellinger’s. See [Hoffmann, 1999].

Illustration of the Kullback–Leibler (KL) divergence for two normal Gaussian distributions. Note the typical asymmetry for the KL divergence is clearly visible.

Let us start with some definitions

Definition 1 The Kullback-Leiber (KL) divergence between $latex {P}&fg=000000$ and $latex {Q}&fg=000000$ is defined by

$latex \displaystyle K(P_{1},P_{0})=\begin{cases} {\displaystyle \int\log\frac{dP_{1}}{dP_{0}}dP_{1}} & \text{if }P_{1}\ll P_{0},\\ {\displaystyle +\infty} & \text{otherwise.} \end{cases} &fg=000000$

Note 2 The functions $latex {V(\cdot,\cdot)}&fg=000000$ and $latex {K(\cdot,\cdot)}&fg=000000$ are particular cases of the Csizsár $latex {f}&fg=000000$-divergence (see [Csiszár, 1967] ) defined for $latex {P_{1}\ll P_{0}}&fg=000000$ in the following way:

$latex \displaystyle {\displaystyle D(P_{1},P_{0})=\int f\left(\frac{dP_{1}}{dP_{0}}\right)dP_{0}}. &fg=000000$

Indeed, $latex {V(\cdot,\cdot)}&fg=000000$ and $latex {K(\cdot,\cdot)}&fg=000000$ correspond to $latex {f(x)=\vert x-1\vert/2}&fg=000000$ and $latex {f(x)=x\log x}&fg=000000$. Among other $latex {f}&fg=000000$-divergences, the most famous is when $latex {f(x)=(x-1)^{2}}&fg=000000$ and we call it the $latex {\chi^{2}}&fg=000000$ divergence

$latex \displaystyle \chi^{2}(P_{1},P_{0})=\begin{cases} {\displaystyle \int\left(\frac{dP_{1}}{dP_{0}}-1\right)dP_{0}} & \text{if }P_{1}\ll P_{0},\\ {\displaystyle +\infty} & \text{otherwise.} \end{cases} &fg=000000$

Note 3 It is possible to show that $latex {V(\cdot,\cdot)}&fg=000000$ is in fact a distance. On the other side, $latex {K(\cdot,\cdot)}&fg=000000$ and $latex {\chi^{2}(\cdot,\cdot)}&fg=000000$ are not distances because they are not symmetric.

Now, the following lemma relates total variation and the KL divergence.

Lemma 4 (Le Cam’s inequalities, 1973)

$latex \displaystyle {\displaystyle 1-V(P_{1},P_{0})\geq\frac{1}{2}\left(\int\sqrt{dP_{0}dP_{1}}\right)^{2}\geq\frac{1}{2}\exp(-K(P_{1},P_{0})).} &fg=000000$

Proof: For the left side we have

$latex \displaystyle \begin{array}{rl} & {\displaystyle \left(\int\sqrt{p_{0}p_{1}}\right)^{2}}\\ & ={\displaystyle \left(\int\sqrt{\min(p_{0},p_{1})}\sqrt{\max(p_{0},p_{1})}\right)^{2}}\\ & \leq{\displaystyle \int\min(p_{0},p_{1})\max(p_{0},p_{1})}[1] \\ & ={\displaystyle \int\min(p_{0},p_{1})\left(2-\int\min(p_{0},p_{1})\right)}[2]\\ & \leq{\displaystyle 2\int\min(p_{0},p_{1})}\\ & ={\displaystyle 2(1-V(P_{0},P_{1})).} \end{array} &fg=000000$

To the right inequality we get

$latex \displaystyle \begin{array}{rl} & {\displaystyle {\displaystyle \left(\int\sqrt{p_{0}p_{1}}\right)}^{2}}\\ & ={\displaystyle \exp\left(2\log{\displaystyle \left(\int\sqrt{p_{0}p_{1}}\right)}\right)}\\ & ={\displaystyle \exp\left(2\log{\displaystyle \left(\int\sqrt{\frac{p_{0}}{p_{1}}}p_{1}\right)}\right)}\\ & \geq{\displaystyle \exp\left(2\int\log{\displaystyle \left(\sqrt{\frac{p_{0}}{p_{1}}}p_{1}\right)}\right)[3]}\\ & ={\displaystyle \exp\left(-\int\log{\displaystyle \left(\sqrt{\frac{p_{1}}{p_{0}}}p_{1}\right)}\right)}\\ & =\exp(-K(P_{1},P_{0})) \end{array} &fg=000000$

 $latex \Box&fg=000000$

Analogously to the main theorem in the last post, we can bound $latex {p_{e,1}}&fg=000000$ with the KL divergence.

Theorem 5 Let $latex {P_{0},P_{1}}&fg=000000$ be two probability measures in $latex {(\mathcal{X},\mathcal{A}).}&fg=000000$ If $latex {K(P_{1},P_{0})\leq\alpha<\infty}&fg=000000$, then

$latex \displaystyle p_{e,1}\geq\frac{1}{4}\exp(-\alpha). &fg=000000$

Proof: The proof is clear combining the results of the last theorem and (4). $latex \Box&fg=000000$

The next time, we will use these results in a concrete example.

[1] By Cauchy-Schwartz inequality
[2] Because $latex {\displaystyle{\int\min(p_0,p_1)+\int \max(p_0,p_1)}=\int p_0 + \int p_1 =2}&fg=000000$.
[3] By Jensen’s inequality.

[Csiszár, 1967] Csiszár, I. a. (1967). Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar., 2:299–318.<\p>

[Hoffmann, 1999] Hoffmann, M. (1999). Adaptive estimation in diffusion processes. Stochastic Processes and Their Applications, 79(1):135–163.<\p>



Leave a Reply