Personal tools
  •  
You are here: Home Electrical and Computer Engineering Information Theory Definitions and Basic Facts

Definitions and Basic Facts

Document Actions
  • Content View
  • Bookmarks
  • CourseFeed

Entropy Function   ::   Joint Entropy   ::   Relative Entropy   ::   Multivariable   ::   Convexity

Convexity and Jensen's inequality

A large part of information theory consists in finding bounds on certain performance measures. The analytical idea behind a bound is to substitute a complicated expression for something simpler but not exactly equal, known to be either greater or smaller than the thing it replaces. This gives rise to simpler statements (and hence gain some insight), but usually at the expense of precision. Knowing when to use a bound to get a useful results generally requires a fair amount of mathematical maturity and experience.

One of the more important inequalities we will use throughout information theory is Jensen's inequality. Before introducing it, you need to know about convex and concave functions.

\begin{definition}
A function $f(x)$\ is said to be {\bf convex} over an interv...
... convex} if equality holds only if
$\lambda=0$\ or $\lambda=1$.
\end{definition}

To understand the definition, recall that $\lambda x_1 + (1-\lambda)
x_2)$ is simply a line segment connecting x 1 and x 2 (in the x direction) and $\lambda f(x_1) + (1-\lambda) f(x_2)$ is a line segment connecting f ( x 1 ) and f ( x 2 ). Pictorially, the function is convex if the function lies below the straight line segment connecting two points , for any two points in the interval.

\begin{definition}
A function $f$\ is {\bf concave} if $-f$\ is convex.
\end{definition}

You will need to keep reminding me of which is which, since when I learned this, the nomenclature was "convex $\cup$ '' and "convex $\cap$ ''.


\begin{example}
\begin{description}
\item[Convex] $x^2$, $e^x$, $\vert x\vert$, $x \log x$.
\item[Concave:] $\log x$, $\sqrt{x}$ \end{description}\end{example}

One reason why we are interested in convex functions is that it is known that over the interval of convexity there is only one minimum . This can strengthen many of the results we might want.


\begin{theorem}
If $f$\ has a second derivative which is non-negative (positive)
everywhere, then $f$\ is convex (strictly convex).
\end{theorem}

\begin{proof}
The Taylor-series expansion of $f$\ about the point $x_0$\ is
\b...
...second by $1-\lambda$, and add
together to get the convexity result.
\end{proof}

We now introduce Jensen's inequality .

\begin{theorem}
If $f$\ is a convex function and $X$\ is a r.v. then
\begin{dis...
...concave then
\begin{displaymath}Ef(X) \leq f(EX).
\end{displaymath}\end{theorem}

The theorem allows us (more or less) to pull a function outside of a summation in some circumstances.

\begin{proof}
The proof is by induction. When $X$\ takes on two values the
ine...
...&= f\left( \sum_{i=1}^k p_i x_i\right)
\end{aligned}\end{displaymath}\end{proof}

There is another inequality that got considerable use (in many of the same ways as Jensen's inequality) way back in the dark ages when I took information theory. I may refer to it simply as the information inequality .

\begin{theorem}$\log x \leq x-1$, with equality if and only if $x=1$.
\end{theorem}

This can also be generalized by taking the line at different points along the function.

With these simple inequalities we can now prove some facts about some of the information measures we defined so far.

\begin{theorem}
$D(p\Vert q) \geq 0$, with equality if and only if $p(x)=q(x)$\ for all $x$.
\end{theorem}

\begin{proof}
\begin{displaymath}\begin{aligned}
-D(p\Vert q) = -\sum_x p(x)\log...
...tly concave function, we have equality if and
only if $q(x)/p(x)=1$.
\end{proof}

\begin{proof}
Here is another proof using the information inequality:
\begin{di...
... \leq x-1) \\
&=\sum_x q(x)-p(x) = 0.
\end{aligned}\end{displaymath}\end{proof}

\begin{corollary}
Mutual information is positive:
\begin{displaymath}I(X;Y) \ge...
...math}with equality if and only if $X$\ and $Y$\ are independent.
\end{corollary}

Let $\Xc$ be the set of values that the random variable X takes on and let $\vert\Xc\vert$ denote the number of elements in the set. For discrete random variables, the uniform distribution over the range $\Xc$ has the maximum entropy .

\begin{theorem}
$H(X) \leq \log \vert\Xc\vert$, with equality iff $X$\ has a uniform distribution.
\end{theorem}

\begin{proof}
Let $u(x) = \frac{1}{\vert\Xc\vert}$\ be the uniform distribution...
...)\log\frac{p(x)}{u(x)} = \log \vert\Xc\vert - H(X).
\end{displaymath}\end{proof}

Note how easily this optimizing value drops in our lap by means of an inequality. There is an important principle of engineering design here: if you can show that some performance criterion is upper-bounded by some function, then show how to achieve that upper bound, you have got an optimum design. No calculus required!

The more we know, the less uncertainty there is:

\begin{theorem}
Condition reduces entropy:
\begin{displaymath}H(X\vert Y) \leq...
... \end{displaymath}with equality iff $X$\ and $Y$\ are independent.
\end{theorem}

\begin{proof}
$0 \leq I(X;Y) = H(X) - H(X\vert Y)$.
\end{proof}

\begin{theorem}
\begin{displaymath}H(X_1,X_2,\ldots,X_n) \leq \sum_{i=1}^n H(X_i...
...splaymath}with equality if and only if the $X_i$\ are independent.
\end{theorem}

\begin{proof}
By the chain rule for entropy,
\begin{displaymath}
\begin{aligned}...
...d\text{(conditioning reduces entropy)}
\end{aligned}\end{displaymath}\end{proof}

Copyright 2008, by the Contributing Authors. Cite/attribute Resource . admin. (2006, May 17). Definitions and Basic Facts. Retrieved January 07, 2011, from Free Online Course Materials — USU OpenCourseWare Web site: http://ocw.usu.edu/Electrical_and_Computer_Engineering/Information_Theory/lecture2_5.htm. This work is licensed under a Creative Commons License Creative Commons License