# Definitions and Basic Facts

Entropy Function :: Joint Entropy :: Relative Entropy :: Multivariable :: Convexity

## Convexity and Jensen's inequality

A large part of information theory consists in finding
*
bounds
*
on certain performance measures. The analytical idea behind a bound is to substitute a complicated expression for something simpler but not exactly equal, known to be either greater or smaller than the thing it replaces. This gives rise to simpler statements (and hence gain some insight), but usually at the expense of precision. Knowing when to use a bound to get a useful results generally requires a fair amount of mathematical maturity and experience.

One of the more important inequalities we will use throughout information theory is Jensen's inequality. Before introducing it, you need to know about convex and concave functions.

To understand the definition, recall that
is simply a line segment connecting
*
x
*
_{
1
}
and
*
x
*
_{
2
}
(in the
*
x
*
direction) and
is a line segment connecting
*
f
*
(
*
x
*
_{
1
}
) and
*
f
*
(
*
x
*
_{
2
}
). Pictorially, the function is convex if the
*
function lies below the straight line segment connecting two points
*
, for any two points in the interval.

You will need to keep reminding me of which is which, since when I learned this, the nomenclature was "convex '' and "convex ''.

One reason why we are interested in convex functions is that it is known that
*
over the interval of convexity there is only one minimum
*
. This can strengthen many of the results we might want.

We now introduce
**
Jensen's inequality
**
.

The theorem allows us (more or less) to pull a function outside of a summation in some circumstances.

There is another inequality that got considerable use (in many of the same ways as Jensen's inequality) way back in the dark ages when I took information theory. I may refer to it simply as the
**
information inequality
**
.

This can also be generalized by taking the line at different points along the function.

With these simple inequalities we can now prove some facts about some of the information measures we defined so far.

Let
be the set of values that the random variable
*
X
*
takes on and let
denote the number of elements in the set. For discrete random variables,
*
the uniform distribution over the range
has the maximum entropy
*
.

Note how easily this optimizing value drops in our lap by means of an inequality. There is an important principle of engineering design here: if you can show that some performance criterion is upper-bounded by some function, then show how to achieve that upper bound, you have got an optimum design. No calculus required!

The more we know, the less uncertainty there is: