# Definitions and Basic Facts

Entropy Function :: Joint Entropy :: Relative Entropy :: Multivariable :: Convexity

## Relative entropy and mutual information

Suppose there is a r.v. with true distribution
*
p
*
. Then (as we will
see) we could represent that r.v. with a code that has average length
*
H
*
(
*
p
*
). However, due to incomplete information we do not know
*
p
*
;
instead we assume that the distribution of the r.v. is
*
q
*
. Then (as
we will see) the code would need more bits to represent the r.v. The
difference in the number of bits is denoted as
*
D
*
(
*
p
*
|q). The
quantity
*
D
*
(
*
p
*
|q) comes up often enough that it has a name: it is
known as the
**
relative entropy
**
.

Note that this is not symmetric, and the
*
q
*
(the second argument)
appears only in the denominator.

Another important concept is that of
*
mutual information
*
. How
much information does one random variable tell about another one. In
fact, this perhaps the central idea in much of information theory.
When we look at the output of a channel, we see the outcomes of a
r.v. What we want to know is what went into the channel --
we want to know what was sent, and the only thing we have is what came
out. The channel coding theorem (which is one of the high points we are
trying to reach in the class) is basically a statement about mutual
information.

Note that when
*
X
*
and
*
Y
*
are independent,
*
p
*
(
*
x
*
,
*
y
*
) =
*
p
*
(
*
x
*
)
*
p
*
(
*
y
*
)
(definition of independence), so
*
I
*
(
*
X
*
;
*
Y
*
) = 0. This makes sense: if
they are independent random variables then
*
Y
*
can tell us nothing
about
*
X
*
.

An important interpretation of mutual information comes from the following.

Interpretation: The information that
*
Y
*
tells us about
*
X
*
is the
reduction in uncertainty about
*
X
*
due to the knowledge of
*
Y
*
.

Observe that by symmetry

*I*(

*X*;

*Y*) =

*H*(

*Y*) -

*H*(

*Y*|X) =

*I*(

*Y*;

*X*).

That is,

*Y*tells as much about

*X*as

*X*tells about

*Y*. Using

*H*(

*X*,

*Y*) =

*H*(

*X*) +

*H*(

*Y*|X) we get

*I*(

*X*;

*Y*) =

*H*(

*X*) +

*H*(

*Y*) -

*H*(

*X*,

*Y*)

The information that
*
X
*
tells about
*
Y
*
is the uncertainty in
*
X
*
plus the uncertainty about
*
Y
*
minus the uncertainty in both
*
X
*
and
*
Y
*
. We can summarize a bunch of statements about entropy as
follows: