# Introduction to Information Theory

Communications Model :: Fundamental Concept :: Channel Models

## The fundamental concept

One of the key (and initially counter-intuitive) concepts in
information theory is that information is conveyed by
*
randomness
*
. This is information as defined in some mathematical
sense, which is not identical to that which humans use. For example,
it is possible to measure the amount of information in a page of
typewritten text. Due to the structure of the English language, the
amount of information conveyed by each letter in a word is
*
substantially
*
less than the 7-bit ASCII representation used. (It is
somewhere over 2 bits/letter usually). There would be more
information conveyed (in the mathematical sense) if the letters were
completely random, instead of structured into words.

On the other hand, it is not too difficult to make the connection
between randomness and information. Consider the tossing of a coin:
if you know the outcome of the coin toss before it is tossed, then
learning the outcome does not give you any more information. If you
have a biased coin that is heads 90% of the time, then you gain very
little information when you learn it is heads. On the other hand, you
gain a fair amount of information when it comes up tails; the
information is thus related somewhat to the degree of "surprise'' at
finding out the answer. Q: what weighting of the coin gives the
maximum amount of information
*
on the average
*
?

Another very important concept that we will say more about later is
that of
**
typical sequences
**
. In a sequence of bits of lenght
*
n
*
,
there are some sequences which are (in a sense to be made precise
later) typical. For example, for a sequence of coin-tossing outcomes
for a fair coin, such as HHTHHTHTT, we would expect the number of
heads and tails to be approximately equal (since the coin is fair).
For an unbiased coin, we would expect the proportion of heads to go
with the bias. Sequences that do not follow this trend, such as
HHHHHHHHH, are thus
*
atypical
*
. A good part of information theory
is capturing this concept of typicallity as precisely as possible and
using it to concluding how many bits are needed to represent sequences
of data. The basic idea is to try to use bits to represent only the
typical sequences, since the others don't come up very often. (Of
course, when they do come up, you don't want to just throw them away.)
This concept of typical sequences is what the
**
asymptotic
equipartition property
**
is all about, which is the topic of Chapter
3.

Suppose we have a discrete random variable
*
X
*
, and
*
x
*
is some
particular outcome what occurs with probability
*
p
*
(
*
x
*
). Then we
assign to that event
*
x
*
the information that it conveys the
uncertainty measure

The base of the logarithms determines the units of information. If
is used, then the units are in
*
bits
*
. If
(natural log) is used, then the units are in
*
nats
*
. While nats
are not as familiar to engineers, it sometimes makes the computations
slightly easier. Q: how do you convert from bits to nats?

What is more commonly useful is the
*
average
*
uncertainty provided
by a random variable
*
X
*
taking values in a space
.

**
The entropy of an r.v. is a measure of the uncertainty of the
random variable.
**
It is a measure of the amount of information
required on the average to describe the random variable.

Notation: We shall use the operator
*
E
*
to denote
**
expectation
**
.
If
(read as:
*
X
*
is distributed according to
*
p
*
(
*
x
*
)),
then for some function of the random variable
*
g
*
(
*
X
*
),

(also known as the law of the unconcious statistician.) Recall

*EX*,

*EX*

^{ 2 }, etc. Then for ,