Introduction to Information Theory
Communications Model :: Fundamental Concept :: Channel Models
The fundamental concept
One of the key (and initially counter-intuitive) concepts in information theory is that information is conveyed by randomness. This is information as defined in some mathematical sense, which is not identical to that which humans use. For example, it is possible to measure the amount of information in a page of typewritten text. Due to the structure of the English language, the amount of information conveyed by each letter in a word is substantially less than the 7-bit ASCII representation used. (It is somewhere over 2 bits/letter usually). There would be more information conveyed (in the mathematical sense) if the letters were completely random, instead of structured into words.
On the other hand, it is not too difficult to make the connection between randomness and information. Consider the tossing of a coin: if you know the outcome of the coin toss before it is tossed, then learning the outcome does not give you any more information. If you have a biased coin that is heads 90% of the time, then you gain very little information when you learn it is heads. On the other hand, you gain a fair amount of information when it comes up tails; the information is thus related somewhat to the degree of "surprise'' at finding out the answer. Q: what weighting of the coin gives the maximum amount of information on the average?
Another very important concept that we will say more about later is
that of typical sequences. In a sequence of bits of lenght n,
there are some sequences which are (in a sense to be made precise
later) typical. For example, for a sequence of coin-tossing outcomes
for a fair coin, such as HHTHHTHTT, we would expect the number of
heads and tails to be approximately equal (since the coin is fair).
For an unbiased coin, we would expect the proportion of heads to go
with the bias. Sequences that do not follow this trend, such as
HHHHHHHHH, are thus atypical. A good part of information theory
is capturing this concept of typicallity as precisely as possible and
using it to concluding how many bits are needed to represent sequences
of data. The basic idea is to try to use bits to represent only the
typical sequences, since the others don't come up very often. (Of
course, when they do come up, you don't want to just throw them away.)
This concept of typical sequences is what the asymptotic
equipartition property is all about, which is the topic of Chapter
3.
Suppose we have a discrete random variable X, and x is some
particular outcome what occurs with probability p(x). Then we
assign to that event x the information that it conveys the
uncertainty measure
The base of the logarithms determines the units of information. If
is used, then the units are in bits. If
(natural log) is used, then the units are in nats. While nats
are not as familiar to engineers, it sometimes makes the computations
slightly easier. Q: how do you convert from bits to nats?
What is more commonly useful is the average uncertainty provided
by a random variable X taking values in a space
.
The entropy of an r.v. is a measure of the uncertainty of the
random variable. It is a measure of the amount of information
required on the average to describe the random variable.
Notation: We shall use the operator E to denote expectation.
If
(read as: X is distributed according to p(x)),
then for some function of the random variable g(X),
(also known as the law of the unconcious statistician.) Recall EX, EX2, etc. Then for







