# Arithmetic Coding

Introduction :: Probability Models :: Applications

## Probability models

Performance of the AC depends on having a good model for the source probabilities. The better the model, the better it might be expected that the code performs. In principle, any probabilistic model can be used. We mention here some useful concepts in developing one.

Suppose, as before, we deal with the case of independent events. We have outcomes
`
a
`
,
`
b
`
, and
, with probabilities
and
. Let
*
l
*
be the number of outcomes (number of coin tosses).
could be anywhere in the range [0,1], and we may not have any predisposition toward one value. We model this ambivalence by saying that

That is, it is uniformly distributed. This is a
*
prior probability
*
. If we had some predisposition about
, this could be incorporated into the prior model (using something like a
distribution, for example). The whole point of Bayesian estimation (which is what we find we are talking about here) is to merge our prior inclinations in with the observations. This is a problem of inference, which we can state this way: given a sequence of
*
F
*
bits, of which
are
`
a
`
s and
are
`
b
`
s, infer
. The inference is accomplished by the posterior ("after'') -- the probability of
*
after
*
a measurement
is made. We write

Now why this? Well, we can write down the conditional probability in the numerator:

(describe why). As we have seen elsewhere, it seems that the conditioning is always easiest they way you don't need it. We also find

So we could infer
as the most probable value (the maximizer) of the posterior. For example, we find
, with maximum of
. Or we could infer based on the mean, which is 3/5. We also want to be able to make predictions. Given a sequence
of length
*
F
*
as evidence we find the prediction of drawing an
`
a
`
as

Note that in this case, we are using the entire posterior probability, so we incorporate all of our uncertainty about
*
p
*
_{
a
}
. We also have
(by its definition), so our predictor is

This update rule is known as Laplace's rule, and is the rule that was used in the coder above. We could write this as

Another model, known as the Dirichlet model, is more "responsive'':

Typically,
is small, like 0.01.

This is not the only possible rule, and doesn't necessarily take into account the relationship that might exist between dependent variables.