# Data Compression

Introduction :: Kraft :: Optimal Codes :: Bounds :: Huffman :: Coding

## Arithmetic coding

Go over idea of arithmetic coding. Start with
*
p
*
_{
0
}
= .75,
*
p
*
_{
1
}
=.25 and explain procedure for encoding and decoding. Then take case of
*
p
*
_{
0
}
=0.7 and
*
p
*
_{
1
}
= 0.3. Generalize to multiple (
) symbols.

## Lempel-Ziv Coding

Lempel-Ziv coding (and its variants) is a very common data compression scheme. It is based upon the source building up a dictionary of previously-seen strings, and transmitting only the "innovations'' while creating new strings.

For the first example, we start off with an empty dictionary, and assume that we know (somehow!) that the dictionary will not contain more than 8 symbols. Suppose we are given the string

The source stream is parsed until the shortest string is encountered that has not been encountered before. Since this is the shortest such string, all of its prefixes must have been sent before. The string can be coded by sending the index from the dictionary of the prefix string and the new bit. This string is then added to the dictionary.

To illustrate, 1 has not been seen before, so we send the index of its prefix (set at 000), then the number 1. We add the sequence 1 to the dictionary. Then 0 has not been seen before, so we send the index of its prefix (000) and the number 0. We add the sequence 0 to the dictionary. The sequence 11 has a prefix string of 1, so we send its index, and the number 1. Proceeding this way, the dictionary looks like this:

index | contents |

000 | null |

001 | 1 |

010 | 0 |

011 | 11 |

100 | 01 |

101 | 010 |

110 | 00 |

111 | 10 |

The encoded sequence is

Observe that we haven't compressed the data: 13 bits went in, and 28 bits came out. But if we had a longer stream of data (and a bigger dictionary), data compression could result. In fact, the point that made fame and fortune for Lempel and Ziv is that they proved that asymptotically, the output rate approaches the entropy rate for a stationary source. (We could spend a year working through their proof.) The bottom line is the following theorem:

An obvious improvement on this basic explanatory example is to use fewer bits at the beginning to send the index information, because the table is known not to need that many bits yet.

There are several implementation issues to refine the performance, including:

- Compression performance vs. table size.
- Dictionary initialization: start empty, or start with the alphabet.
- Adaptive index sizing
- Dictionary organization and search strategies (speed vs. storage)
- Adaptation: throw away sequences that are rarely used.

Here is another example on a three-symbol alphabet {a,
*
b
*
,
*
c
*
}. In this case, the dictionary is assumed to be initialized to contain the alphabet of the source. The sequence is

*a b a b c b a b a b a a a a a*...

The output is

*b*)(4,

*c*)(2

*a*)(6,

*b*)(1,

*a*)(8,

*a*)...

and the dictionary is

index | contents |

1 | a |

2 | b |

3 | c |

4 | ab |

5 | abc |

6 | ba |

7 | bab |

8 | aa |

9 | aaa |