Text compression. Huffman's code.

Fixed-length encoding.

As we all know, computers can operate only with numbers, or more exactly, with zeros and ones. Because of this, when we need to store a text, we actually store some numbers that correspond to characters. For example, if the only text we will store is english text typed in lower case, then we need only 27 character, which can be enumerated from 0 to 26:
Character:  
Code:
Binary:
Thus, the following sentence
can be represented as the following sequence of zeros and ones:
(Spaces are used only to separate numbers.) As you can see it requires exactly bits because the string has characters in it and each character requires exactly 5 bits to store.

Variable length encoding.

To reduce the number of bits required for holding the complete string, we can use variable-length encoding. We will assign shorter codes like (0, 10, 11, etc) to the most commonly used character. We need to assign the codes in such a way that none of the codeword for one character constitutes the beginning of the codeword for another character. For example, if we have codeword 10, then none of codewords for other characters starts with 10. Such codes are called prefix-codes.

To do so we need to know how often each character appears in the string. In other words, we need to compute the frequency table:
Character:
Count:
Code:
Using the new codes the same sentence can be written as:

We can count that this version of the binary representation of the text takes only bits. That is we compressed the text by %.

Every prefix code can be represented as by a binary tree whose leaves are the characters that are to be encoded. The binary tree corresponding to the codes above are shown on the picture

To find out the code of a character we start from the root node and moving along the edges to the leaf with the desired character combining the zeros and ones on the edges together. The advantage of a prefix code is that we need not look ahead when parsing the file. This is readily apparent from the tree representation of the code. To parse, we start at the first bit of the left of the file and at the root of the tree. We sequence through the bits, and go left or right down the tree depending on whether a zero or one is encountered. When we reach a leaf, we obtain the character at that leaf; then we return to the root and repeat the procedure starting with the next bit in sequence.

Huffman code.

The prefix code we created above is not the most efficient code for these sequence of characters. Huffman developed a greedy algorithm that produces an optimal binary character code by constructing a binary tree corresponding to an optimal code. A code produced by the algorithm is called a Huffman code.

The Huffman code uses a priority queue. In a priority queue, the elements with the highest priority is always removed next. In our case, the element with the highest priority is the character with the lowest frequency in the file. A priority queue can be easily implemented as a heap.

To implement the Huffman code, we first create a special structure type elements of which will serve as nodes of the prefix tree:

struct Node {
  char  symbol;     // a distinct character in the file
  int   frequency;  // the frequency of that character in the file
  Node *left;       // pointer to the left subtree
  Node *right;      // pointer to the right subtree
};
Initially we put all elements of the type in the priority queue. We have as many elements in the queue as many distinct characters appear in the file. Remember that the highest priority have the elements with the lowest frequency. Then the algorithm goes as follows: The demo shows how this method works being applied to the file under consideration.

Huffman method produces the following binary code tree