As we all know, computers can operate only with numbers, or more exactly, with zeros and ones. Because of this, when we need
to store a text, we actually store some numbers that correspond to characters. For example, if the only text we will store
is english text typed in lower case, then we need only 27 character, which can be enumerated from 0 to 26:
Character:
Code:
Binary:
Thus, the following sentence
can be represented as the following sequence of zeros and ones:
(Spaces are used only to separate numbers.) As you can see it requires exactly
bits because the string has characters
in it and each character requires exactly 5 bits to store.
Variable length encoding.
To reduce the number of bits required for holding the complete string, we can use variable-length encoding.
We will assign shorter codes like (0, 10, 11, etc) to the most commonly used character. We need to assign the codes in such
a way that none of the codeword for one character constitutes the beginning of the codeword for another character.
For example, if we have codeword 10, then none of codewords for other characters starts with 10.
Such codes are called prefix-codes.
To do so we need to know how
often each character appears in the string. In other words, we need to compute the frequency table:
Character:
Count:
Code:
Using the new codes the same sentence can be written as:
We can count that this version of the binary representation of the text takes only
bits. That is we compressed the text by
%.
Every prefix code can be represented as by a binary tree whose leaves are the characters that are to be encoded. The binary
tree corresponding to the codes above are shown on the picture
To find out the code of a character we start from the root node and moving along the edges to the leaf with the desired
character combining the zeros and ones on the edges together. The advantage of a prefix code is that we need not look ahead
when parsing the file. This is readily apparent from the tree representation of the code. To parse, we start at the first
bit of the left of the file and at the root of the tree. We sequence through the bits, and go left or right down the tree
depending on whether a zero or one is encountered. When we reach a leaf, we obtain the character at that leaf; then we
return to the root and repeat the procedure starting with the next bit in sequence.
Huffman code.
The prefix code we created above is not the most efficient code for these sequence of characters. Huffman developed a
greedy algorithm that produces an optimal binary character code by constructing a binary tree corresponding to an
optimal code. A code produced by the algorithm is called a Huffman code.
The Huffman code uses a priority queue. In a priority queue, the elements with the highest priority is always
removed next. In our case, the element with the highest priority is the character with the lowest frequency in the file.
A priority queue can be easily implemented as a heap.
To implement the Huffman code, we first create a special structure type elements of which will serve as nodes of the
prefix tree:
struct Node {
char symbol; // a distinct character in the fileint frequency; // the frequency of that character in the file
Node *left; // pointer to the left subtree
Node *right; // pointer to the right subtree
};
Initially we put all elements of the type in the priority queue. We have as many elements in the queue as many distinct
characters appear in the file. Remember that the highest priority have the elements with the lowest frequency.
Then the algorithm goes as follows:
it removes two elements from the top of the queue
it creates another element of the Node type with no symbol stored in it;
the frequency of the new element is equal to the sum of the frequencies of the retrieved elements;
the left subtree of the new element is set to be equal to the first of the elements (the one with the
higher priority);
and the right subtree is now pointing to the other element
then we put the new element in the priority queue
we will repeat these steps until there is only one element left in the queue.
The demo shows how this method works being applied to the file under consideration.
Huffman method produces the following binary code tree