Dictionary and hash function

Dictionary

A symbol table or a dictionary is a data structure of items with keys that supports two basic operations: insert a new item, and return an item with the given key.
We will construct a simple dictionary to compute a word frequency in a text file. Our program should read a text file word by word and for each word perform the following actions:
  1. check if this word in the dictionary
  2. if the word is in the dictionary, then get the word counter, increment it, and put it back into the dictionary
  3. if the word is not in the dictionary, then insert the counter with value one and the given word as the key into the dictionary.
To develop the program, we first need to implement the Dictionary class:
class Dictionary {
   public:
      Dictionary();
      ~Dictionary();
      insert(string key, int value);
      int& operator[] (string key);
   private:
      ...
};

The inner structure of the class may vary depending on the particular needs, tools, or designer preferences. We will mention only few possibilities:

The complete implementations of the first two options are provided in the ordered_dictionary and unordered_dictionary files. The third options will be discussed later.

Hash function

In this section we will try to improve the previously created dictionary using hash. Algorithms that use hashing consists of two parts. The first step is to compute a hash function that transforms the key into a table address.
A hash function is a function that takes a key and returns a corresponding numeric value from a particular range.
Ideally, different keys would map into different addresses, but often two or more different keys may hash into the same table address. Thus, the second part of hashing is a collision-resolution precess that deals with such keys.

A simple example of a hash function would be like this. Assume we need to put numbers between 1000 and 1099 into an array of 100 elements, then we can use the hash function that takes a number to place and returns this number mod 100. This would be an ideal hash function in this case because it's easy to compute and it provides for each number a separate cell in the result array.

Unfortunately, this is not always a case. For example, if we are talking about an array of all possible strings, then we do not know array of what size we need to keep to store all of them. We may maintain an array of 26 elements and then our hash function will just return code based on the first letter of the word (0 for 'a', 1 for 'b', etc). In this case, we will definitely end up putting more than one word in the same cell. In other words, we need to deal with a collision. To deal with it, we can, for example, create not an array of words, but an array of word-lists and add each word to the corresponding list.

This is what all of the real dictionaries do, but this is not the best way to split words into groups because of a various reasons.

Another way to divide words into groups is to use a hash function that returns the length of each word as its value. Being used against the same set of words it results in the following picture

This function doesn't provide the best partitioning either.

Both of these functions are not very good also because they map a word into a number between 0 and M, where M is small (either a number of letters in the alphabet or the length of the longest word). What should we do if we would like to map a word into an interval from 0 to 1024?

Thus, our good hash function should be able to map a word into a number from interval 0 - M for pretty much any M, and at the same time provide equal distribution of the words. That is, we do not want to map half of the words into index 0 and the other half into 1 left all other M-2 elements unused.

Some hash functions for strings

Since a hash function is nothing but a map that takes a word and produces an integer number from a specified range, we can consider several ways of creating such functions. All of these ways, though, will treat word letters as 1-byte integers (we will consider ASCII code of a character instead of the character itself). For example, we can take first four letters of a word and treat them as an integer. This can be done using union construct:
union HashKey {
   unsigned int  int_value;
   char letter[4];
};

unsigned int hash(string  s)
{
   HashKey key;
   for(int i=0;i<4;i++)
      key.letter[i] = i < s.length() ? s[i] : 0;
   return key.int_value % M;
}
To map the int_value into a number from [0, M-1] range, we now need to compute the reminder of the division of the int_value by M. For example, this function maps the word test to 9 and word algorithm to 95. The disadvantage of the function is that it considers only the first 4 letters of a word; that is, all the words starting from the same group of letters will end up with the same hash value. For example, the word testing is also mapped to 9.

In our next hash function will take into consideration all letters of a word and also will mix their ASCII representations a little bit. To explain the idea of mixing up, we can look at the previous hash function and realize that it actually was received by doing this:

int key = 0;
for(i=0;i<4;i++)
   key = ((key<<8) + s[i]) % M;
return key % M;
To mix the letter codes we can shift them not by 8, but by 7 positions (which is equivalent to multiplying by 128) or even multiply by 127, which mix them even more. Thus our next function will look
unsigned int hash(string  s)
{
   unsigned int key = 0;
   for(int i=0;i<s.length();i++)
      key = (key * 127 + s[i]) % M;
   return key % M;
}

To mix the codes even better we can change the factor we multiply by (127) every iteration of the algorithm. The following code is called "Universal hash function". For details see Algorithms in C++ by Robert Sedgewick.

unsigned int hash(string  s)
{
   unsigned int key = 0,
                a = 31415,
                b = 27183;
   for(int i=0;i<s.length();i++){
      key = (key * a + s[i]) % M;
      a = (a * b) % (M-1);
   }
   return key % M;
}

References: