Hash Functions

Figure 11.4: The multiplication method of hashing. The w-bit representation of the key k is multiplied by the w-bit value s = A · 2w. The p highest-order bits of the lower w-bit half of the product form the desired hash value h(k).

Although this method works with any value of the constant A, it works better with some values than with others. The optimal choice depends on the characteristics of the data being hashed. Knuth [185] suggests that

is likely to work reasonably well.

As an example, suppose we have k = 123456, p = 14, m = 214 = 16384, and w = 32. Adapting Knuth's suggestion, we choose A to be the fraction of the form s/232 that is closest to , so that A = 2654435769/232. Then k · s = 327706022297664 = (76300 · 232) 17612864, and so r1 = 76300 and r0 = 17612864. The 14 most significant bits of r0 yield the value h(k) = 67.

Universal hashing

If a malicious adversary chooses the keys to be hashed by some fixed hash function, then he can choose n keys that all hash to the same slot, yielding an average retrieval time of Θ(n). Any fixed hash function is vulnerable to such terrible worst-case behavior; the only effective way to improve the situation is to choose the hash function randomly in a way that is independent of the keys that are actually going to be stored. This approach, called universal hashing, can yield provably good performance on average, no matter what keys are chosen by the adversary.

The main idea behind universal hashing is to select the hash function at random from a carefully designed class of functions at the beginning of execution. As in the case of quicksort, randomization guarantees that no single input will always evoke worst-case behavior. Because of the randomization, the algorithm can behave differently on each execution, even for the same input, guaranteeing good average-case performance for any input. Returning to the example of a compiler's symbol table, we find that the programmer's choice of identifiers cannot now cause consistently poor hashing performance. Poor performance occurs only when the compiler chooses a random hash function that causes the set of identifiers to hash poorly, but the probability of this situation occurring is small and is the same for any set of identifiers of the same size.

Let be a finite collection of hash functions that map a given universe U of keys into the range {0, 1, ..., m - 1}. Such a collection is said to be universal if for each pair of distinct keys k, l U , the number of hash functions h for which h(k) = h(l) is at most || /m. In other words, with a hash function randomly chosen from , the chance of a collision between distinct keys k and l is no more than the chance 1/m of a collision if h(k) and h(l) were randomly and independently chosen from the set {0, 1, ..., m - 1}.

The following theorem shows that a universal class of hash functions gives good average-case behavior. Recall that ni denotes the length of list T[i].

Designing a universal class of hash functions

It is quite easy to design a universal class of hash functions, as a little number theory will help us prove. 

We begin by choosing a prime number p large enough so that every possible key k is in the range 0 to p - 1, inclusive. Let Zp denote the set {0, 1, ..., p - 1}, and let denote the set {1, 2, ..., p - 1}. Because we assume that the size of the universe of keys is greater than the number of slots in the hash table, we hav p > m.

We now define the hash function ha,b for any and any b Zp using a linear transformation followed by reductions modulo p and then modulo m:

For example, with p = 17 and m = 6, we have h3,4(8) = 5. The family of all such hash functions is

Each hash function ha,b maps Zp to Zm. This class of hash functions has the nice property that the size m of the output range is arbitrary-not necessarily prime-a feature which we shall use in Section 11.5. Since there are p - 1 choices for a and there are p choices for b, there are p(p - 1) hash functions in p,m.