-
Direct Addressing is a simple variant for implementing dictionaries. Here, each position in the array is associated with a key.
- This is relatively easy to implement.
- If the set of keys is large, this is impractical to implement considering memory constraints.
- If the set of keys actually used is really small, this will result in significantly wasted space.
-
A hash table implements a look-up table or dictionary such that searching is
instead of the typical . - Hash tables are effective when the number of keys stored is small relative to the total number of possible keys.
- Hash tables require the use of a hash function
which maps a key to a slot in the hash table. We assume that the hash table size is significantly smaller than the number of possible values. - Hash functions must be robust against collisions where two keys map to the same hashed values.
- The load factor
of a hash table with slots storing elements is defined as It is the average number of keys that map to the same slot.
- The load factor
Collision Resolution
-
Chaining means placing all elements that hash to the same slot into the same linked list.
- Insertion takes
time - Deletion (assuming doubly linked lists) take
time - The search time depends on the load factor.
-
In the worst case, it is
if all keys map to the same slot (i.e., same search time as a linked list) -
(CLRS 11.1, CLRS 11.2) In a hash table where collisions are resolved by chaining, a search, whether successful or not, takes average case time
under the assumption of simple uniform hashing. The
comes from the fact that we still need to compute the hash.
-
- Insertion takes
-
In an open addressing scheme, all elements occupy the hash table itself. In such a scheme,
. -
Open addressing necessitates probing to find a slot where we put the key. This requires us to extend the hashing function to include the probing number as input. We call this new hash function the auxiliary hash function
For every
, the probe sequence is a permutation of the identity sequence. This way every slot is eventually considered as a candidate. -
Inserting with a probing scheme means considering each slot in the probe sequence for the key until an empty slot is found.
-
Probing methods tend to not give uniform hashing out of the box
-
(CLRS 11.6, CLRS 11.8) Given an open address hash table with uniform hashing, the expected number of probes
is given by Where we assume that each key is equally likely to be searched for.
- (CLRS 117) Inserting an element into an open address hash table requires at most
probes on average, assuming uniform hashing
- (CLRS 117) Inserting an element into an open address hash table requires at most
-
Probing
-
Linear Probing means that our auxiliary hash function is given by
That is, we simply probe all slots in order.
- We use
probe sequences at most. - This is simple to implement.
- However, it suffers from primary clustering where we have long runs of unoccupied slots and long runs tend to become longer.
- We use
-
Quadratic Probing means using an auxiliary hash function given by
Where
are constant. - We use
probe sequences at most. - A slight improvement to linear probing that also reduces primary clustering.
- However, it suffers from secondary clustering wherein long runs are formed indirectly as a result of two keys having the same initial probe positions.
- We use
-
Double Hashing means using an auxiliary hash function defined by two other hash functions
must be relatively prime to the hash table size to search the entire hash table. - We use
probe sequences with a prime or power of m.
Hash Functions
-
Simple Uniform Hashing assumes that all keys are equally likely to hash to any of the
slots independent of any other element. - Ideally a hash function should match simple uniform hashing
-
In practice, we can often employ heuristic techniques to create a hash function that performs well.
-
Hash function should be entropic. There should be no apparent patterns
-
The division method performs hashing as follows
- Certain values of
should be avoided (I.e., numbers that are powers of numbers ) - Often a good choice is a prime not close to an exact power of
.
- Certain values of
-
The multiplication method operates with constant
. and performs the following Where
denotes the fractional part of . - The value of
is not critical (often we can choose it to be a power of . - The value of
is critical. Onne suggestion from Knuth is
- The value of
-
Universal Hashing means choosing a hash function randomly independent of the keys actually stored.
-
At the beginning of execution, we choose a hash function from a set
of hash functions mapping universe . is universal if each pair of distinct keys , the number of hash functions for which is at most . - (CLRS 11.3) Suppose a hash function
is chosen randomly from a universal collection of hash functions and has been used to hash keys into a table of size , using chaining to resolve collisions. If key is not in the table, then the expected length of the list that key hashes to has the following bounds
- (CLRS 11.4) Using universal hashing and collision resolution by chaining in an initially empty table with
slots, it takes expected time to handle any sequence of insert, search, and delete operations containing insert operations
- (CLRS 11.3) Suppose a hash function
-
Universal hashing allows us to avoid the downside of other fixed hashing functions — wherein an adversary can choose keys that map to the same slot.
-
A universal class of hash functions can be designed as follows.
Let
be a prime large enough so that . Denote and . The hash function
for and using the following linear transformation reduced modulo and . The family of all such functions is denoted
. (CLRS 11.5) says that
is universal
-
-
We say that a hashing technique is perfect if
memory accesses are required to perform a search in the worst case. -
This can be achieved using two layers of hash table, each using universal hashing (we denote the universal class as
) . The second table should use a hash function that guarantees no collisions. - Ideally, the second table should also have a size that is the square of the first table
- In practice, if the first table is already too big, we can settle for two level hashing where we hash entries in each slot.
-
(CLRS 11.9) If we store
keys in a hash table of size using a hash function , then the probability of collisions is less than . 1 -
(CLRS 11.10) Suppose we store
keys in a hash table of size using hash function . We have that Where
is the number of keys hashing to slot . - (CLRS 11.11) If we store
keys in a hash table of size using and set the size of each secondary hash table to , then the expected amount of storage required for all secondary hash tables in a perfect hashing scheme is less than . - (CLRS 11.11) Following (CLRS 11.11), the probability that the total storage used for secondary hash tables exceeds
is less than . We can simply test a few randomly chosen hash functions to find one that uses a reasonable amount of storage.
- (CLRS 11.11) If we store
-
Links
- CLRS - Ch. 11
Footnotes
-
This follows from Markov’s Inequality. ↩