• Direct Addressing is a simple variant for implementing dictionaries. Here, each position in the array is associated with a key.

    • This is relatively easy to implement.
    • If the set of keys is large, this is impractical to implement considering memory constraints.
    • If the set of keys actually used is really small, this will result in significantly wasted space.
  • A hash table implements a look-up table or dictionary such that searching is instead of the typical .

    • Hash tables are effective when the number of keys stored is small relative to the total number of possible keys.
    • Hash tables require the use of a hash function which maps a key to a slot in the hash table. We assume that the hash table size is significantly smaller than the number of possible values.
    • Hash functions must be robust against collisions where two keys map to the same hashed values.
      • The load factor of a hash table with slots storing elements is defined as
        It is the average number of keys that map to the same slot.

Collision Resolution

  • Chaining means placing all elements that hash to the same slot into the same linked list.

    • Insertion takes time
    • Deletion (assuming doubly linked lists) take time
    • The search time depends on the load factor.
      • In the worst case, it is if all keys map to the same slot (i.e., same search time as a linked list)

      • (CLRS 11.1, CLRS 11.2) In a hash table where collisions are resolved by chaining, a search, whether successful or not, takes average case time under the assumption of simple uniform hashing.

        The comes from the fact that we still need to compute the hash.

  • In an open addressing scheme, all elements occupy the hash table itself. In such a scheme, .

    • Open addressing necessitates probing to find a slot where we put the key. This requires us to extend the hashing function to include the probing number as input. We call this new hash function the auxiliary hash function

      For every , the probe sequence is a permutation of the identity sequence. This way every slot is eventually considered as a candidate.

    • Inserting with a probing scheme means considering each slot in the probe sequence for the key until an empty slot is found.

    • Probing methods tend to not give uniform hashing out of the box

    • (CLRS 11.6, CLRS 11.8) Given an open address hash table with uniform hashing, the expected number of probes is given by

      Where we assume that each key is equally likely to be searched for.

      • (CLRS 117) Inserting an element into an open address hash table requires at most probes on average, assuming uniform hashing

Probing

  • Linear Probing means that our auxiliary hash function is given by

    That is, we simply probe all slots in order.

    • We use probe sequences at most.
    • This is simple to implement.
    • However, it suffers from primary clustering where we have long runs of unoccupied slots and long runs tend to become longer.
  • Quadratic Probing means using an auxiliary hash function given by

    Where are constant.

    • We use probe sequences at most.
    • A slight improvement to linear probing that also reduces primary clustering.
    • However, it suffers from secondary clustering wherein long runs are formed indirectly as a result of two keys having the same initial probe positions.
  • Double Hashing means using an auxiliary hash function defined by two other hash functions

    • must be relatively prime to the hash table size to search the entire hash table.
    • We use probe sequences with a prime or power of m.

Hash Functions

  • Simple Uniform Hashing assumes that all keys are equally likely to hash to any of the slots independent of any other element.

    • Ideally a hash function should match simple uniform hashing
  • In practice, we can often employ heuristic techniques to create a hash function that performs well.

  • Hash function should be entropic. There should be no apparent patterns

  • The division method performs hashing as follows

    • Certain values of should be avoided (I.e., numbers that are powers of numbers )
    • Often a good choice is a prime not close to an exact power of .
  • The multiplication method operates with constant . and performs the following

    Where denotes the fractional part of .

    • The value of is not critical (often we can choose it to be a power of .
    • The value of is critical. Onne suggestion from Knuth is
  • Universal Hashing means choosing a hash function randomly independent of the keys actually stored.

    • At the beginning of execution, we choose a hash function from a set of hash functions mapping universe .

      is universal if each pair of distinct keys , the number of hash functions for which is at most .

      • (CLRS 11.3) Suppose a hash function is chosen randomly from a universal collection of hash functions and has been used to hash keys into a table of size , using chaining to resolve collisions. If key is not in the table, then the expected length of the list that key hashes to has the following bounds
      • (CLRS 11.4) Using universal hashing and collision resolution by chaining in an initially empty table with slots, it takes expected time to handle any sequence of insert, search, and delete operations containing insert operations
    • Universal hashing allows us to avoid the downside of other fixed hashing functions — wherein an adversary can choose keys that map to the same slot.

    • A universal class of hash functions can be designed as follows.

      Let be a prime large enough so that . Denote and .

      The hash function for and using the following linear transformation reduced modulo and .

      The family of all such functions is denoted .

      (CLRS 11.5) says that is universal

  • We say that a hashing technique is perfect if memory accesses are required to perform a search in the worst case.

    • This can be achieved using two layers of hash table, each using universal hashing (we denote the universal class as ) . The second table should use a hash function that guarantees no collisions.

      • Ideally, the second table should also have a size that is the square of the first table
      • In practice, if the first table is already too big, we can settle for two level hashing where we hash entries in each slot.
    • (CLRS 11.9) If we store keys in a hash table of size using a hash function , then the probability of collisions is less than . 1

    • (CLRS 11.10) Suppose we store keys in a hash table of size using hash function . We have that

      Where is the number of keys hashing to slot .

      • (CLRS 11.11) If we store keys in a hash table of size using and set the size of each secondary hash table to , then the expected amount of storage required for all secondary hash tables in a perfect hashing scheme is less than .
      • (CLRS 11.11) Following (CLRS 11.11), the probability that the total storage used for secondary hash tables exceeds is less than . We can simply test a few randomly chosen hash functions to find one that uses a reasonable amount of storage.

Links

Footnotes

  1. This follows from Markov’s Inequality.