|This is a stub or unfinished. Contribute by editing me.|
A hash table is a dictionary-like data structure that provides lookup time. You associate keys with values (e.g. a dictionary associates words with definitions). Often a hash table is implemented using a hash function and an ordinary array. Collisions (when multiple keys map to the same array index) must be dealt with somehow or the hash table is severely crippled.
Typically a standard hash table library (see UsingHashMap for details) will be adequate.
There are two completely separate algorithms involved in a hash table -- the "hash function" and the "collision resolution algorithm". Any hash function can be mix-and-matched with any collision resolution algorithm.
Most hash functions attempt to minimize the number of collisions, by attempting to spread all the expected keys uniformly over the array. In a few rare cases where all possible values are known ahead of time, it is possible to design a "perfect hash function" that maps each value to a different key, which eliminates all collisions.
An incredibly simple hash function could be the length of the key. This is a terrible hash function for (hopefully) obvious reasons.
If the key is given as a C-like string of length L, we can define the key value as such:
key + key + ... + key[L-2] + key[L-1]
This is a common but poor hash function. For example, "foo", "oof", and "fnp" all hash to the same index.
Good hash functions
- Julienne Walker has a list of good hash functions.
non-uniform hash functions
Has anyone done research on non-uniform hash functions? Non-uniform hash functions that deliberately make several related values hash to the same key?
- In a spelling checker using a Bloom filter, it's OK if a hash function allows thousands of non-words (such as strings that begin with "qqtt") to "collide" to a single key, as long as no correctly-spelled word collides with that same key.
- When searching a table of information using a person's name as the "value", it may be useful to have common mis-spellings of that name all hash to the same key.
- Since certain patterns in Hashlife occur overwhelmingly more often than other patterns, it may be useful to have a hash function that makes each of the most-frequent patterns hash to a key with no collisions, even if it causes less-frequent patterns to hash to a key with an average of 2 or 3 collisions, and some extremely rare patterns to a key with dozens of collisions.
- Most "hash functions" attempt to simulate a "random oracle", to have "well-mixed ... avalanche". However, the hash function used in Python dictionaries has horrible avalanche behavior, but it gives better results on typical hash tables than a perfect random oracle (and therefore better than even cryptographically secure hashes)..
If we follow the wiki: RulesOfOptimization, we will need to do some testing before and after we implement some new "improved" hash table. Perhaps you will find the test data collected by Peter Kankowski to be useful as "fair and unbiased" test data to use to compare your spiffy new hash table with the default hash table built into your language.