Hash Code Evaluations

Note: The measurements made in this document were done with an older version of my Hashcode benchmark, which used the g++ hash_map as its hash table, rather than my own implementation.

Introduction

In these tests, we compare different algorithms for accessing keyed data; the key is an std::string, and the data is a size_t. Generally, the purpose of the tests was to compare different hashing algorithms; for comparison purposes, I also used an std::map and a sorted std::vector< std::pair >, with std::lower_bound as the search function.

In every case, the table was initialized filled before running any tests; the tests consisted of a series of loops, each accessing every element in the table once. The test is very limited, in that it does not include any insertions or deletions, nor any accesses of inexistant elements.

The only thing measured is time. Logically, some indication of distribution should be interesting as well, and the hash_map interface has the necessary hooks to obtain the information, but I couldn't be bothered. In the end, the whole point of the algorithm is to speed up accesses. A perfect distribution is useless if the function takes five minutes. On the other hand of course, "return 0 ;" is faster than any of the algorithms used here. But with a distribution which is totally disasterous. The quality of a hashing algorithm cannot be measured either by raw speed, or by distribution, but by the correct balance between the two, as reflected by the average access times. Thus, that (and only that) is what I measure.

Algorithms tested

For the most part, the hashing algorithms tested are variants of what I call linear congruent hashing (because of its similarity to linear congruent random number generators). Basically, for each character, you mulitply the preceding hash value by some constant, and then fold in the character by either addition or xor, all operations being done modulo some number (typically 2^n, where n is the number of bits in a word, since this operation is free on most hardware).

Obviously, the choice of the multiple is critical: if the multiple is 256, then only the last 4 characters have any effect on a 32 bit hash code. In general, the multiple must have no common factors with either the 2^n modulo used during calculation, nor the final modulo due to the size of the hash table, applied when the hash is actually used. (In http://www.isthe.com/chongo/tech/comp/fnv/, Noll states that "The theory behind which primes make good FNV_prime's is beyond the scope of this web page." I'll admit that I haven't the slightest knowledge of the theory, other than what I just stated about not having common factors with any of the modulos.) The most generally recognized version is FNV. In the past (before I knew of FNV), I developed the idea of using Mersenne primes -- the advantage of a Mersenne prime is that if the machine has slow multiplication (often the case back then), multiplication by a Mersenne prime can be implemented by means of a single shift and a subtraction.

Here, I've run tests using the 32 bit FNV, and my Mersenne prime algorithme with p = 5, 7 and 11. (Strictly speaking, 11 doesn't result in a Mersenne prime, because the result, 2047, is not a prime, but a product of 23 * 89. In the past, however, I've not found this to be a problem. However...) The case of p = 5 corresponds to the hashing algorithm in java.lang.String, although the absence of an unsigned modulo type in Java results in a slight difference from what I've implemented here.

I also tried two "optimizations" with these algorithms. The normal form uses iterators to iterator over the string; in a variant of the FNV, I use char const* obtained via data() and data()+size(), rather than the iterators (which are classes in the g++ implementation). And in one of the Mersenne prime algorithms (p = 7), I use a second implementation which explicitly uses the shift and the subtraction.

Finally, I also try two completely different algorithms: Pearson, which was described in "Fast Hashing of Variable-Length Text Strings", by Peter K. Pearson, CACM, June, 1990, and Jenkins, which is described at http://burtleburtle.net/bob/hash/doobs.html, and recommended in one or two web articles on hash tables. Both are significantly more complicated (more lines of code) than the linear congruent algorithms. I also tried an optimized version of Jenkins, using word accesses instead of accessing bytes and shifting them. This is very implementation dependant -- as written, size_t must be a 32 bit type, but it could easily be adapted to most implementations. Note that these hash codes use NO multiplication; on machines with slow multiplication, that could make a difference (although the Mersenne prime algorithms can be written to use no multiplication as well).

I have not bothered to try the known bad algorithms, like additive hashing. Another algorithm that is missing is CRC; in this case, only because I don't have an implementation handy. There's an implementation in Boost, however, so all that is missing is the time to install it. It is definitly worth considering -- the distribution should be close to perfect, the only question is whether modern techniques have reduced the execution time enough.

Data sets

I ran the tests over the following data sets:

75URLs

This is a set of 73 URLs, taken from the history of my browser at one particular occasion. (It was originally 73, whence the name, but I found and deleted two duplicates.) Most of the lines begin with "http://www."; all begin with "

http://".
      
        
          elements:
              75
          
longest:
              147
          
shortest:
              16
          
average:
              40.1
          
repeat:
              100000

1275URLs


      This is a larger set of URLs, drawn from various files in my
      browser's cache.  It is much more homogeneous than the above; it
      includes a number of relative URLs (thus, without the
      "http:..." part), and a number of URLs which differ
      only in a few characters near the end (something which I think is
      highly defavorable to std::map and
      std::lower_bound).
      
      
        elements:
            1275
        
longest:
            159
        
shortest:
            1
        
average:
            40.7
        
repeat:
            6000

ProgSyms


      The set of all symbols and numeric literals in my source code
      here.  Compared to the previous data sets, the set is a lot
      larger, and the individual elements are a lot shorter (although
      some outsiders are fairly long).
      
      
        elements:
            15047
        
longest:
            52
        
shortest:
            1
        
average:
            7.2
        
repeat:
            500

FrWords


      This data set was generated from the ispell dictionaries for
      French.
      
      
        elements:
            94883
        
longest:
            25
        
shortest:
            1
        
average:
            9.6
        
repeat:
            80

Constructed


      This data set is simply all of the words from x000000 to x999999.
      
      
        elements:
            1000000
        
longest:
            7
        
shortest:
            7
        
average:
            7.0
        
repeat:
            8



  Results
  
    I get the following results on my Sun Sparc Ultra (times in
    microseconds per access):
  

    
      
          75URLs 1275URLs ProgSyms FrWords Constructed
      

      

        sorted vector
        1.72
        3.26
        4.38
        6.82
        10.79
      

        std::map
        1.66
        3.27
        4.22
        7.59
        11.84
      

        FNVHash
        2.34
        2.44
        1.58
        2.18
        2.15
      

        (using data/size)
        2.33
        2.44
        1.58
        2.38
        2.39
      

        Java hash
        1.71
        1.81
        1.39
        2.22
        2.23
      

        Mersenne prime 7
        1.72
        1.81
        1.41
        2.23
        2.25
      

         (using shift/sub)
        1.72
        1.82
        1.42
        2.24
        2.27

      

        Mersenne prime 11
        1.71
        1.83
        1.55
        2.24
        12.12
      

        Pearson
        3.54
        3.68
        2.00
        2.86
        2.85
      

        Jenkins
        3.00
        3.12
        1.97
        2.77
        2.79
      

        Jenkins (word access)
        2.83
        2.95
        1.91
        2.68
        2.73
    
  
  
    The first thing to notice is that the "optimisations" aren't
    necessarily.  Using pointers instead of iterators in the FNV is
    actually slightly slower, and using shift and subtraction instead of
    multiplication in the Mersenne prime algorithm results in no
    difference whatsoever.  (A quick look at the generated code showed
    that the compiler actually generated a shift and subtract, even this
    is because the compiler automatically does the optimisation anyway.
    At least on older Sparcs, multiplication is definitly slow.)  The
    change to word accesses does speed up the Jenkins algorithm, but not
    to the point where it can compete with the linear congruent
    algorithms.
  

    A second important point is that std::map is quite
    sufficient for smaller data sets.  How small, of course depends on
    your application, but you need a good hashing algorithm to beat it
    for anything up to around a 1000 entries.  (This is interesting to
    me, because I often have maps with less than a 100 entries.)
  

    The more complicated algorithms actually give lower performance.  I
    don't know whether this is due to their not having as good a
    distribution, or their taking more time for the same distribution.
    But as I said at the start, I really don't care.  The end results
    are slower, and that is all that matters.

  
Questions
  
    There are two points about which I am still curious.
  

    First, why the inversion in performance between FNV and the Mersenne
    prime algorithms for the last two tests (compared to the first
    three).  Why would the Mersenne primes algorithm result in a poorer
    distribution when used on French words or the artificially
    constructed strings than when used on URLs or program symbols?
    (When I first ran these tests, I didn't have the constructed data,
    and I wondered if there wasn't an anti-French bias somewhere:-).  Or
    more likely, if there wasn't a problem in my handling of negative
    character values -- accented characters in the local code sets,
    which only occur in the list of French words.)
  

    Second, why the extremely bad performance of the Mersenne prime 11
    for the artificially constructed strings?  Is this due to the fact
    that the multiplier isn't really a prime?  But if so, why is this a
    problem only with this one data set?  Or is it a general
    characteristic of the Mersenne prime algorithm, or linear congruent
    hash codes in general, to have degenerate cases, and only by pure
    chance that I only hit the one for Mersenne prime 11 in my data
    sets?  I wish I knew.  (More generally, why do the all of the hash
    codes do best with the program symbols?  Something to do with the
    implementation of g++'s hash table, perhaps, that by chance ProgSyms
    happens to have an optimal number of elements?  Or something to do
    with the distribution of the data?)

	75URLs	1275URLs	ProgSyms	FrWords	Constructed
sorted vector	1.72	3.26	4.38	6.82	10.79
std::map	1.66	3.27	4.22	7.59	11.84
FNVHash	2.34	2.44	1.58	2.18	2.15
(using data/size)	2.33	2.44	1.58	2.38	2.39
Java hash	1.71	1.81	1.39	2.22	2.23
Mersenne prime 7	1.72	1.81	1.41	2.23	2.25
(using shift/sub)	1.72	1.82	1.42	2.24	2.27
Mersenne prime 11	1.71	1.83	1.55	2.24	12.12
Pearson	3.54	3.68	2.00	2.86	2.85
Jenkins	3.00	3.12	1.97	2.77	2.79
Jenkins (word access)	2.83	2.95	1.91	2.68	2.73