In these tests, we compare different algorithms for accessing keyed
data; the key is an std::string, and the data is a size_t.
Generally, the purpose of the tests was to compare different hashing
algorithms; for comparison purposes, I also used an
std::map
and a sorted
std::vector< std::pair >
, with
std::lower_bound
as the search function.
In every case, the table was initialized filled before running any tests; the tests consisted of a series of loops, each accessing every element in the table once. The test is very limited, in that it does not include any insertions or deletions, nor any accesses of inexistant elements.
The only thing measured is time. Logically, some indication of
distribution should be interesting as well, and the hash_map
interface has the necessary hooks to obtain the information, but I
couldn't be bothered. In the end, the whole point of the algorithm
is to speed up accesses. A perfect distribution is useless if the
function takes five minutes. On the other hand of course,
"return 0 ;
" is faster than any of the
algorithms used here. But with a distribution which is totally
disasterous. The quality of a hashing algorithm cannot be measured
either by raw speed, or by distribution, but by the correct balance
between the two, as reflected by the average access times. Thus,
that (and only that) is what I measure.
For the most part, the hashing algorithms tested are variants of what I call linear congruent hashing (because of its similarity to linear congruent random number generators). Basically, for each character, you mulitply the preceding hash value by some constant, and then fold in the character by either addition or xor, all operations being done modulo some number (typically 2^n, where n is the number of bits in a word, since this operation is free on most hardware).
Obviously, the choice of the multiple is critical: if the multiple is 256, then only the last 4 characters have any effect on a 32 bit hash code. In general, the multiple must have no common factors with either the 2^n modulo used during calculation, nor the final modulo due to the size of the hash table, applied when the hash is actually used. (In http://www.isthe.com/chongo/tech/comp/fnv/, Noll states that "The theory behind which primes make good FNV_prime's is beyond the scope of this web page." I'll admit that I haven't the slightest knowledge of the theory, other than what I just stated about not having common factors with any of the modulos.) The most generally recognized version is FNV. In the past (before I knew of FNV), I developed the idea of using Mersenne primes -- the advantage of a Mersenne prime is that if the machine has slow multiplication (often the case back then), multiplication by a Mersenne prime can be implemented by means of a single shift and a subtraction.
Here, I've run tests using the 32 bit FNV, and my Mersenne prime algorithme with p = 5, 7 and 11. (Strictly speaking, 11 doesn't result in a Mersenne prime, because the result, 2047, is not a prime, but a product of 23 * 89. In the past, however, I've not found this to be a problem. However...) The case of p = 5 corresponds to the hashing algorithm in java.lang.String, although the absence of an unsigned modulo type in Java results in a slight difference from what I've implemented here.
I also tried two "optimizations" with these algorithms. The normal form uses iterators to iterator over the string; in a variant of the FNV, I use char const* obtained via data() and data()+size(), rather than the iterators (which are classes in the g++ implementation). And in one of the Mersenne prime algorithms (p = 7), I use a second implementation which explicitly uses the shift and the subtraction.
Finally, I also try two completely different algorithms: Pearson,
which was described in "Fast Hashing of Variable-Length Text
Strings", by Peter K. Pearson, CACM, June, 1990, and Jenkins, which
is described at http://burtleburtle.net/bob/hash/doobs.html, and
recommended in one or two web articles on hash tables. Both are
significantly more complicated (more lines of code) than the linear
congruent algorithms. I also tried an optimized version of Jenkins,
using word accesses instead of accessing bytes and shifting them.
This is very implementation dependant -- as written,
size_t
must be a 32 bit type, but it could easily be
adapted to most implementations. Note that these hash codes use NO
multiplication; on machines with slow multiplication, that could
make a difference (although the Mersenne prime algorithms can be
written to use no multiplication as well).
I have not bothered to try the known bad algorithms, like additive hashing. Another algorithm that is missing is CRC; in this case, only because I don't have an implementation handy. There's an implementation in Boost, however, so all that is missing is the time to install it. It is definitly worth considering -- the distribution should be close to perfect, the only question is whether modern techniques have reduced the execution time enough.
I ran the tests over the following data sets:
http://www.
"; all begin with
"http://".
elements:
75
longest:
147
shortest:
16
average:
40.1
repeat:
100000
- 1275URLs
-
This is a larger set of URLs, drawn from various files in my
browser's cache. It is much more homogeneous than the above; it
includes a number of relative URLs (thus, without the
"
http:...
" part), and a number of URLs which differ
only in a few characters near the end (something which I think is
highly defavorable to std::map
and
std::lower_bound
).
elements:
1275
longest:
159
shortest:
1
average:
40.7
repeat:
6000
- ProgSyms
-
The set of all symbols and numeric literals in my source code
here. Compared to the previous data sets, the set is a lot
larger, and the individual elements are a lot shorter (although
some outsiders are fairly long).
elements:
15047
longest:
52
shortest:
1
average:
7.2
repeat:
500
- FrWords
-
This data set was generated from the ispell dictionaries for
French.
elements:
94883
longest:
25
shortest:
1
average:
9.6
repeat:
80
- Constructed
-
This data set is simply all of the words from x000000 to x999999.
elements:
1000000
longest:
7
shortest:
7
average:
7.0
repeat:
8
I get the following results on my Sun Sparc Ultra (times in microseconds per access):
75URLs 1275URLs ProgSyms FrWords Constructed sorted vector 1.72 3.26 4.38 6.82 10.79 std::map 1.66 3.27 4.22 7.59 11.84 FNVHash 2.34 2.44 1.58 2.18 2.15 (using data/size) 2.33 2.44 1.58 2.38 2.39 Java hash 1.71 1.81 1.39 2.22 2.23 Mersenne prime 7 1.72 1.81 1.41 2.23 2.25 (using shift/sub) 1.72 1.82 1.42 2.24 2.27 Mersenne prime 11 1.71 1.83 1.55 2.24 12.12 Pearson 3.54 3.68 2.00 2.86 2.85 Jenkins 3.00 3.12 1.97 2.77 2.79 Jenkins (word access) 2.83 2.95 1.91 2.68 2.73
The first thing to notice is that the "optimisations" aren't necessarily. Using pointers instead of iterators in the FNV is actually slightly slower, and using shift and subtraction instead of multiplication in the Mersenne prime algorithm results in no difference whatsoever. (A quick look at the generated code showed that the compiler actually generated a shift and subtract, even this is because the compiler automatically does the optimisation anyway. At least on older Sparcs, multiplication is definitly slow.) The change to word accesses does speed up the Jenkins algorithm, but not to the point where it can compete with the linear congruent algorithms.
A second important point is that std::map
is quite
sufficient for smaller data sets. How small, of course depends on
your application, but you need a good hashing algorithm to beat it
for anything up to around a 1000 entries. (This is interesting to
me, because I often have maps with less than a 100 entries.)
The more complicated algorithms actually give lower performance. I don't know whether this is due to their not having as good a distribution, or their taking more time for the same distribution. But as I said at the start, I really don't care. The end results are slower, and that is all that matters.
There are two points about which I am still curious.
First, why the inversion in performance between FNV and the Mersenne prime algorithms for the last two tests (compared to the first three). Why would the Mersenne primes algorithm result in a poorer distribution when used on French words or the artificially constructed strings than when used on URLs or program symbols? (When I first ran these tests, I didn't have the constructed data, and I wondered if there wasn't an anti-French bias somewhere:-). Or more likely, if there wasn't a problem in my handling of negative character values -- accented characters in the local code sets, which only occur in the list of French words.)
Second, why the extremely bad performance of the Mersenne prime 11 for the artificially constructed strings? Is this due to the fact that the multiplier isn't really a prime? But if so, why is this a problem only with this one data set? Or is it a general characteristic of the Mersenne prime algorithm, or linear congruent hash codes in general, to have degenerate cases, and only by pure chance that I only hit the one for Mersenne prime 11 in my data sets? I wish I knew. (More generally, why do the all of the hash codes do best with the program symbols? Something to do with the implementation of g++'s hash table, perhaps, that by chance ProgSyms happens to have an optimal number of elements? Or something to do with the distribution of the data?)