When writing LuaHashMap, the biggest question was always, "How is the performance?". The dogma is that it should be slow since it's going through an interpreter ("scripting language"). But wisdom demands benchmarks before making unsubstantiated claims.
Here's the teaser / spoiler (smaller times are better):
Random Access Existing Key/Value Pairs: Execution Time (strings)
For those of you who want to know more, read on.
Hash Table Shootout revisited
After searching to establish some baselines for a comparison, I found Nick Welch at incise.org had done benchmarks for other C and C++ libraries. Just for Fun, Welch included Python and Ruby. Quoted from the page:
Python's dict really surprised me with its performance. I initially included it on a lark, but it gives the other implementations a run for their money speed-wise, and doesn't use as much memory as I feared it would. It does still use a lot of RAM, though. So while you can feel pretty good about its performance within the context of writing Python code, it certainly doesn't make sense to bring the Python baggage into your C or C++ program just to use the Python dictionary.
Well, this exactly what I'm doing with Lua (but it's not on a lark). We know hash implementations for many interpreted languages need to be fast for their own performance sake (Lua, Python, Perl, etc), so we should expect them to be fast. The question is how much overhead is there in using/communicating with them from C? The author felt the "Python baggage" was too much to bring in. But how would Lua fare with its small profile and design to be embedded and fast?
So this seemed like a good basis to start a new round of benchmarks. But why stop at Lua? So let's do a whole suite of interpreted languages that would otherwise be dismissed. So I'm incorportating Lua, Python, Ruby, Perl, and Tcl. Nick Welch's Git repo was named "Hash Table Shootout" so I'm calling this Hash Table Shootout 2: Rise of the Interpreter Machines.
But there were problems...
There were some problems (in my opinion) with the original benchmark.
• There is a serious bug/oversight in original the STL unordered_map benchmark. They were using const char* as the string type for their hash keys. This is wrong because the implementation will be hashing on the pointer address instead of the contents of the string. If there are two different const char pointers (unique addresses), but both point to strings with the same string contents, they will be treated as different strings instead of the same string. There should be different performance characteristics for hashing on the contents of the string, so the results need to be invalidated.
• The original benchmark missed the most important operation of a hash table to benchmark: simple retrieving. Typically, the most common operation is not inserting or removing, but just retrieving a value, and in fact many hash table implementations intentionally sacrifice the performance of insertion and deletion operations in order to achieve the most optimal lookups. The original benchmark failed to do that which is a serious oversight.
• The original benchmark did not specify clear rules for memory ownership regarding strings. In the original benchmarks, all strings were created using dynamic memory during the insertion phase of the benchmark. But some libraries, such as Ruby and Python will make another copy of the string while other libraries such as the std::unordered_map<const char*, int> implementation was merely keeping a pointer. In my opinion, this was an unfair comparison for both the speed and memory tests and doesn't paint a clear picture of the actual hash table performance.
My attempt to correct these oversights
Using Nick Welch's original code base as a starting point, I decided to continue where he left off and address these oversights. But here are some additional notes you should know about these tests before you see the results.
• All these tests are C programs that use their public C-API for each language to communicate to the interpreter to access their hash table. Don't confuse this for the typical language shootout.
• As a common baseline to the original Hash Table Benchmarks from incise.org, I include the STL unordered_map test as the baseline. However, my compiler is clang/llvm, not gcc, so these may be different implementations and the compilers may be optimizing differently. But the hope is that their performance is close enough to represent a common baseline allowing you to extrapolate performance to Nick Welch's original benchmark (albiet that I consider the unordered_map<const char*, int> invalid).
• The premise of this benchmark (and LuaHashMap) is needing to interoperate from C (not C++). In my section of the universe, portable and interoperable code have much tougher requirements than what most other developers face. Crossing between languages is common for me (hence why I might gravitate towards doing a benchmark like this) and the common denominator that everything can talk to is C. This has implications such as everything needing to convert string representations to/from const char*. Typical examples of this are converting between std::string in C++ and NSString in Objective-C. You can even find this within C itself such as Microsoft's MFC CString, Apple's Core Foundation CFString, and GTK's GString.
• To rectify the broken unordered_map benchmark, I add another benchmark test which uses std::string as the key. I don't want to provide a custom function for const char* because then whatever function I implement becomes the focal point and I am mainly testing out of the box behaviors to give a common baseline of comparison.
• The new memory ownership model now demands all libraries make its own copy of the string and own/manage their own memory. This is what all the interpreters are doing, and now using std::string has the additional benefit of making C++ behave just like this and it doesn't get a free pass any more. (The legacy unordered_map<const char*, int> reference benchmark is exempted from this requirement.)
• Original memory is not deleted after the string copies. Maybe I should have, but didn't want more variables to think about.
• I don't know the C-APIs for all these languages and tried my best to get them through the benchmarks, but in a few cases there were problems. So my benchmark code may need to be scrutinized.
• To my great disappointment, I had to remove the Ruby string benchmarks and the Ruby insertion/deletion benchmarks because they kept crashing for me. I presume I am doing something wrong with the API but cannot figure it out.
• The Python test might get a slight freebie for insert because PyInt_FromLong was not being called for every value. I don't know what kind of overhead this has.
• LuaHashMap is a little more built up than the other tests because LuaHashMap has some parameter safety checking that the other programs may not be doing. This may give it a slight disadvantage in the benchmark tests
• I introduced new benchmarks for getting and modifying (existing) values to address the issue of the original benchmarks not exercising what hash tables are designed explicitly for.
• My fork of the benchmark code can be found at https://github.com/ewmailing/hash-table-shootout. The repository is an absolute mess and all of the builds and runs were orchestrated via manual command line invocation.
• All the benchmarks were run on a 21.5-inch iMac, Mid 2010 (3.2 GHz Intel Core i3) with 16GB 1333 MHz DDR3 RAM. The operating system was OS X 10.8.2 with Xcode 4.5.2 (4G2008a) and the command line tools installed. I ran as a normal user (I forgot to 'sudo nice' as Welch's README recommended). But all applications were shutdown except for Terminal.app. Only stock Mac daemon processes were running.
• All benchmarks were compiled with clang using -O3 and for 64-bit. All libraries except for Lua came preinstalled as standard components with Mac OS X. I compiled Lua with -O3.
• In the initial reference test, I benchmark both Lua 5.1.5 and Lua 5.2.1, but then drop 5.1.5 for simplicitly since their results are so close.
• I reduced down the original 40 million entries to 30 million because I did preliminary work on a 8GB machine and certain libraries (e.g. Python) were running out of memory and hitting swap. I could have probably bumped it, but it took me multiple days to run these tests so I decided I needed my computer more than I needed the remaining entries. (I also reduced the step increment size too.)
• I set the "best out of" value in the bench.py to 3
Insertion/Deletion reference benchmarks
The first suite of benchmarks is to redo the original benchmarks Nick Welch did to try to establish some kind of reference so results can be some what compared to Nick Welch's other benchmarked libraries.
The first up in this suite is the integer benchmarks which test sequential inserts, random inserts, and deletes.
Here are some observations and analysis of the results:
• Lua destroys everybody on the sequential insert benchmark. For those who aren't aware, Lua only has one data structure, the table, which is used as both a hash table and as an array. To accomodate this, Lua has an array component/optimization under the hood to make array usage fast. Most likely, the sequential insert test is triggering the array optimization.
• Lua also wins on the deletion benchmark. This is mostly because Lua doesn't compact its tables and rehash entries when items get removed.
• Perl does the worst among all these tests. This is most likely because Perl's C API converts everything to strings. We'll see in the string results, its absolute performance is the same.
• Tcl is known for being a language where everything is a string. Its performance tends to be on the bottom end in this benchmark, but still much better than Perl. And in absolute terms, it is still very competative with the other libraries. This makes me wonder what Tcl is really doing. (This will be a common theme throughout the rest of this shootout.)
• Ignoring Perl since it is probably converting everything to strings, Python seems to fare the worst in memory consumption, but won the performance benchmark in the random insert benchmark.
• Notice that Lua has a stair step climb in memory consumption. Lua is rounding the number of preallocaed buckets up in steady increments. (I think it is documented as powers of two.)
• The STL unordered_map doesn't come out a clear winner as dogma may have expected, but its performance across the board is steady/reliable/solid.
Note that this test we now have two STL unordered_map results. One is the const char* version for our original base line reference (which is technically invalid), and the other is the std::string based version. Immediately, you can see the difference in speed and memory usage when correcting the problems described above.
• Perl's results are the same as the integer results which helps solidify the idea that Perl's C API is converting everything to strings.
• Tcl's performance seem to improve over the integer test and seems to beat all the implementations in both performance and memory. I'm surprised by this and I don't know how to explain it. One suspicion I had was maybe Tcl isn't doing a full string copy and is just shuffling pointers around. But the memory results suggest something is getting creating and the amount of memory seems credible for a string copy. I also did some side experiements trying to delete the original string and crashing the program, but I was unsuccessful at getting it to crash which further makes me believe Tcl is indeed making its own copy as I expected. (To my frustration, I could not find explicit Tcl documentation on what its behavior is supposed to be or how things work.) Maybe somebody can explain this. I'm baffled.
• Python likes to use the most memory.
• Lua and STL's deletion performance is very similar. This suggests that like Lua, clang's library implementation may also not be compacting/rehashing entries.
• Lua 5.1.5 and Lua 5.2.1 are almost identical in performance, but Lua 5.2.1 does manage to break out just slightly ahead of Lua 5.1.5 in the insertion string tests. It seems that Lua 5.2.1 manages to usually edge out 5.1.5 on even the closer results. So great job goes to Lua team for improving things!
• Lua's insertion performance is the worst among the group. Notice the stair-step like increases in time in the insertion benchmarks. This suggests that running out of buckets and growing a hash table in Lua is an expensive operation. But this also gets to the point about the original benchmark being flawed. Insertion is not necessarily supposed to be a fast operation. And in fact, many libraries (including Lua) let you specify the number elements you think you will need for the hash table so you can avoid the expensive resize operations later.
Insertion/Deletion with pre-sizing hints
As stated in the last bullet point above, the insertion benchmarks are flawed because it is generally the wrong thing to look at when examining a hash table. This is not to say insertion times are unimportant. And they are important enough that some libraries provide API hints to pre-size a hash table to avoid growing costs. Lua provides such an API and this is worth examining by redoing the the benchmark with a pre-size hint.
• Immediately by providing a pre-size hint, Lua's insertion times dropped dramatically and it is now neck-and-neck with the STL unordered_map and in the ballpark with everybody else.
• Notice the stair-step growth in insertion time for Lua is now gone and it is mostly just a flat line.
Access (Get) benchmarks
So now we can move on to the most important benchmarks, the access/retrieving/getting benchmarks. For these benchmarks, for simplicitly, I remove Lua 5.1.5 and the const char* version of the STL unordered_map. Ruby also finally makes a partial re-appearence for the integer benchmarks. Unfortunately, I was unable to get it to not crash in the string benchmarks. I also remove sequental tests. All accesses are randomized.
For a hash table, it is possible to get different performance behaviors depending if the key exists in the table or it doesn't. So this benchmark covers both scenarios.
The benchmark is setup as follows:
• For testing N keys, I create N*2 keys, where half the keys will be valid keys that are inserted into the hash table, and the other half will provide keys that are guaranteed to not be in the table.
• The range of keys is from 0 to INT_MAX. The ranges are further divided up into 4 equal parts to help mix things up a little more to try to avoid penalizing hashing functions that may do better or worse at lower or higher number ranges. The first range is from 0 to INT_MAX/4, the second is from INT_MAX/4+1,INT_MAX/2, and so forth. The first and third range make up the valid keys and the second and forth make up the invalid keys.
• Each key that is generated is created by a random number generator. This allows for the possibility of duplicate keys (like the original benchmarks).
• String tests are done by converting the numbers to strings (like the original benchmarks).
• All valid keys are kept in an array. All invalid keys are kept in another array.
• Timing starts after the hash table is created and all keys are inserted.
• The benchmark does N accesses to see if the key/value pair exists. For each value in N, a random number generator is called to create array index value to fetch from the valid or invalid array.
• Ruby managed to have the smallest memory footprint.
• Python did significantly better at its invalid key/value pair look ups compared to its valid look ups.
• unordered_map, Lua, and Tcl seem to be very competitive with one another.
• Again, Tcl shows amazing performance.
• Perl shows the most dramatic speed up with invalid look ups over valid look ups.
• Everybody but Lua does faster invalid look ups than their own respective valid look ups. Lua does better with its valid lookups. Lua's memory profile spikes too in the invalid key benchmark. I believe the reason has to do with Lua's string internalization process. In Lua, all strings are internalized (i.e. there is only one unique instance of a string in memory). For invalid keys, this suggests Lua may be internalizing the temporary string key to find it in its hash table which results in both performance and memory overhead. This is an interesting aspect of string internalization, but as we'll see shortly, there is more to this story.
Access (Get) with string length optimization
In C, strings are NULL terminated and they don't contain any string length information. To find out the string length, you must count the number of characters until you hit the NULL terminator. When converting string representations, many libraries provide a size parameter to allow you to optimize out the need for strlen() if you already know the length.
In this suite, we actually have 3 candidates that provide a string length parameter, STL std::string, Lua, and Perl. I was curious how much providing this parameter would make a difference so I ran another set of benchmarks with this optimization.
• I'm shocked by these results. It turns out that the string length optimization was slower, albeit just slightly and it is barely noticeable. But the results seemed pretty consistent.
• I can't really explain why this is slower. My best guess is that my code is triggering some kind of cache miss resulting in the slower performance. But I don't really know how to prove that.
• This also suggests strlen() is very fast. I thought I would have thought calling it 30 million times would start making a difference, but it doesn't. I kind of wonder what the compiler and standard C library writers have done to optimize this.
• This demonstrates why we need to benchmark
Access (Get) with Lua string internalization optimization
Previously, we noted that Lua paid a price for its string internalization in the case of looking up invalid keys. Well, there are two sides to every coin. What we haven't seen yet is the up-side of the coin, which is what this next benchmark is about.
When you insert a string into Lua via the C-API, Lua/LuaHashMap returns you the pointer to the internalized string. If we use this const char* pointer instead of the other one which we first created our string entry with, Lua has a chance to use this to its performance benefit. So with a very simple change to the benchmark program, we can start using the internalized pointer instead and see what the up-side benefit is for valid key accesses.
• This graph overlays the original Lua access results with the string internalization optimization. Also included is an optimization that includes both the string length and internalization optimization.
• It is immediately clear we get a noticible performance boost by using the internalized string.
• The string length optimization was a tiny bit worse like before
Access (Get) Final Composite
So if we overlay the internalized results back on the original graph, we get this final graph:
Access (Set) benchmarks
Finally, for completeness, the other common operation on a hash table is to modify a value of a key/value pair that is already in the hash table.
In this benchmark, I enabled both the string length and string internalization options to expecting to see the optimal results. (I wrote this and batched up all the jobs before I saw the results so I didn't realize all the length optimizations were detrimental.)
• The graphs look pretty much the same as their Access/Get counterparts. In all cases, updating a value seems to be slightly slower, but not by much, which is reasonable/expected.
• The only new stand out seemed to be Python's memory usage spiked up for the string test.
These are various things I still don't have complete answers to.
• What is Tcl doing that makes it do so well in these benchmarks?
• How does Lua/LuaHashMap fair with pointers (light user data)? My initial spot check showed performance more like strings than integers with surprised me.
• How does LuaJIT2 compare? Unlike canocial Lua which is pure ANSI C for maximum portability, LuaJIT2 is written in assembly language for maximum performance. Its trace compiler will probably be of no benefit to statically compiled C code like this, but it would be interesting to know if LuaJIT's assembly implementation yields additional performance gains. I did an initial run with beta 10 about 6 months ago, but it crashed for me as I hit around 1GB of RAM. I never had time to pursue it further and figure out what the problem was.
• How does Apple's Core Foundation's CFDictionary compare? Core Foundation is supposedly a portable C library that theoretically works on non-Apple platforms (with lots of caveats). I wrote the implementation, but it only takes pointers and you must provide a hash function for non-Core Foundation types. Using the example one for integers in Apple's own documentation, my non-comprehenisve spot check showed its integer performance beat std::tr1::unordered_map. But for the string test, the proper string representation is CFString. This conversion overhead seemed to be fairly expensive and put it near the bottom of performance.
Did Tcl win? In any case, these benchmarks showed that these interpreter implementations have very good hash implementations and are competative with our reference benchmark of the STL unordered_map. Particularly in the case of Tcl and Lua, they were extremely competative and often were within 5%-10% of unordered_map when they weren't beating it. Impediance mismatch and memory management of strings seems to be the dominating factor of performance and memory which is an issue for all involved in this benchmark. In all cases, there are things that can be done to optimize for these issues, but at the very least, this is continued ammunition to dispell the dogma that interpreted languages are slow.
As for LuaHashMap, my original goals were pretty modest. I just wanted a simple, easy to use, hash table for pure C that could handle numbers, strings, and arbitrary types (pointers). My performance requirements were not very demanding. I only expected to use small tables most of the time that would be on the order of tens, hundreds, or thousands of entries and merely wanted something algorithmically correct.
These benchmarks really exceeded my expectations and really makes LuaHashMap compelling as a general purpose solution for hash tables even if you aren't using Lua already. In both speed and memory, Lua held its own.
Its optimizations for both numbers and strings give it nice benefits over some other C libraries that treat everything as pointers and give up performance. And another interesting aspect is that you can still mix different types in LuaHashMap in the same hash table.
But I am very impressed by Tcl's showing. I'm still a little suspcious that there is a problem with my code. But assuming everything is kosher, I still might not be switching backends just yet. Lua is about 100KB for the core and an additional 100KB for the standard library (which is not needed by LuaHashMap). The Tcl dynamic library shipped on my system is about 2MB. But Tcl isn't a built in component on all systems, so comparing 200KB vs 2MB is a tougher call. Still, I encourage anybody who wants to write a LuaHashMap like wrapper for another interpreter to please do so. I would love to have the option.