building a simple hash map

$$ \gdef \bigO #1{ \mathcal{O} \left( #1 \right) } $$

Hash Maps are like magic: you insert some (many, even) key-value pairs and, whenever you want, you can query the map with $\bigO{1}$ time complexity. It is stupidly useful!1 In fact, it's so useful that dynamic languages like Lua or JavaScript are essentially just a big pile of specialized hash maps.

But how does that even work? How is it able to search through so many elements in such a small time? And what better way to understand than to build one?

#The "hash" in the "map"

Let's, first of all, try to define what we're trying to build. We want a data structure that is able to take a kv-pair and, somehow, just by looking at the key, know where to find it within some internal storage.

What we want is some kind of function $h(k)$ that can map any key we throw at it to a number, which we can then use to find the kv-pair in the storage. This is what we call the hash function, and the value it produces for a given key is it's hash value (often shortened to hash).

An example of applying the hash function $h : \text{Strings} \to \text{Integers}$

An example of applying the hash function $h : \text{Strings} \to \text{Integers}$

The hash value of a key is like an unique identifier for it - except it isn't unique. Different keys might have the same hash, no matter how similar or different they are. When that happens, we call it a hash collision.

We won't be discussing how to make a hash function, as our hash map will be generic over it. So understanding it's purpose is enough for us.

Well, cool. Now we have some sort of magical number unique-but-not-really to our key. How can we use it to find where the kv-pair is?

The idea is simple: we'll use the hash as an index into a buffer which stores the kv-pairs by computing the hash modulo the buffer length. There's just two problems:

  • How do we deal with hash collisions?
  • How big should that buffer be?

We'll answer both these questions with a single idea: buckets.

#Buckets

In order to solve hash collisions, our buffer can't2 just be a kv-pair buffer. Instead, we'll have it store groups of kv-pairs which ended up at the same index. Usually, these groups are called buckets. Therefore, our buffer is actually a buckets buffer.

A diagram of the layout of the buffer. Each index contains a bucket, and each bucket
        contains multiple kv-pairs.

A diagram of the layout of the buffer. Each index contains a bucket, and each bucket contains multiple kv-pairs.

Then, whenever we want to find a kv-pair, we compute the hash of the key to find it's bucket, and perform a linear search through the bucket to find the pair (if it is present).

Alright, hash collisions solved. How about the buffer size?

Remember our goal of $\bigO{1}$ time complexity: in order to achieve that, bucket size must be kept under a fixed limit, as we'll be performing a linear search for kv-pairs in them.

Assuming the hash function is of good quality and it's hash values are somewhat uniformely distributed (i.e. for any given key, all hash values have an equal probability of being it's hash), we can expect buckets to have around the same length.

With this, we can take a simple strategy: if we want to have around $N$ kv-pairs per bucket, we just need to ensure we have at most $N * (\text{bucket count})$ elements in our map. If we exceed that, we'll just double the bucket count.

#Implementation

Let's start by defining our Map struct:

1use std::hash::RandomState;
2
3pub struct Map<K, V, S = RandomState> {
4 build_hasher: S,
5 buckets: Vec<Vec<(K, V)>>,
6 len: usize,
7}

It has three generic parameters: K, the key type, V, the value type, and S, which is a type implementing the BuildHasher trait (we don't add the trait bound in the definition, but we'll add it to the relevant impl blocks).

The fields are pretty self-explanatory, except for build_hasher, which is an instance of S. What even is the BuildHasher trait, anyway?

The BuildHasher trait is implemented by types which are able to produce instances of a type implementing the Hasher trait. The Hasher trait, on the other hand, is implemented by types which are able to hash an arbitrary stream of bytes, which in practice means they implement some hash function.

This is necessary because every Hasher has state, and we want to hash keys independently of one another, therefore we need to be able to create Hashers whenever we want to hash something.

We also set RandomState as the default type for the S parameter, as it's the standard library's default BuildHasher, producing DefaultHashers.

While we're talking about hash traits, let me talk about Hash as well: types which implement Hash know how to write their data to a Hasher in order to be hashed. Pretty simple.

#Borrowed keys

Let's implement some constructors:

1impl<K, V, S> Map<K, V, S>
2{
3 /// Creates a new [`Map`] with the given [`Hasher`] builder.
4 pub fn with_hasher(build_hasher: S) -> Self {
5 Self {
6 build_hasher,
7 buckets: Vec::new(),
8 len: 0,
9 }
10 }
11}
12
13// Note: This is only implemented for Map<K, V, RandomState>!
14impl<K, V> Map<K, V> {
15 /// Creates a new [`Map`] with the default [`BuildHasher`].
16 pub fn new() -> Self {
17 Self::with_hasher(Default::default())
18 }
19}

These constructors are pretty simple - nothing too fancy going on. Let's start implementing some of the more interesting functionality of the map:

1use std::hash::{Hash, BuildHasher};
2use std::borrow::Borrow;
3
4// From now on, assume all method definitions to be within this impl block
5impl<K, V, S> Map<K, V, S>
6where
7 K: Hash + Eq,
8 S: BuildHasher,
9{
10 /// Given a key, returns the index of the bucket it belongs to.
11 fn bucket_for<Q>(&self, key: &Q) -> usize
12 where
13 K: Borrow<Q>,
14 Q: Hash + ?Sized,
15 {
16 let mut hasher = self.build_hasher.build_hasher();
17 key.hash(&mut hasher);
18
19 (hasher.finish() as usize) % self.buckets.len()
20 }
21}

The bucket_for method has an unexpected type parameter Q - what is that for?

We want to be able to query the map with borrowed representations of K as well - not just owned ones. Imagine if, in order to query a Map<String, _>, you had to allocate a String even if you already had a &str. That wouldn't be ideal - and that's what we're avoiding here.

The Borrow<T> trait essentially says that it's implementor can be borrowed as a T and, per it's docs, stablishes a contract that the Eq, Ord and Hash traits behave identically on both types. This allows us to hash the Q we received and treat it's hash as the same hash of a K which would borrow to that value.

#Capacity growth

1/// Grows the map by doubling it's capacity
2fn grow(&mut self) {
3 let old_bucket_count = self.buckets.len();
4 let new_bucket_count = (self.buckets.len() * 2).max(4);
5 self.buckets.resize_with(new_bucket_count, Vec::new);
6
7 // Rehash and move pairs around
8 for check_bucket in 0..old_bucket_count {
9 let mut index = 0;
10 while index < self.buckets[check_bucket].len() {
11 let correct_bucket = self.bucket_for(&self.buckets[check_bucket][index].0);
12 if correct_bucket != check_bucket {
13 let pair = self.buckets[check_bucket].swap_remove(index);
14 self.buckets[correct_bucket].push(pair);
15 } else {
16 index += 1;
17 }
18 }
19 }
20}

grow is the method we'll be calling whenever the map gets too full.

It starts by computing the new size as (self.buckets.len() * 2).max(4) which in practice means "double the length, and make it at least four"3, then resizing the buckets buffer to that size.

Then, since the length of the buffer changed, we need to rehash the kv-pairs already present, since their buckets have likely changed (their hash remains the same, but the resulting bucket after applying the modulo may have not).

The way we do it is a little convoluted, but not very hard to grasp:

  • Iterate over the buckets that were already present before growing
  • For each bucket, iterate over the elements, rehash their keys and then move them if their bucket has changed

The iteration over the elements is done through a while loop with an index that only gets incremented if the element is in the correct bucket, because if it isn't, we're going to remove it and the element at the current index will change (so we need to keep the index the same in order to check it).

#Insertion and Querying

1// Could be any number but 8 seems like a good compromise
2const BUCKET_SIZE: usize = 8;
3
4/// Inserts a key-value pair into the map, returning the old value if
5/// the key was already present.
6pub fn insert(&mut self, key: K, value: V) -> Option<V> {
7 // Special case for when the map is empty (no buckets)
8 if self.buckets.is_empty() {
9 self.grow();
10
11 let bucket = self.bucket_for(&key);
12 self.buckets[bucket].push((key, value));
13 self.len += 1;
14
15 return None;
16 }
17
18 let bucket = self.bucket_for(&key);
19 let pair = self.buckets[bucket].iter_mut().find(|(k, _)| *k == key);
20 match pair {
21 Some((_, v)) => Some(std::mem::replace(v, value)),
22 None => {
23 self.buckets[bucket].push((key, value));
24 self.len += 1;
25
26 // Check if we have more than BUCKET_SIZE elements per bucket
27 // (on average) and then grow if that's the case
28 if Self::BUCKET_SIZE * self.buckets.len() < self.len {
29 self.grow();
30 }
31
32 None
33 }
34 }
35}

Inserting into the map is pretty straightforward: compute the bucket for the key, search for the key in the bucket and then replace the value of the pair if it is present or insert the new pair otherwise. We also handle the special case where the buckets buffer is empty.

1/// Returns the value associated with the given key.
2pub fn get<Q>(&self, key: &Q) -> Option<&V>
3where
4 K: Borrow<Q>,
5 Q: Hash + Eq + ?Sized,
6{
7 let bucket = self.bucket_for(key);
8 self.buckets[bucket]
9 .iter()
10 .find_map(|pair| (key == pair.0.borrow()).then_some(&pair.1))
11}

Querying the map is even more straightforward! We just compute the bucket for the key, search for the key in the bucket and return it's associated value if it is present.

Notice that, just like bucket_for, get has a generic parameter Q which we use instead of K for all the same reasons I listed before.

And with that, you have a simple hash map! Sure, it's missing important methods like remove, but I'll leave them as an exercise for the reader.

It's important to note that this is not the only way to build a hash map. There are way too many ways to build one, actually. This is just the one I chose to implement in this post!

#Testing

Does it really work? Let's test it!

First, we'll implement Debug in order to see what the map contains:

1impl<K, V, S> Debug for Map<K, V, S>
2where
3 K: Debug,
4 V: Debug,
5{
6 fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
7 let mut d = f.debug_map();
8 for pair in self.buckets.iter().flat_map(|b| b.iter()) {
9 d.entry(&pair.0, &pair.1);
10 }
11
12 d.finish()
13 }
14}

Then, we'll write a very simple test:

1#[test]
2fn map_test() {
3 let mut map = Map::new();
4 map.insert("pi", 31415);
5 map.insert("answer", 42);
6 map.insert("nine", 4);
7
8 assert!(matches!(map.get("pi"), Some(31415)));
9 assert!(matches!(map.get("answer"), Some(42)));
10 assert!(matches!(map.get("nine"), Some(4)));
11
12 assert!(!matches!(map.get("pi"), Some(1)));
13 assert!(!matches!(map.get("answer"), None));
14 assert!(!matches!(map.get("nine"), Some(31415)));
15
16 assert!(map.get("foo").is_none());
17 assert!(map.get("bar").is_none());
18 assert!(map.get("baz").is_none());
19
20 dbg!(&map);
21}

And...

running 1 test
test map_test ... ok

---- map_test stdout ----
[src/lib.rs:156:5] &map = {
    "pi": 31415,
    "answer": 42,
    "nine": 4,
}

Cool! It actually works!

#Should you use this?

No. Use the standard library's hash map. Or some other actually good one. The one we just built is, honestly, pretty bad. It could be improved in many ways, even without changing it's architecture much. But it's easy to understand, so there's that!

The simplest optimization you could implement is replacing the type of the buckets Vec<(K, V)> with SmallVec<[(K, V); 2 * Self::BUCKET_SIZE]>. This will make our map more cache friendly, as we remove the added indirection of the Vec in the majority of cases (SmallVec will allocate if the inline storage isn't enough).

I hope you learned something new, though! Have fun hashing stuff.

  1. Note, however, that constant time complexity does not mean "fast!!!!!" - it just means the time it takes does not grow with input size (in this case, the map size). A linear search on an array is still faster for small enough (which is very relative) collections.

  2. I lied - it can. But that's not the approach I'll be following in this post. You can read more about it by searching for the "Open Addressing Strategy" for collision resolution.

  3. Although the max method makes it look like we're stablishing a maximum value for the expression, that's not the case! It's evaluating to the maximum between the two values, i.e. a.max(b) is the same as max(a, b). Docs.