building a simple hash map
hash maps are useful - let's build one!
2025-07-31 · data structures · hash · rust
Hash Maps are like magic: you insert some (many, even) key-value pairs and, whenever you want, you can query the map with $\bigO{1}$ time complexity. It is stupidly useful!1 In fact, it's so useful that dynamic languages like Lua or JavaScript are essentially just a big pile of specialized hash maps.
But how does that even work? How is it able to search through so many elements in such a small time? And what better way to understand than to build one?
#The "hash" in the "map"
Let's, first of all, try to define what we're trying to build. We want a data structure that is able to take a kv-pair and, somehow, just by looking at the key, know where to find it within some internal storage.
What we want is some kind of function $h(k)$ that can map any key we throw at it to a number, which we can then use to find the kv-pair in the storage. This is what we call the hash function, and the value it produces for a given key is it's hash value (often shortened to hash).
An example of applying the hash function $h : \text{Strings} \to \text{Integers}$
The hash value of a key is like an unique identifier for it - except it isn't unique. Different keys might have the same hash, no matter how similar or different they are. When that happens, we call it a hash collision.
We won't be discussing how to make a hash function, as our hash map will be generic over it. So understanding it's purpose is enough for us.
Well, cool. Now we have some sort of magical number unique-but-not-really to our key. How can we use it to find where the kv-pair is?
The idea is simple: we'll use the hash as an index into a buffer which stores the kv-pairs by computing the hash modulo the buffer length. There's just two problems:
- How do we deal with hash collisions?
- How big should that buffer be?
We'll answer both these questions with a single idea: buckets.
#Buckets
In order to solve hash collisions, our buffer can't2 just be a kv-pair buffer. Instead, we'll have it store groups of kv-pairs which ended up at the same index. Usually, these groups are called buckets. Therefore, our buffer is actually a buckets buffer.
A diagram of the layout of the buffer. Each index contains a bucket, and each bucket contains multiple kv-pairs.
Then, whenever we want to find a kv-pair, we compute the hash of the key to find it's bucket, and perform a linear search through the bucket to find the pair (if it is present).
Alright, hash collisions solved. How about the buffer size?
Remember our goal of $\bigO{1}$ time complexity: in order to achieve that, bucket size must be kept under a fixed limit, as we'll be performing a linear search for kv-pairs in them.
Assuming the hash function is of good quality and it's hash values are somewhat uniformely distributed (i.e. for any given key, all hash values have an equal probability of being it's hash), we can expect buckets to have around the same length.
With this, we can take a simple strategy: if we want to have around $N$ kv-pairs per bucket, we just need to ensure we have at most $N * (\text{bucket count})$ elements in our map. If we exceed that, we'll just double the bucket count.
#Implementation
Let's start by defining our Map
struct:
1 use std::hash::RandomState;
2
3 pub struct Map<K, V, S = RandomState> {
4 build_hasher: S,
5 buckets: Vec<Vec<(K, V)>>,
6 len: usize,
7 }
It has three generic parameters: K
, the key type, V
, the value type, and S
, which is a type implementing the BuildHasher
trait (we don't add the trait bound in the definition, but we'll add it to the relevant impl
blocks).
The fields are pretty self-explanatory, except for build_hasher
, which is an instance of S
. What even is the BuildHasher
trait, anyway?
#Hash related traits
The BuildHasher
trait is implemented by types which are able to produce instances of a type implementing the Hasher
trait. The Hasher
trait, on the other hand, is implemented by types which are able to hash an arbitrary stream of bytes, which in practice means they implement some hash function.
This is necessary because every Hasher
has state, and we want to hash keys independently of one another, therefore we need to be able to create Hasher
s whenever we want to hash something.
We also set RandomState
as the default type for the S
parameter, as it's the standard library's default BuildHasher
, producing DefaultHasher
s.
While we're talking about hash traits, let me talk about Hash
as well: types which implement Hash
know how to write their data to a Hasher
in order to be hashed. Pretty simple.
#Borrowed keys
Let's implement some constructors:
1 impl<K, V, S> Map<K, V, S>
2 {
3 /// Creates a new [`Map`] with the given [`Hasher`] builder.
4 pub fn with_hasher(build_hasher: S) -> Self {
5 Self {
6 build_hasher,
7 buckets: Vec::new(),
8 len: 0,
9 }
10 }
11 }
12
13 // Note: This is only implemented for Map<K, V, RandomState>!
14 impl<K, V> Map<K, V> {
15 /// Creates a new [`Map`] with the default [`BuildHasher`].
16 pub fn new() -> Self {
17 Self::with_hasher(Default::default())
18 }
19 }
These constructors are pretty simple - nothing too fancy going on. Let's start implementing some of the more interesting functionality of the map:
1 use std::hash::{Hash, BuildHasher};
2 use std::borrow::Borrow;
3
4 // From now on, assume all method definitions to be within this impl block
5 impl<K, V, S> Map<K, V, S>
6 where
7 K: Hash + Eq,
8 S: BuildHasher,
9 {
10 /// Given a key, returns the index of the bucket it belongs to.
11 fn bucket_for<Q>(&self, key: &Q) -> usize
12 where
13 K: Borrow<Q>,
14 Q: Hash + ?Sized,
15 {
16 let mut hasher = self.build_hasher.build_hasher();
17 key.hash(&mut hasher);
18
19 (hasher.finish() as usize) % self.buckets.len()
20 }
21 }
The bucket_for
method has an unexpected type parameter Q
- what is that for?
We want to be able to query the map with borrowed representations of K
as well - not just owned ones. Imagine if, in order to query a Map<String, _>
, you had to allocate a String
even if you already had a &str
. That wouldn't be ideal - and that's what we're avoiding here.
The Borrow<T>
trait essentially says that it's implementor can be borrowed as a T
and, per it's docs, stablishes a contract that the Eq
, Ord
and Hash
traits behave identically on both types. This allows us to hash the Q
we received and treat it's hash as the same hash of a K
which would borrow to that value.
#Capacity growth
1 /// Grows the map by doubling it's capacity
2 fn grow(&mut self) {
3 let old_bucket_count = self.buckets.len();
4 let new_bucket_count = (self.buckets.len() * 2).max(4);
5 self.buckets.resize_with(new_bucket_count, Vec::new);
6
7 // Rehash and move pairs around
8 for check_bucket in 0..old_bucket_count {
9 let mut index = 0;
10 while index < self.buckets[check_bucket].len() {
11 let correct_bucket = self.bucket_for(&self.buckets[check_bucket][index].0);
12 if correct_bucket != check_bucket {
13 let pair = self.buckets[check_bucket].swap_remove(index);
14 self.buckets[correct_bucket].push(pair);
15 } else {
16 index += 1;
17 }
18 }
19 }
20 }
grow
is the method we'll be calling whenever the map gets too full.
It starts by computing the new size as (self.buckets.len() * 2).max(4)
which in practice means "double the length, and make it at least four"3, then resizing the buckets buffer to that size.
Then, since the length of the buffer changed, we need to rehash the kv-pairs already present, since their buckets have likely changed (their hash remains the same, but the resulting bucket after applying the modulo may have not).
The way we do it is a little convoluted, but not very hard to grasp:
- Iterate over the buckets that were already present before growing
- For each bucket, iterate over the elements, rehash their keys and then move them if their bucket has changed
The iteration over the elements is done through a while loop with an index that only gets incremented if the element is in the correct bucket, because if it isn't, we're going to remove it and the element at the current index will change (so we need to keep the index the same in order to check it).
#Insertion and Querying
1 // Could be any number but 8 seems like a good compromise
2 const BUCKET_SIZE: usize = 8;
3
4 /// Inserts a key-value pair into the map, returning the old value if
5 /// the key was already present.
6 pub fn insert(&mut self, key: K, value: V) -> Option<V> {
7 // Special case for when the map is empty (no buckets)
8 if self.buckets.is_empty() {
9 self.grow();
10
11 let bucket = self.bucket_for(&key);
12 self.buckets[bucket].push((key, value));
13 self.len += 1;
14
15 return None;
16 }
17
18 let bucket = self.bucket_for(&key);
19 let pair = self.buckets[bucket].iter_mut().find(|(k, _)| *k == key);
20 match pair {
21 Some((_, v)) => Some(std::mem::replace(v, value)),
22 None => {
23 self.buckets[bucket].push((key, value));
24 self.len += 1;
25
26 // Check if we have more than BUCKET_SIZE elements per bucket
27 // (on average) and then grow if that's the case
28 if Self::BUCKET_SIZE * self.buckets.len() < self.len {
29 self.grow();
30 }
31
32 None
33 }
34 }
35 }
Inserting into the map is pretty straightforward: compute the bucket for the key, search for the key in the bucket and then replace the value of the pair if it is present or insert the new pair otherwise. We also handle the special case where the buckets buffer is empty.
1 /// Returns the value associated with the given key.
2 pub fn get<Q>(&self, key: &Q) -> Option<&V>
3 where
4 K: Borrow<Q>,
5 Q: Hash + Eq + ?Sized,
6 {
7 let bucket = self.bucket_for(key);
8 self.buckets[bucket]
9 .iter()
10 .find_map(|pair| (key == pair.0.borrow()).then_some(&pair.1))
11 }
Querying the map is even more straightforward! We just compute the bucket for the key, search for the key in the bucket and return it's associated value if it is present.
Notice that, just like bucket_for
, get
has a generic parameter Q
which we use instead of K
for all the same reasons I listed before.
And with that, you have a simple hash map! Sure, it's missing important methods like remove
, but I'll leave them as an exercise for the reader.
It's important to note that this is not the only way to build a hash map. There are way too many ways to build one, actually. This is just the one I chose to implement in this post!
#Testing
Does it really work? Let's test it!
First, we'll implement Debug
in order to see what the map contains:
1 impl<K, V, S> Debug for Map<K, V, S>
2 where
3 K: Debug,
4 V: Debug,
5 {
6 fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
7 let mut d = f.debug_map();
8 for pair in self.buckets.iter().flat_map(|b| b.iter()) {
9 d.entry(&pair.0, &pair.1);
10 }
11
12 d.finish()
13 }
14 }
Then, we'll write a very simple test:
1 #[test]
2 fn map_test() {
3 let mut map = Map::new();
4 map.insert("pi", 31415);
5 map.insert("answer", 42);
6 map.insert("nine", 4);
7
8 assert!(matches!(map.get("pi"), Some(31415)));
9 assert!(matches!(map.get("answer"), Some(42)));
10 assert!(matches!(map.get("nine"), Some(4)));
11
12 assert!(!matches!(map.get("pi"), Some(1)));
13 assert!(!matches!(map.get("answer"), None));
14 assert!(!matches!(map.get("nine"), Some(31415)));
15
16 assert!(map.get("foo").is_none());
17 assert!(map.get("bar").is_none());
18 assert!(map.get("baz").is_none());
19
20 dbg!(&map);
21 }
And...
running 1 test
test map_test ... ok
---- map_test stdout ----
[src/lib.rs:156:5] &map = {
"pi": 31415,
"answer": 42,
"nine": 4,
}
Cool! It actually works!
#Should you use this?
No. Use the standard library's hash map. Or some other actually good one. The one we just built is, honestly, pretty bad. It could be improved in many ways, even without changing it's architecture much. But it's easy to understand, so there's that!
The simplest optimization you could implement is replacing the type of the buckets Vec<(K, V)>
with SmallVec<[(K, V); 2 * Self::BUCKET_SIZE]>
. This will make our map more cache friendly, as we remove the added indirection of the Vec
in the majority of cases (SmallVec
will allocate if the inline storage isn't enough).
I hope you learned something new, though! Have fun hashing stuff. Note, however, that constant time complexity does not mean "fast!!!!!" - it just means the time it takes does not grow with input size (in this case, the map size). A linear search on an array is still faster for small enough (which is very relative) collections. ↩ I lied - it can. But that's not the approach I'll be following in this post. You can read more about it by searching for the "Open Addressing Strategy" for collision resolution. ↩ Although the max
method makes it look like we're stablishing a maximum value for the expression, that's not the case! It's evaluating to the maximum between the two values, i.e. a.max(b)
is the same as max(a, b)
. Docs. ↩