Hash Tables

A conceptual explanation.

ยท

4 min read

What's a Hash Table

This is a data structure seen and proven to be very efficient for data storage, insertion, and retrieval. Unlike an array data structure, hash tables store their data in a key to value pair format, at least on a higher level.

Conceptually, if you need to store a value in a hash table, a key that serves as a pointer is required. For example, say you want to store a users information which includes their name, age, and gender in a hash table, we could have a structure like the below:

// user data
 "name" => "Alfred"
  "age" => "22"
  "gender" => "male"

From the above, we've mapped specific data about a user by pointing a value to a key ( I call this an identity ). We'll look at this data structure in depth in more sections.

The underlying structure of a hash table

Hash tables are typically a dynamic array of linked lists, where an item has to form a linked list type of data structure at a given index.

The Big O Notation

For inserting, retrieving, and deleting an item, a hash table offers a time complexity of O(1) on an average, this could change to O(n) considering edge cases which I'll explain in a bit.

Inserting an item into a hash table.

Remember we said hash tables have key => value pair type of data structure, and this key could either be a string or a number. In some programming languages, it could be just more than just a string or a number.

Let's say we want to create a hash table data structure for a car. Assuming we want to store the model of my Ferrari, what happens under the hood is this,

  • We have a key/value pair like "model" => "Ferrari.

  • Then the key, which is "model" (a string) is passed through a hashing function to produce an integer, this integer represents the index where the value would be stored in an array. Now you see where the name "hash" comes from.

A hash function receives input and performs mathematical operations on the input to produce an output. There a various hash function algorithms implemented by really smart people and they are very powerful. Maybe in our case, our hash function operation converts each string item to its ASCII value and sums them up to produce a value.

For instance, assuming our key "model" represents "30121", the sum of each individual string item in their respective ASCII format would be the number "7". So the value "Ferrari" would be stored at index 7. Now there's an edge case, it's possible that two strings produce the same output "7" after hashing, this is called a collision.

In such a case, this is where the linked list data structure comes in, at each index in the array we have a linked list that contains values that point to each other.

In real life, hash functions are extremely powerful to avoid such collisions and produce unique integer values. Note that the hashing function also takes note of the size of the underlying array to create a number, not more than the length of the array.

The resizing case

Let's say we have an underlying array of length 3 which has just 3 key => value pairs, this would fit in well into that array after hashing and there's no collision. This means that retrieving would be O(1) because, at each index, we have just one node in the linked list.

Assuming we have 100 keys to store in this same array of length 3, they could fit in well after hashing but we would end up with at least 30 nodes for a linked list in indexes 1,2, and 3. Having 30 nodes or a long linked list makes retrieving, inserting, and deleting an O(n) time operation which is a linear time operation, not very efficient.

To fix this, hash tables are smart enough to resize the array and eliminate a long linked list. There are various ways to implement resizing in hash tables but let's look at a simple implementation of resizing.

Let's say your array has a length of 10, as you insert key=>values into the array and get filled to 70% or 80% (index 7 or 8), your hash table is going to be smart enough to know that it has to resize itself by maybe creating a copy of the array and doubling its size then performs a new hashing on previous keys and fills them up into the newly created array, leaving no room for no deep-linked list and more space available. The same thing happens when deleting items from the array, it might reduce its size but still enough to leave no room for a giant linked list.