Guide to the Hash Data Structure

No course on data structures would be complete without discussing the topic of hashes. The reason for this is because hashes are used constantly in the software development world.

Hash in Ruby Development

Before we dive into hashes, I want to give a warning. If you are coming from a language, such as Ruby, you may believe that you're already familiar with hashes because of the Hash data type. Language such as Ruby have a key/value pair data structure called hashes.

However you will quickly discover that computer science based hashes are a bit more complex than the Ruby data type. I'll give you a quick example of the key difference. In Ruby we can create a Hash that contains a list of baseball players and their respective positions like this:

players = { "Altuve": "2nd Base", "Bregman": "3rd Base" }

This part of Ruby's implementation of a hash works properly. However what happens when we attempt to add an item to a key?

players[:Altuve] = "DH"

If you run this code and look at the values of the players hash you'll see that it contains the following value:

{:Altuve=>"DH", :Bregman=>"3rd Base"}

By default, if you present a Ruby based hash with another value for a key that already exists it will simply replace the old value with the new one. This works fine for Ruby purposes, however computer science Hashes require more advanced behavior. And that's what we're going to cover in this guide.

Hash Table

Before we can walk through the behavior of a Hash it's important to understand its structure. My favorite real world analogy for a Hash is a numbered shoe cubby.

large

As you can see in this image we have a shoe cubby and each of the slots has a unique number. If we start populating the cubby with shoes we don't have to keep track of the shoe names, we can simply reference their ID based on the cubby number.

large

In this example if we call 13 it will return the blue running shoes.

large

Pretty straightforward, right? At a high level, a Hash operates with a key/value based system. This is a powerful tool for computer scientists since it allows for rapid lookups of data. But as you may have noticed, so far this is pretty much identical to how an Array works. So what's the difference between a Hash and an Array? There are a number of key differences and we'll start by looking at the most important one first.

Hash Function

When it comes to adding items to an array, the algorithm is pretty dead simple. You simply add an item to the end of the array or to a specific index value. However, adding items to a Hash are quite a bit different.

The whole point of using a hash is to organize data in a structured format that allows for near instant search speeds. The only way to accomplish this type of behavior is to create what's called a hash function that will decide what key should be used for a specific data set. This same hash function is what we'll use for finding elements in the collection when performing a search.

Folding Method

One of the most popular (and straightforward) hash functions is the folding method. The goal of the folding method is to pick out which slot an element should be added to. You can think of this as being the strategy you'd use for deciding on which cubby you'd place a pair of shoes into. This is a critical component for working with Hashes because it will be the logic that decides how elements are organized and also how you find items.

Let's take a practical look at how we can create a hash function using the folding method approach. I'll use the Ruby language for this implementation and we'll build a hash function that organizes the data for social security numbers.

Folding Method Code Implementation

def hash_function ssn, slots
  arr = ssn.scan(/.{1,2}/)
  total = arr.map(&:to_i).inject(&:+)
  p total % slots
end

If this code looks a little foreign, don't worry I'm going to walk through each line. Believe it or not, this is actually a pretty straightforward algorithm (especially compared with other hash functions).

This method starts off by taking a social security number and the total number of slots. You can think of the number of slots as the total number of shoe cubbies that are available.

Next our algorithm uses the scan method, which breaks the social security number into smaller chunks and converts it to an array. So if we have a social security number like 234782231 this process will convert it to look like this:

["23", "47", "82", "23", "1"]

The argument we're passing to the scan method is /.{1,2}/. This is a regular expression that will return the values in pairs (that's where the 2 comes in), except for the last element, which will be by itself (which is where the 1 comes in).

Moving onto the next line, our hash function utilizes the map method to convert each of the array elements to integers. And from there it uses the inject method to sum up the values. If you perform these actions on our array: ["23", "47", "82", "23", "1"] it will result in a total of 176.

Now that we have our total we're almost done! The final line of the hash function is total % slots. This line uses the modulus operator to get the remainder of the total divided by the total number of slots. Extending our example, let's imagine that we have 20 available slots that we want our hash function to work with. The final line would process the equation 176 % 20, which results in a remainder of 16. And this 16 value is the slot that we'll slide the social security number 234782231 into.

Phew! If that process is a bit fuzzy I highly recommend working through the code example so that it can solidify in your mind.

If we run the code with these social security numbers and 3 available slots:

ssns = ["234782231", "578454843", "453413441", "895t87434"]

ssns.each do |ssn|
  hash_function ssn, 3
end

You'll see that we get the values of:

So to review, this means that our hash function is saying:

Put "234782231" in slot 2
Put "578454843" in slot 0
Put "453413441" in slot 2
Put "895t87434" in slot 0

Searching Through Hashes

In order to search through a hash you'd simply pass the social security number into the hash function again and it will return the location that the value was placed in. Pretty cool, right? This process is extremely fast since it results in constant time, which pretty much means instantly.

Collisions

But there is trouble in paradise. Did you notice how our social security numbers got placed in duplicate slots? In a perfect world our hash function would sort items into unique slots. However we're not in a perfect world and you'll discover quickly that there will be many times where values get sorted into the same slots. When this occurs in computer science, it's called a collision.

The result of a collision is that when we run the hash function to find a value, our algorithm may find multiple elements in a single slot.

Collision Resolution

But have no fear, the great computer scientists that came before us planned on collisions occurring and there are a number of ways to resolve collisions.

Open Addressing

The first way to resolve a collision is to supply a fallback plan for the hash function. A straightforward approach is called open addressing. Open addressing places a validation in the hash function so that it confirms that it's not assigning a key to a slot that's already taken.

@taken_keys = []

def hash_function ssn, slots
  arr = ssn.scan(/.{1,2}/)
  total = arr.map(&:to_i).inject(&:+)
  key = total % slots

  if @taken_keys.include? key
    hash_function ssn, (slots + 1)
  else
    p key
    @taken_keys << total % slots
  end
end

ssns = ["234782231", "578454843", "453413441", "895t87434"]

ssns.each do |ssn|
  hash_function ssn, 3
end

In this code I've created a variable called @taken_keys that will store an array of keys. Each time the hash function is called it will add an item to the list of keys that are taken. And as you can see, our conditional in the method checks to see if a key has been taken and if it has it calls the hash function recursively and increments the number of slots. If you run this code you'll see that it fixes our collisions:

This solution has a few drawbacks:

You will have a dynamically changing number of slots, which may not work for all algorithm requirements if the data collection is too large for the system memory.
As you may have noticed we can't simply call the hash function to perform a search, which pretty much kills the point of using the Hash data structure. So we would have to create a dedicated find method that would take in the parameters and perform the search functionality.

This was a very naive implementation of open addressing, however it should give you an idea of the high level concept. Open addressing's goal is to place validations inside of hash functions and ensure that new elements are added to open slots.

Chaining

When it comes to resolving collisions in hashes I prefer to utilize the chaining method. Chaining is the process of allowing collisions to exist in a hash structure, and to chain the elements together.

Going back to our shoe cubby illustration. If we allow collisions to occur it would be like allowing multiple pairs of shoes to be placed inside the same cubby, like this:

large

However it's not good enough to simply toss the shoes into the cubby. If we zoom into the cubby we'll see that the shoes are connected. So when we grab the shoes we'll grab the full set of shoes and look at each one until we find the pair we're searching for.

large

In hashes this behavior is typically accomplished by utilizing the linked list data structure. So in this case we can keep our original hash function algorithm. And when we encounter a slot that has multiple elements we can simply iterate through the collection until we find the value we're searching for. This slows down the algorithm a bit, however it's still exponentially faster than searching through an Array in most cases.

Summary

In summary, the Hash data structure is a powerful tool for organizing and searching through data. As with any data structure, the main key to success is to ensure that Hash behavior is a good fit for your program's requirements.

Key Terms for Hashes

Hash - a key/value based data structure.
Hash function - a function that takes in a value and generates an index inside of a Hash structure.
Folding method - a popular hash function that breaks a value into smaller chunks and utilizes the modulus operator to generate the remainder of the sum of the chunks divided by the total number of available slots.
Collisions - the process that occurs when a hash function maps two values to the same slot.
Open addressing - a collision resolution process where a hash function ensures that elements are not mapped to previously taken slots.
Chaining - the process of allowing collisions to occur and grouping elements that have collided into a linked list.