When it comes to learning about data science there are countless factors that come into play. As data scientists (or developers) it’s vital to have a clear understanding of skills such as cleaning data ranging to mathematical algorithms. However one factor that has a tendency to be overlooked is the concept of causation vs correlation.
How many times have you seen “studies” that claim to have discovered that some type of behavior is the cause of another type of result. Take going to the gym as an example. You may hear research that says that people who go to the gym perform better than individuals who do not work out.
This seems like a rational argument and many people take this research to mean that:
"If I go to the gym, then I will perform better at work."
However this is faulty logic. The mistake with this type of thinking is that it assumes that one action causes another action to occur. In our example, yes, going to the gym has a number of positive benefits, such as: improved concentration, sharper memory, faster learning, perseverance, and lowered stress, just to name a few. However there are a number of ways that these benefits can be attained that aren’t connected with going to the gym at all.
This is a classic case of causation vs correlation. In this case we’re making the mistake of assuming that going to the gym causes improved performance at work. When in reality going to the gym is simply correlated with improved work performance.
When working through the causation vs correlation issue, I think that it helps to break down the concepts to their basic definitions.
Causation can be defined as the act of making something happen. For an example, if I properly study a new topic in machine learning, I will learn a new machine learning concept. The act of properly studying caused me to learn the new concept.
Correlation on the hand can be defined as the mutual relationship between items. Going back to our gym example, we can say that there is a correlation between individuals who go to the gym and those who perform well at work. The key with understanding how correlation works is that it is centered around the association between two concepts.
In breaking down how correlation works I like to think of it visually.
In looking at this chart you can see that I’ve broken down the characteristics of individuals who go to the gym regularly. And I’ve placed these next to the characteristics of high performers.
As a side note, this list is by no means comprehensive. I simply picked out a few common traits that I’ve seen in my experience.
If you look at the chart you’ll see that the characteristics for both types of individuals are the same. However this does not mean that going to the gym causes this type of behavior, it simply illustrates that there is a correlation, or a mutual relationship, between the two.
From a data science perspective when you see correlations like this it should pique your interest, but you shouldn’t jump to the conclusion that one type of behavior causes another.
So what is the right approach? The key is to not let correlations force you to be biased when building your machine learning algorithm.
Now imagine that you’ve been asked to build a machine learning program. The goal of the algorithm is to discover how individuals can improve their performance at work.
Instead of looking at correlations such as:
Executives who go to the gym regularly are more prone to succeed at work.
Focus on the actual causes of high performance. We’ve already listed a number of causes off, such as:
Now that you’ve focused on the actual causes of high performance at work, your algorithm can take a different perspective. Instead of looking at a single association, such as going to the gym. It can look at ALL of the various ways that individuals can improve in these key causal areas.
In taking this approach your machine learning algorithm won’t be limited by bias. Instead it will take a more comprehensive perspective on how individuals can perform at work.
For example, it may show that in addition to going to the gym, individuals could alternatively:
Do you see how this approach doesn’t care about the correlations? Instead it focuses on the root causes of the data and THEN it comes up with a list of ways to improve that aren’t limited by tunnel vision.
On a final note. Please don’t misunderstand the intention of this guide. There are a large number of key benefits to going to the gym regularly. I personally work out daily and I encourage the students that I teach to do the same.
The reason I chose going to the gym as a case study is because it’s a topic that I struggle with. I can get in the mindset that gym attendance is required for me to do well at work. However the truth is that there are a number of ways that job performance can be positively affected that don’t involve the gym at all!
Mark Twain, Charles Dickens, and Einstein all took long walks each day. I know incredibly talented developers who practice determination while playing video games for hours each day. The fact of the matter is that everyone has a different path for achieving peak performance. And it’s important, as data scientists, that we don’t take the naive approach of thinking that correlation is the same as causation.
I've been a software engineer for the past decade and have traveled the world building applications and training individuals on a wide variety of topics.