How to Develop a Big Data Analysis Application in Ruby
In this lesson, we are going to work on a big data classification algorithm based on the decision tree algorithm that you can use in a real-world scenario.
Guide Tasks
  • Read Tutorial
  • Watch Guide Video
Video locked
This video is viewable to users with a Bottega Bootcamp license

In this lesson we are going to work on a big data classification algorithm based on a decision tree algorithm that you can use in a real-world scenario.

Let's say you're working for a company and want to find out your ideal market. You also want to know who are the customers buying your product or service. To answer these questions you can leverage the vast amounts of data you have collected from historical customers.

As with the previous lesson, let's call our rubygems and decisiontree code libraries and setup some attributes. Inn this example our attributes will be demographic data such as age, education, income and marital status. We also have some training data that will be used by the decision tree to make decisions.

require 'rubygems'
require 'decisiontree'

attributes = ['Age', 'Education', 'Income', 'Marital Status']
training = [
  ['36-55', 'Masters', 'High', 'Single', 1],
  ['18-35', 'High School', 'Low', 'Single', 0],
  ['36-55', 'Masters', 'High', 'Single', 1],
  ['18-35', 'PhD', 'Low', 'Married', 1],
  ['< 18', 'High School', 'Low', 'Single', 1],
  ['55+', 'High School', 'High', 'Married', 0],
  ['55+', 'High School', 'High', 'Married', 1],
  ['55+', 'High School', 'High', 'Married', 1],
  ['55+', 'High School', 'High', 'Married', 1],
  ['< 18', 'Masters', 'Low', 'Single', 0],
]

In the training data, the last value of 1 or 0 denotes if each person is a customer or not, where 1 is customer and 0 is not a customer. The above data is just an example, as you'll have thousands of such data points in the real world. In fact, the more data you have, the better your decision tree will be. If you have millions of data points, then you can drill down with more precision to get the information you want.

Typically, you can get all this type of data through the company's CRM software.

Next, let's instantiate our decision tree.

dec_tree = DecisionTree::ID3Tree.new(attributes, training, 1, :discrete)
dec_tree.train

In this code we are passing attributes and training data. The default value is 1 and the algorithm is going to be discrete. We also have to train the decision tree with the method train.

Now we can test some values to see how our system works:

test = ['< 18', 'High School', 'Low', 'Single']
decision = dec_tree.predict(test)
puts "Predicted : #{decision}"

The output is:

Predicted : 1

It predicted that this person should be a customer. The attributes we tested match one of the records in training data, and this is why the decision tree was able to predict that this person is a customer.

Next, let's change the income level of test data like this:

test = ['< 18', 'High School', 'High', 'Single']

If you run, the decision tree still says this person should be a customer, which is the right answer.

Let's change the test data a little bit again.

test = ['18-35', 'High School', 'Low', 'Married']

The output says this person should not be a customer. If you're a company, then you may not want to spend a lot of money on marketing to this person because they do not seem to be interested in what you're offering (at least on a historical basis).

This is how you implement a big data analysis module in Ruby to allow you to make some advanced decisions.