Machine Learning Series Day 7 (Decision Tree Classifier)

I promise it’s not just another “ML Article.”

Alex Guanga

Published in

Becoming Human: Artificial Intelligence Magazine

12 min readJul 16, 2019

Terminology:

Gini Impurity: Measurement of the likelihood of incorrect classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the data set.
Information Gain: It calculates how important each independent variable is (features).
Entropy: A measurement of information (or rather lack thereof).

Concept:

Intuitively, you can think of a Decision Tree Classifier similar to driving to a location for the first time without a GPS (including your phone). You’ll quickly realize that there are a lot of different roads/exits you can take. You don’t know the route so you ask for directions from locals (can’t believe this is how people used to travel). As you’re asking people, you are narrowing down the possible roads/exits you can take.

At a nearby gas station, you ask an employee for help. He suggests taking the 496 expressway.

Mathematics/Statistics:

Method 1 (Information Gain / Entropy) — ID3 Algorithm:

The algorithm is considered to be a “greedy algorithm” because it optimizes short-term goals over long-term goals. Hence, it maximizes the local optima without considering the global optima.

Entropy: A measurement of uncertainty.

Information Gain: The difference of the entropy before and after an event. The goal is to find the most significant information gain and thus, the lowest entropy.

Objective/Meaning: The ID3 algorithm aims to maximize the information gain. Meaning, we would like to ‘know’ more about a potential outcome after a decision than before the decision. Think about it. It is optimal to seek features that can distinguish the possible consequences.

Method 2 (Gini Impurity):

Gini Impurity: The probability of a random sample being classified incorrectly if we randomly pick a label according to the distribution in a branch (the higher the possibility, the worst our model).

Objective/Meaning: The aim is to minimize the Gini Impurity. Intuitively, you are creating an algorithm that can effectively calculate with a high probability that a new sample comes from the predicted distribution.

Deep Dive: Mathematics/Statistics

Method 1: Entropy & Information Gain

Entropy is the measure of uncertainty.

Entropy in Decisions Trees is a borrowed concept from entropy in communication. I believe having a robust comprehension of entropy will help us feel comfortable with Decision Trees.

Let me paint a scenario. Your old-fashioned relative has invited you to a bridal shower with limit seating. Thus, he needs a hand-written note ASAP. If FedEx charges you by the number of characters, you should write a concise letter that addresses your decision (whether or not you’ll be attending the bridal shower).

Some people might write, “I will be going to the wedding.”

Others might write, “Yes.”

Intuitively, you understand the latter is a more efficient communication statement. You’re communicating the response without any fluff.

The goal is to calculate the maximum number of units (bit) required for our communication statement. We don’t want to send unnecessary information.

Claude Shannon, one of the greatest mind in the 20th century, was fascinated by the issue explained above. How can information transfer in the most optimal way? He was interested in finding a method that can communicate as much uncertainty as possible.

Why does the formula above work? I recommend you read the article, as it provides how the Entropy was derived.

A simplified summary is the following: Shannon’s implemented bits, which are a numerical representation of yes and no with 0’s or 1’s. In effect, Shannon’s objective was to understand the maximum number of bits required to send a message. Shannon implemented entropy to use bits and the likelihood of each outcome effectively to reduce the cost of communication.

The lower the entropy, the more information that is encapsulated in each bit. If you are using more bits to convey a message, then it must mean each bit carries less information.

Visual Explanation

While Shannon used entropy for communication, machine learning has incorporated entropy to evaluate efficient splits in our Decision Trees.

Notice the graph on the right. The lowest an entropy could be is 0, which is on the tail ends where the probability of an outcome is either 0 or 1. Entropy is the greatest when the likelihood of an issue is .5.

This makes sense. If our data is split in half, then predicting an outcome is difficult. However, if the information is homogenous, then the probability of a data point categorize to an issue will either 0 or 1 and thus, entropy will be 0 (information is the greatest).

It’s beautiful when mathematics can be applied in unforeseen ways. Created as a method to calculate bits, it has also been influential in producing a useful ML model — Decision Tree (Random Forest and XGBoost).

Shannon understood that bits were useful because it offers a binary approach to life. In communication, it helped researchers comprehend that uncertain messages would carry the most entropy.

Entropy (uncertainty) is the reciprocal of information. I’m assuming you are a well-educated individual and thus, I’m assuming you would agree that learning is most efficient when it’s from ideas that aren’t common knowledge.

Problem Solving Explanation

You bought a new car. For some odd reason, you want to keep it a secret. Your father wants to guess the vehicle you purchased, and thus, you give him four potential cars.

Potential Cars: car_a, car_b, car_c, and car_d.

Implementing Decision Tree with an ID3 algorithm is trivial when dealing with two potential outcomes (“Did you buy car_a?”). If the answer is no, then you can be sure that the person bought car_b. Notice that for two possible outcomes, we only need to ask one question. However, as the number of outcomes increases, the required questions increase.

We could be smart when asking questions. Instead of asking for each outcome (i.e., is it car_a), we could specify our possible results based on previous questions/answers (see below for an example).

It’s easy to see the intuition illustrated in the image above. Ironically, implementing math can simplify the problem using the equation below.

The benefit of using the log formula is when used with a base of 2, it mathematically describes binary choices.

Decisions are binary; hence, the 2 in the log formula.

If all possibilities have equal probabilities, and we have four potential outcomes, the formula can be written as:

The equation above states that to minimize the total possible decisions for four potential issues, we need to have two binary choices at least. This equation above assumes that the possible results are even.

Generalizing The Model

To conclude the model to N possible outcomes, we use the following:

Let P represent 1/N
Because fractions and logarithm functions with a base of 2 compute to negative values, let’s make the functions negative.
We include a multiplier for each outcome, the probability of each outcome (represented by P).
We aggregate the bits for each outcome to compute the total estimated bits.

Real-World Example:

We are responsible for predicting if a person is a sports fan or not. Based on our data, 11 voted “Sports Fan,” and nine voted “Not a Sports Fan.”

One discrete characteristic was a boolean response for “Was the person born in a big city?”

When we segment our data by “Was the person born in a big city?”, we learned that 13 voted “True” while 7 people voted “False.”

From the segmentation created above, we must calculate whether or not they were a sports fan. We found the following:

If “True”: 10 were “Sports Fan” & 3 were “Not a Sports Fan.”
If “False”: 3 were “Sports Fan” & 4 were “Not a Sports Fan.”

Entropy Process:

STEP 1 — Find the entropy of the parent node:

STEP 2 — Find the entropy of the child node:

We’ll use “Was the person born in a big city?” to understand how well it can predict if a person will be a sports fan or not.

Formula (Lived in Sports City):

Formula (Not Lived in Sports City):

STEP 3 — Calculate the child node entropy with the weighted average:

STEP 4 — Information Gain:

To calculate Information Gain, we must compute the entropy for the Parent Node minus the entropy for the Children Node.

.94 — .88 = .06

And you’re done! Of course, the algorithm would compute the entropy for all the possible segmentation and whichever segmentation provides the strongest Information Gain, will be the split taken.

Method 2 Gini Impurity/Gini Index & Gini Gain:

“k” (the letter above the capital Sigma) is the data labels. “p” represents the frequency of each classification.

You’re about to purchase a new car from 3 car dealers. You do not want to buy a “lemon” for a vehicle. Your research has brought some comfort. However, you can never be 100% sure you will make the correct decision. Based on your research, you believe car_dealer_A has an 80% probability of providing a quality car while car_dealer_B and car_dealer_C each have a chance of 10%.

The Gini impurity calculates how often will the purchase of your new car be wrong (given their respective probabilities).
Using the car example, would you prefer if all three car dealerships had a ⅓ probabilities of providing a quality car or the possibility described above (8/10, 1/10, 1/10).
You should choose the latter because you can more confidently predict that the quality (based on characteristics) will be the right car if I go with car_dealer_a. However, I won’t be able to be confident with a prediction (regardless of the features), if they all have an equal possible distribution.

If we were only dealing with two categorical dependent variables, the diagram below describes formula and intuition.

M2 = (PA(PA — 1)), where PA is the probability of event A.
If the split between both categories is even, the “impurity” is the highest — ineffective split.
The lower the Gini Impurity, the better the split because the probability is heavily favoring an event.

Primarily, we find the Gini Impurity for each possible outcome, which is driven by the probability of given history. Think about it. If the model favors a likely outcome, we can confidently make a prediction which is showcased by a low Gini Impurity.

The goal is to reduce the aim for a low Gini Impurity. When the Gini Impurity is the smallest, the Gini Index is the highest! A high Gini Index indicates a good split!

Real-Life Example:

Using the same as the one provided for Entropy.

Process:

STEP 1 — Find the Gini Impurity of the parent node:

STEP 2 — Find the entropy of the child node:

We’ll use “Was the person born in a big city?” to further split whether a person was a sports fan or not.

Formula (Lived in Sports City):

Formula (Not Lived in Sports City):

STEP 3 — Calculate the child node Gini Impurity with the weighted average:

STEP 4 — Calculate Gini Gain:

To calculate Gini Gain, we must compute the entropy for the Parent Node minus the entropy for the Children Node.

.45 — .42 = .03

We must continue with this process for all possible splits. The split that maximizes the Gini Impurity will be selected.

To conclude, the objective of the Gini Impurity is to be as close to 0 as possible. The closer to 0, the more uneven the splits are. More importantly, the more confident you could be with the separation. A low Gini Impurity will then maximize the Gini Gain!

After we do the first split, we have to find further separations with the data value in their subgroup. We no longer use the entire data set.

Final Thoughts:

There are features of a Decision Tree that can help performance improve. For example, we could limit of total branches (splits) the decision tree could is branched, which allows our model not to overfit. For more on this, I suggest you read this Medium article (code in Python).

Sources: