Naive Bayes Python Implementation and Understanding

--

Naive Bayes is a Machine Learning Classifier that is based on the Bayes Theoram of conditional probability. In this article, we will be understanding conditional probability (Bayes Theoram) and then moving on to how it translates to the Naive Byes classifier. we will be understanding the mathematics behind this classifier and then finally coding it in Python.

Bayes Theoram

source: Wikipedia

Named after the statistician Thomas Bayes, this theorem is also known as the theorem of conditional probability. It allows us to calculate the probability of a particular event GIVEN a set of prior conditions. For example, the probability that it will rain tomorrow GIVEN that it rained yesterday.

The formula for calculating conditional probability is shown below.

The term on the left-hand side is read as ‘the probability of event A occurring given that event B has occurred’. The term on the right-hand side is the probability of both events occurring together divided by the probability of event B occurring. The formula is quite straightforward. I will not be delving into its derivation or intuition as this article is not about Bayes Theorem but rather Naive Bayes Classifier.

Big Data Jobs

The Naive Bayes Classifier

Since we are classifying our data into discrete labels, just like any other classifier, for Naive Bayes we will have a set of input features as well as their corresponding output class. A Naive Bayes classifier calculates probability using the following formula.

The left side means, what is the probability that we have y_1 as our output given that our inputs were {x_1 ,x_2 ,x_3}. Now let’s suppose that our problem had a total of 2 classes i.e. {y_1, y_2}. We will now use the above formula twice first to calculate the probability of y_1 occurring and then for y_2 occurring. Whichever has a higher probability will be our predicted class.

This is how Naive Bayes is used for classification.

Trending AI Articles:

1. Why Corporate AI projects fail?

2. How AI Will Power the Next Wave of Healthcare Innovation?

3. Machine Learning by Using Regression Model

4. Top Data Science Platforms in 2021 Other than Kaggle

Naive Bayes in Python

Let's start coding it in Python.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

Our necessary libraries for playing with data. If you do not know how to work around with Pandas then you might want to read about it here first.

Now let’s load our dataset. I have used the Heart disease prediction data set which can be found on Kaggle.

data = pd.read_csv('heart-disease-data/heart.csv') #Read the dataset
data.head()
Head of the dataset

For a Naive Bayes Classifier, we need discrete variables since we can not use continuous variables in calculating probabilities. So we need to drop some columns here such as cholesterol and trestbps.

data.drop(["age", "trestbps", "chol", "thalach", "oldpeak", "slope"],axis = 1 ,inplace=True) #drop irrelevant columnsdata.head()
dropped columns
X = data[data.keys()[:-1]]
y = data[data.keys()[-1]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)#test-train splitdata_train = pd.concat([X_train, y_train],axis = 1) #concat back
data_test = pd.concat([X_test, y_test],axis = 1)

Now we need to code the helper function that would help us calculate all the necessary probabilities.

According to the formula we will need to calculate the probability of occurrence of every input feature as well as output feature and their conditional probabilities given each class label.

First, we will calculate the probability for each input variable.

#Calcualting probabilites for inputs independantlydef get_probabilities_for_inputs(n, column_name, data_frame):

temp = data_frame[column_name] #isolate targetted column
temp = temp.value_counts() #get counts of occurences of each input variable

return (temp/n) #return probiblity of occurence by dividing with total no. of data points

Next, we will calculate conditional probabilities for the input given an output class.

#calculating conditional probabilitydef get_conditional_probabilities(data_frame, n,target, given):

focused_data = data[[target, given]] #isolate target column an dfocus input column
targets_unique = data[target].unique()#list of unique outputs in data
inputs_unique = data[given].unique()

groups = focused_data.groupby(by = [given, target]).size().reset_index()
groups[0] = groups[0]/ n


for targets in targets_unique:
current_target_length = len(focused_data[focused_data[target] == targets])
groups[0] = np.where(groups[target] == targets, groups[0].div(current_target_length),groups[0])

return groups

Next, we will write down our ‘fit’ function that will calculate and return all the necessary probabilities which we will then use for making classifications.

def calculate_probabilities(data):
#splititng input data
x = data[data.keys()[:-1]]
y = data[data.keys()[-1]]
target = y.name

#get length of dataframe
n = len(data)

#get probabilities for each individual input and output
f_in = lambda lst: get_probabilities_for_inputs(n, lst, x)
input_probablities = list(map(f_in,x.keys()))

output_probabilities = get_probabilities_for_inputs(n ,target, y.to_frame())

#get conditional probabilities for every input against every output
f1 = lambda lst: get_conditional_probabilities(data, n, target,lst)
conditional_probabilities = list(map(f1, data.keys()[:-1]))

return input_probablities, output_probabilities, conditional_probabilities

Now that we have all the necessary calculations done and out of the way, we need to make a function that will give us our output class label by making calculations according to the Naive Bayes formula that we wrote above.

def naive_bayes_calculator(target_values, input_values, in_prob, out_prob, cond_prob):

target_values.sort()#sort the target values to assure ascending order
classes = [] #initialise empty probabilites list

for target_value in target_values:
num = 1 #initilaise numerator
den = 1 #initialise denominator
#calculate denominator according to the formula
for i,x in enumerate(input_values):
den *= in_prob[i][x]
#calculate numerator according to the formula
for i, x_1 in enumerate(input_values):
temp_df = cond_prob[i]
num *= temp_df[(temp_df.iloc[:,0] == x_1) & (temp_df.iloc[:,1] == target_value)][0].values[0]
num *= out_prob[target_value]
final_probability = (num/den) #final conditional probability value

classes.append(final_probability) #append probability for current class in a list

return (classes.index(max(classes)), classes)

Now we have all our functions out of the way, we can move on to running them and storing our calculations.

in_prob, out_prob, cond_prob = calculate_probabilities(data_train)#use training data for the initial calculations

The three variables will have all the necessary probabilities; probabilities of all the inputs, probabilities for the output class, and conditional probabilities.

Let’s test the calculations that we have made.

#testing with dummy data
naive_bayes_calculator([1,0], [1,1,0,2,1,3,3],in_prob,out_prob,cond_prob)
Outputs

We have our class prediction and the probabilities for each class inside a tuple.

Now it's time to test on our ‘test data’.

The following function takes a set of inputs and returns the predicted class against each in a list.

def naive_bayes_predictor(test_data, outputs, in_prob, out_prob, cond_prob):

final_predictions = [] #initialise empty list to store test predictions

for row in test_data:
#get prediction for current data
predicted_class, probabilities = naive_bayes_calculator(outputs, row, in_prob, out_prob, cond_prob)
#append to list
final_predictions.append(predicted_class)


return final_predictions

Now calculate accuracy.

test_data_as_list = X_test.values.tolist()
unique_targets = y_test.unique().tolist()
predicted_y = naive_bayes_predictor(test_data_as_list,unique_targets,in_prob,out_prob,cond_prob)print("Accuracy:", (np.count_nonzero(y_test == predicted_y)/len(y_test)) *100)
Our test accuracy

An accuracy of 77.4% is certainly not a bad number considering that we dropped certain important columns and the naivety of the algorithm ignores certain correlations between the input variables.

Conclusion

Naive Bayes is a very simple classifier granted that you understand basic probability and the concept of inputs and outputs in Machine Learning. The algorithm does have certain shortcomings such as ignoring the dependency of input variables on each other. It is very simple to build and gives good results if your data is according to its requirements.

Don’t forget to give us your 👏 !

--

--