Titanic Survival Dataset Part 2/2: Logistic Regression

Predicting whether a passenger survived.

--

Welcome back! In my previous post I wrote an EDA (Exploratory Data Analysis) on Titanic Survival dataset. Check it out now if you haven’t already.

Anyway, in this article I would like to be more focusing on how to create a machine learning model which is able to predict whether a Titanic passenger survived based on their attributes i.e. gender, title, age and many more.

Before going any further, I also want you to know that the project I do here is inspired by this article: https://towardsdatascience.com/kaggle-titanic-machine-learning-model-top-7-fa4523b7c40. I do implement several feature engineering techniques explained in that article with several modifications for the sake of simplicity. Now let’s do this :)

Note: full code available at the end of this article.

As always, the very first thing I do is importing all required modules and loading the dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
df = pd.read_csv('train.csv')
The first 5 data in Titanic dataset.
Big Data Jobs

Feature engineering 1: SibSp & Parch

Now let’s start the feature engineering stuff from the SibSp and Parch columns. According to the dataset details (which you can access it from this link), the two columns represent the number of siblings/spouses and the number of parents/children abroad the Titanic respectively. The idea here is to create a new column called FamilySize in which the value is taken from the two columns I mentioned earlier. This action is taken based on the assumption that larger family size may have greater opportunity to get survived as they can stay intact with each other better than those who travel alone. Below is the code to do that.

df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df.head()

And here’s how our new data frame df looks like. We can see here that the FamilySize column appears as expected.

Adding FamilySize column.

Feature engineering 2: Embarked

According to the EDA explained in my previous article, there are 2 missing values in Embarked column. Since it’s not a significant number, we are just gonna eliminate them:

df = df.dropna(subset=['Embarked'])

The argument subset indicates that the code will drop rows with NaN values in Embarked column only.

Next, if we take the unique values of this column, we will find that there are 3 possible values, namely C, Q and S (which stands for Cherbourg, Queenstown and Southampton). Here I decided to convert this column values into something like one-hot representation since any machine learning algorithm will never work with non-numerical data. To do that, we can use get_dummies() function coming with Pandas module.

embarked_one_hot = pd.get_dummies(df['Embarked'], prefix='Embarked')
df = pd.concat([df, embarked_one_hot], axis=1)
df.head()

The first line of the code above shows that the one-hot-encoded values are stored in embarked_one_hot variable. Then, the variable is concatenated with our original data frame df.

Trending AI Articles:

1. Machine Learning Concepts Every Data Scientist Should Know

2. AI for CFD: byteLAKE’s approach (part3)

3. AI Fail: To Popularize and Scale Chatbots, We Need Better Data

4. Top 5 Jupyter Widgets to boost your productivity!

One-hot representation of Embarked column is concatenated to the original data frame.

Feature engineering 3: Cabin

In my previous post (EDA of this Titanic dataset), I found that the values of Cabin column contains plenty of missing values. Thus, I decided to fill that out with U, which stands for “Unknown”. It can simply be achieved using fillna() method.

df['Cabin'] = df['Cabin'].fillna('U')

Next, I also found that the values of that column are a letter followed with several numbers (also explained in the previous post). What I wanna do now is to extract all those initial characters. My approach here is to employ lambda function like this:

df[‘Cabin’] = df[‘Cabin’].apply(lambda x: x[0])

Now that all values of Cabin column have been updated to only a single letter. The next step to do is to convert the value of this column into one-hot format. To do that, I will use the exact same method as what we have done to Embarked column.

cabin_one_hot = pd.get_dummies(df['Cabin'], prefix='Cabin')
df = pd.concat([df, cabin_one_hot], axis=1)
df.columns
New columns have been added to our data frame df.

Feature engineering 4: Name

You probably might be thinking at the first place that we don’t even need to take into account the values of Name column as it only holds the name of a person. Theoretically, name will never affect the survival chance of a person. And yes, I do agree with that. However, if we pay closer attention to its contents, we are going to find something interesting: title.

We are going to take all these titles.

Those titles may be a good feature to consider whether this person is survived or not. Therefore, we are going to take these titles using get_title() function that we declare manually by ourselves.

def get_title(x):
return x.split(',')[1].split('.')[0].strip()

Now as the function has been declared, we can just apply that function to Name column and store the result to a new column Title.

df['Title'] = df['Name'].apply(get_title)

If you want, you can also check the unique values stored in Title column using df[‘Title’].unique() command. The output is going to look something like this:

array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
'Jonkheer'], dtype=object)

Similar to the Cabin column, we are going to convert the values of Title into one-hot representation because up to this stage its values are still in form of categorical data. Below is my approach to do so.

title_one_hot = pd.get_dummies(df['Title'], prefix='Title')
df = pd.concat([df, title_one_hot], axis=1)

Feature engineering 5: Sex

Well, I guess there’s no much thing to say here. We know that there are only two values in in Sex column, namely female and male, which we know that this is also a categorical data. Therefore, we can simply use pd.get_dummies() function again to convert the values of this column into one-hot format.

sex_one_hot = pd.get_dummies(df['Sex'], prefix='Sex')
df = pd.concat([df, sex_one_hot], axis=1)
We can clearly see here that two new sex columns have been successfully created.

Feature engineering 6: Age

If I were to say, this Age feature engineering is the most tricky part — well, at least for me. According to my previous article which talks about EDA on this Titanic dataset, we found that 177 out of 889 passengers’ age are missing. Therefore, we need to fill this with a number. However, in this case we will not just directly fill those NaNs with the median or mean of all existing age numbers. Instead, I wanna group all passengers data by its Title first, and then compute the median of each title group before eventually use these medians to fill the missing values. Here’s the first thing to do:

age_median = df.groupby('Title')['Age'].median()
age_median

After running the code above, we are going to obtain the median of each Title.

Title
Capt 70.0
Col 58.0
Don 40.0
Dr 46.5
Jonkheer 38.0
Lady 48.0
Major 48.5
Master 3.5
Miss 21.0
Mlle 24.0
Mme 24.0
Mr 30.0
Mrs 35.0
Ms 28.0
Rev 46.5
Sir 49.0
the Countess 33.0
Name: Age, dtype: float64

Next, we need to create a function fill_age() which accepts a single value as its parameter. This x parameter basically just represents every row in our data frame.

def fill_age(x):
for index, age in zip(age_median.index, age_median.values):
if x['Title'] == index:
return age

Now it’s time to apply this fill_age() function. However though, we need to be careful since essentially what we need to do is to replace only the missing Age, not the entire values in Age column. Therefore, I define a lambda function inside of apply() method. What’s actually done by the lambda function itself is that we are going to apply the fill_age() function only when the corresponding age is missing. Otherwise, if the age value already exists, then we will just use its existing value. Below is how to do it:

df['Age'] = df.apply(lambda x: fill_age(x) if np.isnan(x['Age']) else x['Age'], axis=1)

Now if we try to run df.isnull().sum(), we will see that our data frame df no longer contains missing value. But remember that some of our columns are still using categorical type. We can check it by running df.dtypes.

PassengerId             int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
FamilySize int64
Embarked_C uint8
Embarked_Q uint8
Embarked_S uint8
Cabin_A uint8
Cabin_B uint8
Cabin_C uint8
Cabin_D uint8
Cabin_E uint8
Cabin_F uint8
Cabin_G uint8
Cabin_T uint8
Cabin_U uint8
Title object
Title_Capt uint8
Title_Col uint8
Title_Don uint8
Title_Dr uint8
Title_Jonkheer uint8
Title_Lady uint8
Title_Major uint8
Title_Master uint8
Title_Miss uint8
Title_Mlle uint8
Title_Mme uint8
Title_Mr uint8
Title_Mrs uint8
Title_Ms uint8
Title_Rev uint8
Title_Sir uint8
Title_the Countess uint8
Sex_female uint8
Sex_male uint8
dtype: object

Hence, we need to drop all the columns that contain categorical data using drop() method.

df = df.drop(['PassengerId', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Title'], axis=1)

Now, the very last step in feature engineering part is to normalize all values. In this project I decided to use linear scaling method for simplicity.

df = (df-df.min())/(df.max()-df.min())

That’s basically all of the feature engineering part. Now that we’re getting closer to the main part: model training!

Machine learning: logistic regression

But wait! before training the model, we are going to define the X and y variable for this problem. Since the purpose of this project is to find out whether a passenger survived, thus we can simply set the values in Survived column to be the ground truth (a.k.a label, or y). Meanwhile, all other columns are going to be our features (X). Below is my approach to do that.

y = df['Survived'].values
X = df.iloc[:,1:].values

Again, there’s another thing that we need to do: separating the data into train/test, which can simply be done using train_test_split() function coming from Sklearn module. In this case, I decided to use 20% of the data as the test set.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=21, test_size=0.2)

Now as we already got both train and test data, we can start to define a logistic regression model. The reason why I choose this classifier model is because we are dealing with categorical target (either true or false). Linear regression is obviously not going to work in this case since it can only predict continuous values. But why not the others like decision tree, random forest, SVM, or others? Simply because I found that the final accuracy of those algorithms are just worse than what I obtain using logistic regression.

Well, I won’t explain the math behind this logistic regression algorithm itself since I am not sure whether I can do it well here. But for those who wanna learn more about it in detail, I do recommend you to read this article.

Anyway, I am going to jump directly to the code. Now what we need to do is to initialize a LogisticRegression() object, which I put in clf variable.

clf = LogisticRegression()

As the classifier has been initialized, we can start to train the model using our X_train and y_train pair. It can simply be done using fit() method. The process should not take long since our dataset size is relatively small.

clf.fit(X_train, y_train)

Now after the clf model has been trained well, we can try to print out the accuracy score like this:

print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))
Accuracy score on train and test data.

And then I found that the model gets the accuracy of 84% on both train and test data. According to this result, we can say that this logistic regression classifier is not overfitting, even though the accuracy itself might still able to be improved using some other techniques.

Model evaluation

In the model evaluation chapter, we are gonna see more clearly how the predictions distribution looks like. Here I would like to display 2 confusion matrices in which the first one is going to display train data predictions and the next one is used to show the test data predictions.

Let’s start creating the first one. To do that, we need to predict our train data itself and store the predictions in train_preds variable.

train_preds = clf.predict(X_train)

Next, I can simply use confusion_matrix() function to construct a confusion matrix. Remember that the first argument should be the actual values and then followed by the predictions in the next one.

cm = confusion_matrix(y_train, train_preds)

As the cm array has been created, now we can use its value to be displayed using heatmap() function coming from Seaborn module.

plt.figure(figsize=(6,6))
plt.title('Confusion matrix on train data')
sns.heatmap(cm, annot=True, fmt='d', cmap=plt.cm.Greens, cbar=False)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
Confusion matrix on train data.

What we actually see in the figure above is how the data is predicted. For example, here we got 62 survived passengers which are predicted as not survived. Also , we found here that there are 51 not survived passengers yet predicted as survived.

By doing the same thing, we can also display the confusion matrix which is constructed based on predictions on test data (except here I replace plt.cm.Greens with plt.cm.Reds).

Confusion matrix on test data.

And that’s all! I’m pretty sure that 84% of accuracy that I obtain can not be considered as the best one. So I hope you are able to find a technique which can improve the model accuracy. It can probably be achieved by applying more advanced feature engineering or using other machine learning algorithms.

Thanks for reading! Feel free to leave a comment if you find any mistake in this article!

By the way, here’s the code :)

References

Kaggle Titanic: Machine Learning model (Top 7%) by Sanjay. M. https://towardsdatascience.com/kaggle-titanic-machine-learning-model-top-7-fa4523b7c40

Logistic Regression — Detailed Overview by Saishruthi Swaminathan https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc

Don’t forget to give us your 👏 !

--

--

A machine learning, deep learning, computer vision, and NLP enthusiast. Doctoral student of Computer Science, Universitas Gadjah Mada, Indonesia.