Titanic Survival Dataset Part 1/2: Exploratory Data Analysis

Finding out important (and probably interesting) information in the dataset.

--

Hi everyone, Ardi here! In this article I wanna do Exploratory Data Analysis (EDA) on Titanic dataset. So far, I’ve been doing several projects in which most of those are related to classification on unstructured data (i.e. image classification). Today, instead of doing the similar project, I wanna try to work with structured data which I think this one is more related to the field of data science in general. Here I decided to use Titanic dataset. The main goal of working with this bunch of data is to perform prediction whether a passenger was survived based on given attributes that they have. The dataset itself can be downloaded here. It should not take long as it only consists of some tiny csv files.

Now after the download finishes we can start to write some code. As usual, I will begin with some imports. By the way I use the combination between Matplotlib and Seaborn just because I’ve been familiar with Matplotlib’s codes while on the other hand I like the figure styles of Seaborn better.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

Next, we will load and display the training data. In this EDA I decided not to take into account the data from test set because it does not mention the survival status of the passengers.

df = pd.read_csv(‘train.csv’)
df.head()
The first 5 passengers data.

Data shape, Data types and NaN values

As the data has been loaded, I wanna find out the size of this data frame using df.shape command, which the result indicates that our train.csv contains 891 rows (each representing a passenger) and 12 columns (the attributes of each passenger).

(891, 12)
Big Data Jobs

The datatypes of each column can also be shown by taking the dtypes attribute of df (just by running df.dtypes).

PassengerId      int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object

We can see here that there are int64, float64 and object. Well the first two simply means integers and floats respectively, while the object itself is essentially just a string. In the feature engineering chapter we are going to convert all these strings into numbers as basically any machine learning algorithms can only work with numerical data.

Trending AI Articles:

1. Machine Learning Concepts Every Data Scientist Should Know

2. AI for CFD: byteLAKE’s approach (part3)

3. AI Fail: To Popularize and Scale Chatbots, We Need Better Data

4. Top 5 Jupyter Widgets to boost your productivity!

Next, I wanna check whether the our data frame df contains NaN (Not a Number) values, which can be done like this:

df.isnull().sum()

The code above displays the following output. Here we can see the number of missing attributes in each column. Well, those missing values may cause a problem — for sure — and we will fix this in the next chapter.

PassengerId      0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

Number of survived vs not survived passengers

Before I go any further, I wanna show you the details of this Titanic dataset.

Dataset details. Source https://www.kaggle.com/c/titanic/data.

According to the table above, it shows that the values of Survived column are either 0 or 1, where 0 represents that the passenger is not survived while 1 says that they are survived. Now in order to find out the number of the two, we are going to employ groupby() method like this:

survived_count = df.groupby('Survived')['Survived'].count()
survived_count

Here’s how to read it: “Group the data frame by values in Survived column, and count the number of occurrences of each group.”

In this case, since the Survived only has 2 possible values (either 0 or 1), then the code above produces two groups. If we print out survived_count variable, it will produce the following output:

Survived
0 549
1 342
Name: Survived, dtype: int64

Based on the output above, we can see that there are 549 people who were not survived. To make things look better, I wanna display these numbers in form of graph. Here I will use bar() function coming from Matplotlib module. The function is pretty easy to understand. The two parameters that we need to pass is just the index name and its values.

plt.figure(figsize=(4,5))
plt.bar(survived_count.index, survived_count.values)
plt.title('Grouped by survival')
plt.xticks([0,1],['Not survived', 'Survived'])
for i, value in enumerate(survived_count.values):
plt.text(i, value-70, str(value), fontsize=12, color='white',
horizontalalignment='center', verticalalignment='center')
plt.show()

And here is the output:

Number of survived and not survived passengers.

Now I will also do the similar thing in order to find out the number of survived persons based on their gender. Notice that here I use sum() instead of count() because we are only interested to calculate the number of survived passengers which are represented by number 1. So it’s kinda like adding 1s in each group.

survived_sex = df.groupby('Sex')['Survived'].sum()plt.figure(figsize=(4,5))
plt.bar(survived_sex.index, survived_sex.values)
plt.title('Survived female and male')
for i, value in enumerate(survived_sex.values):
plt.text(i, value-20, str(value), fontsize=12, color='white',
horizontalalignment='center', verticalalignment='center')
plt.show()
Number of survived females and males.

Well, I think the graph above is pretty straightforward to understand :)

Ticket class, gender and embarkation distribution

Next, I wanna find out the distribution of ticket classes where the attribute is stored at Pclass column. The way to do it is pretty much similar to the one I created earlier.

pclass_count = df.groupby('Pclass')['Pclass'].count()

Now that there are 3 values stored in pclass_count variable in which each of those represents the number of tickets in each class. However, instead of printing out a graph here I prefer to display it in form of pie chart using pie() function.

plt.figure(figsize=(7,7))
plt.title(‘Grouped by pclass’)
plt.pie(pclass_count.values, labels=[‘Class 1’, ‘Class 2’, ‘Class 3’],
autopct=’%1.1f%%’, textprops={‘fontsize’:13})
plt.show()
Ticket class distribution shown in percent.

Furthermore, we can also display gender and embarkation distribution pie chart using the exact same method.

Gender distribution shown in percent.
Embarkation distribution shown in percent.

Age distribution

Another thing that I wanna find out is the age distribution. Before I go further, remember that our Age column contains 177 missing values out of 891 data in total. Therefore, we need to get rid of those NaNs first. Here’s my approach to do it:

ages = df[df['Age'].notnull()]['Age'].values

What I am actually doing in the code above is just to retrieve all non-NaN age values and then store the result to ages Numpy array. Next, I will use histogram() function taken from Numpy module. Notice that here I pass two arguments to the function: ages array and a list of bins.

ages_hist = np.histogram(ages, bins=[0,10,20,30,40,50,60,70,80,90])
ages_hist

After running the code above, we should get the following output:

(array([ 62, 102, 220, 167,  89,  48,  19,   6,   1], dtype=int64),
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90]))

It’s important to know that the output value of np.histogram() function above is a tuple with 2 elements, where the first one holds the number of data in each bin while the second one is the bins itself. To make things clearer in the figure, I will also define labels in ages_hist_labels.

ages_hist_labels = [‘0–10’, ‘11–20’, ‘21–30’, ‘31–40’, ‘41–50’, ‘51–60’, ‘61–70’, ‘71–80’, ‘81–90’]

And finally we can show the histogram like this:

plt.figure(figsize=(7,7))
plt.title('Age distribution')
plt.bar(ages_hist_labels, ages_hist[0])
plt.xlabel('Age')
plt.ylabel('No of passenger')
for i, bin in zip(ages_hist[0], range(9)):
plt.text(bin, i+3, str(int(i)), fontsize=12,
horizontalalignment='center', verticalalignment='center')
plt.show()
Age distribution.

Cabin distribution

If we pay attention to our Cabin column, we can see that all non-NaN values are always started with a capital letter which then followed by several numbers. This can be checked using df[‘Cabin’].unique()[:10] command. Here I only return the first 10 unique values for simplicity.

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
'C23 C25 C27', 'B78'], dtype=object)

I got a feeling that probably these initial letters might contain something important, so then I decided to take it and leave the numbers. In order to do that, we need to create a function called take_initial().

def take_initial(x):
return x[0]

The function above is pretty straightforward though. The argument x essentially represents a string of each row in which we will return only its initial character. Before applying the function to all rows in the Cabin column, we need to drop all NaN values first and store it in cabins object like this:

cabins = df['Cabin'].dropna()

Now as the null values have been removed, we can start to apply the take_initial() function and directly updating the contents of cabins:

cabins = cabins.apply(take_initial)

Next we will use value_counts() method to find out the number of occurrences of each letter. I will also directly store its values in cabins_count object.

cabins_count = cabins.value_counts()
cabins_count

After running the code above we are going to see the following output.

C    59
B 47
D 33
E 32
A 15
F 13
G 4
T 1
Name: Cabin, dtype: int64

Finally, to make things look better, I will use plt.bar() again to display it in form of bar chart.

plt.title('Cabin distribution')
plt.bar(cabins_count.index, cabins_count.values)
plt.show()
Cabin distribution.

Fare distribution

Fare attributes might also play an important role to predict whether a passenger is survived. Different to the previous figures, here instead of using bar or pie chart, I will create a boxplot. Fortunately, it’s extremely simple to do that as basically it can be shown just by using plt.boxplot() function.

plt.figure(figsize=(13,1))
plt.title(‘Fare distribution’)
plt.boxplot(df[‘Fare’], vert=False)
plt.show()
Fare distribution.

Here we see that the distribution is skewed to the right (a.k.a. positive skew) due to the fact that the longer tail is located at the right part of the box, where most of the data points are spread more densely at the range of approximately 10 to 35 currency unit. Additionally, outliers in the samples are represented by the circles. We can also see the fare distribution details using df[‘Fare’].describe() command, which the output is going to look something like this:

count    891.000000
mean 32.204208
std 49.693429
min 0.000000
25% 7.910400
50% 14.454200
75% 31.000000
max 512.329200
Name: Fare, dtype: float64

That’s pretty much about the EDA of Titanic dataset. In the next chapter I am going to do some feature engineering on this data frame. See you there!

By the way, here’s the code and the link to the next chapter :)

Don’t forget to give us your 👏 !

--

--

A machine learning, deep learning, computer vision, and NLP enthusiast. Doctoral student of Computer Science, Universitas Gadjah Mada, Indonesia.