News Topic Classification using LSTM

How LSTM (Long Short-Term Memory) cells learn to categorize texts.

--

Hi everyone, Ardi here. Welcome back to my page! In my last post I was explaining about a simple Exploratory Data Analysis (EDA) and survival prediction on Titanic dataset. Check it out now if you haven’t already!

Instead of working with a structured dataset again, here I decided to deal with the unstructured one: NLP (Natural Language Processing). To be more precise, in this project I wanna create a neural network architecture which is expected to be able to classify news topics based on its content. The main idea here is to employ LSTM cell thanks to its ability to recognize pattern on sequential data (e.g. signals, texts, etc.)

The dataset that I use for this project is taken from here. If you open the webpage, there will be several download links, and I decided to take the one highlighted in yellow.

Where I download the dataset. Source: http://qwone.com/~jason/20Newsgroups/.

It should not take long to download the file since the size of that dataset is only around 14 MB. Now after extracting the compressed file, we should get the following folders:

The dataset is grouped by its labels.

We can see here that there are 20 different classes available in the dataset. However though, in this project I only wanna use 4 of those for the sake of simplicity (the folders that I highlighted in green.)

In total, there are 3732 files in those 4 classes that we use, where each of the first 3 (computer graphics, motorcycles and medical science) consists of approximately 970 samples while the last one (politics) contains around 700 texts.

Artificial Intelligence Jobs

I do share the final code of this project in the end of this article in Python format (.py). I do recommend you to run it using Jupyter Notebook so that you know exactly what’s actually done at each line. Also, before we go any further, it’s important to ensure that your Python environment is already installed with the following modules:

import os
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.layers.embeddings import Embedding

If there’s no error appears after running those imports above, we should ready to go!

By the way, I divided this article to several chapters:

  1. Text preprocessing
  2. Label preprocessing
  3. Model training
  4. Model evaluation

Text preprocessing

The next thing to do after importing all modules is to load the dataset. I am going to create a function called read_file() to make things tidier. This function is pretty simple though. The input is just a path to the text files, while the output is a list in which each of the index holds the content of each file.

# Don't forget to include slash at the end of the pathdef read_files(path):
file_contents = list()
filenames = os.listdir(path)

for i in range(len(filenames)):
with open(path+filenames[i]) as f:
file_contents.append(f.read())

return file_contents

Now we will directly use the function above to read all texts in our dataset like this:

class_0 = read_files('20news-18828/comp.graphics/')
class_1 = read_files('20news-18828/rec.motorcycles/')
class_2 = read_files('20news-18828/sci.med/')
class_3 = read_files('20news-18828/talk.politics.misc/')

Keep in mind that the variables of class_0, class_1, class_2 and class_3 represent computer graphics, motorcycles, medical science and politics labels repsectively. I will also create a labels list which stores each of those class names.

labels = ['comp.graphics', 'rec.motorcycles', 'sci.med', 'talk.politics.misc']

If you wanna check whether our dataset has been loaded successfully, we can print them out. Here I want to print the first article in class_0 by running print(class_0[0]). The output is going to look something like this:

The content of class_0[0] (the first computer graphics article).

Since all texts are still stored in different array (class_0, class_1, class_2, class_3), we are going to put them all into a single array to simplify our next processes. Here I use np.append() function several times to store everything in all_texts array.

all_texts = np.append(class_0, class_1)
all_texts = np.append(all_texts, class_2)
all_texts = np.append(all_texts, class_3)

Now if we run len(all_texts), then the output will be 3732, in which this value is the total number of files that we are going to work with in this project. However though, it’s important to keep in mind that all these 3732 files are still messy. Thus we need to clean those up using clean() function that I define manually. I also create a stop_words set which contains all stop words in English which I will use them to eliminate all stop words in our dataset.

stop_words = set(stopwords.words('english'))def clean(text):
# Lowering letters
text = text.lower()
# Removing html tags
text = re.sub('<[^>]*>', '', text)
# Removing emails
text = re.sub('\S*@\S*\s?', '', text)
# Removing urls
text = re.sub('https?://[A-Za-z0-9]','',text)
# Removing numbers
text = re.sub('[^a-zA-Z]',' ',text)
word_tokens = word_tokenize(text)
filtered_sentence = []
for word_token in word_tokens:
if word_token not in stop_words:
filtered_sentence.append(word_token)

# Joining words
text = (' '.join(filtered_sentence))
return text

There are several steps done in the code above. First, the function accepts a raw text as its input. Next, all characters in this raw text is going to be converted into lowercase. This step is very necessary since the exact same words will be interpreted as different words just because one is using lowercase while the other one uses uppercase. HTML tags, email addresses, URLs and numbers are also removed because I think those words are just not quite informative. By the way I use re (Regular Expression) module to get rid of that.

Trending AI Articles:

1. Microsoft Azure Machine Learning x Udacity — Lesson 4 Notes

2. Fundamentals of AI, ML and Deep Learning for Product Managers

3. Roadmap to Data Science

4. Work on Artificial Intelligence Projects

Afterwards, this text is then tokenized using word_tokenize() function taken from NLTK module. What’s essentially done at this tokenization process is that each of the words in the text is going to be put in an array while at the same time all spaces and escape characters (i.e. \t, \n, etc.) are just dropped before eventually the cleaned text is returned.

Now as the clean() function has been created, we are gonna clean all texts and store the result directly to all_cleaned_texts array.

all_cleaned_texts = np.array([clean(text) for text in all_texts])

You may try to run print(all_cleaned_texts[0]) if you wanna see how the cleaned version of our data looks like.

This is the cleaned version of the text in the previous figure.

That’s not all of the text preprocessing though! Still we need to encode all those cleaned texts into numbers because we know that all machine learning/deep learning algorithms can only work with numerical data. In order to do that, we are going to employ Tokenizer() object taken from Keras module and use it to create a word-to-number mapping.

tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_cleaned_texts)

We can check how the mapping looks like by taking the word_index attribute of tokenizer object like this:

tokenizer.word_index
Output of tokenizer.word_index

It essentially tells us that the word subject is going to be encoded to 1, writes will be 2, and so on. Also, if we try to print out the length of this word_index, then the result will be 35362, where this value represents the number of unique words in our dataset. In this stage, we haven’t actually convert those texts into sequence of numbers. To do that, we still need to run the code below. By the way, I will directly convert that into Numpy array to make future computation easier.

all_encoded_texts = tokenizer.texts_to_sequences(all_cleaned_texts)
all_encoded_texts = np.array(all_encoded_texts)

Now that all the texts have been converted into sequence of numbers which is stored in all_encoded_texts array. Here’s how the encoded version of our first news article looks like.

This is the encoded version of our first text.

But again, the text preprocessing step hasn’t finished yet! Notice that all files are having different lengths. We can either check it by opening the files manually or using a simple code like this:

for i in range(5):
print('Length of file', i, ':', len(all_encoded_texts[i]))
Length of the first 5 files in our dataset.

In order for machine learning algorithms to work, we need to modify those files such that all of those are having the exact same length. My approach here is to employ pad_sequences() function taken from Keras module. Also, in this project I decided to limit each data sample to 500 words. Zero padding will be added to the beginning of the file which contains less than 500 words.

all_encoded_texts = sequence.pad_sequences(all_encoded_texts, maxlen=500)

Now if we try to print the shape of all_encoded_texts, we are going to obtain the following output:

(3732, 500)

It essentially tells us that we are now having 3732 data samples in which each of those consists of 500 words in length. Finally, that’s all of the text preprocessing! Now that this all_encoded_texts array is going to be our features (X data).

Label preprocessing

We have already got our X data for the training process in the previous stage. Now in this part, we need to define our ground truth — or, also known as label. The idea here is that I wanna create another array which has the exact same length as our X data, yet this one stores the class name of each sample. For the sake of simplicity, we are just going to encode the class name manually, where computer graphics, motorcycles, medical science and politics will be encoded as 0, 1, 2 and 3 respectively.

labels_0 = np.array([0] * len(class_0))
labels_1 = np.array([1] * len(class_1))
labels_2 = np.array([2] * len(class_2))
labels_3 = np.array([3] * len(class_3))

Also, we will concatenate all those separated labels into a single array using np.append() — exactly the same as what we’ve done to concatenate the actual texts.

all_labels = np.append(labels_0, labels_1)
all_labels = np.append(all_labels, labels_2)
all_labels = np.append(all_labels, labels_3)

If we check the shape of this all_labels variable, then we are going to obtain the value of 3732 — which is also exactly the same as the number of X data.

Furthermore, we still need to convert those labels into one-hot representation. This step is done because that’s just what a neural network expect in multi class classification task. For those who are still not familiar with one-hot, it’s essentially looks like this:

How one-hot representation looks like.

In order to convert our labels to be in such form, we can simply use OneHotEncoder() object. Notice the code below that I use np.newaxis to create a new axis before fitting the one-hot encoder. Well, this is done just because OneHotEncoder() works with that additional axis.

all_labels = all_labels[:, np.newaxis]one_hot_encoder = OneHotEncoder(sparse=False)
all_labels = one_hot_encoder.fit_transform(all_labels)

Now if you print all_labels, the result is gonna look something like the following figure.

Labels which are already converted into one-hot representation.

That pretty much it about label preprocessing stage! Keep in mind that all_labels array is going to be our y data. We are going to use both X and y in the next stage — well, certainly :)

Model training

There are several things to do in the model training stage. First, we need to split the data into train/test. This is pretty important to find out whether our model is overfitting during the training process. Luckily, we got a train_test_split() function which is taken from Scikit-Learn module. In this case I decided to use 20% of samples in the dataset as the test data while the rest is going to be used for training. Additionally, it’s good to know that when we use this train_test_split() function we no longer need to shuffle the data since the function does it automatically for us.

X_train, X_test, y_train, y_test = train_test_split(all_encoded_texts, all_labels, test_size=0.2, random_state=11)

Now, let’s begin to construct the neural network by defining a Sequential() model which then followed by adding 3 layers.

model = Sequential()model.add(Embedding(input_dim=35362, output_dim=32, input_length=500))
model.add(LSTM(100))
model.add(Dense(4, activation='sigmoid'))

The first layer that we put in the neural network model is an embedding layer. The arguments of that layer are vocabulary size (as the input_dim), vector size (as the output_dim) and input length respectively. In this case, we can take the vocabulary size from the length of tokenizer.word_index, which shows that we got 35362 unique words in our dictionary. The vector size itself is free to choose, here I decided to represent each word in 32 dimensions — well this is essentially the main purpose of using an embedding layer. The last argument I think is pretty straightforward — it’s the number of words of each text sample. Here’s an article if you wanna read more about embedding layer.

The second layer to add consists of 100 LSTM cells. This type of neuron is commonly used to perform classification on sequential data. This is because an LSTM cell does not treat every data point (in this case a data point is a word) as an uncorrelated sample. Instead, input in the previous time steps will also be taken into account to update cell state and the next output value. Well, that’s just an LSTM explanation in a nutshell. If you are interested to understand the underlying math I suggest you to read it from this article.

Lastly, we will connect this LSTM layer with a fully-connected layer which contains 4 neurons where each of those are used to represent a single class name.

Now before the training process begin we need to compile the model like this:

model.compile(loss='categorical_crossentropy', 
optimizer='adam',
metrics=['accuracy'])

In this case, I use categorical cross entropy loss function in which its values is going to be minimized using Adam optimizer. This loss function is chosen because this classification task has more than 2 classes. To the optimizer itself, I choose Adam since in many cases it just works better than any other optimizers. Below is the summary of our model looks like:

Summary of our neural network model.

Now as the neural network has been compiled, we can start the training process. Notice that I store the learning history in history variable.

history = model.fit(X_train, y_train, epochs=12, batch_size=64, validation_data=(X_test, y_test))

Here is how my training progress goes:

Train on 2985 samples, validate on 747 samples
Epoch 1/12
2985/2985 [==============================] - 75s 25ms/step - loss: 1.3544 - accuracy: 0.3652 - val_loss: 1.2647 - val_accuracy: 0.6466
.
.
.
Epoch 5/12
2985/2985 [==============================] - 48s 16ms/step - loss: 0.4278 - accuracy: 0.9196 - val_loss: 0.4589 - val_accuracy: 0.8768
.
.
.
Epoch 9/12
2985/2985 [==============================] - 49s 16ms/step - loss: 0.1058 - accuracy: 0.9759 - val_loss: 0.1859 - val_accuracy: 0.9438
.
.
.
Epoch 12/12
2985/2985 [==============================] - 49s 16ms/step - loss: 0.0253 - accuracy: 0.9966 - val_loss: 0.1499 - val_accuracy: 0.9625

For the sake of simplicity, I decided to delete several epochs. But don’t worry, you can still see the training process history using the following code (both accuracy and loss value).

plt.figure(figsize=(9,7))
plt.title('Accuracy score')
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['accuracy', 'val_accuracy'])
plt.show()
plt.figure(figsize=(9,7))
plt.title('Loss value')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['loss', 'val_loss'])
plt.show()
Accuracy score improvement.
Loss value decrease.

According to the 2 graphs above, we can see that the performance of our neural network classifier is pretty good. The model achieves 99.7% of accuracy on training data and 96.3% on test data. In fact, I’ve tried to increase the number of epochs to see whether the performance can still be improved. However, somehow the accuracy on both data are just fluctuating randomly at around 90% to 97%. So then I decided to restart the training and stop at epoch 12.

Model evaluation

Now we are going to get deeper into the model evaluation. To me, understanding the accuracy and loss value graphs is not enough. I do prefer to construct a confusion matrix and see which class causes the neural net classifier getting confused.

To do that, let’s begin with predicting our X_test data. The first line of the code below returns something like a probability value — yet it’s important to keep in mind that it’s actually not a probability since our output layer is using sigmoid activation function, not a softmax. But still, the idea is to take the highest value as the predicted class, which is done in the second line of the code below.

predictions = model.predict(X_test)
predictions = one_hot_encoder.inverse_transform(predictions)

The next thing to do is to convert our labels back from one-hot format. My approach here is to use np.argmax() function and store the non-one-hot format in y_test_evaluate array.

y_test_evaluate = np.argmax(y_test, axis=1)

Now as the values of predictions and ground truth are already comparable, so we can directly use confusion_matrix() function taken from Sklearn module. Remember that the arguments of this confusion matrix are the actual value and the predicted value respectively.

cm = confusion_matrix(y_test_evaluate, predictions)

Finally, we can just draw the confusion matrix by passing cm variable as the argument of heatmap() function. We can see the result below that strangely most of the errors here are in the politics texts. It shows that there are 21 politics-related news which are misclassified as medical science text.

plt.figure(figsize=(8,8))
plt.title('Confusion matrix on test data')
sns.heatmap(cm, annot=True, fmt='d', xticklabels=labels, yticklabels=labels,
cmap=plt.cm.Blues, cbar=False, annot_kws={'size':14})
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
Confusion matrix on test data.

Now what if we got a new string and we want to find out which class should this text belong to? There are several steps required to do so. First lemme create a string like this:

string = 'I just purchased a new motorcycle, I feel like it is a lot better than cars'

We know that the text is actually pretty clean. However, notice that there are still several stop words in which may not be important in the text above. Hence, we will pass this text into our clean() function that we defined in the earlier part of this writing.

cleaned_string = clean(string)
How our string looks like after being cleaned.

The next thing to do is to encode those texts into numbers and store the result in encoded_string variable.

encoded_string = tokenizer.texts_to_sequences([cleaned_string])
How our string looks like after being encoded into numbers.

Now, remember that the encoded text above only consists of 8 words. We can not directly feed this into our neural network since it takes exactly 500 words in order to work. Thus, we are going to add zeros in front of this text such that there will be 500 encoded values in the string. We can simply do this using pad_sequences() function and set the maxlen argument to 500.

encoded_string = sequence.pad_sequences(encoded_string, maxlen=500)

I don’t wanna display the screenshot of the encoded_string here because it takes plenty of space. Just print that out yourself and you’ll see how it looks like.

As the string has been at the exact same size as our trained model input, we can now start doing the prediction and store the probability-like value to the string_predict variable. Lastly we can take the argmax of string_predict value to find out the prediction made by our neural network model on our own data.

string_predict = model.predict(encoded_string)
np.argmax(string_predict)
The prediction of our neural network on new data.

We can see here that the prediction is right! The string that we are using to test this sample is highly related to motorcycle, and our neural net predicts it correctly.

That’s pretty much about today’s article. Let me know your opinion regarding to this project in the comment section below. Thanks for reading and see you!

Note: here’s the code used for this project:

References

Illustrated Guide to LSTM’s and GRU’s: A step by step explanation by Michael Phi https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

What is an Embedding Layer? by Georgios Drakos https://gdcoder.com/what-is-an-embedding-layer/#:~:text=The%20first%20argument%20(8)%20is,that%20we%20used%20for%20padding.

Don’t forget to give us your 👏 !

--

--

A machine learning, deep learning, computer vision, and NLP enthusiast. Doctoral student of Computer Science, Universitas Gadjah Mada, Indonesia.