1-Dimensional Convolution Layer for NLP Task

Is this tweet real? Let’s find out using CNN approach.

Published in

Becoming Human: Artificial Intelligence Magazine

15 min readSep 28, 2020

Photo by 🇨🇭 Claudio Schwarz | @purzlbaum on Unsplash

Is there anyone here who still think that CNN can only be used to perform classification on image data? I used to think so until I realized that Conv1D layer does exist in Keras library. Commonly, in the field of computer vision, we are dealing with image classification problems which are commonly be solved using Conv2D layer. Theoretically speaking, Conv2D works by applying kernels which strides along 2-dimensional space. In the case of image, the filters of this 2-dimensional convolution layer is shifting along its height and width. On the other hand, Conv1D only moves along a single axis, thus, it completely makes sense to apply this kind of convolution layer to sequential data like text or signals.

To conduct the project, I decided to use this Kaggle competition dataset. The dataset itself consists of 2 classes which shows whether a tweet tells a real disaster or not. Technically speaking, the positive (real disaster) text is labeled as 1 while the negative (non-disaster) text has the label of 0. Such encoding is very important since essentially neural network can only be trained using numerical data. Now before actually starting to code, I wanna give the outline of this article.

Exploratory data analysis (EDA)
Feature engineering
Constructing CNN
Training & evaluation

As always, the full code is available at the end of this article.

EDA 1 — Creating word cloud

Let’s start with importing all required libraries.

import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.models import Model
from keras.layers import Input, Dense, LSTM, Conv1D, MaxPooling1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.callbacks import EarlyStopping

Well, that’s a lot, lol. But believe me we will make use of those imported functions later. And you know, it’s much easier to import many things rather than defining them manually — for sure — so, why not :)

Anyway, the first EDA to do is actually not quite important, yet it’s kinda fancy: creating word cloud. The idea here is that I wanna display the first several most common words which appear in the entire tweet dataset. So, the largest word to be printed is going to be the one that has the highest occurrence.

Since the dataset has not been loaded yet, we will do it using read_csv() function and store the result in data frame df. Notice the code below that I only use train.csv because the data in test.csv are completely unlabeled, hence we can not find out the performance of our model by doing evaluation on the second csv file.

path = '/kaggle/input/nlp-getting-started/'
df = pd.read_csv(path+'train.csv')
df.head()

By the way I am doing this project using Kaggle notebook since its GPU performance is much better than the one I got in my laptop.

Next, we are going to create a function which is useful to clean the text data. The cleaning process itself is not very complicated. What I basically do here is just to convert all words into lowercase and remove unnecessary words like email addresses, twitter usernames (mentions), numbers and white spaces with the help of Regex (Regular expression) module. In case you still don’t get the idea of using regular expressions, here’s a good post to learn from.

def clean(text):
    # Lowering letters
    text = text.lower()

    # Removing emails & twitter usernames
    text = re.sub('\S*@\S*', '', text)
    
    # Removing urls (S+ matches all non whitespace chars)
    text = re.sub(r'http\S*', '', text)
    
    # Removing numbers
    text = re.sub('[^a-zA-Z]',' ',text)
    
    # Removing all whitespaces and join with proper space
    word_tokens = word_tokenize(text)    

    return ' '.join(word_tokens)

As the clean() function has been defined, we can start to use it to actually clean all the texts using the code below. Here all the tweets are also concatenated and stored in all_text variable.

all_text = df['text'].values
all_text = clean(' '.join(all_text))

Subsequently, we are going to create a WordCloud object. I think the parameters I pass in the object initialization below is pretty straightforward to understand.

wordcloud = WordCloud(width=2560, height=1440, 
                    background_color='black',
                    min_font_size=10)word_cloud = wordcloud.generate(all_text)

Finally we can display the word cloud using imshow() function like this:

plt.figure(figsize=(16,9))
plt.imshow(word_cloud)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.show()

That’s it! According to the image above, we can say that the word new and amp are the two most common words in the dataset. Note that by default all stop words are just dropped by the WordCloud object. That’s the reason why the most common words is not something like “the”, “and”, “so”, etc.

Well, that was just for fun. Now let’s do the real data analysis :)

EDA 2 — Class distribution

In general, machine learning algorithm is better to be trained using balanced dataset. In the case of binary classification that we encounter in this task, we would like to find out whether the positive and negative samples are approximately equal. In order to do so, we are going to take the count of both classes simply by applying value_counts() method on a column in our data frame and directly store the result in target_distribution variable.

target_distribution = df['target'].value_counts()

This target_distribution basically is a Pandas series (simply a data frame containing single column). Now what I wanna do is to display values stored in the series in form of a pie chart using plt.pie() function. Notice that by using Pandas series like what we do, we can just take its index to be used as the class labels written in the chart.

plt.figure(figsize=(8,8))
plt.title('Class distribution')
plt.pie(target_distribution, labels=target_distribution.index, 
        autopct='%1.1f%%', textprops={'fontsize':13})
plt.show()

Class distribution shown using pie chart.

According to the figure above, we can say that the class distribution is slightly unbalanced, where the number of positive disaster samples is less than those of negative samples. In this project I will just keep all these data to be used to train (and test) the model since I think such slight unbalanced distribution is not going to affect our neural network model too much.

EDA 3 — Top hashtags

Next up, I wanna find all hashtags that appear in the entire dataset and count all of those. Furthermore, the most positive and negative hashtags will also be displayed. In order to do that, we need to define a function called find_hashtags() which will be applied to the the text column of the data frame. The return value of this function is a list containing a word started with “#” symbol. The findall() function taken from Regex module works well for that. It’s worth noting that the regular expression that I write in bold below basically matches all words started with “#” and followed by one or more characters except whitespace.

def find_hashtags(x):
    hashtags = re.findall('#\S+', x)
    return np.array(hashtags)

Now as the function for finding hashtags has been defined, we can just apply it to text column and directly store the array of hashtags in a new column.

df['hashtags'] = df['text'].apply(find_hashtags)

A new column is generated to store all hashtags appearing in the text column.

The idea of displaying the most mentioned hashtags is getting clearer as all of those are already placed in a single column. What I wanna do afterwards is to create an empty list and put every single element in the hashtag column to the list. Afterwards, since doing value aggregation is easier to be done using Pandas data frame, we’ll convert the all_hashtags list into a new data frame with a single column.

all_hashtags = list()
for hashtags in df['hashtags'].values:
    for i in range(len(hashtags)):
        all_hashtags.append(hashtags[i])all_hashtags = pd.DataFrame(all_hashtags)

Subsequently, what I wanna do is to count the number of all unique hashtags and sort them such that the most common word appears at the bottom of the series. In addition, I will only keep the first 15 most commonly used hashtags.

all_hashtags = all_hashtags.groupby(0)[0].count().sort_values(ascending=True)[-15:]

To make things even clearer, we are going to display these data in form of bar chart. The code below should be working well for that. It’s important to know that the essential part to display the graph is only the barh() function.

plt.figure(figsize=(10,7))
plt.title('Top 15 hashtags in all tweets')
plt.barh(all_hashtags.index, all_hashtags.values)
plt.ylabel('Hashtags')
plt.xlabel('Occurrences')
plt.show()

The figure above shows the most common hashtags in the tweet dataset. Now if we wanna display only the one coming from positive labels, we can just reuse the codes above and change the condition such that the appended words are only taken from the positive tweets. Here’s the details to do so.

positive_hashtags = list()
for hashtags in df[df['target']==1]['hashtags'].values:
    for i in range(len(hashtags)):
        positive_hashtags.append(hashtags[i])

Notice the code written in bold above. Here instead of taking every single row in hashtags column, we are going to create a condition where the accessed row is only the one that has the target of 1 (positive). And the rest of the code are basically the same as the one used to display the previous figure. For the sake of simplicity, I will not also show the code to display the most negative hashtags since these are all almost exactly the same.

positive_hashtags = pd.DataFrame(positive_hashtags)
positive_hashtags = positive_hashtags.groupby(0)[0].count().sort_values(ascending=True)[-15:]plt.figure(figsize=(10,7))
plt.title('Top 15 hashtags in positive (real disaster) tweets')
plt.barh(positive_hashtags.index, positive_hashtags.values)
plt.ylabel('Hashtags')
plt.xlabel('Occurrences')
plt.show()

Most common hashtags in positive tweets.

Most common hashtags in negative tweets.

EDA 4— Tweet length distribution

The last exploratory data analysis we want to do here is finding out the length of tweets. To do so, it’s necessary to create a new function to return the number of words based on given raw string. The implementation is as simple as the following code.

def calculate_length(x):
    return(len(x.split()))

Subsequently, we will apply the function to text column and store the result in another new column length.

df['length'] = df['text'].apply(calculate_length)

Next, we will separate these values into two such that the positive and negative tweet lengths may be compared easily. The variable of positive_lengths and negative_lengths are now containing the number of words in form of Pandas series.

positive_lengths = df[df['target']==1]['length']
negative_lengths = df[df['target']==0]['length']

Now, the data distribution will be displayed using distplot() function taken from Seaborn library. Notice that the most essential part of displaying the figure below is just the two lines written in bold.

plt.figure(figsize=(15,5))
plt.title('Tweet length distribution')
sns.distplot(negative_lengths, kde=False)
sns.distplot(positive_lengths, kde=False)
plt.legend(['negatives', 'positives'])
plt.xlabel('no of words')
plt.ylabel('no of tweets')
plt.grid(False)
plt.show()

Length distribution shown using distplot() function.

Well I think there is no much thing to explain about the figure above. What I can interpret is just most of the tweets, both positives and negatives, are having the length of around 12 to 17 words. Furthermore, it’s important to know that the longest tweet that we have here is approximately 30 words (well it’s probably something like 33). We will use this maximum tweet length in the next chapter.

Feature engineering

The feature engineering of this task is actually not going to be complicated. This is because what we need to do is just to take all tweets stored in text column and perform what’s so-called as word to number mapping. But wait! There are several steps required to do prior to the encoding operation.

First, we know that the text in our data frame is still messy, so we need to clean that up. If you scroll to the top of this article, you’ll see that we have applied the clean() function, but for the sake of creating word cloud. So here, we are going to apply the same function again, but this one we will actually update the content of text column with its cleaned version. It can be achieved simply by running the code below.

df['text'] = df['text'].apply(clean)

Now as everything in text column have been updated, we will put the entire column values into X array. Also, all labels are going to be stored in y array just to make things look more straightforward.

X = df['text']
y = df['target']

Afterwards, we will split the data into train and test. Here I will implement train_test_split() function coming from Scikit-Learn module. Here the argument of 0.2 passed for the test_size indicates that I will be using 20% of samples in the dataset for testing purpose in order to find out whether or not the model is overfitting.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=24)

In addition, this train-test split is done prior to the encoding process because we want the encoding to be based only on the train data since test data in any machine learning problems should be treated as something that we have never seen before.

Now it’s time to actually turn all words into integers. Things are getting extremely easy with the help of Tokenizer object taken from Keras module. Notice that the code that I write in bold below is used to create word-to-number mapping based only on our train tweets. Furthermore, here I also store the vocabulary size (number of unique words) to vocab_size variable. Well, this is going to be useful later when we construct the neural network.

Note: if we check Keras documentation on word_index attribute. We will see that the returned value is vocab_size-1 for some reasons. That’s basically why I put +1 at the end of the code.

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)vocab_size = len(tokenizer.word_index)+1

After being fitted to the train data, then we will convert both all train and test tweets into numbers using texts_to_sequences() method.

encoded_X_train = tokenizer.texts_to_sequences(X_train)
encoded_X_test = tokenizer.texts_to_sequences(X_test)

Finally, now all the tweets have been converted into a list of integers. We can check the first sentence in encoded_X_test list just to check whether things went as expected.

We can see here that all words already converted into numbers.

Remember the maximum tweet length we obtained during the EDA chapter? At the point we found that the longest sentence consists of around 33 words. Since we know that any machine learning or deep learning algorithm can only work with fixed data size, so now we will create zero padding for them such that all encoded tweets are having exactly 33 elements. Luckily we got pad_sequences() function which can easily be implemented to do the work.

encoded_X_train = sequence.pad_sequences(encoded_X_train, maxlen=33)
encoded_X_test = sequence.pad_sequences(encoded_X_test, maxlen=33)

Now we can try to print out the first tweet stored in encoded_X_test again to see the difference.

This is how the first tweet in test data looks like after being processed. The length of this array is exactly 33.

That’s pretty much all of the feature engineering. In fact, we don’t need to do anything with the labels because its values have been exactly like what we need (either 0 or 1).

Constructing CNN

So far, we have already got the perfect features shape and correct label encodings. Therefore, we can start to actually construct the neural network. Additionally, here we are going to use functional style to do so instead of sequential style. Check this article out if you wanna find out more about the two styles of building neural network. Anyway, below is how I construct the entire CNN-based neural net.

input1 = Input(shape=(33,))
x = Embedding(input_dim=vocab_size, output_dim=32, input_length=33)(input1)
x = Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')(x)
x = MaxPooling1D(pool_size=2)(x)
x = Dropout(0.5)(x)
x = Flatten()(x)
x = Dense(30, activation='sigmoid')(x)
x = Dropout(0.5)(x)
x = Dense(5, activation='sigmoid')(x)
x = Dropout(0.5)(x)
output1 = Dense(1, activation='sigmoid')(x)

model = Model(input1, output1)

Here we start with an input layer which accepts a single-dimensional array of 33 elements. As you might have guessed, this input shape does represent the length of all processed tweets.

Subsequently, this input layer is connected to an embedding layer which takes 3 arguments, namely input_dim, output_dim and input_length. We can see here that vocab_size that we defined in the feature engineering chapter is used as the value of input_dim argument. What’s essentially done in this stage is that we are going to put every single unique word into a 32-dimensional space such that similar words in the dataset will be placed closely to each other in that high dimensions space. In other words, a single word will be represented with 32 features. Here’s an illustration of how the output of embedding layer looks like. Note that the values used here is not the actual numbers.

How the output of embedding layer looks like.

Afterwards, we connect this embedding layer to a Conv1D. In fact, the main idea of this layer is exactly the same as Conv2D which we commonly use in image classification task. What makes Conv1D different is just because its filters are moving along a single axis instead of two. Below is another illustration which shows how the filter (highlighted in blue) of 1 dimensional convolution layer strides. In addition, I’ll employ 32 different filters for this case (it’s probably an overkill though).

Next up, this convolution layer is then connected to a max-pooling layer. This is basically done to reduce the dimensionality of the data such that the training process is may take shorter duration. In fact, it’s not always necessary to employ this layer right after convolutions, but I decided to go with it anyway. After that, we will flatten all the matrices using Flatten layer before eventually being connected to several Dense layers. Also, I decided to put several Dropout layers between them to prevent the model of being overfitted too fast. Finally, the last layer to connect here is a Dense layer consisting of a single neuron with sigmoid activation function.

There might be a question regarding to the output layer:

Commonly the number of neurons should be exactly the same as the number of available class in the dataset. But here, we got 2 classes, so why do we only need 1 neuron? (Me, when I started to learn deep learning)

Then the answer is:

Because in binary classification task we commonly have the labels of 0 and 1. At the same time we also know that the output value of sigmoid activation function ranges between 0 and 1. Thus, by rounding up this output value, then we can either get 0 or 1. And therefore, a single neuron is enough. (Me, several moments after asking the question to myself)

Now after the neural network has been constructed, we can compile it using binary cross entropy loss function along with Adam optimizer prior to the training process. Furthermore, I do also initialize an EarlyStopping object which is useful to terminate the training process once the model starts to overfit no matter how many epochs that we defined in the first place.

model.compile(loss='binary_crossentropy', optimizer='adam', 
              metrics=['acc'])es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

Training & evaluation

Finally it’s time to train! Just run the code below to do so.

history = model.fit(encoded_X_train, y_train, epochs=100, 
                    validation_data=(encoded_X_test, y_test), 
                    callbacks=[es])

We can see here that the final accuracy towards train data is 82.4% while the one towards test data is 78.2%. Well, the model quite overfitting though, but I think it’s not so bad. If you pay attention to the code above, you will see that I set the epochs to 100 yet eventually the training stops at fifth epoch thanks to the EarlyStopping object that we implement earlier.

Anyway, we can plot both accuracy improvement and loss decrease graph using the code below, which the result is displayed in the next figure.

plt.figure(figsize=(8,6))
plt.title('Accuracy')
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(['acc', 'val_acc'])
plt.show()plt.figure(figsize=(8,6))
plt.title('Loss')
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend(['loss', 'val_loss'])
plt.show()

In conclusion, — well I got no idea what to conclude here, lol. I just wanna say that it’s also possible to perform text categorization using CNN approach, or more specifically using Conv1D layer. In fact, the final accuracy of 78.2% that we obtained on test data might not be the best possible performance. I think the problem is not caused of the neural network architecture, yet it’s due to the relatively simple feature engineering. In case you wanna try to make some improvements, I suggest you to take into account several other features like the number of capital letters, number of symbols, etc.

That’s all of this project. Feel free to comment if you have questions or if you find any mistake in this article. See you!

Here’s the entire code used in this project.