An Implementation of Semi-Supervised Learning

A self-training algorithm to perform sentiment analysis on IMDB review dataset.

--

Photo by Jon Tyson on Unsplash

Machine learning can be divided into several categories, where the most popular is supervised and unsupervised learning. Both methods are the two which are very commonly used in the field of data science. Supervised learning algorithms are used when all samples in a dataset are completely labeled, while unsupervised algorithms are employed to handle dataset without labels at all.

On the other hand, what if we only got partially labeled data? For example, we got a dataset of 10000 samples but only 1500 of them are labeled, while the rest are entirely unlabeled. In such cases, we can utilize what’s so-called as semi-supervised learning method. In this article, we are going to get deeper into the code of one of the simplest semi-supervised algorithm, namely self-learning.

Semi-supervised learning is applicable in a case where we only got partially labeled data.

The self-learning algorithm itself works like this:

  1. Train the classifier with the existing labeled dataset.
  2. Predict a portion of samples using the trained classifier.
  3. Add the predicted data with high confidentiality score into training set.
  4. Repeat all steps above.

The dataset used in this project is IMDB movie reviews which can easily be downloaded through Keras API. The objective is pretty straightforward: we need to classify whether a text contains positive or negative review. In other words, this problem is just like a sentiment analysis in general. The dataset itself is already separated into train/test, where each of the sets are having unique 25000 review texts.

Now, before we get into the code, I wanna show you first the outline of this writing.

  1. Loading data & preprocessing
  2. Data splitting
  3. Model training
  4. Model evaluation

As usual, the entire code is available at the end of this writing.

Loading data & preprocessing

The field of machine learning always relies on the data availability, hence loading data should be done at the initial stage. Run the code below to load both the required modules and the dataset.

Importing modules and loading dataset.

The data loading above might look quite strange. We can see there that the last two variables are having _test suffix, yet the first two are not declared as _train. So basically I do this because we will use X_train and y_train for other purpose. Also, it’s important to know that the parameter num_words in the code above indicates that we are going to take only the most common 8000 words.

By default, all texts stored in both X arrays are already integer-encoded. Below is how the first movie review looks like. It is worth noting that smaller number indicates higher word occurrence in the entire dataset.

All text here are already converted into integers.

Now if we check the length of each review text, we will see that all of those are having different number of words. This is actually going to be our first problem since essentially any neural network will expect a fixed input shape.

The length of the first 3 movie reviews.

The idea to solve this problem is that I wanna employ pad_sequences() function taken from Keras module which will limit the text lengths such that all of those are going to have a particular fixed number of words. Here I decided to find out the average word count of each movie review and take the value to limit text lengths. — In fact, the average length is 239 words, but then I go with 250 words instead since the value looks nicer, lol. — It’s completely fine to do so, though.

Applying pad_sequences() function to set all reviews to be having exactly 250 words.

After running the code above, we can just check the shape of X_pad and X_test_pad. Number shown in the first axis in the figure below represents the number of samples while the next one denotes the word count.

Now that all texts are having the exact same length.

Data splitting

We know that basically all data in this IMDB review are perfectly labeled. However, since we are going to simulate semi-supervised learning algorithm, then we will assume that we only know a little part of those labeled data. The green block in the illustration below represents a portion of labeled samples whereas the red blocks are assumed to be the unlabeled data in the training set.

The main idea of the self-learning algorithm is that we are going to train a neural network classifier using only 5000 samples taken from green block and predict all samples in fold 1. The newly predicted data with high confidentiality score and the previous training data will then be combined to re-train the model. The process is done repeatedly until the last fold before we eventually predict the test data.

Data distribution in the dataset. Red blocks denote unlabeled data in training set.

In order to split data in such way, we are going to employ KFold object taken from Sklearn module. This KFold is commonly used to perform cross validation, yet in this case we utilize this stuff for different purpose. The last two lines in the code below highlights the idea that the 0th fold is the only data with known labels. By the way you may open up the documentation of KFold in case you’re still not familiar with it.

Split the data into 5 folds.

Model training

The training is going to be done using simple LSTM-based neural network. Essentially I use this type of neural net because it generally works excellent for sequential data. Here I decided to create the architecture inside a create_model() function. I prefer to do it like this because it’ll be a lot simpler to call the same function rather than declaring the exact same model multiple times.

The function to build neural network classifier.

Just to recap, so far we have already got the data folds in which it is stored in X_fold[0] to X_fold[4]. Also, we got a pair of X_train and y_train which is basically taken from X_fold[0] array, and we pretend that it’s the only known labeled data.

Now let’s do the first training by first initializing the model and directly followed by applying fit() method to the model. Remember that our initial train data consists of 5000 samples. Actually, here I would like to use the last 1000 data for the sake of validating just to check whether the model suffers overfitting. Also, here I decided to go with 2 epochs only since it’s just the right value for this case for some reasons.

Model training with data from fold 0, followed by predicting data on fold 1.

After running the code above, we should get the following progress bar. We can see here that the model is pretty good as it achieves 82.7% of accuracy towards validation data, and more importantly, it’s not overfitting.

Epoch 1/2
125/125 [==============================] - 2s 17ms/step - loss: 0.6734 - acc: 0.5918 - val_loss: 0.5636 - val_acc: 0.7400
Epoch 2/2
125/125 [==============================] - 2s 14ms/step - loss: 0.4379 - acc: 0.8307 - val_loss: 0.4079 - val_acc: 0.8270

Now what’s next? In fact, up to this step we already got a “semi-trained” model. The reason why I call it that way is because it’s already trained, but it’s still trained on small scale. If we think of this model as a meat, then it is kinda like “medium-well”. In order to make it “well done”, then we need to use the entire data fold for training. Therefore, the next thing to do is to predict the next data fold (X_fold[1]) using this “medium-well” model and then use these predictions as additional labeled data.

However, we should not use the entire sample predictions to be appended to the next training data. Instead, we will perform some sort of filtering method where predictions with small confidentiality score are just gonna be dropped since there is a possibility that these predictions are incorrect. Additionally, such labeling method is commonly called as pseudo-labeling.

The function below is used to filter out data with a specific confidentiality score threshold. Note that here we use sigmoid activation function where the output value must be somewhere between 1 and 0. By default, the decision boundary used in common cases is 0.5. This means that all output larger than 0.5 are going to be mapped to 1 (positive) while the others will be mapped to 0 (negative). However, in this case I wanna create a threshold with the values of 0.95 and 0.05. These values essentially say that any positive output which has less than 0.95 score will be dropped, while any negative output with greater than 0.05 score is gonna be discarded as well.

A function to drop samples with low confidentiality score.

Here’s a graph of sigmoid activation function that we implement in the very last layer in our neural network. The output colored in green (>0.95 and <0.05) represents the sample distribution that we are going to use for the next training process.

Sigmoid activation function.

After running the code above, we can try to print out the shape of X_new to find out the number of remaining samples. Remember that initially all folds contain 5000 data, but here we only got 1406. This output value indicates that 3594 texts are predicted with relatively low confidentiality score.

We got 1406 new samples out of 5000 for the next training.

Since we wanna use the data in X_new for the next training, then we need to concatenate it to our existing X_train array. Here I’m going to use join_shuffle() function to do so. What’s basically done in the function is just appending the new data and directly shuffling it.

A function to concatenate X_train with X_new (along with their labels).

Finally, our X_train and y_train have been updated! Hence we can start to do the second training process. The steps are exactly the same as what we have done earlier. The difference is that here we will do the prediction on data in fold 2. Below is the entire process to do so.

Predicting data in fold 2.

Model evaluation

Up to this point, we have just done the training process twice, where the first training is done to predict fold 1 based on the model trained on fold 0, while our second training was done using data in fold 0 and fold 1 which is then used to predict data in fold 2. For the sake of simplicity, I will just directly use the data coming from fold 0, 1 and 2 to train a new classifier model and perform prediction on our test data. Below is the code to do so.

Train a model with data from fold 0, 1, and 2, then validate on the entire test data.
Epoch 1/2
167/167 [==============================] - 31s 187ms/step - loss: 0.6497 - acc: 0.6080 - val_loss: 0.5879 - val_acc: 0.7118
Epoch 2/2
167/167 [==============================] - 31s 185ms/step - loss: 0.4044 - acc: 0.8388 - val_loss: 0.4335 - val_acc: 0.8115

We can say that this model is pretty good as we got the final accuracy of around 81% on 25000 test data even only by using the first 3 data folds in the training set.

This is in fact might not be optimal model since we haven’t used the entire train data to actually train the neural net. Therefore, you may run the entire code above up until you use all train data if you wanna obtain better result. — I don’t guarantee that the accuracy will drastically improving though.

Alright, that’s the end of today’s article. I put all codes used above below so that you can just copy-paste to your code editor if you want. Thanks for reading!

The full code of this semi-supervised learning implementation.

--

--

A machine learning, deep learning, computer vision, and NLP enthusiast. Doctoral student of Computer Science, Universitas Gadjah Mada, Indonesia.