My solution to achieve top 1% in a novel Data Science NLP Competition

--

This is a documentation of my First Kaggle Competition! Well first let me introduce myself, I’m a Bachelor’s Degree Holder in Electronics Engineering majored in Telco. Started my journey into AI after entering corporate work as a developer, took courses in Udacity on Machine learning and AI and developed AI based computer vision solutions prototypes for my company.

Most Frequent words that appear in sentences labelled as severe toxic(Got from fellow Kaggler public post)

So after all that, I have a big urge to find some good AI problems that I can build a solution for and at the same time build up more knowledge in this niche field. That’s when I found Kaggle, and I must say the kind of challenges and problems that its platform provides is no doubt broad and wide in terms of its scope and definitely suitable for any field of data science and AI. FYI, Kaggle is a platform that host AI/Machine Learning Competitions where the competitions that are hosted are real actual problems that some real companies out there want’s to solve and thus collaborated with Kaggle to host a competition to solve the problem, and of course with a competition comes with a huge prize money(well its huge for me).

So back to the title, in short Kaggle hosted a competition with the Conversation AI team which is a research initiative founded by Google Jigsaw (A part of Alphabet) to work on tools to help improve online conversation, which the area of focus here is the study of negative online behaviors . To be specific they challenged competitors to build AI models to perform multi-label negative comment classification. Basically classifying a text comment on whether they are rude, disrespectful, hateful and so on. The most frequent severe toxic word that appear in sentences is as displayed above if you notice(Attention:it is really TOXIC!)….…The competition is graded by the ROC-AUC metric which scikit-learn library has a convenient function for it(The highest score wins!). 4551 teams participated and I achieved top 1% on both Public Leader board scores and Private Leader board scores(Well I dropped 23 places as shown…. Luckily its still Top 1%!).

Lets get down and dirty

By loading the csv formatted training dataset with python, jupyter notebook and pandas , it looks like this.

Trending AI Articles:

1. Only Numpy: Implementing Convolutional Neural Network using Numpy

2. TensorFlow Object Detection API: basics of detection (2/2)

3. Understanding and building Generative Adversarial Networks(GANs)- Deep Learning with PyTorch.

To process the text comments naturally we have to convert the words to a representation that can be processed by machine learning and deep learning models. Three main methods of representation I have used is word level TF-IDF, raw char level label encoding, and low-dimensional word vector representations. Let me show them one by one.

TF-IDF — It is an algorithm used to weigh a word in any content and assign the importance to that word based on the number of times it appears in the document. More importantly, it checks how relevant the keyword is throughout the dataset(In this context would be the combination of test and train comments), which is referred to as corpus. Specifically it is called Term Frequency - Inverse Document Frequency. Full description here. Scikit-learn library has a convenient function to do just this >> Link

Raw Character-level Label Encoding — Encoding all characters and symbols in regards to a specific order of all possible characters , in doesn’t matter in what order it is but it needs to be consistent throughout test and train dataset. For example “A” will be turn to 0, “C” will be 2 and so one. Below is the encoding order I have used.

“abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:’\”/\\|_@#$%^&*~`+-=<>()[]{}\n”

Word Vector Representations — This method is used to represent a word with a low-dimensional vector (e.g. 100 dimensions). The dimensions are usually latent. GloVe vectors and FastText vectors by Facebook , both of them are used interchangeably and also pre-trained with different number of dimensions(200,300) with different Datasets which consist of Common Crawl , Wiki, and Twitter Dataset. Now initially I thought that these models were called word2vec models but on the contrary these are not. This link helps clear the doubt. GLOVE learns by constructing a co-occurrence matrix (words X context) that basically count how frequently a word appears in a context. Since it’s going to be a gigantic matrix, we factorize this matrix to achieve a lower-dimension representation. FastText models is used because of it uses n-gram characters as the smallest unit thus generate better word embeddings(array of numbers with predefined dimensions) for rare words, or even words not seen during training because the n-gram character vectors are shared with other words. About Word2Vec the main idea behind it is that you train a model on the context on each word, so similar words will have similar numerical representations, because of time limitations and good performance with both FastText and Glove embeddings, I did not try this.

Preprocessing

So the input to my models would be just text represented in numbers, what preprocess can I do? Well, because of the nature of this competition which contains actual people online comments in the real world, the text received may not necessarily be in the best form of English in terms of grammar, typos and also weird symbols. Heck it may not even be in English, Exploratory Data Analysis(EDA) before actually building your preprocessing methods matters a lot in such a competition, for example one example preprocessing method would be to translate the foreign language back to English, but to do that we need to know it even existed in the data! and also the type of foreign language it represents(japanese, russian etc). EDA goes a long way. Conveniently there are definitely many examples of EDA that will be displayed on every single competition from competitors(Why people do that in a competition? Because you can get credits for it on Kaggle on comments and on sharing codes called kernals).

So I’ll share what preprocessing I did for the final submission, based on EDA, I found out that there are few comments that really is not English and well, what I did is I used a fellow Kaggler shared method to first translate all the text into for say russian using Google API using textblob library, and then proceed to translate everything back to English. This method will do two things, first it will augment the data, and second it will provide quality data to the models rather than just inputting words that will not be recognizable by the glove and fasttext vectors. The rest of the preprocessing methods I will list down below(some of them copied from generous kernals):

  • Replaced positive words like smiley faces, comments like “yay” to good
  • Replaced sad words like :’) to sad and hahahaha to good
  • Replaced words that have a special symbol in between them to better representations that do not have special symbols, example “I’m” to I am, “can’t to can not”. Reason being Glove vectors and FastText vectors can’t recognize special symbols
  • Removed url links and ip addresses that don’t add information(unless people send sarcastic information by saying “www.waitlonglong.com” ps: I did that before)
  • Replaced newline tokens typically “/n” with empty spaces(Some editors automatically add this)
  • convert & to “and” ,@ to “at” and numbers to its english counterpart.

Another last step which can be considered as preprocessing as well is the tokenization of words. Tokenization means chopping up the sequence of sentences into separate parts where the method on how you want to tokenize the sentence can be customized, for me I have used Keras built-in tokenizer functions to separate each and every word encoding the words based on the train dataset and also the train+test dataset(Used interchangebly), basically similar to the character level encoding described above but now in word level, and return also the array of encoded sequences all in one library. I also did padding of the sequences for each example to a fixed length because deep learning methods cannot accept sparse features. Eg. if the fixed length is 250 and the sequence is 190 in length, add 60 zeros or negative one’s at the end. Looks something like this:

“4728283e489329249924880–1–1–1–1–1–1–1–1–1–1–1–1–1–1–1–1–1–1–1”

That sums up the representation methods I have used! Note all of the methods described above contributed in some way to my final model, more about that in a bit ;)

Time for the REAL Modelling

In my final solution submission I have used ensemble learning methods to stack, bag and boost multiple models which all in all consist of around 40 models that contributed to one final model.

So I guess that explains everything on the notion on the term interchangeably used above, basically ensemble is a method that can allow us to combine different “weak” models and output a final model that would be better than a single best performing model. Ensemble methods is based on the notion that rather than getting one good plumber that can cover 80% of the job, get 3 plumbers with diverse skillset(features or in the case of this competition, predictions) that each can fulfill 30% of the job to finally get 90%. Makes sense? An intense description on ensemble can be found HERE .

In short get as much diverse models as possible to win the game. I will summarise a routine step I did to achieve the models.

  • Make sure each and every model is build using k-fold cross-validation method. Previously I though k-fold CV is meant to find best parameters to train and then train the final model on the full train dataset, but it seems that for deep learning its not, the concept I used is to train k models and validate on each of the k different validation sets, predict the test set on each of the models and average them geometrically or arithmetically and boom you got your final predictions which is not so prone to overfitting. This is essentially Bagging or in other words Bootstrap-Aggregating
  • Retrain the models with different configurations, different data(augmented and different preprocessing or no preprocessing), use different model configurations, use different pre-trained embeddings on both fasttext and glove with different dimensions, use different max sequence length, use different tokenization max word length.
  • Prepare to store the validation predictions for each of the folds so that there is a possibility to perform stacking after.(I only did this mid into the competition and not for all datasets as the augmented ones only can be stacked with predictions from this augemented dataset)
  • For stacking I added interchangeably handcrafted features from the original unprocessed train data combined with the OOF meta-features which I can summarize as, normalized text features between 0 to 1, number of exclamation in a comment, number of uppercase letter in a comment, number of capitals, number of capitals vs total length of comment, and number of special symbols.

I’ll proceed to list down the types of models that I have used(Roughly):

  • DPCNN
  • SCNN
  • Bi-LSTM with Concatenation of Max Pooling, Average Pooling and Attention Networks
  • Bi-GRU with Concatenation of Max Pooling, Average Pooling and Attention Networks
  • Bi-GRU with Concatenation of Max Pooling, Average Pooling
  • Bi-LSTM with Concatenation of Max Pooling, Average Pooling
  • Factorization Machines with Follow-The-Regularized-Leader (used wordbatch implementation)
  • Naive Bayes Features with Logistic Regression(Not SVM not sure why they say its SVM)
  • Bi-GRU to Capsule Network(Tried implementations in mxnet and keras)
  • Bi-LSTM to Capsule Network
  • Bi-GRU-CNN
  • Bi-LSTM-CNN
  • XGBoost
  • Light Gradient Boosintg Machine(LGBM)
  • Stack Logistic Regression
  • Stack XGB
  • Stack LGBM
  • Double Channel Bi-GRU with trainable glove embeddings in one side and non-trainable Fasttext embeddings in one side connecting to 2DCNN

Most of the implementations I used is coded using Keras with tensorflow as backend. In addition to this, from a generous share from a fellow Kaggler I found a very interesting way to ensemble models together by iterating combinations of different models, specifically hill climbing by replacements by this paper. The exact ensemble architecture I used is shown below

Essentially Blending is just a glorified way of saying weighted averaging, the weights I used for blending in the end sadly is only determined by probing the public leader board score, the best way to set the weights without overfitting I believe would be to based on the OOF validation performance which is also the same reason why I added Hill Climbing ensemble to the blend.I lost track on the type of models I have been using initially into the competition as its my first try so I messed up naming the models prediction files. Will start performing systematic ensembling next competition! As out of fold stacking can only be done with the same set data and folds(can be different preprocessing) to prevent data leakage separate stacking needs to be done. Some of the individual models in the final blend is also from some of the models in the 28 model stacker, most of it is included in the blend because I didn’t plan to do stacking in the first place so I did not produce Out Of Fold(OOF) meta-features to train on(But I want to use it so BLEND!)

Last but not least here is a plot of my performance gain on Public LB for my 138 submissions throughout the whole competition. Seems to be a constant boost over the two months!

Hurray to my first Silver Medal!Cheers to more Kaggle Competitions!

And oh link to my code base can be seen HERE

--

--