Only Numpy: Implementing “ADDING GRADIENT NOISE IMPROVES LEARNING FOR VERY DEEP NETWORKS” from Google Brain with interactive code

--

Today I am going to implement the Gradient Noise Paper Published by Luke Vilnis who was a Google Brain Researcher. And Quoc V. Le, Ilya Sutskever, who is in the Google Brain. For the link to the original paper, please click here.

Main Objective

Additive gradient descent Image from Original Paper

So the main contribution of that paper is rather using normal gradient descent, we are going to add a Gaussian Noise to every gradient with
Mean Value of 0 and certain Standard Deviation Value. How and where do we calculate this STD? It is shown below.

Top 3 Most Popular Ai Articles:

1. From Perceptron to Deep Neural Nets

2. Neural networks for solving differential equations

3. Turn your Raspberry Pi into homemade Google Home

Additive gradient descent Image from Original Paper

So the η value is selected from 3 set of numbers, γ = 0.55 and finally variable t stands for each training time step.

Additive gradient descent Image from Original Paper

Recap in One Image from Delip

Image from Delip Rao

So as a recap, rather than performing normal gradient descent (1) we are going to use the additive gradient descent (2) by deriving the standard deviation using equation (3).

Training Data / Hyper Parameter for Both Normal and Additive Gradient Descent

Our Training Data

Simple classification task with 4 layers of Standard Neural Network, nothing special. And the above training data is generated by Sklearn (So yeah technically it is not only numpy, but LOL we are only using Sklearn to create training data, I promise!)

Creating Training Data

One other very important thing is to make the weight for each model exactly the same, and as seen below for fair comparison we declare one weight and copy them to both normal gradient descent network and additive network.

Forward Feed for Both Normal and Additive Gradient Descent

Forward Feed For Normal Gradient Descent
Forward Feed For Additive Gradient Descent

Both normal and additive gradient descent model have 4 layers, and each layer uses the activation function tanh() and logistic sigmoid() alternatively. Also please take note we are using L2 Norm Cost Function.

Back Propagation for Normal Gradient Descent

As seen above, standard back propagation, with preset learning rate nothing special.

Back Propagation for Additive Gradient Descent

As seen above, between the code there is a section called “Calculate The additive Noise”. That is the part we calculate the Additive noise respect to each time stamp →iter. Using preset η value.

Experiment Results

The results were absolutely amazing, as seen above the two networks are trained with exactly same learning rate as well as the amount of training time (Number of Epoch). However, we can observe that Final Error for Additive Gradient Descent Model is extremely lower (not for the case when we have chosen the η value as 0.3), in most cases. Don’t believe it? See for yourself, by playing with the interactive code shown below.

Interactive Code

Please follow this link to access the code.

Update 2/15: I just realized that the code is not running. Don’t worry! I will update the new code in Google Colab!

Final Words

It’s so amazing, how that one line of code can change the learning process of the model, I wonder the effect of this noise paired with different optimization models. Maybe I’ll do experiments on them in the future…..

If any errors are found, please email me at jae.duk.seo@gmail.com.

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also did deriving back propagation on simple RNN here if you are interested.

References

  1. Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., & Martens, J. (2015). Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807.
  2. D. (2016, June 05). Make your Stochastic Gradient Descent more Stochastic. Retrieved January 18, 2018, from http://deliprao.com/archives/153
  3. Hakky St, Student, Software Engineer Follow. “An overview of gradient descent optimization algorithms.” LinkedIn SlideShare, 14 Apr. 2017, www.slideshare.net/ssuser77b8c6/an-overview-of-gradient-descent-optimization-algorithms.

--

--

Exploring the intersection of AI, deep learning, and art. Passionate about pushing the boundaries of multi-media production and beyond. #AIArt