Calculus isn’t boring — Tensorflow Part2

Published in

Becoming Human: Artificial Intelligence Magazine

12 min readMar 25, 2019

Writing from Antarctica — We just shifted our house, my room name is ‘Antarctica.’ Robots may book an air ticket to travel Antarctica after 200 years. Thanks for reminding me, I have just started this blog to give a small introduction about the use of calculus in the machine learning, The subject is huge so please don’t anticipate that I will embrace all the areas, am trying to build an intuition how math impersonates a great role in it, if you don’t understand in the first read, it’s okay. I have also not understood in the first effort, researchers have built the intuition in 4–5 years.

One of the best courses I followed in the Coursera about the math in the machine learning:

Mathematics for Machine Learning: Multivariate Calculus | Coursera

This course offers a brief introduction to the multivariate calculus required to build many common machine learning…

www.coursera.org

By the way, if you have not read my first post, please read:

Learn basic Tensorflow Part 1

Last year I searched for a proper tensorflow tutorial, but I could not find, It was scattered. Google’s tensorflow…

becominghuman.ai

m training examples : {(X1,Y1),(X2,Y2),(X3,Y3),..,(Xm,Ym)}

Here m and n are the Real numbers. There will be a single output point 1 (cat) or 0 (non- cat) for every input image. If there are m input image samples, then we will have m outputs. So the shape of the output matrix will be (1,m)

Y = [Y1, Y2, Y3, Y4, …, Ym]

We will go over logistic regression, and this learning algorithm is frequently used when you have two types of outputs- mainly for binary classification problem. Given an input feature vector X maybe corresponding to an image that you want to recognize as either a cat picture or not a cat picture, You want an algorithm that can output a prediction, which we’ll call Y hat, which is your estimate of Y. More formally, you want Y hat to be the probability of the chance that Y is equal to one given the input features X.So in other words, if X is a picture, you want Y hat to tell you, what is the chance that this is a cat picture?

So X is an x dimensional vector, given that the parameters of logistic regression will be W which is also an x dimensional vector, together with b which is just a real number. So given an input X and the parameters W and b, how do we generate the output Y hat? Well, we can build a linear function:

𝑦̂ = 𝑤 𝑇𝑥 + 𝑏 (read w Transpose x plus b)

Y hat should really be between zero and one, and it’s difficult to enforce that because of W transpose X plus B can be much bigger than one, or it can even be negative, which doesn’t make sense for probability. That’s why we want to use an activation function. We will use here sigmoid function to generate y hat.

This is what the sigmoid function looks like.

So long story short.

𝐺𝑖𝑣𝑒𝑛 𝑥, 𝑦̂ = 𝑃 (𝑦 = 1|𝑥), where 0 ≤ 𝑦̂ ≤ 1.

The parameters used in Logistic regression are:

The input features vector: 𝑥 ∈ ℝ𝑛𝑥, where 𝑛𝑥 is the number of features.

The training label: 𝑦 ∈ 0,1

The weights: 𝑤 ∈ ℝ𝑛𝑥, where 𝑛𝑥 is the number of features

The threshold: 𝑏 ∈ ℝ

The output: 𝑦̂ = 𝜎(𝑤 𝑇𝑥 + 𝑏)

Sigmoid function: s = 𝜎(𝑤 𝑇𝑥 + 𝑏) = 𝜎(𝑧)= 1/( 1+ 𝑒−𝑧)

(𝑤 𝑇𝑥 + 𝑏) Is a linear function like (𝑎𝑥 + 𝑏), but since we are looking for a probability constraint between [0,1], the sigmoid function is used. The function is bounded between [0,1] as shown in the graph above. Some observations from the graph:

a)If 𝑧 is a large positive number, then 𝜎(𝑧) = 1

b)If 𝑧 is a small or large negative number, then 𝜎(𝑧) = 0

c)If 𝑧 = 0, then 𝜎(𝑧) = 0.5

Now we want to measure the Loss function. The loss function is the discrepancy between the predicted output 𝑦̂(i) and the actual output y(i). The loss function computes the error for a single training example.

Cost function: The cost function is the loss function in the entire training set. We need to minimize the cost function to find the w and b.

Now let’s discuss how to use Gradient Descent algorithm to train, and to measure the w and b as we have seen the cost function, J, which is a function of parameter w and b. We want to obtain the values of w and b which will minimize the cost function, ie, J.If we build a 3-dimensional graph, where x-axis follows b, y-axis follows the cost function, and z-axis follows the w. Global optima are the point where the Cost function will be minimal.

From the left-sided picture, we can surmise that J function is like a convex function; it is like a bowl. To find a good value of the parameters w and b, we will start it from zero, the first red dot point. Random initialization also works, but for logistic regression, people don’t usually do it. No matter where you have initialized the w and b you will get the same point or roughly near points. So the Gradient Descent starts from a single point then it takes a step in the steepest downhill direction. If you are lucky after a single iteration, you may find the global optimal position. The logic is obvious here. Firstly we calculate the value of w and then we will follow the same for b.

We will calculate the derivative of J with the respect of w.so dJ(w) means dJ(w)/dw.After acquiring the value of dJ(w) , we will multiply the same with learning rate(alpha), later we will subtract the same from the current w value. We redo the process until we get the minimal value of w. Now we are placing the logic in the algorithm.

We usually use SGD or Stochastic Gradient Descent in the place of Gradient Descent in the real-life production environment. In GD we usually compute whole input batch, but when the input size is near billion, then for every iteration computing GD is a more expensive job. That’s why SGD came into this situation, as per the name suggest it randomly chose a single input from the batch. For more read posting the link:

Reducing Loss: Stochastic Gradient Descent | Machine Learning Crash Course | Google Developers

What if we could get the right gradient on average for much less computation? By choosing examples at random from our…

developers.google.com

Activation function: When you implement a deep neural network, the main work is what activation function we will prefer. We will discuss mainly five activations functions.

a)Sigmoid Function: the mathematical function is : a = 1 / ( 1 + e^-z).So sigmoid function is always going within zero and one. The graph has been shown in the left plane. The sigmoid function has been widely used in machine learning intro materials, especially for the logistic regression and some basic neural network implementations. However, you may need to know that the sigmoid function is not your only choice for the activation function and it does have drawbacks.

b)Tanh Function: The activation function which is almost always better performing than sigmoid function, is Tanh function or Hyperbolic Tangent function. Mathematically it is the shifted version of the sigmoid. It covers from -1 to 1. When you use binary classification in the output layer, the output value should be in 0 to 1; you can use sigmoid. Other than, tanh is always superior activation function. One of the main drawbacks of Signoid and tanh functions is, when z is very large or small, the derivative will be very small. So the gradient descent is also more modest. In the time of calculation w and b, we have to calculate more iterations to reach the optimal point.

c)Relu Function: One of the most straightforward non-linear activation function is the Relu or Rectified Linear unit. The main advantage of Relu function is, the derivative is always 1 for the positive values. The calculation of global optima is more efficient. However, such a simple solution is not perfect still. From Andrej Karpathy’s CS231n course:

Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be “dead” (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

d)Leaky Relu Function: This activation function is not used widely. Leaky ReLUs is one attempt to fix the “dying ReLU” problem. The derivative is equal to zero when z is negative for Relu function. Leaky Relu covers the part. The formula of the activation function is : a = max(0.01z,z)

e)Softmax Function: I individually cherished the activation function and used mostly. The softmax function squashes the outputs of each unit to be between 0 and 1, just like a sigmoid function. But it also divides each output such that the total sum of the outputs is equal to 1 (check it on the figure left). The output of the softmax function is equivalent to a categorical probability distribution; it tells you the probability that any of the classes are true. The mathematical equation of the function is :

Now the question is why a Linear equation needs a non-linear activation function:

Because the activation function introduces the non-linearity in the linear function. When you build eight layers or 9 layer network without activation function, the variance of the prediction will not cover more space. The real world problems are not always as simple as binary classification and will not follow a simple linear equation. For more read, please follow the link.

Why must a nonlinear activation function be used in a backpropagation neural network?

Thanks for contributing an answer to Stack Overflow! Please be sure to answer the question. Provide details and share…

stackoverflow.com

Now its time for coding, we will develop the binary classification problem without tensorflow.it will help us to clear our primary machine learning mathematical equations in a programmatic way.

To build the neural network, we need to execute several helper functions. Firstly we write math helper functions.

We intend to achieve a 2 layer neural network for the noted binary classification problem. We need to implement the following steps.

a) We will initialize the parameters for a two-layer network.

b)We will implement the forward propagation module (shown in purple in the figure below)

c)We will Compute the loss.

d)We will implement the backward propagation module (denoted in red in the figure below).

e) We will finally update the parameters.

Initialization:

The goal is to initialize the parameter for 2 layer network. The model structure will be :LINEAR -> RELU -> LINEAR -> SIGMOID.

Don’t initialize the weights with zeros, in the hidden layers the output will be symmetrical. After the initialization of weights, we have multiplied the weights with 0.01. if the initialization value of weight is 100 then z will be higher as z = wx+b, if z is high and we use sigmoid or tanh as activation function, then we need to iterate more loops in gradient descent to get the optimal value.

Forward Propagation Module:

We have initialized parameters, now its time to write the helper functions for forwarding propagation. The linear forward module (vectorized over all the examples) computes the following equations:

Z[l]=(W[l]*A[l−1])+b[l] where A[0]=XA[0]=X. and ‘l’ is the order of a layer.

Cost Function

After implementing the forward propagation, we need to measure the cost function.

Backward propagation module:

Just like with forward propagation, we will implement helper functions for backpropagation. Remember that backpropagation is used to calculate the gradient of the loss function with respect to the parameters.

Update Parameters:

In this section we will update the parameters of the model, using gradient descent:

W[l]=W[l]−α dW[l]

b[l]=b[l]−α db[l]

where α is the learning rate. After computing the updated parameters, store them in the parameters dictionary.

We have all the expected helper functions. We need to download first the testing and training dataset file. Please download the files from the below link.

Datasets — Google Drive

Edit description

drive.google.com

This the code for 2 layers neural network. And the accuracy for testing data is 72%, it can be increased to 80% if you use 4 layers of a network. Homework is to implement 4 layers of a network. All the helper functions are available.

I have added the Jupiter notebook file in the Google Drive dataset for further help.

Next part of the series is on pure coding, no more theory for the next two posts. If you enjoyed this post, I’d be very grateful if you’d help it spread by sharing it on Twitter or LinkedIn.

And Lastly Please share your comments and thoughts below. I’ll be happy to respond.

Gracias!!Thank You!!!

Follow me: Twitter: https://twitter.com/raahul_rahl

LinedIn: https://www.linkedin.com/in/raahuldutta/

Don’t forget to give us your 👏 !