Introduction to Optimization and Gradient Descent Algorithm [Part-2].

Gradient descent is the most common method for optimization.

--

Photo by Jose Escobar on Unsplash

This is the second part of the series Optimization, In this blog post, we’ll continue to discuss the Rod Balancing problem and try to solve using Gradient descent method. In Part-1 we understood what is an optimization and tried to solve the same problem using Exhaustive Search(a gradient-free method). If you haven’t read that here is the link.

Gradient-Based Algorithms are usually much faster than gradient-free methods, the whole objective here is to always try to improve over time i.e., somehow try to make the next step which results in a better solution than previous one. The Gradient Descent Algorithm is one of the most well-known gradients based algorithm. The Decision Variables used here are continuous ones since it gives more accurate gradients(slope) at any given point on the curve.

So, to solve our problem with gradient descent we’ll reframe our Objective function to a minimization problem. To do this we’ll make an assumption and define our cost function(it is also sometimes known as loss function or error function). We will assume that the best solution would be the one which can balance the rod for at least 10 seconds (let’s state this assumption as ‘y’). The cost function at its base is the function which returns the difference of the actual output and desired output. For our problem, the cost function would become:

the Cost function for Rod Balancing problem

Note: we squared the difference to avoid negative values or you can just take absolute value, either will work.

Jobs in Big Data

Now for every test result [f(x)] i.e., the time in seconds, the rod stayed on the finger, we can calculate our cost [C(x)]. So our Objective function would now change to minimizing C(x) instead of maximizing f(x), we can state the modified Objective function as,

Modified Objective function.

Since Objective function is changed now, our curve also gets inverse, i.e., on the y-axis instead of time we plot cost and will try to minimize it.

maximization to a minimization problem

Any Gradient descent based algorithm follows 3 step procedure:
1. Search direction.
2. Step size
3. Convergence check.

Once we know the error, we have to find the direction of where we should move our finger on the rod for a better solution. The direction is decided by taking the derivative of the cost function with respect to the decision variable(s). This simply means calculating slope(‘dC/dx’ ) on the curve for a specific value of the decision variable, this slope is known as the gradient. The greater the slope, the further we are from the minima(i.e., the lowest point on the curve).

For Gradient descent we apply a simple rule,

“If the slope is negative, we increase decision variable(s) and if the slope is positive, we decrease decision variable(s) with some value.”

Once we know the direction of where we want to take our variables for the next step we update them, The above rule can be easily given in mathematical term as,

Update rule

But using this update rule may overshoot the value, resulting in skipping the minima and jump to the other side of the curve. So, instead of reaching to the centre of the rod the variable may jump and go to the other corner and may introduce greater error. To avoid this we decide the step size by multiplying a very small value(usually 0.001 or 0.0001) to the gradient, this value prevents overshooting as we are not taking a very huge step. This is known as the learning rate(α), unlocks the key principle where Gradient descent shines,

“Big steps when away, small steps when closer.”

What above statement says is when the slope is greater(i.e., when the steepness is high) the variables will update with larger values and when the slope starts getting smaller(i.e., the steepness is low, reaching to the bottom) the variables will update with a very small value, This is the behaviour what we actually follow in the real world while solving this kind of problems. So our update rule now changes to,

the modified Update rule

We perform this operation of updating variable for a certain number of epochs(one complete pass to the training examples) until it converges.
Below is the video of me trying to solve the rod balancing problem using the gradient descent method, you‘ll notice how fast, compare to exhaustive search we discussed in part-1, we get the point ‘x’ on the rod where it balances perfectly.

Top 4 Most Popular Ai Articles:

1. AI for CFD: Intro (part 1)

2. Using Artificial Intelligence to detect COVID-19

3. Real vs Fake Tweet Detection using a BERT Transformer Model in few lines of code

4. Machine Learning System Design

solving Rod balancing problem by Gradient descent method

How the value of x converges to the minima, can be shown by the graph below.

‘w’ is nothing but ‘x’ in our rod balancing problem

There are different variants of Gradient descent, the one we looked at in this post is Stochastic Gradient Descent(SGD). SGD performs the update operation on variables after each example/test, just like we did in our problem. There are also other variants such as :

Batch or Mini-batch Gradient Descent: In this, the update operation is performed after each iteration of the training examples, iteration may consist of ‘m’ training examples. In batch gradient descent m is the total training examples whereas in mini-batch gradient descent total training examples are divided into batch-size(usually 32 or 64) and this batch-size will become m training examples for the update operation. An epoch will be considered complete only when all the batches have completed their update operation. For this, the most common cost function we use is Mean Squared Error(MSE), it is almost same as what we have used in our problem. The things which differ are, we add all individual errors for ‘m’ examples and divide by ‘2m’. It is given as,

Mean Squared Error

There are also different Gradients Descent Algorithms, few important ones are listed below. They do provide better performance but underlying concepts remains the same.

  1. Momentum
  2. Adagrad
  3. RMSprop
  4. Adam

Refer to this article, it explains the performance difference and how they vary from each other.

While selecting the optimization algorithm it is always advised to select based on their popularity. But, this is not always true, look at the image below and decide what suits best for your problem.

Congratulations!!! If you have made this far you must be having a comprehensive understanding of Gradient Descent Algorithm by now. Let me know if you found this 2 part series helpful by leaving response or giving a clap. I will cover more Machine Learning foundation topics in future, to get notified do follow me here.

Don’t forget to give us your 👏 !

--

--