Linear Regression from Scratch: Machine Learning Approach

The “Hello world!” of machine learning from scratch.

Published in

Becoming Human: Artificial Intelligence Magazine

7 min readSep 22, 2020

Hi! Ardi here. In my previous post I was writing about the statistical approach to solve linear regression problem, which is basically only using several formulas to create a best-fit straight line to estimate the value of dependent variable y based on given training data x. Click the link below if you wanna read the article.

Linear Regression from Scratch: Statistical Approach

Let’s do this without optimizer. Just a bunch of statistical formulas.

medium.com

Today, in this post I wanna do the similar thing, yet this one is going to be done using machine learning approach. In statistics, we do not use optimization algorithm to solve this task. On the other hand, such algorithm is required in the field of machine learning. Here I decided to use gradient descent optimization algorithm (which is the simplest one) to minimize the value of MSE (Mean Squared Error). Furthermore, the dataset I use in this project is exactly the same as what I use in the previous post, which can be downloaded from here.

Note: I put the full code at the end of this post.

Now let’s start to load the required modules first:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

As you can see here we do not import Sklearn module since we are going to do all the calculation from scratch.

Regression line

Before we go deeper into the machine learning, it’s important to know that the linear regression line is basically just a linear function — hence the name is linear regression. The equation can be denoted like this:

Linear regression equation.

Here x is used to represent all samples in the dataset. Notice that here I use y_hat (instead of just y) since the line basically represents value predictions, not the actual target value. The main objective of doing linear regression is to figure out the value of m and b, which represent slope and y-intercept respectively. In statistical approach, we can directly apply a formula to compute those unknown values. However, here in machine learning, we are going to start by assigning random value for both variables and then we try to predict the best value for m and b with the help of error/loss function and optimization algorithm. Basically the idea here is to use optimizer to minimize the error value gradually.

Loss function: MSE (Mean Squared Error)

Before doing the error minimization process, we have to know first how our error function looks like. Here I decided to use what’s called as MSE.

The function above is pretty simple though. Here y, y_hat and n represent actual y, predicted y and the number of samples in our dataset. Also, it’s important to remember that i denotes i-th sample. Next, since the prediction y_hat is essentially obtained using our regression line, then we can substitute this variable with a linear function.

Still, our problem here is that we do not have the optimal value for m and b just yet such that the error value is minimized. So in the next step we are going to use gradient descent algorithm to gradually update this m and b values.

Gradient descent algorithm

There are several steps that we need to do to run this algorithm:

First: initialize value for m and b. I mentioned earlier that the value of these 2 variables should be random numbers. However, to make things simpler, I decided to assign 0 to both variables as the initial value.

m = 0
b = 0

If we try to print our line with m=0 and b=0, then we are going to see an output like this:

x_line = np.linspace(0,10,100)
y_line = m*x_line + bplt.figure(figsize=(8,6))
plt.title('Data distribution')
plt.scatter(x, y, s=10)
plt.plot(x_line,y_line, c='r')
plt.xlabel('hours')
plt.ylabel('score')
plt.show()

Second: define learning rate L and number of epochs. In simple words, learning rate defines how fast our gradient descent algorithm reduces error value for each epoch (iteration). Generally, the value of learning rate is a very small number. Here I decided to set the the value to 0.001. It’s important to keep in mind that small L value slows down the training process (we might need to increase the number of epochs), yet on the other hand, large learning rate value may cause our gradient descent algorithm fail to reach its minimum error.

L = 0.001
epochs = 100

Third: calculate the partial derivative of our loss function with the respect to m and b. Here I store those derivatives to dm and db.

Derivative of MSE with the respect of m.

Derivative of MSE with the respect of b.

Fourth: update the value of m and b by taking into account the value of both derivatives and learning rate. Note that the third and fourth step are going to be done iteratively.

Implementation

As we already got the idea of how gradient descent algorithm works, now we can start to implement this in the code. All the code below are based on the mathematical notations we defined earlier.

# The number of samples in the dataset
n = np.float(x.shape[0])# An empty list to store the error in each epoch
losses = []for i in range(epochs):
    yhat = m*x + b
    
    # Keeping track of the error decrease
    mse = (1/n) * np.sum((y - yhat)**2)
    losses.append(mse)
    
    # Derivatives
    dm = (-2/n) * np.sum(x * (y - yhat))
    db = (-2/n) * np.sum(y - yhat)    # Values update
    m = m - L*dm
    b = b - L*db

After the training process is done, we can try to print out the new value of m and b. We can see here that now those values have been updated thanks to the gradient descent algorithm.

If we display the regression line with these updated values, we should get the following output:

x_line = np.linspace(0,10,100)
y_line = m*x_line + bplt.figure(figsize=(8,6))
plt.title('Data distribution')
plt.plot(x_line, y_line, c='r')
plt.scatter(x, y, s=10)
plt.xlabel('hours')
plt.ylabel('score')
plt.show()

We can see here that our algorithm works well as it’s now able to create a line which approximates all data points in our dataset. In other words, we can also say that this regression line produces much smaller error compared to our initial line when m = b = 0. Here’s the code if you want to see how the error value decreases as the training process goes.

plt.title('Loss values')
plt.plot(losses)
plt.ylabel('loss')
plt.xlabel('epoch')
print('Initial loss\t:', losses[0])
print('Final loss\t:', losses[-1])