Step by Step Guide to Regression Analysis

--

We all know Applied Statistics hold a close relationship with Machine Learning but often times we find ourselves writing code and pulling commands from ML libraries randomly without knowing why. I mean if it works, it works right? I say otherwise, I think its crucial that we truly understand these core concepts before we dive into this domain. My search for a proper lesson led me to this excellant book on one such topic — Regression Analysis. Its a prevalent topic in the ML sphere and fairly easy to grasp. I learned a great deal from the author and would like to share some of that knowledge in the simplest possible way.

Big Data Jobs

What is Regression Analysis?

Regression analysis is a statistical tool or method to establish a relationship between predictor variables and a response variable. To demonstrate with an example: Lets consider Y as the response variable and let X¹, X²,X³ as predictor variables. A regression model defines the relationship between the aforementioned variables. For instance, we want to predict sale price of an apartment (Y). This would be affected by a lot of factors such as location(), size( ) and tax().We want to define and perfect a model or a regression equation that represents the relationship among these variables and measure how ‘Y’ changes with each unit of change in the ‘X’ variables.

One might ask, how is it different from correlation? Well simply put, correlation finds the co-relationship between two or more independent variables and the strength of that association. The more you practice the more you’re likely to master your arts. Is that a positive correlation? Absolutely, there is a positve dependence between the two variables.

On the contrary Regression determines a functional relationship between the dependent variable (Y) and how it changes with the changing independent variables (X). This allows us to estimate or predict future values. Exampe of a Linear Regression is as follows.

Trending AI Articles:

1. Why Corporate AI projects fail?

2. How AI Will Power the Next Wave of Healthcare Innovation?

3. Machine Learning by Using Regression Model

4. Top Data Science Platforms in 2021 Other than Kaggle

Linear Regression- X and Y are increasing proportionally

In the above graph, Y is proportionally and linearly increasing as X is increasing. This can be mathematically written as :

Steps in Regression Analysis

Regression Analysis is an analytical process whose end goal is to understand the inter-relationships in the data and find as much useful information as possible. According to the book, there are a number of steps which are loosely detailed below.

1. Problem definition

The very first step is to,ofcourse, define the problem we are trying to solve. Perhaps a business question that needs to be answered or simply a prediction we want to make based on some set of data. In this stage we must know the target variable and the attributes we presume affects the target variable. This would be later analysed to judge its credibility.

For the sake of our discussion lets take the Titanic Dataset as an example. In this dataset we have data of about 900 passengers.The question or the problem we must solve is predicting which passenger likely survived the tragedy given their data.

A look at the Titanic Dataset

So now we know, that ‘Survival’ is the response variable but of the 10 attributes given for each passenger, how do we determine which of these predictor variables affect the result? Thats where data analysis comes in .

2. Analyse Data

Graphs and charts

The key is to have visual representations of our data so we can better understand the ‘inter-relationships’ of the variables and likely so, the book I was referring to earlier, highly recommends using visual tools to make the EDA(Exploratory Data Analysis) process easier.

For the afore-mentioned dataset, we could try answering a number of things that might give us a better understanding of the problem at hand.

Whats the survival rate of passengers from each class?

How about the survival rate based on gender?

How about the Correlation of all the attributes?

Heatmap showing correlation

Finding correlation is an important step as it allows us to roughly pick the attributes that have a relation with the response variable. We are most likely to pick the attributes/variables that show a positive correlation with respect to the target variable.

From this section we can deduce that plotting graphs are vital for the next step which is choosing a model. Graphs before model fitting can range from histograms, boxplots, root and leaf display, scatter plots etc.

3. Model Selection

Based on the data , we are to pick a suitable model or regression equation. You may be familiar with many such models like Linear Regression, Support Vector Machine, Random Forest etc. The task in this step is to pick one that we assume will express the relationships of our data in the best way possible. This assumption can be later accepted or refuted based on analysis after fitting the model.

4. Model Fitting

For simplicity’s sake, lets consider Linear regression. Y= mx+c. We have the data, we have a model. At this stage we are going to train the model on the given dataset but what of the parameters of this equation?

We must estimate these parameters when fitting the model however they can be optimised with many algorithms. Perhaps this is when terms like ‘Gradient Descent’ or ‘Adam optimiser’ rings a bell. The purpose of an optimiser is simply to update the values in every iteration of training so we can minimise loss or error. This is the part where our model learns to correct itself and provide a best fitting solution or model that would likely have high accuracy.

For a simple model like Linear regression, we can use Least Squares method to estimate the parameters ‘m (slope)’ and ‘c (y-intercept)’ to get the best fit line that crosses through most of the data points.The least squares method basically minimizes the sum of the square of the errors as small as possible given that no outliers are present in the data.

5. Model evaluation

Final step is model evaluation — measuring and criticising exactly how good is the model fitting the data points. We run the model on the test data and check to see how accurately it was able to predicit the output values. Now, there are a number of measures to check this as discussed below:

i) We can find RMSE(root mean squared error) of the actual Y values and predicted Y values. There are other variations of it that can be explored.

Formula for RMSE

ii) We can calculate R-squared value which measures the goodness of fit or varaince within a range of 0 to 1 where ideal value is 1.

Formula to find R-squared value

iii) We can perform cross validation to asses which model among a few chosen performed the best for our given problem.

iv) Finding statistical significance of parameters. This involves stating a hypothesis, a null hypothesis and an alpha level(probability of error level). An example is Chi-squared Test which tests if there is any relation between two variables.

Formula for Chi-Square Statistical Test

There are many other methods , some more complex than others but these are usually a good place to start. Based on this analysis, the model is updated and perfected after which it can be used for its intended purpose.

Conclusion

We’ve made it to the end of the article, finally! If you’ve stuck by till now I hope I was able to explain the key concepts in Regression Analysis. Feel free to ask any question regarding the topic; discussions and/or suggestions are highly appreciated.

Moreover here is the link to the book i was referring to: Regression Analysis by Example and the code for the Titanic disaster survival prediction is available in my github. Have a great day!

Don’t forget to give us your 👏 !

--

--