A Short Machine Learning Explanation — in terms of Linear Algebra, Probability and Calculus
Linear Algebra—
Tensor
In some cases we will need an array with more than two axes.
In the general case, an array of numbers arranged on a regular grid with a
variable number of axes is known as a tensor.
Tensors and Multidimensional arrays are different types of object, the first is a type of function, the second is a data structure suitable for representing a tensor in a coordinate system.
Scalar
A scalar is just a single number, in contrast to most of the other
objects studied in linear algebra, which are usually arrays of multiple numbers.
In terms of tensor — A tensor that contains only one number is called a Scalar(or scalar tensor, or 0-dimensional tensor, or 0D tensor).
Vector
A vector is an array of numbers or a 1D Tensor.
Note- 5D vector is different from a 5D Tensor. 5D vector is a 1D Tensor with 5 elements. Dimensionality of a vector refers to the number of entries along a specific axis.
Matrix
A matrix is a 2-D array of numbers or 2D tensor. A matrix has two axes (often referred to rows and columns). You can visually interpret a matrix as a rectangular grid of numbers.
Transformation
A transformation converts an input from one set (domain) to an output of the same or another set (range).
Transformation is nothing but function. Transformations are functions which operate on vectors.
For example:-
Consider the function f(x)= x+1.
We can think of “f” as a function(relation or transformation) which maps x to x+1. Functions can also be called as transformations which transforms inputs to ouputs.
A linear transformation is a transformation for which the following holds:
A machine-learning model transforms its input data into meaningful outputs, a process that is “learned” from exposure to known examples of inputs and outputs. Therefore, the central problem in machine learning and deep learning is to meaningfully transform data: in other words, to learn useful representations of the input data at hand — representations that get us closer to the expected output.
Probability —
Probabilistic Supervised Learning
Most supervised learning algorithms are based on estimating a probability distribution p( y | x). We can do this simply by using maximum
likelihood estimation to find the best parameter vector θ for a parametric family of distributions p(y | x; θ).
The maximum likelihood estimator can be generalized to the case where our goal is to estimate a conditional probability P(y | x;θ) in order to predict y given x . This is actually the most common situation because it forms the basis for most supervised learning
Linear Regression as Maximum Likelihood
Mean Squared Error -
The conditional log-likelihood is given by -
Maximizing the log-likelihood with respect to w yields the same estimate of the parameters w as does minimizing the mean squared error.
The two criteria have different values but the same location of the optimum. This justifies the use of the MSE as a maximum likelihood estimation procedure.
Calculus —
Derivative
Derivative is Instantaneous rate of change or Slope of tangent to a curve.
We can find the Slope of the curve at point A by taking delta_y/delta_x. Assume that both the deltas are infinitesimally small, so we get a more accurate value for the slope. In Calculus we denote this as —
which means infinitesimally small change in “y” over an infinitesimally small change in “x” — also called derivative.
When we take the derivative of a function f(x), we get another function f’(x). To get the slope of the tangent at a particular point, substitute the values in the the function f’(x).
Gradient Descent
Suppose we have a function y = f (x).
The derivative of this function is denoted as f’(x) . The derivative f’(x) gives the slope of f(x) at the point x. In other words, it specifies how to scale a small change in the input in order to obtain the corresponding change in the output.
The derivative is therefore useful for minimizing a function because it tells
us how to change x in order to make a small improvement in y.
The gradient points directly uphill, and the negative gradient points directly downhill. We can decrease f by moving in the direction of the negative gradient. This is known as the method of steepest descent or gradient descent.
Chain Rule
If a variable z depends on the variable y, which itself depends on the variable x, so that y and z are therefore dependent variables, then z depends on x as well.
i.e. If z=f(y) and y=f(x), then
The chain rule then states,
Chain Rule is used in finding the gradients of the weights in a neural network. Application of Chain Rule in a Neural Networks to find the gradients is termed as Backpropagation.