Calculus for Machine Learning

Derivatives - Single Variable

Derivatives are an important concept in Machine Learning. Derivatives are used in Machine learning for the purpose of optimization and used to minimize the error in Machine Learning. For example, the task of a Machine Learning model forecasting house price is to optimize the best possible prediction.

Instantaneous Rate of Change

A derivative is an instantaneous rate of change of a function. Distance can be represented as a function, and its rate of chage is called Velocity. The measure of how fast the distance is changing with respect to time is called the instantaneous rate ogf change and its the slope of the tangent line. The instantaneous rate of change is a measure of how fast the relation between two variables is changing at any point.

Slopes, minima and maxima

Minima, and maxima are the points of the function curve where the derivative of the tangent line is 0 or the derivative is 0

Derivatives Notations

$$ \text{function : } y = f(x) $$ $$ \text{Derivative of f is expressed as :} $$ $$ \text{Lagrange's Notation : } f'(x) $$ $$ \text{Leibniz's Notation : } \frac{dy}{dx} = \frac{d}{d} f(x) $$

Derivative of a Constant

Derivative of a constant is always 0 since there is no change in the function $$ \frac{\Delta y}{\Delta x} = \frac{c - c}{x_1 - x_0} = 0 $$

Derivative of a Line

The equation of a line is : $ f(x) = ax + b $ The derivative of a line is the slope of the line $ \Large{\frac{\Delta y}{\Delta x} = \frac{rise}{run} = a} $

$ \Large{\frac{\Delta{y}}{\Delta{x}} = \frac{a(x + \Delta{x}) + b - (ax + b)}{\Delta{x}}} $

$ \Large{\frac{\Delta{x}}{\Delta{x}}} = a $

Derivative of Quadratic

Quadratics : $ y = f(x) = x^2 $

Slope : $ \Large{\frac{\Delta f}{\Delta x} = \frac{(x + \Delta{x})}{f(x)}} $

$$ \frac{df}{dx} \ \frac{\Delta f}{\Delta x} = \frac{(x + \Delta{x})^{2} - x^{2}}{\Delta{x}} $$ $$ = 2x \ \ \text{as} \ \Delta{x} \ \rightarrow \ 0 $$ $$ f'(x) = 2x $$

Derivative of Cubic Functions

Cubic : $ y = f(x) = x^3 $

Slope : $ \Large{\frac{\Delta f}{\Delta x} = \frac{(x + \Delta{x})^{2} - x^{2}}{\Delta{x}}} $

$ 3x \Delta x + 3x^{2} + \Delta x^{2}$

Derivative of Hyperbola Functions

Hyperbola : $ y = f(x) = x^{-1} = \frac{1}{x} $

$ f'(x) = -x^{-2} $

Derivative of Power Functions

$ \large{f(x) = x^{2}} $ $ \large{f'(x) = 2x^{1}} $
$ \large{f(x) = x^{3}} $ $ \large{f'(x) = 3x^{2}} $
$ \large{f(x) = x^{-1}} $ $ \large{f'(x) = (-1)x^{-2}} $
$ \large{f(x) = x^{n}} $ $ \large{f'(x) = \frac{d}{dx} f(x) = nx^{(n - 1)}} $

Derivatives of Trigonometric Function

$$ \large{y = f(x) = \sin(x)} $$ $$ \large{f'(x) = \cos(x)} $$ $$ \large{y = f(x) = \cos(x)} $$ $$ \large{f'(x) = -\sin(x)} $$

Derivatives of Exponential Function

$$ \large{y = f(x) = e^x} $$ $$ \large{f'(x) = e^x} $$

Derivatives of Logarithmic Function

Logarithmic function $ f^{-1}(y) = log(y) $ is the inverse of the exponetial function $ f(x) = e^{x} $ because $ e^{\log(x)} = x $ and $ \log(e^{y} = y) \text{.} \ \log(x) $ is the inverse of $\ e^{x} $

$ \large{\frac{d}{dy} f^{-1}(y) = \frac{1}{f'(f^{-1}(y))}} $

$ \large{\frac{d}{dy} \log(y) = \frac{1}{e^{log(y)}} = \frac{1}{y}} $ $ \large{\frac{d}{dy} \log(y) = \frac{1}{y}} $

Differentiable Function

A function is called differentiable at a point means that the derivative exist for that point. For a function to be differentiable at an interval, the derivative at an interval: The derivative has to exist for every point in the interval.

A few functions do not statisfy this property, they are called non-differentiable function. For example, the absolute function $ f(x) = \begin{cases} \ x & \ if \ \ x\geq 0 \\ -x & \ if \ \ x\leq x \end{cases} $ is not differentiable at the origin, the point with sharp edge or corner is called a cusp. A piece-wise function with jump discontinuity is also a non-differentiable function. A function with a vertical tangent is non-differentiable, because the vertical tangent does not have a well-defined slope.

Properties by Derivatives

Multiplication by scalar $$ f' = cg' $$ Product Rule

$$ f'(t) = g'(t)h(t) + g(t)h'(t) $$ Product Rule

$$ \frac{d}{dt} f(g(h(t))) = \frac{df}{dg} . \frac{dg}{dh} . \frac{dg}{dh} $$ $$ f(g(h(t))) = \frac{df}{dg} . \frac{dg}{dh} . \frac{dg}{dh} $$ $$ f'(g(h(t))).g'(h(t)).h'(t) $$ If Temperature changes W.R.T height $ \frac{dT}{dh} $ and height changes W.R.T. time $ \frac{dh}{dt} $ then temperature changes w.r.t. time $ \frac{d}{T} $ $$ \frac{dT}{dt} = \frac{dT}{dh}.\frac{dh}{dt} $$

Optimization

Optimization of Squared Loss

The cost function (quadratic) : $ (x - a^{2}) + (x - b^{2}) $. In order to optimize the cost function we need to find the point where the slope is 0, that is

$ \frac{d}{dx}[(x - a)^{2} + (x - b)^{2}] = 0 $

By using the chain rule we get

$ 2(x - a) + 2(x - b) = 0 $

$ 2x - a - b = 0 $

$ 2x = a + b $

$ x = \large{\frac{a + b}{2}} $

Hence the optimal solution is in the middle of the curve.

Generalized Squared Loss

Minimize $ (x - a_1)^{2} + (x - a_2)^{2} + ... + (x - a_n)^{2} $

Solution : $ \ x = \Large{\frac{a_1 + a_2 + a_3 + ... + a_n}{n}} $

Optimization of log Loss

$$ \log{(g(p))} = \log((p)^{7}(1 - P)^{3}) = log{(p)^{7}} + \log{((1 - p)^{3})} $$ $$ 7 \ \log{p} - 3 \ \log{(p - 1)} = G(p) $$ In Machine Learning Logarithm is used to simplify a product, which might be a very tiny number.

Multivariable Calculus

Introduction to Tangent Planes
In multivarible such as two variables function the concept of tangent plane is used in place of a tangent line. Optimizing a function of two variables can get complicated and we use several tools to optimize such functions. Gradient Descent is an widely used method of optimization.
Definition of the derivative of a multivariable function
Partial Derivative with respect to x $ f_x(a, b) = \lim_{x \to a} \frac{f(a+h, b) - f(a, b)}{h} $
Partial Derivative with respect to y $ f_y(a, b) = \lim_{x \to a} \frac{f(a+h, b) - f(a, b)}{h} $
The tangent plane in two dimension can be found by cutting the space into planes and calculate the tangents on these planes. We can find the tangent plane of the function $ f(x, y) = x^{2} + y^{2} $ in the following way.

We can fix the value of y = 4. Fix $ \ y = 4, \ f(x, y) = x^{2} + 4^{2} $

$ \frac{d}{dx}(f(x,4)) = 2x $

We can fix the value of x = 2. Fix $ \ x = 2, \ f(x, y) = 2^{2} + y^{2} $

$ \frac{d}{dy}(f(2,y)) = 2y $

The tangent plane contains two tangent lines.
Partial Derivatives
Whlie taking the derivative of a multivariable function we take one variable constant and take the derivative by making the function as the function of one variable. In case of two variables we take y as conatant and take the derivative of the function w.r.t. x and in th esecond step take x constant and take the derivative of the function w.r.t. y $$ f(x, y) \ \rightarrow \ f_x = \frac{\partial f}{\partial x} \text{and} \ \ f_y = \frac{\partial f}{\partial y} $$
Steps of finding the partial derivative of f with respect to x.
Step 1: Treat all the other variables except x as constant.
Step 2: Differentiate the function using the normal rule of differentiation.

Steps of finding the partial derivative of f with respect to y
Step 1: Treat all the other variables except y as constant.
Step 2: Differentiate the function using the normal rule of differentiation.
$ f(x, y) = x^{2} + y^{2} \\ \frac{\partial f}{\partial x} = 2x \\ \frac{\partial f}{\partial y} = 2y $
Partial Derivatives : More Examples

$ f(x, y) = 3x^{3}y^{3} \\ \frac{\partial f}{\partial x} = 6xy^{3} \\ \frac{\partial f}{\partial y} = 9x^{2}y^{2} $
Gradients
Gradient is an organization of derivatives in a vector form. The number of items in the vector will simply be the number of variables in the function.

$ f(x, y) = x^{2} + y^{2} \\ \text{Gradient :} \large{\begin{bmatrix} 2x \\ 2y \end{bmatrix}} \ \nabla{f} = \large{\begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix}} $

The gradient at the point (2,3) is $ \nabla{f} = \begin{bmatrix} 2 & . & 2 \\ 2 & . & 3 \end{bmatrix} = \begin{bmatrix} 4 \\ 6 \end{bmatrix} $
Gradients and maxima/minima
Gradients are useful when we want to minimize or maximize function. The minimum of a function with two variables happens when both the slopes of the tangent line given by the partial derivatives are 0
Optimization using Gradient Descent
Gradient Descent is a powerful method for minimizing or maximizing functions, especially in higher dimensions. In order to optimiza a function we first find the local minima and then start at any point of the function and try to reach to the local minima by walking small steps along the slope.
new point = old point - slope $ x_1 = x_0 - f'(x_0) \\ x_1 = x_0 - \alpha f'(x_0) \\ $ $ \alpha $ is called the learning rate, which really means the step size of proceeding towards the minima in the iterative method of Gradient Descent optimization.
Gradient Descent Algorithm
function f(x) Objective: Minimize f(x), find the minimum point of f(x)
Step - 1 :
Define learning rate $ \alpha $ Choose a staring point $x_0$
Step - 2 :
$ x_k = x_{k - 1} - af'(x_{k} - 1) $
Step - 3 :
Iterate Step: 2 until reaching close enough to the true minimum.
Gradient Descent:
$ f(x) = e^{x} - \log{(x)} \ \ \ f'(x) = e^{x} - \frac{1}{x} \\ \text{Start :} = 0.05 \ \ \ \ \ \text{Rate :} \alpha = 0.005 $

$ \text{Find :} \text{Find :} \ f'(0.05) = -18.9 \\ \ \text{Move by} \ \ \ -0.005 \ \ \ \ \ \ f'(0.05) \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x \rightarrow 0.1447 \\ $

$ \text{Find :} \ f'(0.1447) = -5.7552 \\ \text{Move by} \ \ \ -0.005 \ \ \ \ \ \ f'(0.05) \\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ x \rightarrow 0.1735 \\ $
Newton's Method : An alternate to Gradient Descent
The primary objective of Newton's method is to find zeros of a function. This method can be adpated to Optimization.

Newton's method Newton's method for Optimization
find a 0 of $ f(x) $ Goal: minimize $ g(x) \rightarrow $ find zeros of $ g'(x) $
1) Start with some x_0 1) Start with some x_0
2) Update: 2) Update:
$ \large{x_{k+1} = x_{k} - \frac{f(x_{k})}{f'(x_{k})}} $ $ \large{x_{k+1} = x_{k} - \frac{g(x_{k})}{g'(x_{k})}} $