Calculus for Machine Learning
Derivatives are an important concept in Machine Learning. Derivatives are used in Machine learning for the purpose of optimization and used to minimize the error in Machine Learning. For example, the task of a Machine Learning model forecasting house price is to optimize the best possible prediction.
A derivative is an instantaneous rate of change of a function. Distance can be represented as a function, and its rate of chage is called Velocity. The measure of how fast the distance is changing with respect to time is called the instantaneous rate ogf change and its the slope of the tangent line. The instantaneous rate of change is a measure of how fast the relation between two variables is changing at any point.
Minima, and maxima are the points of the function curve where the derivative of the tangent line is 0 or the derivative is 0
$$ \text{function : } y = f(x) $$ $$ \text{Derivative of f is expressed as :} $$ $$ \text{Lagrange's Notation : } f'(x) $$ $$ \text{Leibniz's Notation : } \frac{dy}{dx} = \frac{d}{d} f(x) $$
Derivative of a constant is always 0 since there is no change in the function $$ \frac{\Delta y}{\Delta x} = \frac{c - c}{x_1 - x_0} = 0 $$
The equation of a line is : $ f(x) = ax + b $
The derivative of a line is the slope of the line $ \Large{\frac{\Delta y}{\Delta x} = \frac{rise}{run} = a} $
$ \Large{\frac{\Delta{y}}{\Delta{x}} = \frac{a(x + \Delta{x}) + b - (ax + b)}{\Delta{x}}} $
$ \Large{\frac{\Delta{x}}{\Delta{x}}} = a $
Quadratics : $ y = f(x) = x^2 $
Slope : $ \Large{\frac{\Delta f}{\Delta x} = \frac{(x + \Delta{x})}{f(x)}} $
$$ \frac{df}{dx} \ \frac{\Delta f}{\Delta x} = \frac{(x + \Delta{x})^{2} - x^{2}}{\Delta{x}} $$ $$ = 2x \ \ \text{as} \ \Delta{x} \ \rightarrow \ 0 $$ $$ f'(x) = 2x $$
Cubic : $ y = f(x) = x^3 $
Slope : $ \Large{\frac{\Delta f}{\Delta x} = \frac{(x + \Delta{x})^{2} - x^{2}}{\Delta{x}}} $
$ 3x \Delta x + 3x^{2} + \Delta x^{2}$
Hyperbola : $ y = f(x) = x^{-1} = \frac{1}{x} $
$ f'(x) = -x^{-2} $
$ \large{f(x) = x^{2}} $
$ \large{f'(x) = 2x^{1}} $
$ \large{f(x) = x^{3}} $
$ \large{f'(x) = 3x^{2}} $
$ \large{f(x) = x^{-1}} $
$ \large{f'(x) = (-1)x^{-2}} $
$ \large{f(x) = x^{n}} $
$ \large{f'(x) = \frac{d}{dx} f(x) = nx^{(n - 1)}} $
$$ \large{y = f(x) = \sin(x)} $$ $$ \large{f'(x) = \cos(x)} $$ $$ \large{y = f(x) = \cos(x)} $$ $$ \large{f'(x) = -\sin(x)} $$
$$ \large{y = f(x) = e^x} $$ $$ \large{f'(x) = e^x} $$
Logarithmic function $ f^{-1}(y) = log(y) $ is the inverse of the exponetial function $ f(x) = e^{x} $ because
$ e^{\log(x)} = x $ and $ \log(e^{y} = y) \text{.} \ \log(x) $ is the inverse of $\ e^{x} $
$ \large{\frac{d}{dy} f^{-1}(y) = \frac{1}{f'(f^{-1}(y))}} $
$ \large{\frac{d}{dy} \log(y) = \frac{1}{e^{log(y)}} = \frac{1}{y}} $
$ \large{\frac{d}{dy} \log(y) = \frac{1}{y}} $
A function is called differentiable at a point means that the derivative exist for that point. For a function to be differentiable at an interval, the derivative at an interval: The derivative has to exist for every point in the interval.
A few functions do not statisfy this property, they are called non-differentiable function. For example, the absolute function $ f(x) = \begin{cases} \ x & \ if \ \ x\geq 0 \\ -x & \ if \ \ x\leq x \end{cases} $ is not differentiable at the origin, the point with sharp edge or corner is called a cusp. A piece-wise function with jump discontinuity is also a non-differentiable function. A function with a vertical tangent is non-differentiable, because the vertical tangent does not have a well-defined slope.
Multiplication by scalar
$$ f' = cg' $$
Product Rule
$$ f'(t) = g'(t)h(t) + g(t)h'(t) $$
Product Rule
$$ \frac{d}{dt} f(g(h(t))) = \frac{df}{dg} . \frac{dg}{dh} . \frac{dg}{dh} $$
$$ f(g(h(t))) = \frac{df}{dg} . \frac{dg}{dh} . \frac{dg}{dh} $$
$$ f'(g(h(t))).g'(h(t)).h'(t) $$
If Temperature changes W.R.T height $ \frac{dT}{dh} $ and height changes W.R.T. time $ \frac{dh}{dt} $ then
temperature changes w.r.t. time $ \frac{d}{T} $
$$ \frac{dT}{dt} = \frac{dT}{dh}.\frac{dh}{dt} $$
The cost function (quadratic) : $ (x - a^{2}) + (x - b^{2}) $. In order to optimize the cost function
we need to find the point where the slope is 0, that is
$ \frac{d}{dx}[(x - a)^{2} + (x - b)^{2}] = 0 $
By using the chain rule we get
$ 2(x - a) + 2(x - b) = 0 $
$ 2x - a - b = 0 $
$ 2x = a + b $
$ x = \large{\frac{a + b}{2}} $
Hence the optimal solution is in the middle of the curve.
Minimize $ (x - a_1)^{2} + (x - a_2)^{2} + ... + (x - a_n)^{2} $
Solution : $ \ x = \Large{\frac{a_1 + a_2 + a_3 + ... + a_n}{n}} $
$$ \log{(g(p))} = \log((p)^{7}(1 - P)^{3}) = log{(p)^{7}} + \log{((1 - p)^{3})} $$ $$ 7 \ \log{p} - 3 \ \log{(p - 1)} = G(p) $$ In Machine Learning Logarithm is used to simplify a product, which might be a very tiny number.
Newton's method
Newton's method for Optimization
find a 0 of $ f(x) $
Goal: minimize $ g(x) \rightarrow $ find zeros of $ g'(x) $
1) Start with some x_0
1) Start with some x_0
2) Update:
2) Update:
$ \large{x_{k+1} = x_{k} - \frac{f(x_{k})}{f'(x_{k})}} $
$ \large{x_{k+1} = x_{k} - \frac{g(x_{k})}{g'(x_{k})}} $