Derivatives and Optimization
Cheatsheet Content
Machine Learning Motivation Predicting house prices based on the number of bedrooms is a classic machine learning regression problem. Number of Bedrooms Price of House ($) 1 $150,000 2 $250,000 3 $350,000 5 $600,000 6 $650,000 7 $750,000 8 $800,000 9 ?? 10 $1,050,000 Machine Learning models are trained to find patterns in data to make predictions. For example, if we plot house prices vs. bedrooms, a model can learn a trend to predict the price for 9 bedrooms. Classification Problem: Sentiment Analysis Machine learning can also classify data, like determining the mood of a sentence. "Aack aack aack!" -> Negative mood "Beep beep!" -> Positive mood "Aack beep aack!" -> Negative mood "Aack beep beep beep!" -> Positive mood A model learns from labeled data to predict the sentiment of new text. Math Concepts in ML Model Training Gradients Derivatives Optimization Gradient Descent Loss and Cost functions Linear Regression Classification Neural Networks Introduction to Derivatives Derivatives help us understand rates of change, like velocity. Consider a car's distance traveled ($x$) over time ($t$): t (seconds) x (meters) 0 0 5 36 10 122 15 202 20 265 ... ... 60 1000 If distance covered in equal time intervals varies (e.g., $10-15s: 80m$, $15-20s: 63m$), the speed is not constant. Calculating Average Velocity (Slope) Average velocity is the slope of the distance-time graph: $\text{slope} = \frac{\text{rise}}{\text{run}} = \frac{\Delta x}{\Delta t}$ For $t=10s$ to $t=15s$: $\text{slope} = \frac{x(15) - x(10)}{t(15) - t(10)} = \frac{202m - 122m}{15s - 10s} = \frac{80m}{5s} = 16 m/s$ Estimating Instantaneous Velocity To find velocity at a specific moment (e.g., $t=12.5s$), we need smaller $\Delta t$ intervals. t (s) x (m) 10 122 11 138 12 155 13 170 14 186 15 202 For $t=12s$ to $t=13s$ (around $t=12.5s$): $\text{slope} = \frac{x(13) - x(12)}{13s - 12s} = \frac{170m - 155m}{1s} = 15 m/s$ The derivative $dx/dt$ represents the instantaneous velocity as $\Delta t \to 0$. Derivatives and Tangents The derivative at a point is the slope of the tangent line to the curve at that point. $\frac{\Delta x}{\Delta t} \to \frac{dx}{dt}$ as $\Delta t \to 0$ Slopes, Maxima, and Minima At a maximum or minimum point of a smooth function, the slope of the tangent line (and thus the derivative) is zero. Example: If $x(t)$ stops changing, $dx/dt = 0$. t (s) x (m) 19 265 20 265 Slope = $\frac{265m - 265m}{20s - 19s} = \frac{0m}{1s} = 0 m/s$ This indicates the car is momentarily stopped or changing direction. Local maxima and minima occur where the derivative is zero. Derivative Notation General slope: $\frac{\Delta y}{\Delta x}$ Derivative (instantaneous slope): $\frac{dy}{dx}$ If $y = f(x)$, the derivative is also denoted as $f'(x)$ (Lagrange's notation) or $\frac{d}{dx}f(x)$ (Leibniz's notation). Common Derivatives: Lines Derivative of a Constant Function If $f(x) = c$ (a horizontal line), the slope is always zero. $f'(x) = 0$ Derivative of a Linear Function If $f(x) = ax + b$, the slope is constant and equal to $a$. $f'(x) = a$ Derived using limits: $\frac{\Delta y}{\Delta x} = \frac{(a(x+\Delta x) + b) - (ax + b)}{\Delta x} = \frac{ax + a\Delta x + b - ax - b}{\Delta x} = \frac{a\Delta x}{\Delta x} = a$ As $\Delta x \to 0$, $f'(x) = a$. Common Derivatives: Quadratics Derivative of Quadratic Functions If $f(x) = x^2$: Using limits: $\frac{\Delta f}{\Delta x} = \frac{(x+\Delta x)^2 - x^2}{\Delta x} = \frac{x^2 + 2x\Delta x + (\Delta x)^2 - x^2}{\Delta x} = \frac{2x\Delta x + (\Delta x)^2}{\Delta x} = 2x + \Delta x$ As $\Delta x \to 0$, $f'(x) = 2x$. At $x=1$, $f'(1) = 2(1) = 2$. Common Derivatives: Higher Degree Polynomials Derivative of Cubic Functions If $f(x) = x^3$: Using limits: $\frac{\Delta f}{\Delta x} = \frac{(x+\Delta x)^3 - x^3}{\Delta x} = \frac{x^3 + 3x^2\Delta x + 3x(\Delta x)^2 + (\Delta x)^3 - x^3}{\Delta x}$ $= \frac{3x^2\Delta x + 3x(\Delta x)^2 + (\Delta x)^3}{\Delta x} = 3x^2 + 3x\Delta x + (\Delta x)^2$ As $\Delta x \to 0$, $f'(x) = 3x^2$. At $x=0.5$, $f'(0.5) = 3(0.5)^2 = 0.75$. Common Derivatives: Other Power Functions Derivative of $f(x) = x^{-1} = \frac{1}{x}$ Using limits: $\frac{\Delta f}{\Delta x} = \frac{(x+\Delta x)^{-1} - x^{-1}}{\Delta x} = \frac{\frac{1}{x+\Delta x} - \frac{1}{x}}{\Delta x} = \frac{\frac{x - (x+\Delta x)}{x(x+\Delta x)}}{\Delta x} = \frac{-\Delta x}{x(x+\Delta x)\Delta x} = -\frac{1}{x(x+\Delta x)}$ As $\Delta x \to 0$, $f'(x) = -\frac{1}{x^2} = -x^{-2}$. At $x=1$, $f'(1) = -1$. Derivative of Power Functions: General Rule For $f(x) = x^n$, the derivative is $f'(x) = nx^{n-1}$. $f(x) = x^2 \Rightarrow f'(x) = 2x^1$ $f(x) = x^3 \Rightarrow f'(x) = 3x^2$ $f(x) = x^{-1} \Rightarrow f'(x) = -1x^{-2}$ Inverse Functions and their Derivatives If $g(x)$ is the inverse of $f(x)$, then $g(f(x)) = x$. We denote $g(x)$ as $f^{-1}(x)$. Example: $f(x) = x^2$ (for $x>0$), then $g(y) = \sqrt{y}$. The derivative of an inverse function $g(y)$ is given by: $g'(y) = \frac{1}{f'(x)}$ where $y = f(x)$ This can also be written as $g'(y) = \frac{1}{f'(f^{-1}(y))}$. For $f(x) = x^2$, $f'(x) = 2x$. So, $g'(y) = \frac{1}{2x} = \frac{1}{2\sqrt{y}}$. Derivative of Trigonometric Functions Derivative of Sine Function If $f(x) = \sin(x)$, then $f'(x) = \cos(x)$. Derivative of Cosine Function If $f(x) = \cos(x)$, then $f'(x) = -\sin(x)$. Meaning of the Exponential (e) $e \approx 2.71828182\dots$ is defined as the limit of $(1 + \frac{1}{n})^n$ as $n \to \infty$. This arises in compound interest: if you earn 100% interest per year, compounding more frequently yields more money, approaching $e$ times your initial investment. Compounding once: $(1+1)^1 = 2$ Compounding twice (50% every 6 months): $(1+\frac{1}{2})^2 = 2.25$ Compounding three times (33.3% every 4 months): $(1+\frac{1}{3})^3 \approx 2.37$ Compounding infinitely often: $\lim_{n \to \infty} (1 + \frac{1}{n})^n = e$ Derivative of $e^x$ A unique property of $e^x$ is that its derivative is itself: $f(x) = e^x \Rightarrow f'(x) = e^x$ This means the slope of $e^x$ at any point $x$ is equal to the value of $e^x$ at that point. For $f(x) = a^x$, its derivative is $f'(x) = \ln(a)a^x$. Derivative of $\ln(x)$ The natural logarithm function, $\ln(x)$, is the inverse of $e^x$. If $f(x) = e^x$, then $f^{-1}(y) = \ln(y)$. Using the inverse function derivative rule $g'(y) = \frac{1}{f'(f^{-1}(y))}$: $f'(x) = e^x$ So, $\frac{d}{dy}\ln(y) = \frac{1}{e^{\ln(y)}} = \frac{1}{y}$. Therefore, if $f(x) = \ln(x)$, then $f'(x) = \frac{1}{x}$. Existence of the Derivative Differentiable Functions A function is differentiable at a point if its derivative exists at that point. Geometrically, this means there is a unique tangent line at that point. A function is differentiable over an interval if it is differentiable at every point in the interval. Non-Differentiable Functions A function is not differentiable at points where: Corners or Cusps: The slope changes abruptly, leading to different left and right derivatives (e.g., $f(x) = |x|$ at $x=0$). Jump Discontinuities: The function has a sudden break or jump (e.g., piecewise functions with gaps). Vertical Tangents: The tangent line is vertical, meaning the slope is infinite (e.g., $f(x) = x^{1/3}$ at $x=0$). Properties of the Derivative: Multiplication by Scalars If $f(x) = c \cdot g(x)$, where $c$ is a constant, then the derivative is: $f'(x) = c \cdot g'(x)$ Multiplying a function by a scalar $c$ also multiplies its slope/derivative by $c$. Properties of the Derivative: The Sum Rule If $f(x) = g(x) + h(x)$, then the derivative is: $f'(x) = g'(x) + h'(x)$ The derivative of a sum of functions is the sum of their derivatives. Example: If a boat moves at velocity $v_B$ and a child walks on it at $v_C$ in the same direction, the child's total velocity relative to the earth is $v_{Total} = v_B + v_C$. If $v_B = \frac{dx_B}{dt}$ and $v_C = \frac{dx_C}{dt}$, then $v_{Total} = \frac{dx_B}{dt} + \frac{dx_C}{dt}$. Properties of the Derivative: The Product Rule If $f(t) = g(t)h(t)$, then the derivative is: $f'(t) = g'(t)h(t) + g(t)h'(t)$ This can be visualized by considering the change in area of a rectangle with sides $g(t)$ and $h(t)$ over time. Properties of the Derivative: The Chain Rule If $f(t) = g(h(t))$, then the derivative is: $f'(t) = g'(h(t)) \cdot h'(t)$ In Leibniz notation: $\frac{df}{dt} = \frac{dg}{dh} \cdot \frac{dh}{dt}$ Example: If temperature ($T$) changes with height ($h$), and height ($h$) changes with time ($t$), then the rate of change of temperature with respect to time is $\frac{dT}{dt} = \frac{dT}{dh} \cdot \frac{dh}{dt}$. Introduction to Optimization Optimization involves finding the inputs that result in the minimum or maximum output of a function. In machine learning, this often means finding model parameters that minimize a "cost" or "loss" function. The key idea is that at a local minimum or maximum, the derivative of the function is zero. A function can have multiple local minima/maxima. Finding the global minimum/maximum requires checking all such points and boundary conditions. Optimization of Squared Loss: The Powerline Problem Goal: Minimize the total cost of connecting houses to power lines. Cost function: Often modeled as the sum of squared distances to power lines. One Powerline Problem If a house is at position $x$ and the single power line is at position $a$. The cost is $C(x) = (x-a)^2$. To minimize $C(x)$, we set the derivative to zero: $C'(x) = \frac{d}{dx}(x-a)^2 = 2(x-a) = 0$ $x-a = 0 \Rightarrow x = a$ The house should be placed at the same position as the power line to minimize cost. Two Powerline Problem Houses at positions $a$ and $b$. We want to place a new house at $x$ to minimize the sum of squared distances to $a$ and $b$. Cost function: $C(x) = (x-a)^2 + (x-b)^2$ To minimize $C(x)$: $C'(x) = \frac{d}{dx}[(x-a)^2 + (x-b)^2] = 2(x-a) + 2(x-b) = 0$ $(x-a) + (x-b) = 0$ $2x - a - b = 0$ $2x = a + b \Rightarrow x = \frac{a+b}{2}$ The optimal position is the mean of $a$ and $b$. Three Powerline Problem Houses at positions $a, b, c$. Place a new house at $x$ to minimize the sum of squared distances. Cost function: $C(x) = (x-a)^2 + (x-b)^2 + (x-c)^2$ To minimize $C(x)$: $C'(x) = \frac{d}{dx}[(x-a)^2 + (x-b)^2 + (x-c)^2] = 2(x-a) + 2(x-b) + 2(x-c) = 0$ $(x-a) + (x-b) + (x-c) = 0$ $3x - a - b - c = 0$ $3x = a + b + c \Rightarrow x = \frac{a+b+c}{3}$ The optimal position is the mean of $a, b, c$. General Squared Loss To minimize $\sum_{i=1}^n (x-a_i)^2$, the solution is $x = \frac{\sum_{i=1}^n a_i}{n}$, which is the mean of all $a_i$. Optimization of Log-Loss Log-loss (or cross-entropy loss) is commonly used in classification problems. Consider a coin toss experiment: 7 Heads (H) and 3 Tails (T) in 10 tosses. We want to find the probability $p$ of getting a head that maximizes the likelihood of observing this sequence. Likelihood function: $g(p) = p^7 (1-p)^3$ Maximizing $g(p)$ using Derivatives (Product Rule) $g'(p) = \frac{d}{dp} [p^7 (1-p)^3]$ Using product rule $(uv)' = u'v + uv'$ where $u=p^7, v=(1-p)^3$: $u' = 7p^6$ $v' = 3(1-p)^2(-1) = -3(1-p)^2$ (using chain rule) $g'(p) = 7p^6(1-p)^3 + p^7(-3(1-p)^2)$ $g'(p) = p^6(1-p)^2[7(1-p) - 3p]$ $g'(p) = p^6(1-p)^2[7 - 7p - 3p] = p^6(1-p)^2(7 - 10p)$ Setting $g'(p) = 0$ for maxima: $p^6(1-p)^2(7 - 10p) = 0$ Possible solutions: $p=0$, $p=1$, or $7-10p=0 \Rightarrow p=0.7$ The maximum likelihood occurs at $p=0.7$. Maximizing $g(p)$ using Logarithms (Log-Loss) Instead of maximizing $g(p)$, we can maximize $\ln(g(p))$ (log-likelihood), which is often easier. This is because $\ln$ is a monotonically increasing function, so its maximum occurs at the same $p$ as $g(p)$. $G(p) = \ln(g(p)) = \ln(p^7 (1-p)^3)$ Using log properties: $G(p) = 7\ln(p) + 3\ln(1-p)$ Now, differentiate $G(p)$ with respect to $p$: $G'(p) = \frac{d}{dp}[7\ln(p) + 3\ln(1-p)]$ $G'(p) = 7 \cdot \frac{1}{p} + 3 \cdot \frac{1}{1-p} \cdot (-1)$ (using chain rule for $\ln(1-p)$) $G'(p) = \frac{7}{p} - \frac{3}{1-p}$ Set $G'(p) = 0$ for maxima: $\frac{7}{p} - \frac{3}{1-p} = 0$ $\frac{7}{p} = \frac{3}{1-p}$ $7(1-p) = 3p$ $7 - 7p = 3p$ $7 = 10p \Rightarrow p = 0.7$ This confirms the same result as before, but with simpler differentiation. The negative of the log-likelihood ($-G(p)$) is called the log-loss. Why the Logarithm in ML? Simplifies Products to Sums: Derivatives of products are complex (product rule), while derivatives of sums are simple (sum rule). Logarithms convert products into sums, simplifying differentiation. This is especially useful when dealing with many terms, as seen in complex likelihood functions. Handles Tiny Numbers: Products of many probabilities can become extremely small, leading to numerical underflow. Taking the logarithm converts these tiny products into sums of negative numbers, which are easier for computers to handle accurately.