Optimization Fundamentals
Cheatsheet Content
### Critical Points & Gradient #### Definition of Critical Points Critical points are points in the domain of a function where its gradient is zero or undefined. These points are candidates for local maxima, local minima, or saddle points. To find the critical points of a function $f(x_1, x_2, ..., x_n)$, you need to solve the system of equations where the partial derivatives with respect to each variable are equal to 0: $$\nabla f(\mathbf{x}) = \mathbf{0} \implies \frac{\partial f}{\partial x_i} = 0 \quad \text{for all } i=1, \dots, n$$ **Steps to find critical points:** 1. Compute the partial derivatives of $f$ with respect to each variable. 2. Set each partial derivative equal to zero. 3. Solve the resulting system of equations for $x_1, x_2, \dots, x_n$. 4. Check for points where the gradient is undefined, if applicable (e.g., points where particular partial derivatives don't exist). #### Gradient Definition - **Definition:** The gradient of a scalar-valued multivariable function is a vector of its first-order partial derivatives. It points in the direction of the greatest rate of increase of the function. - **Formula:** For a function $f: \mathbb{R}^n \to \mathbb{R}$, the gradient $\nabla f(\mathbf{x})$ is: $$\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$ - **Geometric Interpretation:** At any point, the gradient vector is perpendicular to the level set (contour line or surface) of the function at that point. Its magnitude indicates the steepness of the function. - **Directional Derivative:** The directional derivative of $f$ in the direction of a unit vector $\mathbf{u}$ is given by $D_{\mathbf{u}}f(\mathbf{x}) = \nabla f(\mathbf{x}) \cdot \mathbf{u}$. ### Hessian Matrix #### Definition & General Formula The Hessian matrix is a square matrix of second-order partial derivatives of a scalar-valued function. It describes the local curvature of a function. For a function $f: \mathbb{R}^n \to \mathbb{R}$, the Hessian matrix $H(f)(\mathbf{x})$ (or $\nabla^2 f(\mathbf{x})$) is: $$H(f)(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$ **Note:** If the second partial derivatives are continuous in a neighborhood around a point, then the mixed partial derivatives are equal ($\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$), making the Hessian a symmetric matrix (Clairaut's Theorem or Schwarz's Theorem). This symmetry is crucial for many optimization properties. #### Second Derivative Test for Critical Points (Multivariate) For a critical point $\mathbf{x}^*$ where $\nabla f(\mathbf{x}^*) = \mathbf{0}$: 1. **Compute the Hessian matrix** $H(f)(\mathbf{x}^*)$ at the critical point. 2. **Evaluate the definiteness of $H(f)(\mathbf{x}^*)$:** - If $H(f)(\mathbf{x}^*)$ is **positive definite** (all eigenvalues are positive), then $f$ has a **local minimum** at $\mathbf{x}^*$. A common test for positive definiteness is that all leading principal minors are positive ($D_1 > 0, D_2 > 0, \dots, D_n > 0$). For 2x2 matrix: $D_1 = f_{xx} > 0$ and $D_2 = \det(H) = f_{xx}f_{yy} - f_{xy}^2 > 0$. - If $H(f)(\mathbf{x}^*)$ is **negative definite** (all eigenvalues are negative), then $f$ has a **local maximum** at $\mathbf{x}^*$. For 2x2 matrix: $D_1 = f_{xx} 0$. - If $H(f)(\mathbf{x}^*)$ is **indefinite** (has both positive and negative eigenvalues), then $f$ has a **saddle point** at $\mathbf{x}^*$. For 2x2 matrix: $\det(H) ### Eigenvalues & Eigenvectors #### Definitions - **Eigenvectors:** For a given linear transformation (represented by a matrix $A$), an eigenvector $\mathbf{v}$ is a non-zero vector that changes at most by a scalar factor when that linear transformation is applied to it. In the context of the Hessian, eigenvectors indicate the principal directions of curvature. - **Eigenvalues:** The scalar factor $\lambda$ by which an eigenvector is scaled. In the context of the Hessian, eigenvalues quantify the magnitude and "direction" (concave up or concave down) of the curvature along their corresponding eigenvectors. For a square matrix $A$ (like the Hessian $H$), eigenvalues ($\lambda$) and eigenvectors ($\mathbf{v}$) satisfy the equation: $$A\mathbf{v} = \lambda\mathbf{v}$$ This can be rewritten as: $$(A - \lambda I)\mathbf{v} = \mathbf{0}$$ where $I$ is the identity matrix. To find eigenvalues, solve the characteristic equation: $$\det(A - \lambda I) = 0$$ Once eigenvalues are found, substitute each $\lambda$ back into $(A - \lambda I)\mathbf{v} = \mathbf{0}$ to find the corresponding eigenvectors. #### Matrix Definiteness based on Eigenvalues For a symmetric matrix (like the Hessian): - **Positive Definite (PD):** All eigenvalues $\lambda_i > 0$. - **Negative Definite (ND):** All eigenvalues $\lambda_i ### Convexity #### Convex Sets - **Definition:** A set $S \subseteq \mathbb{R}^n$ is convex if for any two points $\mathbf{x}_1, \mathbf{x}_2 \in S$, the line segment connecting them is entirely contained within $S$. That is, for all $\theta \in [0, 1]$, the point $\theta \mathbf{x}_1 + (1-\theta)\mathbf{x}_2 \in S$. - **Examples of Convex Sets:** Hyperplanes, half-spaces, balls (Euclidean, $L_1$, $L_\infty$), cones, polyhedra, intersections of convex sets. - **Non-examples:** Star-shaped sets, sets with holes or indentations. - **Properties:** - The intersection of any number of convex sets is a convex set. - The sum of convex sets is convex. - The image of a convex set under an affine map is convex. - The inverse image of a convex set under an affine map is convex. #### Convex Functions - **Definition:** A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex if its domain is a convex set and for all $\mathbf{x}, \mathbf{y}$ in its domain and for all $\theta \in [0, 1]$: $$f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \le \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y})$$ Geometrically, this means the line segment connecting any two points on the graph of the function lies above or on the graph. - **Strictly Convex:** If the inequality is strict for $\theta \in (0, 1)$ and $\mathbf{x} \ne \mathbf{y}$. - **Concave Function:** A function $f$ is concave if $-f$ is convex. $$f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \ge \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y})$$ - **Properties:** - The sum of convex functions is convex. - A non-negative weighted sum of convex functions is convex. - The composition of a convex function with an affine mapping is convex. - The maximum of a set of convex functions is convex. #### Proving Convexity of Functions 1. **Definition (Jensen's Inequality):** Directly use the definition: $f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \le \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y})$. This is often used for simple functions or when other methods are difficult. 2. **First-Order Condition:** For a differentiable function $f$, it is convex if and only if its domain is convex and for all $\mathbf{x}, \mathbf{y}$ in its domain: $$f(\mathbf{y}) \ge f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x})$$ Geometrically, the tangent hyperplane (linear approximation) at any point lies below or on the graph of the function. 3. **Second-Order Condition (Hessian):** For a twice-differentiable function $f$ with a convex domain, it is convex if and only if its Hessian matrix $\nabla^2 f(\mathbf{x})$ is **positive semi-definite (PSD)** for all $\mathbf{x}$ in its domain. - If $\nabla^2 f(\mathbf{x})$ is **positive definite (PD)** for all $\mathbf{x}$, then $f$ is strictly convex. - If $\nabla^2 f(\mathbf{x})$ is **negative semi-definite (NSD)** for all $\mathbf{x}$, then $f$ is concave. - If $\nabla^2 f(\mathbf{x})$ is **negative definite (ND)** for all $\mathbf{x}$, then $f$ is strictly concave. 4. **Operations that preserve convexity:** Sums, non-negative scaling, maximization, composition with affine maps. ### Convex Optimization Problem #### Definition A convex optimization problem is an optimization problem of the form: $$\begin{array}{ll} \text{minimize} & f_0(\mathbf{x}) \\ \text{subject to} & f_i(\mathbf{x}) \le 0, \quad i=1, \dots, m \\ & h_j(\mathbf{x}) = 0, \quad j=1, \dots, p \end{array}$$ where: - The objective function $f_0(\mathbf{x})$ is a **convex function**. - The inequality constraint functions $f_i(\mathbf{x})$ are **convex functions**. - The equality constraint functions $h_j(\mathbf{x})$ are **affine functions** (i.e., $h_j(\mathbf{x}) = \mathbf{a}_j^T \mathbf{x} - b_j$). **Key Properties:** - The feasible set (defined by the constraints) of a convex optimization problem is always a convex set. This is because it is the intersection of the sublevel sets of convex functions (which are convex) and affine sets (which are convex). - Any local minimum of a convex optimization problem is also a global minimum. - For strictly convex objective functions, if a minimum exists, it is unique. ### L-Smoothness & Strong Convexity #### L-Smoothness (Lipschitz Continuous Gradient) A function $f$ is said to be $L$-smooth if its gradient $\nabla f$ is Lipschitz continuous with constant $L$. This means that the slope of the function does not change too rapidly. - **Definition:** For a differentiable function $f: \mathbb{R}^n \to \mathbb{R}$, it is $L$-smooth if there exists a constant $L > 0$ such that for all $\mathbf{x}, \mathbf{y} \in \text{dom}(f)$: $$||\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})||_2 \le L||\mathbf{x} - \mathbf{y}||_2$$ - **Second-Order Condition:** If $f$ is twice differentiable, then $f$ is $L$-smooth if and only if its Hessian matrix $\nabla^2 f(\mathbf{x})$ is bounded, i.e., its maximum eigenvalue is bounded by $L$ for all $\mathbf{x}$ in the domain: $$\lambda_{\max}(\nabla^2 f(\mathbf{x})) \le L \quad \text{for all } \mathbf{x} \in \text{dom}(f)$$ where $\lambda_{\max}(A)$ denotes the largest eigenvalue of matrix $A$. - **Importance:** $L$-smoothness implies that the function does not have arbitrarily sharp "peaks" or "valleys," which is important for the convergence of gradient-based optimization algorithms. It guarantees that a quadratic upper bound exists for the function: $$f(\mathbf{y}) \le f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x}) + \frac{L}{2} ||\mathbf{y} - \mathbf{x}||_2^2$$ #### Strong Convexity A function is strongly convex if it is "uniformly" convex, meaning it curves upwards at least by a certain amount everywhere. It ensures a unique minimum and faster convergence rates for optimization algorithms. - **Definition:** A function $f: \mathbb{R}^n \to \mathbb{R}$ is $\mu$-strongly convex if there exists a constant $\mu > 0$ such that for all $\mathbf{x}, \mathbf{y} \in \text{dom}(f)$: $$f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \le \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y}) - \frac{1}{2}\mu\theta(1-\theta)||\mathbf{x}-\mathbf{y}||_2^2$$ - **First-Order Condition:** For a differentiable function $f$, it is $\mu$-strongly convex if and only if its domain is convex and for all $\mathbf{x}, \mathbf{y}$ in its domain: $$f(\mathbf{y}) \ge f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x}) + \frac{\mu}{2} ||\mathbf{y} - \mathbf{x}||_2^2$$ - **Second-Order Condition:** If $f$ is twice differentiable, then $f$ is $\mu$-strongly convex if and only if its Hessian matrix $\nabla^2 f(\mathbf{x})$ satisfies: $$\nabla^2 f(\mathbf{x}) \succeq \mu I \quad \text{for all } \mathbf{x} \in \text{dom}(f)$$ This means that all eigenvalues of $\nabla^2 f(\mathbf{x})$ are greater than or equal to $\mu$: $$\lambda_{\min}(\nabla^2 f(\mathbf{x})) \ge \mu \quad \text{for all } \mathbf{x} \in \text{dom}(f)$$ where $\lambda_{\min}(A)$ denotes the smallest eigenvalue of matrix $A$. #### Condition Number The condition number $\kappa$ of a function (or its Hessian) is the ratio of its $L$-smoothness constant to its strong convexity constant: $$\kappa = \frac{L}{\mu}$$ - A small condition number (close to 1) indicates a well-conditioned problem, where the function resembles a "round" bowl. Optimization algorithms tend to converge quickly. - A large condition number indicates an ill-conditioned problem, where the function resembles a "long, narrow valley" (an ellipsoid with highly varying principal axes). Optimization algorithms can struggle to converge, exhibiting zig-zagging behavior. ### Gradient Descent #### Definition Gradient Descent is a first-order iterative optimization algorithm for finding the local minimum of a differentiable function. It works by repeatedly taking steps proportional to the negative of the gradient of the function at the current point. #### Gradient Descent Update Rule The core of the algorithm is the update rule, which moves the current estimate $\mathbf{x}_k$ to a new estimate $\mathbf{x}_{k+1}$: $$\mathbf{x}_{k+1} = \mathbf{x}_k - t_k \nabla f(\mathbf{x}_k)$$ Where: - $\mathbf{x}_k$: Current parameter vector at iteration $k$. - $\nabla f(\mathbf{x}_k)$: Gradient of the objective function $f$ at $\mathbf{x}_k$. - $t_k$: Learning rate (or step size) at iteration $k$. This determines the size of the step taken in the direction of the negative gradient. **Choosing the Learning Rate ($t_k$):** - **Constant Learning Rate:** $t_k = t$ for all $k$. Simple to implement. If $f$ is $L$-smooth, a common choice for convergence is $t #### Convergence of Gradient Descent - **General Convex Functions (L-smooth):** With an appropriate step size (e.g., constant $t \le 1/L$), Gradient Descent converges to a global minimum. The convergence rate is typically $O(1/k)$, meaning the error decreases inversely with the number of iterations. - **Strongly Convex Functions (L-smooth and $\mu$-strongly convex):** For strongly convex functions, Gradient Descent exhibits **linear (or geometric) convergence**, meaning the error decreases exponentially with the number of iterations. The convergence rate is $O((1 - \mu/L)^k)$ or $O((1 - 1/\kappa)^k)$, where $\kappa = L/\mu$ is the condition number. A smaller condition number leads to faster convergence. - **Non-Convex Functions:** Gradient Descent may converge to a local minimum, a saddle point, or fail to converge. The final point depends heavily on initialization and the specific landscape of the function. #### Variants of Gradient Descent - **Stochastic Gradient Descent (SGD):** When dealing with large datasets, calculating the full gradient can be computationally expensive. SGD approximates the gradient using a single randomly chosen data point or a small mini-batch. $$\mathbf{x}_{k+1} = \mathbf{x}_k - t_k \nabla f_i(\mathbf{x}_k)$$ (for a randomly chosen data point $i$) - **Pros:** Faster for large datasets, escapes shallow local minima in non-convex settings. - **Cons:** Higher variance in updates, often requires decaying learning rates. - **Mini-Batch Gradient Descent:** A compromise between full batch GD and SGD, using a small batch of data points to estimate the gradient. Balances computational efficiency and update stability. - **Momentum:** Incorporates a "momentum" term from previous updates to accelerate convergence and dampen oscillations, especially in ill-conditioned problems. $$\mathbf{v}_{k+1} = \beta \mathbf{v}_k + (1-\beta) \nabla f(\mathbf{x}_k)$$ $$\mathbf{x}_{k+1} = \mathbf{x}_k - t_k \mathbf{v}_{k+1}$$ where $\beta \in [0,1)$ is the momentum coefficient. - **Adaptive Learning Rate Methods (e.g., Adam, RMSprop, Adagrad):** These methods dynamically adjust the learning rate for each parameter based on past gradients. They are widely used in deep learning for their robustness and faster convergence. ### Taylor Approximations #### Taylor's Theorem Taylor's theorem provides a way to approximate a function by a polynomial whose coefficients depend on the function's derivatives at a single point. This polynomial is called the Taylor polynomial. **General Form (Univariate):** If $f$ is $k$ times differentiable at a point $a$, then for any $x$ in an open interval containing $a$: $$f(x) = P_{k,a}(x) + R_k(x)$$ where $P_{k,a}(x)$ is the Taylor polynomial of degree $k$ and $R_k(x)$ is the remainder term. $$P_{k,a}(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \dots + \frac{f^{(k)}(a)}{k!}(x-a)^k = \sum_{d=0}^{k} \frac{f^{(d)}(a)}{d!}(x-a)^d$$ The remainder term $R_k(x)$ (Lagrange form) is given by: $$R_k(x) = \frac{f^{(k+1)}(c)}{(k+1)!}(x-a)^{k+1}$$ for some $c$ between $a$ and $x$. The remainder term quantifies the error of the approximation. **General Form (Multivariate):** For a function $f: \mathbb{R}^n \to \mathbb{R}$ that is $k$ times continuously differentiable at $\mathbf{a}$: $$f(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^T(\mathbf{x}-\mathbf{a}) + \frac{1}{2!}(\mathbf{x}-\mathbf{a})^T \nabla^2 f(\mathbf{a})(\mathbf{x}-\mathbf{a}) + \dots + R_k(\mathbf{x})$$ where the terms are based on the tensors of derivatives. For optimization, the first and second-order approximations are most relevant. #### First and Second-Order Taylor Approximations These approximations are fundamental in optimization for understanding local function behavior and designing algorithms. - **First-order Taylor approximation (univariate):** Linear approximation $$f(x) \approx P_{1,a}(x) = f(a) + f'(a)(x-a)$$ This approximates the function with its tangent line at $a$. - **Second-order Taylor approximation (univariate):** Quadratic approximation $$f(x) \approx P_{2,a}(x) = f(a) + f'(a)(x-a) + \frac{1}{2}f''(a)(x-a)^2$$ This approximates the function with a parabola, capturing its curvature. - **First-order Taylor approximation (multivariate):** Linear approximation $$f(\mathbf{x}) \approx P_{1,\mathbf{a}}(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^T(\mathbf{x}-\mathbf{a})$$ This approximates the function with its tangent hyperplane at $\mathbf{a}$. - **Second-order Taylor approximation (multivariate):** Quadratic approximation $$f(\mathbf{x}) \approx P_{2,\mathbf{a}}(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^T(\mathbf{x}-\mathbf{a}) + \frac{1}{2}(\mathbf{x}-\mathbf{a})^T \nabla^2 f(\mathbf{a})(\mathbf{x}-\mathbf{a})$$ This approximates the function with a quadratic form, capturing its local curvature. This is crucial for understanding Newton's method and the second-derivative test. #### Applications in Optimization - **Gradient Descent:** The first-order Taylor approximation is the basis for Gradient Descent. By minimizing the first-order approximation (plus a quadratic penalty for step size), one arrives at the gradient descent update. $$f(\mathbf{x}) \approx f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T(\mathbf{x} - \mathbf{x}_k)$$ To minimize this, we move in the direction opposite to the gradient. - **Newton's Method:** Uses the second-order Taylor approximation to find the minimum. By setting the gradient of the quadratic approximation to zero, we derive the Newton update rule: $$\nabla f(\mathbf{x}_k) + \nabla^2 f(\mathbf{x}_k)(\mathbf{x} - \mathbf{x}_k) = \mathbf{0}$$ $$\mathbf{x}_{k+1} = \mathbf{x}_k - (\nabla^2 f(\mathbf{x}_k))^{-1} \nabla f(\mathbf{x}_k)$$ Newton's method converges much faster than gradient descent (quadratic convergence) near a minimum if the function is well-behaved, but requires computing and inverting the Hessian. - **Local Analysis:** Taylor approximations are used to analyze the behavior of a function near a critical point (e.g., the second derivative test uses the quadratic approximation). - **Error Analysis:** The remainder term helps in bounding the error of these approximations.