Optimization Fundamentals

Shared 3/15/2026•347 views

/ 1

Cheatsheet Content

### Critical Points & Gradient #### Definition of Critical Points Critical points are points in the domain of a function where its gradient is zero or undefined. These points are candidates for local maxima, local minima, or saddle points. To find the critical points of a function $f(x_1, x_2, ..., x_n)$, you need to solve the system of equations where the partial derivatives with respect to each variable are equal to 0: $$\nabla f(\mathbf{x}) = \mathbf{0} \implies \frac{\partial f}{\partial x_i} = 0 \quad \text{for all } i=1, \dots, n$$ **Steps to find critical points:** 1. Compute the partial derivatives of $f$ with respect to each variable. 2. Set each partial derivative equal to zero. 3. Solve the resulting system of equations for $x_1, x_2, \dots, x_n$. 4. Check for points where the gradient is undefined, if applicable (e.g., points where particular partial derivatives don't exist). #### Gradient Definition - **Definition:** The gradient of a scalar-valued multivariable function is a vector of its first-order partial derivatives. It points in the direction of the greatest rate of increase of the function. - **Formula:** For a function $f: \mathbb{R}^n \to \mathbb{R}$, the gradient $\nabla f(\mathbf{x})$ is: $$\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$$ - **Geometric Interpretation:** At any point, the gradient vector is perpendicular to the level set (contour line or surface) of the function at that point. Its magnitude indicates the steepness of the function. - **Directional Derivative:** The directional derivative of $f$ in the direction of a unit vector $\mathbf{u}$ is given by $D_{\mathbf{u}}f(\mathbf{x}) = \nabla f(\mathbf{x}) \cdot \mathbf{u}$. ### Hessian Matrix #### Definition & General Formula The Hessian matrix is a square matrix of second-order partial derivatives of a scalar-valued function. It describes the local curvature of a function. For a function $f: \mathbb{R}^n \to \mathbb{R}$, the Hessian matrix $H(f)(\mathbf{x})$ (or $\nabla^2 f(\mathbf{x})$) is: $$H(f)(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}$$ **Note:** If the second partial derivatives are continuous in a neighborhood around a point, then the mixed partial derivatives are equal ($\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$), making the Hessian a symmetric matrix (Clairaut's Theorem or Schwarz's Theorem). This symmetry is crucial for many optimization properties. #### Second Derivative Test for Critical Points (Multivariate) For a critical point $\mathbf{x}^*$ where $\nabla f(\mathbf{x}^*) = \mathbf{0}$: 1. **Compute the Hessian matrix** $H(f)(\mathbf{x}^*)$ at the critical point. 2. **Evaluate the definiteness of $H(f)(\mathbf{x}^*)$:** - If $H(f)(\mathbf{x}^*)$ is **positive definite** (all eigenvalues are positive), then $f$ has a **local minimum** at $\mathbf{x}^*$. A common test for positive definiteness is that all leading principal minors are positive ($D_1 > 0, D_2 > 0, \dots, D_n > 0$). For 2x2 matrix: $D_1 = f_{xx} > 0$ and $D_2 = \det(H) = f_{xx}f_{yy} - f_{xy}^2 > 0$. - If $H(f)(\mathbf{x}^*)$ is **negative definite** (all eigenvalues are negative), then $f$ has a **local maximum** at $\mathbf{x}^*$. For 2x2 matrix: $D_1 = f_{xx} 0$. - If $H(f)(\mathbf{x}^*)$ is **indefinite** (has both positive and negative eigenvalues), then $f$ has a **saddle point** at $\mathbf{x}^*$. For 2x2 matrix: $\det(H) ### Eigenvalues & Eigenvectors #### Definitions - **Eigenvectors:** For a given linear transformation (represented by a matrix $A$), an eigenvector $\mathbf{v}$ is a non-zero vector that changes at most by a scalar factor when that linear transformation is applied to it. In the context of the Hessian, eigenvectors indicate the principal directions of curvature. - **Eigenvalues:** The scalar factor $\lambda$ by which an eigenvector is scaled. In the context of the Hessian, eigenvalues quantify the magnitude and "direction" (concave up or concave down) of the curvature along their corresponding eigenvectors. For a square matrix $A$ (like the Hessian $H$), eigenvalues ($\lambda$) and eigenvectors ($\mathbf{v}$) satisfy the equation: $$A\mathbf{v} = \lambda\mathbf{v}$$ This can be rewritten as: $$(A - \lambda I)\mathbf{v} = \mathbf{0}$$ where $I$ is the identity matrix. To find eigenvalues, solve the characteristic equation: $$\det(A - \lambda I) = 0$$ Once eigenvalues are found, substitute each $\lambda$ back into $(A - \lambda I)\mathbf{v} = \mathbf{0}$ to find the corresponding eigenvectors. #### Matrix Definiteness based on Eigenvalues For a symmetric matrix (like the Hessian): - **Positive Definite (PD):** All eigenvalues $\lambda_i > 0$. - **Negative Definite (ND):** All eigenvalues $\lambda_i ### Convexity #### Convex Sets - **Definition:** A set $S \subseteq \mathbb{R}^n$ is convex if for any two points $\mathbf{x}_1, \mathbf{x}_2 \in S$, the line segment connecting them is entirely contained within $S$. That is, for all $\theta \in [0, 1]$, the point $\theta \mathbf{x}_1 + (1-\theta)\mathbf{x}_2 \in S$. - **Examples of Convex Sets:** Hyperplanes, half-spaces, balls (Euclidean, $L_1$, $L_\infty$), cones, polyhedra, intersections of convex sets. - **Non-examples:** Star-shaped sets, sets with holes or indentations. - **Properties:** - The intersection of any number of convex sets is a convex set. - The sum of convex sets is convex. - The image of a convex set under an affine map is convex. - The inverse image of a convex set under an affine map is convex. #### Convex Functions - **Definition:** A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex if its domain is a convex set and for all $\mathbf{x}, \mathbf{y}$ in its domain and for all $\theta \in [0, 1]$: $$f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \le \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y})$$ Geometrically, this means the line segment connecting any two points on the graph of the function lies above or on the graph. - **Strictly Convex:** If the inequality is strict for $\theta \in (0, 1)$ and $\mathbf{x} \ne \mathbf{y}$. - **Concave Function:** A function $f$ is concave if $-f$ is convex. $$f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \ge \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y})$$ - **Properties:** - The sum of convex functions is convex. - A non-negative weighted sum of convex functions is convex. - The composition of a convex function with an affine mapping is convex. - The maximum of a set of convex functions is convex. #### Proving Convexity of Functions 1. **Definition (Jensen's Inequality):** Directly use the definition: $f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \le \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y})$. This is often used for simple functions or when other methods are difficult. 2. **First-Order Condition:** For a differentiable function $f$, it is convex if and only if its domain is convex and for all $\mathbf{x}, \mathbf{y}$ in its domain: $$f(\mathbf{y}) \ge f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x})$$ Geometrically, the tangent hyperplane (linear approximation) at any point lies below or on the graph of the function. 3. **Second-Order Condition (Hessian):** For a twice-differentiable function $f$ with a convex domain, it is convex if and only if its Hessian matrix $\nabla^2 f(\mathbf{x})$ is **positive semi-definite (PSD)** for all $\mathbf{x}$ in its domain. - If $\nabla^2 f(\mathbf{x})$ is **positive definite (PD)** for all $\mathbf{x}$, then $f$ is strictly convex. - If $\nabla^2 f(\mathbf{x})$ is **negative semi-definite (NSD)** for all $\mathbf{x}$, then $f$ is concave. - If $\nabla^2 f(\mathbf{x})$ is **negative definite (ND)** for all $\mathbf{x}$, then $f$ is strictly concave. 4. **Operations that preserve convexity:** Sums, non-negative scaling, maximization, composition with affine maps. ### Convex Optimization Problem #### Definition A convex optimization problem is an optimization problem of the form: $$\begin{array}{ll} \text{minimize} & f_0(\mathbf{x}) \\ \text{subject to} & f_i(\mathbf{x}) \le 0, \quad i=1, \dots, m \\ & h_j(\mathbf{x}) = 0, \quad j=1, \dots, p \end{array}$$ where: - The objective function $f_0(\mathbf{x})$ is a **convex function**. - The inequality constraint functions $f_i(\mathbf{x})$ are **convex functions**. - The equality constraint functions $h_j(\mathbf{x})$ are **affine functions** (i.e., $h_j(\mathbf{x}) = \mathbf{a}_j^T \mathbf{x} - b_j$). **Key Properties:** - The feasible set (defined by the constraints) of a convex optimization problem is always a convex set. This is because it is the intersection of the sublevel sets of convex functions (which are convex) and affine sets (which are convex). - Any local minimum of a convex optimization problem is also a global minimum. - For strictly convex objective functions, if a minimum exists, it is unique. ### L-Smoothness & Strong Convexity #### L-Smoothness (Lipschitz Continuous Gradient) A function $f$ is said to be $L$-smooth if its gradient $\nabla f$ is Lipschitz continuous with constant $L$. This means that the slope of the function does not change too rapidly. - **Definition:** For a differentiable function $f: \mathbb{R}^n \to \mathbb{R}$, it is $L$-smooth if there exists a constant $L > 0$ such that for all $\mathbf{x}, \mathbf{y} \in \text{dom}(f)$: $$||\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})||_2 \le L||\mathbf{x} - \mathbf{y}||_2$$ - **Second-Order Condition:** If $f$ is twice differentiable, then $f$ is $L$-smooth if and only if its Hessian matrix $\nabla^2 f(\mathbf{x})$ is bounded, i.e., its maximum eigenvalue is bounded by $L$ for all $\mathbf{x}$ in the domain: $$\lambda_{\max}(\nabla^2 f(\mathbf{x})) \le L \quad \text{for all } \mathbf{x} \in \text{dom}(f)$$ where $\lambda_{\max}(A)$ denotes the largest eigenvalue of matrix $A$. - **Importance:** $L$-smoothness implies that the function does not have arbitrarily sharp "peaks" or "valleys," which is important for the convergence of gradient-based optimization algorithms. It guarantees that a quadratic upper bound exists for the function: $$f(\mathbf{y}) \le f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x}) + \frac{L}{2} ||\mathbf{y} - \mathbf{x}||_2^2$$ #### Strong Convexity A function is strongly convex if it is "uniformly" convex, meaning it curves upwards at least by a certain amount everywhere. It ensures a unique minimum and faster convergence rates for optimization algorithms. - **Definition:** A function $f: \mathbb{R}^n \to \mathbb{R}$ is $\mu$-strongly convex if there exists a constant $\mu > 0$ such that for all $\mathbf{x}, \mathbf{y} \in \text{dom}(f)$: $$f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \le \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y}) - \frac{1}{2}\mu\theta(1-\theta)||\mathbf{x}-\mathbf{y}||_2^2$$ - **First-Order Condition:** For a differentiable function $f$, it is $\mu$-strongly convex if and only if its domain is convex and for all $\mathbf{x}, \mathbf{y}$ in its domain: $$f(\mathbf{y}) \ge f(\mathbf{x}) + \nabla f(\mathbf{x})^T (\mathbf{y} - \mathbf{x}) + \frac{\mu}{2} ||\mathbf{y} - \mathbf{x}||_2^2$$ - **Second-Order Condition:** If $f$ is twice differentiable, then $f$ is $\mu$-strongly convex if and only if its Hessian matrix $\nabla^2 f(\mathbf{x})$ satisfies: $$\nabla^2 f(\mathbf{x}) \succeq \mu I \quad \text{for all } \mathbf{x} \in \text{dom}(f)$$ This means that all eigenvalues of $\nabla^2 f(\mathbf{x})$ are greater than or equal to $\mu$: $$\lambda_{\min}(\nabla^2 f(\mathbf{x})) \ge \mu \quad \text{for all } \mathbf{x} \in \text{dom}(f)$$ where $\lambda_{\min}(A)$ denotes the smallest eigenvalue of matrix $A$. #### Condition Number The condition number $\kappa$ of a function (or its Hessian) is the ratio of its $L$-smoothness constant to its strong convexity constant: $$\kappa = \frac{L}{\mu}$$ - A small condition number (close to 1) indicates a well-conditioned problem, where the function resembles a "round" bowl. Optimization algorithms tend to converge quickly. - A large condition number indicates an ill-conditioned problem, where the function resembles a "long, narrow valley" (an ellipsoid with highly varying principal axes). Optimization algorithms can struggle to converge, exhibiting zig-zagging behavior. ### Gradient Descent #### Definition Gradient Descent is a first-order iterative optimization algorithm for finding the local minimum of a differentiable function. It works by repeatedly taking steps proportional to the negative of the gradient of the function at the current point. #### Gradient Descent Update Rule The core of the algorithm is the update rule, which moves the current estimate $\mathbf{x}_k$ to a new estimate $\mathbf{x}_{k+1}$: $$\mathbf{x}_{k+1} = \mathbf{x}_k - t_k \nabla f(\mathbf{x}_k)$$ Where: - $\mathbf{x}_k$: Current parameter vector at iteration $k$. - $\nabla f(\mathbf{x}_k)$: Gradient of the objective function $f$ at $\mathbf{x}_k$. - $t_k$: Learning rate (or step size) at iteration $k$. This determines the size of the step taken in the direction of the negative gradient. **Choosing the Learning Rate ($t_k$):** - **Constant Learning Rate:** $t_k = t$ for all $k$. Simple to implement. If $f$ is $L$-smooth, a common choice for convergence is $t #### Convergence of Gradient Descent - **General Convex Functions (L-smooth):** With an appropriate step size (e.g., constant $t \le 1/L$), Gradient Descent converges to a global minimum. The convergence rate is typically $O(1/k)$, meaning the error decreases inversely with the number of iterations. - **Strongly Convex Functions (L-smooth and $\mu$-strongly convex):** For strongly convex functions, Gradient Descent exhibits **linear (or geometric) convergence**, meaning the error decreases exponentially with the number of iterations. The convergence rate is $O((1 - \mu/L)^k)$ or $O((1 - 1/\kappa)^k)$, where $\kappa = L/\mu$ is the condition number. A smaller condition number leads to faster convergence. - **Non-Convex Functions:** Gradient Descent may converge to a local minimum, a saddle point, or fail to converge. The final point depends heavily on initialization and the specific landscape of the function. #### Variants of Gradient Descent - **Stochastic Gradient Descent (SGD):** When dealing with large datasets, calculating the full gradient can be computationally expensive. SGD approximates the gradient using a single randomly chosen data point or a small mini-batch. $$\mathbf{x}_{k+1} = \mathbf{x}_k - t_k \nabla f_i(\mathbf{x}_k)$$ (for a randomly chosen data point $i$) - **Pros:** Faster for large datasets, escapes shallow local minima in non-convex settings. - **Cons:** Higher variance in updates, often requires decaying learning rates. - **Mini-Batch Gradient Descent:** A compromise between full batch GD and SGD, using a small batch of data points to estimate the gradient. Balances computational efficiency and update stability. - **Momentum:** Incorporates a "momentum" term from previous updates to accelerate convergence and dampen oscillations, especially in ill-conditioned problems. $$\mathbf{v}_{k+1} = \beta \mathbf{v}_k + (1-\beta) \nabla f(\mathbf{x}_k)$$ $$\mathbf{x}_{k+1} = \mathbf{x}_k - t_k \mathbf{v}_{k+1}$$ where $\beta \in [0,1)$ is the momentum coefficient. - **Adaptive Learning Rate Methods (e.g., Adam, RMSprop, Adagrad):** These methods dynamically adjust the learning rate for each parameter based on past gradients. They are widely used in deep learning for their robustness and faster convergence. ### Taylor Approximations #### Taylor's Theorem Taylor's theorem provides a way to approximate a function by a polynomial whose coefficients depend on the function's derivatives at a single point. This polynomial is called the Taylor polynomial. **General Form (Univariate):** If $f$ is $k$ times differentiable at a point $a$, then for any $x$ in an open interval containing $a$: $$f(x) = P_{k,a}(x) + R_k(x)$$ where $P_{k,a}(x)$ is the Taylor polynomial of degree $k$ and $R_k(x)$ is the remainder term. $$P_{k,a}(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \dots + \frac{f^{(k)}(a)}{k!}(x-a)^k = \sum_{d=0}^{k} \frac{f^{(d)}(a)}{d!}(x-a)^d$$ The remainder term $R_k(x)$ (Lagrange form) is given by: $$R_k(x) = \frac{f^{(k+1)}(c)}{(k+1)!}(x-a)^{k+1}$$ for some $c$ between $a$ and $x$. The remainder term quantifies the error of the approximation. **General Form (Multivariate):** For a function $f: \mathbb{R}^n \to \mathbb{R}$ that is $k$ times continuously differentiable at $\mathbf{a}$: $$f(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^T(\mathbf{x}-\mathbf{a}) + \frac{1}{2!}(\mathbf{x}-\mathbf{a})^T \nabla^2 f(\mathbf{a})(\mathbf{x}-\mathbf{a}) + \dots + R_k(\mathbf{x})$$ where the terms are based on the tensors of derivatives. For optimization, the first and second-order approximations are most relevant. #### First and Second-Order Taylor Approximations These approximations are fundamental in optimization for understanding local function behavior and designing algorithms. - **First-order Taylor approximation (univariate):** Linear approximation $$f(x) \approx P_{1,a}(x) = f(a) + f'(a)(x-a)$$ This approximates the function with its tangent line at $a$. - **Second-order Taylor approximation (univariate):** Quadratic approximation $$f(x) \approx P_{2,a}(x) = f(a) + f'(a)(x-a) + \frac{1}{2}f''(a)(x-a)^2$$ This approximates the function with a parabola, capturing its curvature. - **First-order Taylor approximation (multivariate):** Linear approximation $$f(\mathbf{x}) \approx P_{1,\mathbf{a}}(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^T(\mathbf{x}-\mathbf{a})$$ This approximates the function with its tangent hyperplane at $\mathbf{a}$. - **Second-order Taylor approximation (multivariate):** Quadratic approximation $$f(\mathbf{x}) \approx P_{2,\mathbf{a}}(\mathbf{x}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^T(\mathbf{x}-\mathbf{a}) + \frac{1}{2}(\mathbf{x}-\mathbf{a})^T \nabla^2 f(\mathbf{a})(\mathbf{x}-\mathbf{a})$$ This approximates the function with a quadratic form, capturing its local curvature. This is crucial for understanding Newton's method and the second-derivative test. #### Applications in Optimization - **Gradient Descent:** The first-order Taylor approximation is the basis for Gradient Descent. By minimizing the first-order approximation (plus a quadratic penalty for step size), one arrives at the gradient descent update. $$f(\mathbf{x}) \approx f(\mathbf{x}_k) + \nabla f(\mathbf{x}_k)^T(\mathbf{x} - \mathbf{x}_k)$$ To minimize this, we move in the direction opposite to the gradient. - **Newton's Method:** Uses the second-order Taylor approximation to find the minimum. By setting the gradient of the quadratic approximation to zero, we derive the Newton update rule: $$\nabla f(\mathbf{x}_k) + \nabla^2 f(\mathbf{x}_k)(\mathbf{x} - \mathbf{x}_k) = \mathbf{0}$$ $$\mathbf{x}_{k+1} = \mathbf{x}_k - (\nabla^2 f(\mathbf{x}_k))^{-1} \nabla f(\mathbf{x}_k)$$ Newton's method converges much faster than gradient descent (quadratic convergence) near a minimum if the function is well-behaved, but requires computing and inverting the Hessian. - **Local Analysis:** Taylor approximations are used to analyze the behavior of a function near a critical point (e.g., the second derivative test uses the quadratic approximation). - **Error Analysis:** The remainder term helps in bounding the error of these approximations.

Related Cheatsheets

Create Your Own AI Cheatsheet

Generate comprehensive study cheatsheets from your notes, textbooks, or lecture materials using AI.

Optimization Fundamentals

Related Cheatsheets

Halliday Physics Fundamentals

Fundamentals of Electrical Engineering

Halliday Physics Fundamentals

Physics Fundamentals

Computer Fundamentals & C Programming

Aptitude Fundamentals

Create Your Own AI Cheatsheet