1. Modeling Text with Distances (Vector Space Model) Transforming discrete linguistic units into a continuous geometric space (Vectorization) to quantify semantic relationships. Vector Space Model (VSM) Documents are represented as points in a high-dimensional manifold. Given a vocabulary $V$ of size $M$, a document $d$ is mapped to a vector $x \in \mathbb{R}^M$. Bag-of-Words (BoW): Each component $x_i$ is the frequency of the $i$-th term. TF-IDF Weighting: Accounts for term importance. Term Frequency (TF): $tf_{i,j} = \frac{N_{i,j}}{\sum_k N_{k,j}}$ Inverse Document Frequency (IDF): $idf_i = \log \frac{N}{|\{j \in D : t_i \in d_j\}|}$ Weight: $w_{i,j} = tf_{i,j} \cdot idf_i$ $N_{i,j}$ is occurrences of term $i$ in document $j$. $N$ is total number of documents in corpus $D$. Geometric Distance Metrics Euclidean Distance ($L_2$ Norm): $d_2(x, y) = \sqrt{\sum_{i=1}^M (x_i - y_i)^2}$ Sensitive to document length; longer documents have higher magnitudes. Cosine Similarity: Measures angle between vectors, length-invariant. $\cos(\theta) = \frac{x \cdot y}{||x|| \cdot ||y||} = \frac{\sum_{i=1}^M x_i y_i}{\sqrt{\sum_{i=1}^M x_i^2} \sqrt{\sum_{i=1}^M y_i^2}}$ Cosine Distance: $d_{cos}(x, y) = 1 - \cos(\theta)$ 2. Distributional and Information-Theoretic Distances Used when documents are modeled as probability distributions (e.g., Latent Dirichlet Allocation). Kullback-Leibler (KL) Divergence: Measures how $P$ diverges from $Q$. $D_{KL}(P||Q) = \sum_{i=1}^M P(i) \log \frac{P(i)}{Q(i)}$ Non-symmetric ($D_{KL}(P||Q) \ne D_{KL}(Q||P)$). Jensen-Shannon Divergence (JSD): Symmetric version of KL divergence. $JSD(P||Q) = \frac{1}{2} D_{KL}(P||M) + \frac{1}{2} D_{KL}(Q||M)$, where $M = \frac{1}{2}(P+Q)$. 3. Semantic Distances in Low-Dimensional Embeddings Word embeddings (Word2Vec, GloVe) map words into a dense space $\mathbb{R}^n$ ($n \ll M$) where semantic similarity correlates with geometric proximity. Word Mover's Distance (WMD): Treats documents as "piles of earth", calculates minimum cumulative cost to transform word distribution of document A into B. Optimization problem: $\min_{T \ge 0} \sum_{i,j=1}^M T_{i,j} \cdot c(i,j)$ Subject to: $\sum_{j=1}^M T_{i,j} = d_{A,i}$ and $\sum_{i=1}^M T_{i,j} = d_{B,j}$ $c(i,j) = ||v_i - v_j||_2$ is Euclidean distance between word embeddings. 4. Metric Properties For a function $d(x,y)$ to be a valid distance metric, it must satisfy: Non-negativity: $d(x, y) \ge 0$ Identity of indiscernibles: $d(x, y) = 0 \iff x = y$ Symmetry: $d(x, y) = d(y, x)$ Triangle Inequality: $d(x, z) \le d(x, y) + d(y, z)$ Cosine Distance does not strictly satisfy triangle inequality, but its square root or angular distance $\arccos(\text{similarity})$ does. 5. Distances Applied to Sets: Jaccard and Edit Distances Measuring similarity/dissimilarity between sets (Jaccard) and sequences (Edit Distance). Jaccard Similarity and Distance Measures similarity between two finite sample sets by comparing intersection size against union size. Jaccard Similarity: $J(A, B) = \frac{|A \cap B|}{|A \cup B|}$ Range: $0 \le J(A, B) \le 1$. If $A, B$ are empty, $J(A, B)=1$. Jaccard Distance: $d_J(A, B) = 1 - J(A, B) = \frac{|A \cup B| - |A \cap B|}{|A \cup B|}$ A true metric satisfying all 4 axioms. Useful for comparing vocabularies (Bag-of-Words). Edit Distance (Levenshtein Distance) Measures minimum single-character edits (insertions, deletions, substitutions) to change one string into another. Formal Recurrence Relation: Let $lev(i, j)$ be the distance between $s_1[1..i]$ and $s_2[1..j]$. If $\min(i, j) = 0$, then $lev(i, j) = \max(i, j)$. Otherwise, $lev(i, j) = \min \begin{cases} lev(i-1, j) + 1 \\ lev(i, j-1) + 1 \\ lev(i-1, j-1) + \mathbb{I}(s_1[i] \ne s_2[j]) \end{cases}$ $\mathbb{I}(s_1[i] \ne s_2[j])$ is $1$ if characters differ, $0$ otherwise. Computational Complexity: $O(m \cdot n)$ for strings of length $m, n$. Crucial when relative order is significant (e.g., DNA sequence alignment, spell-checking). Comparison Feature Jaccard Distance Edit (Levenshtein) Distance Data Type Unordered Sets Ordered Sequences / Strings Operations Set Union and Intersection Insert, Delete, Substitute Range $[0, 1]$ $[0, \max(m, n)]$ Key Use Case Document Similarity, Ecology Spell check, Bioinformatics For high-dimensional set data, Jaccard distance is approximated using MinHash. 6. Norms in Vector Spaces A norm measures the "length" or "magnitude" of vectors, transforming algebraic structures into geometric ones. Formal Definition of a Normed Vector Space A norm $|| \cdot || : V \to [0, \infty)$ on a vector space $V$ over field $\mathbb{F}$ satisfies for $u, v \in V, \alpha \in \mathbb{F}$: Non-negativity and Definiteness: $||v|| \ge 0$, and $||v|| = 0 \iff v = 0$. Absolute Homogeneity: $||\alpha v|| = |\alpha| ||v||$. Triangle Inequality: $||u + v|| \le ||u|| + ||v||$. A vector space with a norm is a Normed Vector Space. Family of $L_p$ Norms For a vector $x = (x_1, \dots, x_n) \in \mathbb{R}^n$, the $L_p$-norm (Minkowski norm) is: $||x||_p = \left( \sum_{i=1}^n |x_i|^p \right)^{1/p}$ $p \ge 1$ ensures Triangle Inequality. Euclidean Norm ($L_2$) Most intuitive measure of distance, corresponding to physical length. $p=2$: $||x||_2 = \sqrt{\sum_{i=1}^n x_i^2}$ Induced by the standard Inner Product: $||x||_2 = \sqrt{\langle x, x \rangle}$. Satisfies the Parallelogram Law: $2||u||_2^2 + 2||v||_2^2 = ||u+v||_2^2 + ||u-v||_2^2$. Alternative Metric Choices ($L_1$ and $L_\infty$) Manhattan Norm ($L_1$): Sum of absolute values. $||x||_1 = \sum_{i=1}^n |x_i|$ Used in Compressed Sensing (LASSO regression) for sparsity. Chebyshev Norm ($L_\infty$): Maximum absolute component. $||x||_\infty = \lim_{p \to \infty} ||x||_p = \max_i |x_i|$ Measures "worst-case" deviation, common in control theory. Induced Metrics and Topology Every norm $|| \cdot ||$ induces a metric $d(x, y) = ||x - y||$. In finite-dimensional spaces, all norms are equivalent, meaning they induce the same topology. $C_1 ||x||_a \le ||x||_b \le C_2 ||x||_a$ for constants $C_1, C_2 > 0$. In infinite-dimensional spaces, equivalence may not hold, making norm choice critical. 7. Explanation of Distance Types Manhattan Distance ($L_1$ Norm) Taxicab metric, sum of absolute differences of coordinates. Pathfinding on grid-like layouts. $d_1(x, y) = ||x - y||_1 = \sum_{i=1}^n |x_i - y_i|$ Always $\ge$ Euclidean distance. Sensitive to coordinate system rotation. Preferred in high dimensions for better contrast between distant and near points. Hamming Distance For categorical data or strings of equal length. Quantifies positions where symbols differ. $d_H(x, y) = \sum_{i=1}^n \mathbb{I}(x_i \ne y_i)$ $\mathbb{I}(x_i \ne y_i)$ is $1$ if $x_i \ne y_i$, else $0$. Applications: Error detection/correction, genetic divergence. Mahalanobis Distance Measures distance between a point and a distribution, accounting for variance and covariance. $D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$ $\mu$ is mean, $\Sigma$ is covariance matrix. If $\Sigma$ is identity matrix, reduces to Euclidean distance. Scale-invariant, accounts for correlations, defines ellipsoidal contours. Cosine and Angular Distances Focuses on orientation rather than magnitude. Cosine Similarity: $\cos(\theta) = \frac{x \cdot y}{||x|| \cdot ||y||}$ Cosine Distance: $d_{cos}(x, y) = 1 - \cos(\theta)$ (not a proper metric due to triangle inequality violation). Angular Distance: $d_\theta(x, y) = \frac{\theta}{\pi} = \frac{\arccos\left(\frac{x \cdot y}{||x||\cdot||y||}\right)}{\pi}$ (a proper metric, normalized to $[0, 1]$). Widely used in NLP for word embeddings/document vectors. Effective in high-dimensional spaces ("curse of dimensionality"). Context: Choice of metric depends on data topology (e.g., Manhattan for grid, Hamming for symbolic, Mahalanobis for probabilistic, Cosine/Angular for orientation-based similarity).