1. Introduction to Machine Learning Definition: Field of study that gives computers the ability to learn without being explicitly programmed. Learning Types: Supervised Learning: Learning from labeled data (e.g., classification, regression). Unsupervised Learning: Learning from unlabeled data (e.g., clustering, dimensionality reduction). Reinforcement Learning: Learning through trial and error with rewards/penalties. Applications: Image recognition, natural language processing, recommendation systems, medical diagnosis. Key Challenges: Data quality, overfitting/underfitting, computational cost, interpretability. 2. Supervised Learning: Regression 2.1 Linear Regression Model: $y = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n + \epsilon$ Goal: Find $\beta$ values that minimize the cost function. Cost Function (Mean Squared Error - MSE): $J(\beta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\beta(x^{(i)}) - y^{(i)})^2$ Gradient Descent: Iteratively adjusts $\beta$ to minimize $J(\beta)$. $\beta_j := \beta_j - \alpha \frac{\partial}{\partial \beta_j} J(\beta)$ $\alpha$: learning rate. Normal Equation: $\beta = (X^T X)^{-1} X^T y$ (closed-form solution). 2.2 Polynomial Regression Models non-linear relationships by adding polynomial features: $y = \beta_0 + \beta_1 x + \beta_2 x^2 + \dots + \beta_k x^k$ Can fit complex curves, but prone to overfitting if degree $k$ is too high. 2.3 Regularization Techniques to prevent overfitting by penalizing large coefficients. Lasso Regression (L1): Adds $|\beta|$ penalty. Can lead to sparse models (some $\beta_j$ become zero). $J(\beta) = \text{MSE}(\beta) + \lambda \sum_{j=1}^{n} |\beta_j|$ Ridge Regression (L2): Adds $\beta^2$ penalty. Shrinks coefficients towards zero. $J(\beta) = \text{MSE}(\beta) + \lambda \sum_{j=1}^{n} \beta_j^2$ $\lambda$: regularization parameter. 3. Supervised Learning: Classification 3.1 Logistic Regression Used for binary classification. Outputs probability $P(y=1|x)$. Sigmoid Function: $h_\theta(x) = g(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}$ Decision Boundary: $h_\theta(x) \ge 0.5 \implies y=1$, else $y=0$. Cost Function (Log Loss/Cross-Entropy): $J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)}))]$ Optimized using Gradient Descent. 3.2 Support Vector Machines (SVM) Finds the optimal hyperplane that maximally separates classes. Margin: Distance between the hyperplane and the closest data points (support vectors). Hard Margin: Assumes linearly separable data, no misclassifications allowed. Soft Margin: Allows some misclassifications (controlled by C parameter) for non-linearly separable data. Kernel Trick: Maps data to a higher-dimensional space to make it linearly separable. Common Kernels: Linear, Polynomial, Radial Basis Function (RBF/Gaussian). 3.3 K-Nearest Neighbors (KNN) Non-parametric, instance-based learning algorithm. Classification: A new data point is classified by a majority vote of its $k$ nearest neighbors. Regression: Output is the average of the $k$ nearest neighbors' values. Distance Metrics: Euclidean, Manhattan. Pros: Simple, no training phase. Cons: Computationally expensive for large datasets, sensitive to feature scaling. 4. Unsupervised Learning 4.1 Clustering: K-Means Partitions $n$ observations into $k$ clusters. Algorithm: Initialize $k$ centroids randomly. Assign each data point to the closest centroid. Recalculate centroids as the mean of all points in their cluster. Repeat steps 2-3 until convergence (centroids no longer move significantly). Distance Metric: Typically Euclidean distance. Elbow Method: Used to determine optimal $k$ by plotting WCSS (Within-Cluster Sum of Squares) against $k$. 4.2 Hierarchical Clustering Builds a hierarchy of clusters. Agglomerative (Bottom-Up): Start with each point as a single cluster. Repeatedly merge the two closest clusters until only one cluster remains. Divisive (Top-Down): Start with all points in one cluster. Recursively split clusters until each point is a single cluster. Dendrogram: Tree-like diagram showing the hierarchy of clusters. Linkage Criteria: Single, Complete, Average, Ward. 4.3 Dimensionality Reduction: PCA Principal Component Analysis (PCA): Transforms data into a new coordinate system where axes (principal components) are orthogonal and capture maximum variance. Steps: Standardize the data. Compute the covariance matrix. Compute eigenvectors and eigenvalues of the covariance matrix. Sort eigenvectors by decreasing eigenvalues. Select $k$ eigenvectors to form a projection matrix. Transform the original data using the projection matrix. Goal: Reduce feature space while retaining most of the information. 5. Model Evaluation and Selection 5.1 Metrics for Regression Mean Squared Error (MSE): $\frac{1}{m} \sum_{i=1}^{m} (y^{(i)} - \hat{y}^{(i)})^2$ Root Mean Squared Error (RMSE): $\sqrt{\text{MSE}}$ (interpretable in original units) Mean Absolute Error (MAE): $\frac{1}{m} \sum_{i=1}^{m} |y^{(i)} - \hat{y}^{(i)}|$ $R^2$ Score (Coefficient of Determination): $1 - \frac{\sum (y^{(i)} - \hat{y}^{(i)})^2}{\sum (y^{(i)} - \bar{y})^2}$ (proportion of variance explained) 5.2 Metrics for Classification Confusion Matrix: Predicted Positive Predicted Negative Actual Positive True Positive (TP) False Negative (FN) Actual Negative False Positive (FP) True Negative (TN) Accuracy: $\frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$ (Overall correctness) Precision: $\frac{\text{TP}}{\text{TP} + \text{FP}}$ (Proportion of positive identifications that were actually correct) Recall (Sensitivity): $\frac{\text{TP}}{\text{TP} + \text{FN}}$ (Proportion of actual positives identified correctly) F1-Score: $2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$ (Harmonic mean of precision and recall) ROC Curve & AUC: Plot of True Positive Rate vs. False Positive Rate at various threshold settings. AUC (Area Under Curve) measures overall model performance. 5.3 Model Selection Bias-Variance Trade-off: Bias: Error from erroneous assumptions in the learning algorithm (underfitting). Variance: Error from sensitivity to small fluctuations in the training set (overfitting). Cross-Validation: Hold-out: Split data into training and testing sets. K-Fold CV: Data split into $k$ folds. Model trained $k$ times, each time using a different fold as test set. Leave-One-Out CV (LOOCV): $k=n$ (number of data points). Hyperparameter Tuning: Grid Search: Exhaustively searches through a specified subset of hyperparameters. Random Search: Randomly samples hyperparameters from a given distribution. 6. Ensemble Methods Combine multiple learning algorithms to obtain better predictive performance. Bagging (Bootstrap Aggregating): Trains multiple models independently on different bootstrap samples of the training data. Random Forest: Ensemble of decision trees trained via bagging. Introduces randomness in feature selection. Boosting: Sequentially builds models, where each new model attempts to correct errors of the previous ones. AdaBoost (Adaptive Boosting): Weights misclassified samples more heavily in subsequent iterations. Gradient Boosting: Builds models by minimizing a loss function using gradient descent. XGBoost, LightGBM, CatBoost: Optimized gradient boosting frameworks. Stacking: Trains a meta-learner to combine predictions from multiple base models. 7. Decision Trees Tree-like model where each internal node represents a "test" on an attribute, each branch represents an outcome of the test, and each leaf node represents a class label or a value. ID3, C4.5, CART: Common algorithms for building decision trees. Splitting Criteria: Gini Impurity: For classification, measures how often a randomly chosen element would be incorrectly labeled. $G = 1 - \sum_{k=1}^{C} p_k^2$ Entropy: Measures impurity or disorder. $H = -\sum_{k=1}^{C} p_k \log_2(p_k)$ Information Gain: Reduction in entropy after splitting on an attribute. Pruning: Techniques to reduce the size of decision trees to prevent overfitting.