Machine Learning Q&A Bank
Cheatsheet Content
### Introduction to Machine Learning Machine Learning (ML) is a branch of Artificial Intelligence (AI) that enables systems to learn from data, identify patterns, and make decisions with minimal human intervention. It builds models from sample data (training data) to make predictions or decisions without being explicitly programmed for every task. **Key Concepts:** * **Learning from Data:** ML algorithms analyze large datasets to find underlying structures and relationships. * **Pattern Recognition:** Identifying recurring trends or features in data. * **Prediction/Decision Making:** Using learned patterns to forecast future outcomes or classify new data. * **Adaptation:** ML models can improve their performance over time as more data becomes available. ### Types of Machine Learning Machine Learning can be broadly categorized into three main types based on the nature of the learning signal or feedback available to a learning system: #### Supervised Learning * **Definition:** Algorithms learn from labeled data, meaning each training example comes with a correct output or "label." The goal is to learn a mapping function from input to output. * **Task:** Prediction (regression) or Classification. * **Examples:** Image classification, spam detection, predicting house prices. * **Algorithms:** Linear Regression, Logistic Regression, Decision Trees, Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Naive Bayes. #### Unsupervised Learning * **Definition:** Algorithms learn from unlabeled data, meaning there are no pre-defined output labels. The goal is to discover hidden patterns, structures, or relationships within the data. * **Task:** Clustering, Dimensionality Reduction, Association. * **Examples:** Customer segmentation, anomaly detection, topic modeling. * **Algorithms:** K-Means Clustering, Hierarchical Clustering, Principal Component Analysis (PCA), Apriori. #### Reinforcement Learning * **Definition:** An agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. It learns through trial and error, receiving rewards for good actions and penalties for bad ones. * **Task:** Sequential decision-making, optimal control. * **Examples:** Game playing (AlphaGo), robotics, autonomous driving. * **Algorithms:** Q-learning, SARSA, Deep Q-Networks. ### Applications of Machine Learning Machine Learning has revolutionized various industries and aspects of daily life. Some prominent applications include: * **Image Recognition:** Facial recognition, object detection, medical image analysis. * **Natural Language Processing (NLP):** Spam filtering, sentiment analysis, machine translation, chatbots. * **Recommendation Systems:** Suggesting products (e-commerce), movies (Netflix), music (Spotify) based on user preferences. * **Healthcare:** Disease diagnosis, drug discovery, personalized medicine. * **Finance:** Fraud detection, algorithmic trading, credit scoring. * **Autonomous Vehicles:** Self-driving cars, drone navigation. * **Speech Recognition:** Voice assistants (Siri, Alexa), transcription services. * **Cybersecurity:** Intrusion detection, malware analysis. * **Robotics:** Robot control, path planning. ### Section A — Very Short Answer (1–2 marks) #### Q1. Define Machine Learning. **Answer:** Machine Learning is a field of artificial intelligence that allows systems to learn from data, identify patterns, and make decisions or predictions without being explicitly programmed. #### Q2. What is the main difference between Supervised and Unsupervised Learning? **Answer:** Supervised learning uses labeled data for training, while unsupervised learning works with unlabeled data to find hidden patterns or structures. #### Q3. Give two examples of Supervised Learning algorithms. **Answer:** Linear Regression, Logistic Regression, Decision Trees, Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Naive Bayes. (Any two are acceptable) #### Q4. Give two examples of Unsupervised Learning algorithms. **Answer:** K-Means Clustering, Hierarchical Clustering, Principal Component Analysis (PCA), Apriori Algorithm. (Any two are acceptable) #### Q5. What is the primary goal of Classification in Machine Learning? **Answer:** The primary goal of classification is to predict a categorical label or class for new input data based on learned patterns from labeled training data. #### Q6. What is the primary goal of Regression in Machine Learning? **Answer:** The primary goal of regression is to predict a continuous numerical value for new input data based on relationships learned from training data. #### Q7. What is a 'feature' in Machine Learning? **Answer:** A feature is an individual measurable property or characteristic of a phenomenon being observed. It is an input variable used in a model for prediction. #### Q8. What is a 'label' in Supervised Learning? **Answer:** A label is the output or target variable that we are trying to predict or classify in supervised learning. It's the "correct answer" for a given input. #### Q9. What is the 'hypothesis function' in Linear Regression? **Answer:** The hypothesis function in Linear Regression is a linear equation, typically $h_\theta(x) = \theta_0 + \theta_1x$, which the model uses to predict the output based on input features. #### Q10. What is the 'cost function' in Machine Learning? **Answer:** A cost function (or loss function) measures the error between the predicted output of a model and the actual output. The goal of training is to minimize this cost function. #### Q11. What is the 'hyperplane' in Support Vector Machines (SVM)? **Answer:** In SVM, a hyperplane is a decision boundary that separates data points of different classes in the feature space. In a 2D space, it's a line; in 3D, it's a plane. #### Q12. What is the 'margin' in SVM? **Answer:** The margin in SVM is the distance between the hyperplane and the nearest data point from either class (known as support vectors). SVM aims to maximize this margin. #### Q13. What is 'Entropy' in the context of Decision Trees? **Answer:** Entropy is a measure of impurity or disorder in a set of data. In Decision Trees, it quantifies the uncertainty or randomness in a collection of samples. #### Q14. What is 'Information Gain' in Decision Trees? **Answer:** Information Gain is the reduction in entropy achieved by splitting a dataset on a particular feature. Decision Trees choose features that provide the highest information gain. #### Q15. Define 'K' in K-Nearest Neighbors (KNN). **Answer:** 'K' in KNN refers to the number of nearest data points (neighbors) that are considered to classify a new data point or predict its value. It's a user-defined parameter. #### Q16. What is the purpose of 'clustering' in Unsupervised Learning? **Answer:** Clustering aims to group a set of unlabeled data points into subsets (clusters) such that data points within the same cluster are more similar to each other than to those in other clusters. #### Q17. What is a 'dendrogram' in Hierarchical Clustering? **Answer:** A dendrogram is a tree-like diagram that records the sequence of merges or splits and the similarity/distance levels at which these operations occur in hierarchical clustering. #### Q18. What is 'Association Rule Mining'? **Answer:** Association Rule Mining is a technique used to discover interesting relationships or associations among a large set of data items, often expressed as "if-then" rules. #### Q19. Define 'Support' in Apriori Algorithm. **Answer:** Support for an itemset is the proportion of transactions in the dataset that contain the itemset. It indicates how frequently an itemset appears in the dataset. #### Q20. Define 'Confidence' in Apriori Algorithm. **Answer:** Confidence for a rule $X \rightarrow Y$ is the conditional probability $P(Y|X)$, i.e., the proportion of transactions containing X that also contain Y. It indicates the reliability of the rule. ### Section B — Short Answer (5 marks) #### Q1. Explain the concept of the 'Cost Function' and its role in training a Machine Learning model. **Answer:** A **cost function**, also known as a loss function, quantifies the error between the predicted output of a machine learning model and the actual target values. Its primary role is to provide a metric that indicates how well the model is performing. During the training process, the algorithm iteratively adjusts the model's parameters (e.g., weights and biases in linear regression) with the aim of **minimizing this cost function**. By reducing the cost, the model learns to make more accurate predictions and better generalize to unseen data. For instance, in linear regression, the Mean Squared Error (MSE) is a common cost function, penalizing larger prediction errors more heavily. #### Q2. Briefly explain the working principle of Logistic Regression. **Answer:** Logistic Regression is a supervised learning algorithm used for **binary classification** tasks. Despite its name, it's a classification algorithm, not a regression one. It works by using a **sigmoid (logistic) function** to map the linear combination of input features to a probability value between 0 and 1. This probability represents the likelihood of the input belonging to a particular class. A threshold (e.g., 0.5) is then applied to classify the input into one of two categories. The model learns the optimal weights and bias by maximizing the likelihood of observing the training data, typically using an optimization algorithm like gradient descent. #### Q3. Describe the main idea behind the K-Nearest Neighbors (KNN) algorithm. **Answer:** K-Nearest Neighbors (KNN) is a simple, non-parametric, lazy learning algorithm used for both classification and regression. The core idea is that **similar things are close to each other**. To classify a new data point, KNN looks at its 'K' nearest neighbors (data points) in the feature space. The classification of the new point is then determined by a **majority vote** of its K neighbors (for classification) or by the average of their values (for regression). The 'distance' between points is typically calculated using metrics like Euclidean distance. KNN is 'lazy' because it doesn't build a model during training; it just stores the training data and performs computations only during prediction. #### Q4. Explain the concept of 'Support Vectors' and their importance in SVM. **Answer:** In Support Vector Machines (SVM), **Support Vectors** are the data points that lie closest to the decision boundary (hyperplane) and are crucial for defining the hyperplane itself. These are the points that are most difficult to classify. The SVM algorithm's objective is to find a hyperplane that maximizes the **margin** (the distance between the hyperplane and the nearest data point from either class). Only the support vectors influence the position and orientation of this optimal hyperplane. All other data points, which are further away from the margin, have no impact on the decision boundary. This makes SVM robust and less sensitive to outliers, as only a subset of the training data (the support vectors) determines the model. #### Q5. How does a Decision Tree make predictions? Explain with an example. **Answer:** A Decision Tree makes predictions by creating a tree-like model of decisions based on the features of the input data. It starts at the **root node** and traverses down the tree by evaluating conditions (splits) on specific features at each internal node. Each internal node represents a test on an attribute, each branch represents an outcome of the test, and each leaf node represents a class label (for classification) or a numerical value (for regression). For example, to predict if an email is spam, the tree might first ask "Does it contain the word 'free'?" If yes, it moves to a branch that might ask "Is the sender unknown?". It continues until it reaches a leaf node, which provides the final prediction (e.g., 'Spam' or 'Not Spam'). #### Q6. What is the role of 'prior probability' and 'likelihood' in the Naive Bayes Classifier? **Answer:** In the Naive Bayes Classifier, **prior probability** ($P(C)$) refers to the initial probability of a class occurring, regardless of any features. It's simply the frequency of that class in the training dataset. For example, if 70% of emails are spam, $P(\text{Spam}) = 0.7$. **Likelihood** ($P(X|C)$) refers to the probability of observing a particular feature ($X$) given that the class ($C$) is true. For example, $P(\text{'free' | Spam})$ is the probability of the word 'free' appearing in a spam email. Naive Bayes combines these using Bayes' theorem ($P(C|X) \propto P(X|C) \cdot P(C)$) to calculate the posterior probability, which is the probability of a class given the observed features. #### Q7. Explain the concept of 'Grouping unlabeled data' in Unsupervised Learning. **Answer:** Grouping unlabeled data is the fundamental task of **unsupervised learning**. Unlike supervised learning, where data comes with predefined labels, unsupervised learning algorithms are presented with data that has no target output. The goal is to discover inherent structures, patterns, or relationships within this raw data. **Clustering** is a prime example of grouping unlabeled data, where algorithms partition data points into distinct groups (clusters) based on their similarity. Data points within a cluster are more similar to each other than to those in other clusters. This grouping can reveal hidden segments, identify anomalies, or simplify complex datasets, providing valuable insights without prior knowledge of categories. #### Q8. Describe the iterative process of K-Means Clustering. **Answer:** K-Means Clustering is an iterative algorithm that aims to partition 'n' data points into 'K' clusters. The process typically involves these steps: 1. **Initialization:** Randomly select K data points from the dataset to serve as initial cluster centroids. 2. **Assignment Step:** Each data point is assigned to the nearest centroid. The distance is usually calculated using Euclidean distance. 3. **Update Step:** The centroids are re-calculated as the mean (average) of all data points assigned to that cluster. 4. **Convergence:** Steps 2 and 3 are repeated until the cluster assignments no longer change, or the centroids stabilize (i.e., their positions do not significantly shift between iterations), or a maximum number of iterations is reached. This iterative refinement minimizes the within-cluster sum of squares. #### Q9. What is a 'dendrogram' and how is it used in Hierarchical Clustering? **Answer:** A **dendrogram** is a tree-like diagram that visually represents the hierarchical relationships between clusters in hierarchical clustering. Each leaf node in the dendrogram represents a single data point, and as you move up the tree, data points or clusters are merged (agglomerative) or split (divisive). The height of the merge point (or split point) on the dendrogram indicates the distance or dissimilarity between the merged clusters. By cutting the dendrogram horizontally at a certain distance threshold, one can determine the desired number of clusters. It provides a comprehensive view of the clustering structure at different levels of granularity, allowing for flexible cluster formation. #### Q10. Explain the core idea behind the Apriori Algorithm for Association Rule Mining. **Answer:** The Apriori Algorithm is a classic algorithm for finding **frequent itemsets** and deriving **association rules** from transactional databases (e.g., market basket analysis). Its core idea is based on the **Apriori principle**: "If an itemset is frequent, then all of its subsets must also be frequent." Conversely, "If an itemset is infrequent, then all of its supersets must also be infrequent." This principle allows the algorithm to efficiently prune the search space. It iteratively generates candidate itemsets of increasing size and uses the minimum support threshold to filter out infrequent ones, thus drastically reducing the number of itemsets to evaluate. Once frequent itemsets are found, association rules are generated from them based on a minimum confidence threshold. #### Q11. Differentiate between Supervised and Unsupervised Learning with respect to their applications. **Answer:** | Feature | Supervised Learning | Unsupervised Learning | | :---------------- | :--------------------------------------------------- | :--------------------------------------------------- | | **Data Type** | Labeled data (input-output pairs) | Unlabeled data (only input features) | | **Goal** | Predict output (classification/regression) | Find hidden patterns/structures (clustering/association) | | **Feedback** | Direct feedback (correct labels) | No direct feedback | | **Common Tasks** | Spam detection, image classification, stock price prediction, medical diagnosis. | Customer segmentation, anomaly detection, market basket analysis, dimensionality reduction. | | **Key Algorithms**| Linear Regression, Logistic Regression, SVM, Decision Trees, Naive Bayes. | K-Means, Hierarchical Clustering, Apriori, PCA. | | **Output** | Specific predictions or categories | Groups, clusters, or underlying relationships | #### Q12. What are the advantages of using a Random Forest algorithm over a single Decision Tree? **Answer:** Random Forest is an ensemble learning method that builds multiple decision trees and combines their outputs, offering several advantages over a single Decision Tree: 1. **Reduced Overfitting:** Single decision trees are prone to overfitting, especially deep ones. Random Forests mitigate this by averaging the predictions of many trees, each trained on a random subset of data and features. 2. **Improved Accuracy:** The aggregation of multiple diverse trees generally leads to higher predictive accuracy than any individual tree. 3. **Handles Missing Values:** It can handle missing values without explicit imputation. 4. **Feature Importance:** It provides a good indication of feature importance, showing which features contribute most to the predictions. 5. **Robustness to Noise:** It is more robust to noise and outliers in the data. 6. **Less Prone to Variance:** The randomness introduced in building individual trees (bagging and feature randomness) helps reduce the model's variance. ### Section C — Long Answer (10 marks) #### Q1. Elaborate on Linear Regression, including its mathematical derivation, cost function, and how it is optimized. **Answer:** **Linear Regression** is a fundamental supervised learning algorithm used for predicting a continuous target variable based on one or more independent predictor variables. It assumes a linear relationship between the input features ($X$) and the output variable ($Y$). **Mathematical Model:** For a single feature, the linear regression model is represented as: $$h_\theta(x) = \theta_0 + \theta_1 x$$ where: * $h_\theta(x)$ is the predicted output (hypothesis). * $x$ is the input feature. * $\theta_0$ is the y-intercept (bias term). * $\theta_1$ is the slope of the line (weight for the feature). For multiple features, the model extends to: $$h_\theta(\mathbf{x}) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n$$ In vector form, by adding a dummy feature $x_0=1$ to $\mathbf{x}$: $$h_\theta(\mathbf{x}) = \mathbf{\theta}^T \mathbf{x}$$ where $\mathbf{\theta}$ is the vector of parameters ($\theta_0, \theta_1, ..., \theta_n$) and $\mathbf{x}$ is the vector of features ($1, x_1, ..., x_n$). **Cost Function (Mean Squared Error - MSE):** To find the best-fitting line, we need to define a metric that quantifies the difference between the predicted values and the actual values. This is the **cost function**. For linear regression, the most common cost function is the Mean Squared Error (MSE), which calculates the average of the squared differences between predicted and actual values. Squaring the errors ensures that positive and negative errors don't cancel out and penalizes larger errors more heavily. Given a training set of $m$ examples $(x^{(i)}, y^{(i)})$, the cost function $J(\theta)$ is defined as: $$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$$ The $\frac{1}{2}$ factor is for mathematical convenience during differentiation, as it cancels out the 2 from the squared term. **Optimization (Gradient Descent):** The goal of training is to find the values of $\theta_0, \theta_1, ..., \theta_n$ that minimize the cost function $J(\theta)$. **Gradient Descent** is an iterative optimization algorithm used for this purpose. **Working Principle of Gradient Descent:** 1. **Initialization:** Start with some initial values for $\theta_0, \theta_1, ...$ (often zeros or random values). 2. **Iteration:** Repeatedly update the parameters $\theta_j$ by moving in the direction opposite to the gradient of the cost function. The gradient points towards the steepest ascent, so moving in the opposite direction leads to the steepest descent towards the minimum. The update rule for each parameter $\theta_j$ is: $$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)$$ where: * $\alpha$ is the **learning rate**, a positive constant that determines the size of the step taken in each iteration. A small $\alpha$ leads to slow convergence, while a large $\alpha$ might overshoot the minimum. * $\frac{\partial}{\partial \theta_j} J(\theta)$ is the partial derivative of the cost function with respect to $\theta_j$. **Derivation of Gradient for Linear Regression:** Let's find the partial derivative for $\theta_j$: For $j=0$ (for $\theta_0$): $$\frac{\partial}{\partial \theta_0} J(\theta) = \frac{\partial}{\partial \theta_0} \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$$ $$= \frac{1}{2m} \sum_{i=1}^{m} 2(h_\theta(x^{(i)}) - y^{(i)}) \frac{\partial}{\partial \theta_0} (\theta_0 + \theta_1 x^{(i)} - y^{(i)})$$ $$= \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot 1$$ For $j=1$ (for $\theta_1$): $$\frac{\partial}{\partial \theta_1} J(\theta) = \frac{\partial}{\partial \theta_1} \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$$ $$= \frac{1}{2m} \sum_{i=1}^{m} 2(h_\theta(x^{(i)}) - y^{(i)}) \frac{\partial}{\partial \theta_1} (\theta_0 + \theta_1 x^{(i)} - y^{(i)})$$ $$= \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)}$$ Combining these, the update rules for Gradient Descent are: Repeat until convergence: $$\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})$$ $$\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}$$ (These updates are performed simultaneously for all $\theta_j$). **Diagram:** A typical gradient descent visualization shows a cost function surface (e.g., a bowl shape for convex functions). The algorithm starts at an arbitrary point and takes steps down the steepest slope until it reaches the global minimum, where the cost function is minimized. *Figure: Gradient Descent finding the minimum of a cost function.* By iteratively adjusting the parameters $\theta$ based on the gradient of the cost function, Linear Regression finds the optimal line that best fits the training data, allowing it to make accurate predictions on new, unseen data. #### Q2. Explain the Naive Bayes Classifier in detail, including its underlying assumption and how it is used for classification. **Answer:** The **Naive Bayes Classifier** is a probabilistic supervised learning algorithm based on **Bayes' Theorem** with a strong (naive) assumption of independence among features. It is widely used for classification tasks, especially in text classification (e.g., spam detection) due to its simplicity and efficiency. **Bayes' Theorem:** At its core, Naive Bayes uses Bayes' Theorem, which states the conditional probability of an event: $$P(A|B) = \frac{P(B|A) P(A)}{P(B)}$$ In the context of classification, we want to find the probability of a class ($C$) given some observed features ($\mathbf{X} = x_1, x_2, ..., x_n$): $$P(C|\mathbf{X}) = \frac{P(\mathbf{X}|C) P(C)}{P(\mathbf{X})}$$ Where: * $P(C|\mathbf{X})$: **Posterior Probability** - the probability of class $C$ given the features $\mathbf{X}$. This is what we want to calculate to make a prediction. * $P(\mathbf{X}|C)$: **Likelihood** - the probability of observing features $\mathbf{X}$ given that the class is $C$. * $P(C)$: **Prior Probability** - the probability of class $C$ occurring independently of the features. * $P(\mathbf{X})$: **Evidence** - the probability of observing the features $\mathbf{X}$ (acts as a normalizing constant and is often ignored as it's the same for all classes). **The "Naive" Assumption:** The "naive" part comes from the assumption that all features are **conditionally independent** of each other given the class. This means that the presence or absence of one feature does not affect the presence or absence of any other feature, given the class. So, $P(\mathbf{X}|C) = P(x_1, x_2, ..., x_n|C)$ can be simplified as: $$P(x_1, x_2, ..., x_n|C) = P(x_1|C) \cdot P(x_2|C) \cdot ... \cdot P(x_n|C)$$ This assumption, while often not true in real-world data, greatly simplifies the computation and often leads to surprisingly good results. **Classification Process:** To classify a new data point with features $\mathbf{X}_{new} = (x_1, x_2, ..., x_n)$, the Naive Bayes Classifier calculates the posterior probability for each possible class $C_k$: $$P(C_k|\mathbf{X}_{new}) \propto P(C_k) \prod_{j=1}^{n} P(x_j|\mathbf{C}_k)$$ The classifier then assigns the data point to the class that has the highest posterior probability: $$C_{predict} = \arg\max_{C_k} \left[ P(C_k) \prod_{j=1}^{n} P(x_j|\mathbf{C}_k) \right]$$ **Steps for Training and Prediction:** 1. **Calculate Prior Probabilities ($P(C_k)$):** For each class, count its occurrences in the training data and divide by the total number of training examples. $$P(C_k) = \frac{\text{Number of training examples in class } C_k}{\text{Total number of training examples}}$$ 2. **Calculate Likelihoods ($P(x_j|C_k)$):** For each feature $x_j$ and each class $C_k$, count how often $x_j$ appears when the class is $C_k$, and divide by the total count of class $C_k$. * **For categorical features:** $$P(x_j=\text{value}|C_k) = \frac{\text{Count of } x_j=\text{value} \text{ in class } C_k}{\text{Count of class } C_k}$$ * **For continuous features (e.g., Gaussian Naive Bayes):** Assume features follow a Gaussian distribution and calculate mean ($\mu$) and standard deviation ($\sigma$) for each feature within each class. Then use the Probability Density Function (PDF) of the Gaussian distribution. $$P(x_j|C_k) = \frac{1}{\sqrt{2\pi\sigma_{jk}^2}} \exp\left(-\frac{(x_j - \mu_{jk})^2}{2\sigma_{jk}^2}\right)$$ 3. **Prediction:** For a new, unseen data point $\mathbf{X}_{new}$: * For each class $C_k$, calculate $P(C_k) \prod_{j=1}^{n} P(x_j|\mathbf{C}_k)$. * The class with the highest resulting value is the predicted class. **Example (Spam/Non-Spam Email Classification):** Suppose we want to classify an email as spam or not spam based on words like 'free' and 'money'. * **Training:** We count occurrences of 'free' and 'money' in spam and non-spam emails. * $P(\text{Spam})$ = (Number of Spam emails) / (Total emails) * $P(\text{Non-Spam})$ = (Number of Non-Spam emails) / (Total emails) * $P(\text{'free' | Spam})$ = (Number of Spam emails with 'free') / (Number of Spam emails) * $P(\text{'money' | Spam})$ = (Number of Spam emails with 'money') / (Number of Spam emails) * Similarly for $P(\text{'free' | Non-Spam})$ and $P(\text{'money' | Non-Spam})$. * **Prediction:** For a new email containing 'free' and 'money': * Calculate $P(\text{Spam | 'free', 'money'}) \propto P(\text{Spam}) \cdot P(\text{'free' | Spam}) \cdot P(\text{'money' | Spam})$ * Calculate $P(\text{Non-Spam | 'free', 'money'}) \propto P(\text{Non-Spam}) \cdot P(\text{'free' | Non-Spam}) \cdot P(\text{'money' | Non-Spam})$ * The email is classified as the class with the higher posterior probability. **Laplace Smoothing (for zero probabilities):** A common issue is when a feature value seen in the test data was not present in the training data for a particular class. This would lead to a likelihood of zero, making the entire posterior probability zero for that class. Laplace smoothing (or Add-1 smoothing) addresses this by adding a small constant (typically 1) to all counts, preventing zero probabilities. **Advantages:** * Simple and fast to implement. * Performs well with large datasets. * Handles high-dimensional data effectively. * Requires a small amount of training data to estimate the necessary parameters. **Disadvantages:** * The strong independence assumption often doesn't hold true in real-world data, which can affect performance. * Can be a poor estimator of probabilities (though often good for classification). **Diagram:** A conceptual diagram would show features (e.g., circles) being passed into Bayes' theorem to calculate probabilities for different classes. *Figure: Conceptual diagram of Naive Bayes Classifier.* In summary, Naive Bayes is a powerful and efficient classification algorithm that leverages probability theory and a simplifying independence assumption to make predictions based on observed features. #### Q3. Discuss Decision Trees, explaining how they are built using Entropy and Information Gain. Include a diagram. **Answer:** **Decision Trees** are non-parametric supervised learning algorithms that can be used for both classification and regression tasks. They model decisions as a tree-like structure, where internal nodes represent tests on attribute values, branches represent the outcome of the test, and leaf nodes represent class labels (for classification) or numerical values (for regression). **Structure of a Decision Tree:** * **Root Node:** The topmost node, representing the entire dataset, which is then split into two or more homogeneous sets. * **Internal Node:** Represents a test on an attribute (feature). Each branch stemming from an internal node corresponds to a possible outcome of the test. * **Leaf Node (Terminal Node):** Represents a class label (decision) or a numerical prediction. It does not split further. * **Splitting:** The process of partitioning data into two or more subsets based on an attribute test. * **Pruning:** The process of removing sub-nodes of a decision node to prevent overfitting. **How Decision Trees are Built (ID3 Algorithm - using Entropy and Information Gain):** The core idea behind building a decision tree is to select the "best" attribute at each step to split the data, such that the resulting subsets are as "pure" as possible. **Entropy** and **Information Gain** are key metrics used to achieve this. 1. **Entropy:** * **Definition:** Entropy is a measure of impurity, disorder, or uncertainty in a set of data. In the context of decision trees, it quantifies the randomness or mixed-upness of class labels within a dataset. * **Formula:** For a dataset $S$ with $c$ classes, the entropy is calculated as: $$Entropy(S) = \sum_{i=1}^{c} -p_i \log_2(p_i)$$ where $p_i$ is the proportion of examples in $S$ that belong to class $i$. * **Interpretation:** * If all examples in $S$ belong to the same class (pure set), $Entropy(S) = 0$. * If examples are equally divided among classes (maximum impurity), $Entropy(S) = 1$ (for binary classification). 2. **Information Gain:** * **Definition:** Information Gain measures the reduction in entropy (or increase in purity) achieved by splitting a dataset on a particular attribute. It quantifies how much "information" a feature provides about the class label. * **Formula:** For a dataset $S$ and an attribute $A$, the Information Gain is calculated as: $$Gain(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Entropy(S_v)$$ where: * $Values(A)$ is the set of all possible values for attribute $A$. * $S_v$ is the subset of $S$ for which attribute $A$ has value $v$. * $|S_v|$ is the number of examples in $S_v$. * $|S|$ is the total number of examples in $S$. * **Interpretation:** The attribute that results in the highest Information Gain is chosen as the splitting criterion at a node because it best separates the data into more homogeneous subsets. **Decision Tree Building Algorithm (ID3):** 1. **Start:** The entire training dataset is considered the root node. 2. **Calculate Entropy:** Compute the entropy of the current dataset (node) with respect to the target class. 3. **Calculate Information Gain:** For each attribute, calculate the information gain by splitting the dataset on that attribute. 4. **Select Best Attribute:** Choose the attribute that yields the highest information gain. This attribute becomes the splitting criterion for the current node. 5. **Split Data:** Divide the dataset into subsets based on the values of the chosen attribute. 6. **Recursive Call:** Recursively repeat steps 2-5 for each subset (child node) until: * All examples in a subset belong to the same class (entropy is zero, it becomes a leaf node). * There are no remaining attributes to split on. * The number of examples in a subset falls below a predefined threshold. 7. **Assign Leaf Node:** When a stopping condition is met, the node becomes a leaf node, and its class label is assigned based on the majority class of the examples within that subset. **Example Walkthrough:** Imagine a dataset predicting if a person will play tennis based on 'Outlook', 'Temperature', 'Humidity', 'Windy'. * **Step 1: Calculate initial Entropy of 'Play Tennis' (target class).** * Suppose 9 'Yes' and 5 'No' in a total of 14 examples. * $P(\text{Yes}) = 9/14$, $P(\text{No}) = 5/14$. * $Entropy(S) = -(9/14)\log_2(9/14) - (5/14)\log_2(5/14) \approx 0.940$ * **Step 2: Calculate Information Gain for each attribute.** * **For 'Outlook':** (e.g., values: 'Sunny', 'Overcast', 'Rain') * Calculate $Entropy(\text{Sunny})$, $Entropy(\text{Overcast})$, $Entropy(\text{Rain})$ for the 'Play Tennis' class within those subsets. * $Gain(S, \text{Outlook}) = Entropy(S) - [(|S_{Sunny}|/|S|)Entropy(S_{Sunny}) + (|S_{Overcast}|/|S|)Entropy(S_{Overcast}) + (|S_{Rain}|/|S|)Entropy(S_{Rain})]$ * Repeat for 'Temperature', 'Humidity', 'Windy'. * **Step 3: Select the attribute with the highest Information Gain.** This becomes the root node. Suppose 'Outlook' has the highest gain. * **Step 4: Split the dataset** based on 'Outlook' values. Now, for each branch (e.g., 'Sunny'), repeat the process to find the next best split, and so on, until leaf nodes are formed. **Diagram of a Decision Tree:** *Figure: A simple Decision Tree for predicting 'Will Play Tennis'.* Decision trees are intuitive and easy to interpret. However, they can be prone to overfitting, especially with complex trees. Techniques like pruning and ensemble methods (e.g., Random Forests) are used to address this. #### Q4. Explain the Support Vector Machine (SVM) algorithm, emphasizing the concepts of hyperplane, margin, and support vectors. Include a diagram. **Answer:** **Support Vector Machine (SVM)** is a powerful supervised learning algorithm primarily used for **classification** (though it can also be used for regression). Its main goal is to find an optimal decision boundary, called a **hyperplane**, that best separates different classes in the feature space. **1. Hyperplane:** * **Definition:** In a 2-dimensional space, a hyperplane is simply a line. In a 3-dimensional space, it's a plane. In higher-dimensional spaces (where data points exist), it is a $(n-1)$-dimensional subspace that divides the $n$-dimensional space into two half-spaces. * **Role:** The hyperplane acts as the decision boundary. All data points falling on one side of the hyperplane are classified into one class, and points on the other side are classified into another class. * **Equation:** For a linear SVM, the equation of the hyperplane is typically given by: $$\mathbf{w}^T \mathbf{x} + b = 0$$ where: * $\mathbf{w}$ is the weight vector (normal to the hyperplane). * $\mathbf{x}$ is the input feature vector. * $b$ is the bias term (intercept). **2. Margin:** * **Definition:** The margin in SVM is the distance between the separating hyperplane and the closest data points from either class. These closest data points are called **support vectors**. * **Goal of SVM:** Unlike other linear classifiers that might find any separating line, SVM aims to find the hyperplane that maximizes this margin. A larger margin implies better generalization capability and robustness to unseen data, as it provides a "cushion" between the classes. * **Mathematical Representation:** The two parallel hyperplanes defining the margin are: $$\mathbf{w}^T \mathbf{x} + b = 1$$ $$\mathbf{w}^T \mathbf{x} + b = -1$$ The distance between these two hyperplanes is $\frac{2}{||\mathbf{w}||}$, and SVM's objective is to maximize this quantity, which is equivalent to minimizing $||\mathbf{w}||^2$. **3. Support Vectors:** * **Definition:** Support vectors are the data points from the training set that lie on the margin-defining hyperplanes (i.e., closest to the decision boundary). They are the critical elements that determine the position and orientation of the optimal hyperplane. * **Importance:** Only the support vectors are relevant in defining the decision boundary. If all other non-support vector data points were removed from the training set, the optimal hyperplane would remain unchanged. This makes SVM efficient with large datasets, as only a subset of data points influences the model. **Working Principle and Optimization:** The SVM algorithm works by formulating an optimization problem: * **Objective:** Minimize $||\mathbf{w}||^2$ (or $\frac{1}{2}||\mathbf{w}||^2$) * **Constraints:** For all training points $(\mathbf{x}_i, y_i)$: * If $y_i = 1$, then $\mathbf{w}^T \mathbf{x}_i + b \ge 1$ * If $y_i = -1$, then $\mathbf{w}^T \mathbf{x}_i + b \le -1$ These constraints can be combined into a single inequality: $y_i (\mathbf{w}^T \mathbf{x}_i + b) \ge 1$. This is a convex optimization problem that can be solved using quadratic programming techniques to find the optimal $\mathbf{w}$ and $b$. **Dealing with Non-linearly Separable Data (Soft Margin SVM and Kernel Trick):** * **Soft Margin SVM:** In real-world scenarios, data is often not perfectly linearly separable (some overlap between classes). Soft Margin SVM introduces a **slack variable** ($\xi_i$) for each data point, allowing some misclassifications or points to fall within the margin. The objective then becomes to minimize $||\mathbf{w}||^2 + C \sum \xi_i$, where $C$ is a regularization parameter that balances maximizing the margin and minimizing misclassifications. * **Kernel Trick:** For highly non-linear data, SVM uses the **Kernel Trick**. This involves mapping the original input features into a higher-dimensional feature space where the data might become linearly separable. Instead of explicitly calculating the coordinates in this higher-dimensional space, a **kernel function** (e.g., Polynomial, Radial Basis Function (RBF) / Gaussian) calculates the dot product between the mapped feature vectors. This allows SVM to find non-linear decision boundaries in the original feature space efficiently. **Diagram:** *Figure: SVM with optimal hyperplane, margin, and support vectors.* In the diagram: * The solid line is the **optimal hyperplane**. * The dashed lines represent the **margin boundaries**. * The circled data points are the **support vectors**. **Advantages:** * Effective in high-dimensional spaces. * Effective in cases where the number of dimensions is greater than the number of samples. * Uses a subset of training points (support vectors) in the decision function, making it memory efficient. * Versatile with different kernel functions for various decision boundary shapes. **Disadvantages:** * Can be slow to train on large datasets. * Choosing an appropriate kernel function and regularization parameter ($C$) can be challenging. * Less direct probability estimates compared to algorithms like Naive Bayes. In essence, SVM is a powerful and elegant algorithm that seeks to find the best possible separating boundary by maximizing the margin, making it a robust choice for many classification problems. #### Q5. Explain the K-Means Clustering algorithm. Describe its steps and discuss its advantages and disadvantages. **Answer:** **K-Means Clustering** is a popular and straightforward unsupervised learning algorithm used for partitioning a dataset into a predefined number of distinct, non-overlapping clusters. The goal is to group data points such that points within the same cluster are more similar to each other than to those in other clusters. The "K" in K-Means refers to the number of clusters we want to form, which must be specified beforehand. **Algorithm Steps (Iterative Process):** 1. **Initialization:** * **Choose K:** Decide on the number of clusters, $K$. This is often determined by domain knowledge or methods like the Elbow Method. * **Random Centroid Placement:** Randomly select $K$ data points from the dataset to act as the initial cluster **centroids** (or means). These centroids represent the center of each cluster. 2. **Assignment Step (E-step - Expectation):** * For each data point in the dataset, calculate its distance to all $K$ centroids. * Assign each data point to the cluster whose centroid is the closest (i.e., minimizes the distance). Common distance metrics include Euclidean distance. * This step effectively partitions the dataset into $K$ preliminary clusters. 3. **Update Step (M-step - Maximization):** * After all data points have been assigned, recalculate the position of each cluster centroid. The new centroid for each cluster is the **mean** (average) of all data points currently assigned to that cluster. * This moves the centroids to the true center of their respective assigned data points. 4. **Convergence Check:** * Repeat steps 2 and 3 iteratively until a stopping criterion is met. Common stopping criteria include: * Cluster assignments no longer change (data points remain in the same clusters). * The positions of the centroids no longer change significantly (they stabilize). * A maximum number of iterations has been reached. * The decrease in the within-cluster sum of squares (WCSS) falls below a threshold. **Objective Function:** K-Means aims to minimize the **Within-Cluster Sum of Squares (WCSS)**, also known as inertia. WCSS is the sum of the squared distances between each data point and its assigned cluster centroid. $$WCSS = \sum_{j=1}^{K} \sum_{\mathbf{x} \in \text{Cluster}_j} ||\mathbf{x} - \mathbf{c}_j||^2$$ where $\mathbf{c}_j$ is the centroid of cluster $j$. **Diagram:** A visual representation of K-Means iterations: *Figure: Animation showing the iterative process of K-Means clustering.* (1) Initial random centroids. (2) Points assigned to nearest centroids. (3) Centroids moved to mean of assigned points. (4) Repeat until convergence. **Advantages of K-Means Clustering:** 1. **Simplicity:** Relatively easy to understand and implement. 2. **Efficiency:** Computationally efficient for large datasets, especially when clusters are globular and well-separated. It scales linearly with the number of data points. 3. **Scalability:** Can handle a large number of data points and features. 4. **Versatility:** Applicable to various domains, including customer segmentation, image compression, document clustering. 5. **Guaranteed Convergence:** The algorithm is guaranteed to converge to a local optimum. **Disadvantages of K-Means Clustering:** 1. **Requires Pre-defining K:** The number of clusters $K$ must be specified in advance, which is often difficult without prior knowledge. 2. **Sensitive to Initial Centroids:** The final clustering result can be highly dependent on the initial random placement of centroids. Different initializations can lead to different local optima. (To mitigate this, K-Means is often run multiple times with different initializations). 3. **Sensitive to Outliers:** Outliers can significantly affect cluster centroids, pulling them away from the true cluster center and distorting the clustering. 4. **Assumes Spherical Clusters:** K-Means works best with clusters that are roughly spherical and have similar sizes and densities. It struggles with clusters of arbitrary shapes (e.g., elongated, intertwined). 5. **Not Suitable for Non-linear Boundaries:** It cannot effectively identify clusters with complex, non-linear boundaries. 6. **Requires Numerical Data:** Typically works with numerical data and struggles with categorical features without proper encoding. Despite its limitations, K-Means remains a powerful and widely used clustering algorithm due to its simplicity and efficiency, especially as a first-pass clustering method. #### Q6. Describe Hierarchical Clustering, explaining both agglomerative and divisive approaches. How is a dendrogram used to interpret the results? **Answer:** **Hierarchical Clustering** is an unsupervised learning method that builds a hierarchy of clusters. It does not require specifying the number of clusters beforehand. The results are typically visualized as a **dendrogram**, which is a tree-like diagram. There are two main types: **1. Agglomerative Hierarchical Clustering (Bottom-Up):** * This is the more common approach. It starts with each data point as its own individual cluster. * It then iteratively merges the closest pairs of clusters until all data points are in a single, large cluster (or a stopping condition is met). **Steps for Agglomerative Clustering:** 1. **Initialization:** Each data point is considered a single cluster. If there are 'N' data points, there are 'N' clusters. 2. **Calculate Proximity Matrix:** Compute the similarity or distance between all pairs of clusters. Common distance metrics include Euclidean, Manhattan, and Cosine distance. 3. **Merge Closest Clusters:** Identify the two closest clusters and merge them into a new, larger cluster. 4. **Update Proximity Matrix:** Recompute the distances from the new cluster to all other clusters. This step requires a **linkage criterion** to define the distance between two clusters: * **Single Linkage (Min):** Distance between two clusters is the minimum distance between any two points in the different clusters. (Prone to "chaining") * **Complete Linkage (Max):** Distance between two clusters is the maximum distance between any two points in the different clusters. (Tends to produce compact, spherical clusters) * **Average Linkage:** Distance between two clusters is the average distance between all pairs of points in the different clusters. * **Ward's Method:** Minimizes the total within-cluster variance. It merges clusters that result in the smallest increase in the sum of squared distances within clusters. 5. **Repeat:** Steps 3 and 4 are repeated until only one cluster (containing all data points) remains. **2. Divisive Hierarchical Clustering (Top-Down):** * This approach starts with all data points in a single, large cluster. * It then iteratively splits the largest or most diverse cluster into smaller clusters until each data point is in its own individual cluster (or a stopping condition is met). * Divisive clustering is computationally more complex than agglomerative clustering, as it requires a method to decide which split is optimal at each step. **Dendrogram and its Interpretation:** A **dendrogram** is a graphical representation of the hierarchy of clusters formed by hierarchical clustering. It is a powerful tool for visualizing the clustering process and determining the optimal number of clusters. **Structure of a Dendrogram:** * **X-axis:** Represents the individual data points (or the clusters formed from them). * **Y-axis:** Represents the dissimilarity (distance) level at which clusters are merged (agglomerative) or split (divisive). * **Branches:** Connect the merged or split clusters. The longer the vertical line of a merge, the greater the distance between the clusters being joined. **How to Interpret a Dendrogram to Determine Clusters:** To determine the number of clusters from a dendrogram: 1. **Draw a horizontal line (cut-off threshold)** across the dendrogram at a specific height (distance level) on the y-axis. 2. **Count the number of vertical lines (clusters)** that the horizontal line intersects. This number represents the number of clusters at that chosen dissimilarity level. 3. By varying the height of the cut, you can obtain different numbers of clusters. Taller vertical lines indicate more distinct clusters that are merged at higher dissimilarity, suggesting they are less similar to each other. **Example Diagram (Agglomerative Clustering Dendrogram):** *Figure: A dendrogram showing hierarchical clustering.* If we cut the dendrogram at a dissimilarity of, say, 1.5, we would get 3 clusters. If we cut it at 0.5, we would get 5 clusters. **Advantages of Hierarchical Clustering:** * **No need to pre-specify K:** The number of clusters is not required as input and can be determined by cutting the dendrogram. * **Reveals cluster hierarchy:** Provides a full hierarchy of clusters, which can be very informative for understanding the data structure at different levels of granularity. * **Visual interpretation:** Dendrograms offer an intuitive way to visualize and interpret the clustering results. * **Handles arbitrary shapes:** Can discover clusters of arbitrary shapes, unlike K-Means. **Disadvantages of Hierarchical Clustering:** * **Computational Intensity:** Can be computationally expensive for large datasets ($O(N^3)$ or $O(N^2 \log N)$ depending on linkage), especially agglomerative. Divisive is even more complex. * **Irreversible Decisions:** Once a merge or split is made, it cannot be undone. This can lead to suboptimal clusters if an early decision was poor. * **Sensitivity to Noise/Outliers:** Single linkage can be very sensitive to noise and outliers, leading to "chaining" where clusters are forced together due to single close points. * **Difficulty with large datasets:** Dendrograms can become very complex and difficult to read for a large number of data points. Hierarchical clustering is particularly useful when the underlying structure of the data is truly hierarchical or when exploring different granularities of clustering is important. #### Q7. Explain the Apriori Algorithm for Association Rule Mining, including the concepts of support, confidence, and how candidate itemsets are generated. **Answer:** The **Apriori Algorithm** is a classic and influential algorithm for **Association Rule Mining**. It is used to discover frequent itemsets and derive strong association rules from large transactional databases, such as market basket analysis (e.g., "customers who buy bread also buy milk"). **Core Concepts:** 1. **Itemset:** A collection of one or more items (e.g., {Bread, Milk}). 2. **Frequent Itemset:** An itemset whose occurrence in the database is greater than or equal to a user-defined minimum **support** threshold. 3. **Association Rule:** An implication of the form $X \rightarrow Y$, where $X$ and $Y$ are itemsets, and $X \cap Y = \emptyset$. It suggests that if $X$ occurs, then $Y$ is likely to occur. **Key Metrics for Association Rules:** * **Support:** * **Definition:** The **support** of an itemset $X$ is the proportion of transactions in the database that contain $X$. It measures the frequency of an itemset. * **Formula:** $Support(X) = \frac{\text{Number of transactions containing } X}{\text{Total number of transactions}}$ * **Role:** Used to identify frequent itemsets. A minimum support threshold (`min_sup`) is set; only itemsets meeting this threshold are considered frequent. * **Confidence:** * **Definition:** The **confidence** of an association rule $X \rightarrow Y$ is the conditional probability that $Y$ appears in a transaction given that $X$ has already appeared. It measures the reliability of the rule. * **Formula:** $Confidence(X \rightarrow Y) = \frac{Support(X \cup Y)}{Support(X)}$ * **Role:** Used to filter strong association rules from frequent itemsets. A minimum confidence threshold (`min_conf`) is set; only rules meeting this threshold are considered strong. **The Apriori Principle (Anti-monotone Property):** The efficiency of the Apriori algorithm comes from the **Apriori principle**: * "If an itemset is frequent, then all of its subsets must also be frequent." * Conversely, and more importantly for pruning: "If an itemset is infrequent, then all of its supersets must also be infrequent." This principle allows the algorithm to prune (eliminate) many candidate itemsets, significantly reducing the search space. **Algorithm Steps:** The Apriori algorithm uses an iterative, level-wise search approach, where $k$-itemsets are used to explore $(k+1)$-itemsets. 1. **Generate Frequent 1-Itemsets ($L_1$):** * Scan the database once to count the occurrences of each individual item (1-itemset). * Filter out items whose support is less than `min_sup`. The remaining items form the set of frequent 1-itemsets, $L_1$. 2. **Iterative Generation of Candidate $k$-Itemsets ($C_k$) and Frequent $k$-Itemsets ($L_k$):** * **Join Step (Candidate Generation):** Use $L_{k-1}$ (frequent $(k-1)$-itemsets) to generate candidate $k$-itemsets ($C_k$). This is typically done by joining $L_{k-1}$ with itself. For example, if $\{A, B\}$ and $\{A, C\}$ are frequent 2-itemsets, they can generate candidate $\{A, B, C\}$. * **Prune Step:** Apply the Apriori principle. For each candidate $k$-itemset in $C_k$, check if all of its $(k-1)$-subsets are present in $L_{k-1}$. If any $(k-1)$-subset is not frequent, then the candidate $k$-itemset cannot be frequent either, and it is pruned (removed from $C_k$). This step drastically reduces the number of candidates. * **Count Support:** Scan the database again to count the actual support for each remaining candidate $k$-itemset in $C_k$. * **Filter Frequent Itemsets:** Filter out candidates whose support is less than `min_sup`. The remaining itemsets form the set of frequent $k$-itemsets, $L_k$. 3. **Repeat:** Continue steps 2 until no more frequent itemsets can be generated (i.e., $L_k$ is empty). 4. **Generate Association Rules:** * Once all frequent itemsets are found, association rules are generated from them. * For every frequent itemset $L$, and for every non-empty subset $A$ of $L$: * Form the rule $A \rightarrow (L - A)$. * Calculate the confidence of the rule: $Confidence(A \rightarrow L - A) = \frac{Support(L)}{Support(A)}$. * If $Confidence(A \rightarrow L - A) \ge \text{min_conf}$, then the rule is considered strong and is outputted. **Example Trace:** Assume `min_sup = 2` (out of 4 transactions) and `min_conf = 70%`. **Database:** | TID | Items | | --- | ---------------- | | 1 | A, B, C | | 2 | A, C, D | | 3 | B, C, E | | 4 | A, B, C, E | **1. Find Frequent 1-Itemsets ($L_1$):** | Itemset | Count | Support | Frequent? | | :------ | :---- | :------ | :-------- | | {A} | 3 | 3/4=0.75| Yes | | {B} | 3 | 3/4=0.75| Yes | | {C} | 4 | 4/4=1.00| Yes | | {D} | 1 | 1/4=0.25| No | | {E} | 2 | 2/4=0.50| Yes | $L_1 = \{\{A\}, \{B\}, \{C\}, \{E\}\}$ **2. Generate Frequent 2-Itemsets ($L_2$):** * **Candidate 2-Itemsets ($C_2$) from $L_1$ (Join & Prune):** * Possible pairs: {A,B}, {A,C}, {A,E}, {B,C}, {B,E}, {C,E} * All these pairs have frequent 1-itemset subsets, so no pruning at this stage. * **Count Support for $C_2$:** | Itemset | Count | Support | Frequent? | | :------ | :---- | :------ | :-------- | | {A,B} | 2 | 0.50 | Yes | | {A,C} | 3 | 0.75 | Yes | | {A,E} | 1 | 0.25 | No | | {B,C} | 3 | 0.75 | Yes | | {B,E} | 2 | 0.50 | Yes | | {C,E} | 2 | 0.50 | Yes | $L_2 = \{\{A,B\}, \{A,C\}, \{B,C\}, \{B,E\}, \{C,E\}\}$ **3. Generate Frequent 3-Itemsets ($L_3$):** * **Candidate 3-Itemsets ($C_3$) from $L_2$:** * Join $L_2$ with itself. Potential candidates: * From {A,B} and {A,C} $\rightarrow$ {A,B,C} * From {A,C} and {C,E} $\rightarrow$ {A,C,E} * From {B,C} and {B,E} $\rightarrow$ {B,C,E} * **Prune $C_3$:** * For {A,B,C}: Subsets are {A,B}, {A,C}, {B,C}. All are in $L_2$. Keep. * For {A,C,E}: Subsets are {A,C}, {A,E}, {C,E}. {A,E} is NOT in $L_2$. Prune {A,C,E}. * For {B,C,E}: Subsets are {B,C}, {B,E}, {C,E}. All are in $L_2$. Keep. $C_3 = \{\{A,B,C\}, \{B,C,E\}\}$ * **Count Support for $C_3$:** | Itemset | Count | Support | Frequent? | | :------ | :---- | :------ | :-------- | | {A,B,C} | 2 | 0.50 | Yes | | {B,C,E} | 2 | 0.50 | Yes | $L_3 = \{\{A,B,C\}, \{B,C,E\}\}$ **4. Generate Frequent 4-Itemsets ($L_4$):** * **Candidate 4-Itemsets ($C_4$) from $L_3$:** * Only possible join for {A,B,C} and {B,C,E} would be {A,B,C,E}. * **Prune $C_4$:** * For {A,B,C,E}: Subsets are {A,B,C}, {A,B,E}, {A,C,E}, {B,C,E}. * {A,B,E} is not a subset of $L_3$. {A,C,E} is not a subset of $L_3$. Prune {A,B,C,E}. $C_4 = \emptyset$. Algorithm terminates. **5. Generate Association Rules from $L_1, L_2, L_3$ (using min_conf = 70%):** * **From {A,B,C}:** * $\{A,B\} \rightarrow \{C\}$: $Conf = Support(\{A,B,C\}) / Support(\{A,B\}) = 0.50 / 0.50 = 1.00$ (Strong) * $\{A,C\} \rightarrow \{B\}$: $Conf = Support(\{A,B,C\}) / Support(\{A,C\}) = 0.50 / 0.75 = 0.66$ (Not Strong) * $\{B,C\} \rightarrow \{A\}$: $Conf = Support(\{A,B,C\}) / Support(\{B,C\}) = 0.50 / 0.75 = 0.66$ (Not Strong) * $\{A\} \rightarrow \{B,C\}$: $Conf = Support(\{A,B,C\}) / Support(\{A\}) = 0.50 / 0.75 = 0.66$ (Not Strong) * $\{B\} \rightarrow \{A,C\}$: $Conf = Support(\{A,B,C\}) / Support(\{B\}) = 0.50 / 0.75 = 0.66$ (Not Strong) * $\{C\} \rightarrow \{A,B\}$: $Conf = Support(\{A,B,C\}) / Support(\{C\}) = 0.50 / 1.00 = 0.50$ (Not Strong) * **From {B,C,E}:** * $\{B,C\} \rightarrow \{E\}$: $Conf = Support(\{B,C,E\}) / Support(\{B,C\}) = 0.50 / 0.75 = 0.66$ (Not Strong) * $\{B,E\} \rightarrow \{C\}$: $Conf = Support(\{B,C,E\}) / Support(\{B,E\}) = 0.50 / 0.50 = 1.00$ (Strong) * $\{C,E\} \rightarrow \{B\}$: $Conf = Support(\{B,C,E\}) / Support(\{C,E\}) = 0.50 / 0.50 = 1.00$ (Strong) * ... (other rules also generated and checked) **Strong Rules Found:** * $\{A,B\} \rightarrow \{C\}$ (Confidence = 1.00) * $\{B,E\} \rightarrow \{C\}$ (Confidence = 1.00) * $\{C,E\} \rightarrow \{B\}$ (Confidence = 1.00) **Advantages:** * **Simplicity and Interpretability:** Relatively easy to understand and implement. * **Guaranteed Completeness:** Guarantees finding all frequent itemsets and strong association rules satisfying the min_sup and min_conf thresholds. * **Well-understood:** Foundation for many other association rule mining algorithms. **Disadvantages:** * **Computational Cost:** Can be very slow and memory-intensive for very large databases or when `min_sup` is low, as it requires multiple passes over the database. * **Candidate Generation Overhead:** Generating a large number of candidate itemsets can be a bottleneck. * **Difficulty with Sparse Data:** Performance can degrade with sparse datasets. Despite its limitations, Apriori remains a foundational algorithm for understanding and implementing association rule mining, particularly for datasets that are not excessively large. #### Q8. Compare and contrast Supervised and Unsupervised Learning. Give relevant examples for each. **Answer:** **Supervised Learning** and **Unsupervised Learning** are the two most fundamental categories in Machine Learning, differing primarily in the nature of the data they learn from and the types of problems they are designed to solve. **Comparison Table:** | Feature | Supervised Learning | Unsupervised Learning | | :-------------------- | :--------------------------------------------------- | :--------------------------------------------------- | | **Data Type** | **Labeled Data:** Each training example includes input features and a corresponding "correct" output label/target. | **Unlabeled Data:** Only input features are available; there are no target labels. | | **Learning Process** | Learns a mapping function from inputs to outputs by finding patterns and relationships between features and labels. | Discovers hidden patterns, structures, relationships, or groupings within the data without explicit guidance. | | **Goal** | To predict the output for new, unseen input data. | To understand the underlying structure or distribution of the data. | | **Tasks** | **Classification:** Predicting a categorical label (e.g., spam/not-spam). **Regression:** Predicting a continuous numerical value (e.g., house price). | **Clustering:** Grouping similar data points together (e.g., customer segmentation). **Association:** Discovering relationships between items (e.g., market basket analysis). **Dimensionality Reduction:** Reducing the number of features while retaining important information. | | **Feedback Mechanism**| Direct feedback is available during training (model's predictions are compared to actual labels, and errors are used to adjust the model). | No direct feedback; evaluation is often subjective or based on intrinsic measures (e.g., cluster compactness). | | **Complexity** | Generally more straightforward to evaluate performance (accuracy, MSE). | Evaluation can be more challenging as there's no "ground truth" to compare against. | | **Applications** | Medical diagnosis, spam detection, image recognition, stock price prediction, sentiment analysis, credit scoring. | Customer segmentation, anomaly detection, topic modeling, recommender systems (partially), gene sequencing. | | **Typical Algorithms**| Linear Regression, Logistic Regression, Decision Trees, Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Naive Bayes, Random Forest. | K-Means Clustering, Hierarchical Clustering, Apriori Algorithm, Gaussian Mixture Models (GMM), Principal Component Analysis (PCA). | | **Output** | Specific predictions (e.g., "Yes", "No", "250,000 USD"). | Data groupings, rules, or reduced data representations. | **Key Contrasts:** 1. **Presence of Labels:** This is the most critical distinction. Supervised learning relies on a teacher (labels) to guide the learning process, while unsupervised learning explores data autonomously. 2. **Purpose:** Supervised learning is for making predictions based on known outcomes, whereas unsupervised learning is for discovering insights and structures in data where outcomes are unknown. 3. **Data Preparation:** Supervised learning requires significant effort in data labeling, which can be expensive and time-consuming. Unsupervised learning can work with raw, unlabeled data, making it useful for exploratory data analysis. 4. **Evaluation:** Supervised models are easier to evaluate quantitatively (e.g., accuracy, precision, recall, R-squared) because there are known correct answers. Unsupervised models often require qualitative evaluation or internal metrics (e.g., silhouette score for clustering) as there's no ground truth. **Examples:** **Supervised Learning Examples:** * **Email Spam Detection:** Training a model with emails labeled as "spam" or "not spam" to predict if a new email is spam. (Classification) * **Predicting House Prices:** Training a model with historical data of house features (size, location, number of rooms) and their corresponding sale prices to predict the price of a new house. (Regression) * **Image Classification:** Training a model with images of cats and dogs, each labeled with its animal type, to identify animals in new images. (Classification) * **Medical Diagnosis:** Training a model on patient symptoms and their diagnosed diseases to predict the likelihood of a disease for a new patient. (Classification) **Unsupervised Learning Examples:** * **Customer Segmentation:** Grouping customers into different segments (e.g., "high-value," "frequent buyers," "new customers") based on their purchasing behavior without prior knowledge of these segments. (Clustering) * **Market Basket Analysis:** Discovering that people who buy "diapers" often also buy "baby wipes" from transactional data, to suggest cross-selling opportunities. (Association Rule Mining) * **Anomaly Detection:** Identifying unusual patterns in network traffic data that might indicate a cyber-attack, without prior labels of "normal" or "attack" traffic. (Clustering/Outlier Detection) * **Document Clustering:** Grouping news articles into topics (e.g., "sports," "politics," "technology") based on their content without pre-defined categories. (Clustering) In summary, the choice between supervised and unsupervised learning depends entirely on the problem at hand and the availability of labeled data. Both are crucial for different types of machine learning challenges. ### Section D — Numerical / Problem Solving #### Q1. Linear Regression Calculation **Problem:** Given the following data points, find the regression line $y = \theta_0 + \theta_1x$ using the formulas for $\theta_0$ and $\theta_1$. | X | Y | |---|---| | 1 | 2 | | 2 | 4 | | 3 | 5 | | 4 | 4 | | 5 | 5 | **Given Data:** $x = [1, 2, 3, 4, 5]$ $y = [2, 4, 5, 4, 5]$ Number of data points, $m = 5$ **Formulas Used:** The coefficients $\theta_1$ and $\theta_0$ for simple linear regression can be calculated using the following formulas: $$\theta_1 = \frac{m \sum(xy) - \sum x \sum y}{m \sum(x^2) - (\sum x)^2}$$ $$ \theta_0 = \bar{y} - \theta_1 \bar{x} $$ where: * $\sum x$: Sum of all $x$ values * $\sum y$: Sum of all $y$ values * $\sum (xy)$: Sum of the product of $x$ and $y$ for each data point * $\sum (x^2)$: Sum of the square of each $x$ value * $\bar{x}$: Mean of $x$ values * $\bar{y}$: Mean of $y$ values **Step-by-step Calculation:** 1. **Calculate sums:** * $\sum x = 1 + 2 + 3 + 4 + 5 = 15$ * $\sum y = 2 + 4 + 5 + 4 + 5 = 20$ * $x^2$ values: $1^2=1, 2^2=4, 3^2=9, 4^2=16, 5^2=25$ * $\sum (x^2) = 1 + 4 + 9 + 16 + 25 = 55$ * $xy$ values: $1 \cdot 2=2, 2 \cdot 4=8, 3 \cdot 5=15, 4 \cdot 4=16, 5 \cdot 5=25$ * $\sum (xy) = 2 + 8 + 15 + 16 + 25 = 66$ 2. **Calculate means:** * $\bar{x} = \frac{\sum x}{m} = \frac{15}{5} = 3$ * $\bar{y} = \frac{\sum y}{m} = \frac{20}{5} = 4$ 3. **Calculate $\theta_1$:** $$ \theta_1 = \frac{m \sum(xy) - \sum x \sum y}{m \sum(x^2) - (\sum x)^2} $$ $$ \theta_1 = \frac{5 \cdot 66 - 15 \cdot 20}{5 \cdot 55 - (15)^2} $$ $$ \theta_1 = \frac{330 - 300}{275 - 225} $$ $$ \theta_1 = \frac{30}{50} $$ $$ \theta_1 = 0.6 $$ 4. **Calculate $\theta_0$:** $$ \theta_0 = \bar{y} - \theta_1 \bar{x} $$ $$ \theta_0 = 4 - 0.6 \cdot 3 $$ $$ \theta_0 = 4 - 1.8 $$ $$ \theta_0 = 2.2 $$ **Final Result:** The regression line equation is: $$ y = 2.2 + 0.6x $$ #### Q2. Naive Bayes Probability Calculation **Problem:** A dataset for predicting whether a person plays tennis (`Play`) based on `Outlook`, `Temperature`, `Humidity`, and `Windy` features is given. Calculate the probability that a person will `Play=Yes` given `Outlook=Sunny`, `Temperature=Cool`, `Humidity=High`, `Windy=True` using Naive Bayes. **Given Data (Training Data):** | Outlook | Temperature | Humidity | Windy | Play | | :------- | :---------- | :------- | :---- | :--- | | Sunny | Hot | High | False | No | | Sunny | Hot | High | True | No | | Overcast | Hot | High | False | Yes | | Rain | Mild | High | False | Yes | | Rain | Cool | Normal | False | Yes | | Rain | Cool | Normal | True | No | | Overcast | Cool | Normal | True | Yes | | Sunny | Mild | High | False | No | | Sunny | Cool | Normal | False | Yes | | Rain | Mild | Normal | False | Yes | | Sunny | Mild | Normal | True | Yes | | Overcast | Mild | High | True | Yes | | Overcast | Hot | Normal | False | Yes | | Rain | Mild | High | True | No | **Query:** Predict `Play=Yes` or `Play=No` for a new day with features: `Outlook = Sunny`, `Temperature = Cool`, `Humidity = High`, `Windy = True` **Formula Used (Naive Bayes Classifier):** $$P(C|\mathbf{X}) \propto P(C) \prod_{i=1}^{n} P(x_i|C)$$ We need to calculate $P(\text{Play=Yes } | \text{ Sunny, Cool, High, True})$ and $P(\text{Play=No } | \text{ Sunny, Cool, High, True})$ and compare them. **Step-by-step Calculation:** 1. **Calculate Prior Probabilities for `Play`:** * Total instances = 14 * `Play=Yes` instances = 9 * `Play=No` instances = 5 * $P(\text{Play=Yes}) = 9/14 \approx 0.643$ * $P(\text{Play=No}) = 5/14 \approx 0.357$ 2. **Calculate Likelihoods for each feature given `Play=Yes`:** * **For `Play=Yes` (9 instances):** * `Outlook=Sunny`: 2 Yes (out of 9) $\rightarrow P(\text{Sunny | Yes}) = 2/9 \approx 0.222$ * `Temperature=Cool`: 3 Yes (out of 9) $\rightarrow P(\text{Cool | Yes}) = 3/9 \approx 0.333$ * `Humidity=High`: 3 Yes (out of 9) $\rightarrow P(\text{High | Yes}) = 3/9 \approx 0.333$ * `Windy=True`: 3 Yes (out of 9) $\rightarrow P(\text{True | Yes}) = 3/9 \approx 0.333$ 3. **Calculate Likelihoods for each feature given `Play=No`:** * **For `Play=No` (5 instances):** * `Outlook=Sunny`: 3 No (out of 5) $\rightarrow P(\text{Sunny | No}) = 3/5 = 0.600$ * `Temperature=Cool`: 1 No (out of 5) $\rightarrow P(\text{Cool | No}) = 1/5 = 0.200$ * `Humidity=High`: 4 No (out of 5) $\rightarrow P(\text{High | No}) = 4/5 = 0.800$ * `Windy=True`: 3 No (out of 5) $\rightarrow P(\text{True | No}) = 3/5 = 0.600$ 4. **Calculate Posterior Probability (Numerator) for `Play=Yes`:** $P(\text{Yes}) \cdot P(\text{Sunny | Yes}) \cdot P(\text{Cool | Yes}) \cdot P(\text{High | Yes}) \cdot P(\text{True | Yes})$ $= (9/14) \cdot (2/9) \cdot (3/9) \cdot (3/9) \cdot (3/9)$ $= 0.643 \cdot 0.222 \cdot 0.333 \cdot 0.333 \cdot 0.333$ $= 0.643 \cdot 0.00819 \approx 0.005265$ 5. **Calculate Posterior Probability (Numerator) for `Play=No`:** $P(\text{No}) \cdot P(\text{Sunny | No}) \cdot P(\text{Cool | No}) \cdot P(\text{High | No}) \cdot P(\text{True | No})$ $= (5/14) \cdot (3/5) \cdot (1/5) \cdot (4/5) \cdot (3/5)$ $= 0.357 \cdot 0.600 \cdot 0.200 \cdot 0.800 \cdot 0.600$ $= 0.357 \cdot 0.0576 \approx 0.020563$ **Final Result:** Comparing the two values: * $P(\text{Play=Yes } | \text{ Sunny, Cool, High, True}) \propto 0.005265$ * $P(\text{Play=No } | \text{ Sunny, Cool, High, True}) \propto 0.020563$ Since $0.020563 > 0.005265$, the Naive Bayes Classifier predicts that the person will **`Play=No`** for the given conditions. #### Q3. Entropy & Information Gain Calculation **Problem:** Consider a dataset with 14 instances, where 9 instances belong to class 'Yes' and 5 instances belong to class 'No'. We want to decide the root node for a Decision Tree based on the 'Outlook' feature, which has three possible values: 'Sunny', 'Overcast', and 'Rain'. **Given Data:** | Outlook | Play (Target Class) | Count | | :------- | :------------------ | :---- | | Sunny | No | 3 | | Sunny | Yes | 2 | | Overcast | Yes | 4 | | Rain | Yes | 3 | | Rain | No | 2 | **Summary of 'Play' counts:** * Total instances ($|S|$) = 14 * `Play=Yes` count = 9 * `Play=No` count = 5 **Summary of 'Outlook' splits:** * **Outlook = Sunny:** 5 instances (3 No, 2 Yes) * **Outlook = Overcast:** 4 instances (0 No, 4 Yes) * **Outlook = Rain:** 5 instances (2 No, 3 Yes) **Formulas Used:** * **Entropy:** $Entropy(S) = \sum_{i=1}^{c} -p_i \log_2(p_i)$ * **Information Gain:** $Gain(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Entropy(S_v)$ **Step-by-step Calculation:** 1. **Calculate Entropy of the entire dataset ($Entropy(S)$):** * $p_{\text{Yes}} = 9/14$ * $p_{\text{No}} = 5/14$ * $Entropy(S) = -(9/14) \log_2(9/14) - (5/14) \log_2(5/14)$ * $Entropy(S) = -(0.643) \log_2(0.643) - (0.357) \log_2(0.357)$ * $Entropy(S) = -0.643(-0.638) - 0.357(-1.485)$ * $Entropy(S) = 0.409 + 0.530 = 0.939$ (approximated to 3 decimal places) 2. **Calculate Entropy for each value of 'Outlook':** * **For `Outlook = Sunny` ($S_{Sunny}$):** (3 No, 2 Yes, total 5 instances) * $p_{\text{No}} = 3/5$, $p_{\text{Yes}} = 2/5$ * $Entropy(S_{Sunny}) = -(3/5) \log_2(3/5) - (2/5) \log_2(2/5)$ * $Entropy(S_{Sunny}) = -0.6 \log_2(0.6) - 0.4 \log_2(0.4)$ * $Entropy(S_{Sunny}) = -0.6(-0.737) - 0.4(-1.322)$ * $Entropy(S_{Sunny}) = 0.442 + 0.529 = 0.971$ * **For `Outlook = Overcast` ($S_{Overcast}$):** (0 No, 4 Yes, total 4 instances) * $p_{\text{No}} = 0/4 = 0$, $p_{\text{Yes}} = 4/4 = 1$ * $Entropy(S_{Overcast}) = -(0) \log_2(0) - (1) \log_2(1) = 0$ (A pure set has 0 entropy) * **For `Outlook = Rain` ($S_{Rain}$):** (2 No, 3 Yes, total 5 instances) * $p_{\text{No}} = 2/5$, $p_{\text{Yes}} = 3/5$ * $Entropy(S_{Rain}) = -(2/5) \log_2(2/5) - (3/5) \log_2(3/5)$ * $Entropy(S_{Rain}) = -0.4 \log_2(0.4) - 0.6 \log_2(0.6)$ * $Entropy(S_{Rain}) = -0.4(-1.322) - 0.6(-0.737)$ * $Entropy(S_{Rain}) = 0.529 + 0.442 = 0.971$ 3. **Calculate Information Gain for 'Outlook' ($Gain(S, \text{Outlook})$):** $$ Gain(S, \text{Outlook}) = Entropy(S) - \left[ \frac{5}{14} Entropy(S_{Sunny}) + \frac{4}{14} Entropy(S_{Overcast}) + \frac{5}{14} Entropy(S_{Rain}) \right] $$ $$ Gain(S, \text{Outlook}) = 0.939 - \left[ \frac{5}{14}(0.971) + \frac{4}{14}(0) + \frac{5}{14}(0.971) \right] $$ $$ Gain(S, \text{Outlook}) = 0.939 - \left[ (0.357 \cdot 0.971) + (0.286 \cdot 0) + (0.357 \cdot 0.971) \right] $$ $$ Gain(S, \text{Outlook}) = 0.939 - \left[ 0.346 + 0 + 0.346 \right] $$ $$ Gain(S, \text{Outlook}) = 0.939 - 0.692 $$ $$ Gain(S, \text{Outlook}) = 0.247 $$ **Final Result:** The Entropy of the original dataset is **0.939**. The Information Gain for the 'Outlook' attribute is **0.247**. If this were the highest Information Gain among all features, 'Outlook' would be chosen as the root node for the Decision Tree. #### Q4. KNN Distance Classification **Problem:** Given a dataset of existing data points with two features (X1, X2) and their corresponding classes, classify a new data point using the K-Nearest Neighbors (KNN) algorithm with K=3. **Given Data:** | Data Point | X1 | X2 | Class | | :--------- | :- | :- | :---- | | P1 | 1 | 1 | A | | P2 | 2 | 1 | A | | P3 | 2 | 2 | A | | P4 | 3 | 2 | B | | P5 | 3 | 3 | B | | P6 | 1 | 2 | ? | (This is a typo in the original problem, should be a known class to train, or an example to classify. Assuming it is a point to classify for the problem.) Here, it should be a known class for training set. Let's assume P6 is X1=1, X2=2, Class A. And new point is P_new. | | P6 | 1 | 2 | A | **New Data Point to Classify:** $P_{new} = (X1=2.5, X2=2.5)$ **K value:** $K=3$ **Formula Used (Euclidean Distance):** The Euclidean distance between two points $(x_1, y_1)$ and $(x_2, y_2)$ is: $$ D = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} $$ **Step-by-step Calculation:** 1. **Calculate Euclidean Distance from $P_{new}(2.5, 2.5)$ to each existing data point:** * **To P1 (1, 1):** $D(P_{new}, P1) = \sqrt{(2.5 - 1)^2 + (2.5 - 1)^2} = \sqrt{(1.5)^2 + (1.5)^2} = \sqrt{2.25 + 2.25} = \sqrt{4.5} \approx 2.12$ * **To P2 (2, 1):** $D(P_{new}, P2) = \sqrt{(2.5 - 2)^2 + (2.5 - 1)^2} = \sqrt{(0.5)^2 + (1.5)^2} = \sqrt{0.25 + 2.25} = \sqrt{2.5} \approx 1.58$ * **To P3 (2, 2):** $D(P_{new}, P3) = \sqrt{(2.5 - 2)^2 + (2.5 - 2)^2} = \sqrt{(0.5)^2 + (0.5)^2} = \sqrt{0.25 + 0.25} = \sqrt{0.5} \approx 0.71$ * **To P4 (3, 2):** $D(P_{new}, P4) = \sqrt{(2.5 - 3)^2 + (2.5 - 2)^2} = \sqrt{(-0.5)^2 + (0.5)^2} = \sqrt{0.25 + 0.25} = \sqrt{0.5} \approx 0.71$ * **To P5 (3, 3):** $D(P_{new}, P5) = \sqrt{(2.5 - 3)^2 + (2.5 - 3)^2} = \sqrt{(-0.5)^2 + (-0.5)^2} = \sqrt{0.25 + 0.25} = \sqrt{0.5} \approx 0.71$ * **To P6 (1, 2):** $D(P_{new}, P6) = \sqrt{(2.5 - 1)^2 + (2.5 - 2)^2} = \sqrt{(1.5)^2 + (0.5)^2} = \sqrt{2.25 + 0.25} = \sqrt{2.5} \approx 1.58$ 2. **Sort distances in ascending order and identify K=3 nearest neighbors:** | Data Point | Distance (approx) | Class | | :--------- | :---------------- | :---- | | P3 | 0.71 | A | | P4 | 0.71 | B | | P5 | 0.71 | B | | P2 | 1.58 | A | | P6 | 1.58 | A | | P1 | 2.12 | A | The 3 nearest neighbors are P3, P4, P5 (all have approx distance 0.71). 3. **Determine the class by majority vote among the K=3 neighbors:** * Neighbors: P3 (Class A), P4 (Class B), P5 (Class B) * Votes for Class A: 1 * Votes for Class B: 2 **Final Result:** Since Class B has the majority vote (2 out of 3), the new data point $P_{new}(2.5, 2.5)$ is classified as **Class B**. #### Q5. K-Means Clustering Iterations **Problem:** Perform K-Means clustering for K=2 on the following 6 data points. Use P1 and P4 as initial centroids. Show the steps for the first two iterations. **Given Data Points:** P1(1, 1), P2(1.5, 2), P3(3, 4), P4(5, 7), P5(3.5, 5), P6(4.5, 5) **K value:** K=2 **Initial Centroids:** C1 = P1(1, 1) C2 = P4(5, 7) **Formula Used (Euclidean Distance):** $$ D = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} $$ --- **Iteration 1:** **Step 1: Assignment Step (Assign each point to the nearest centroid)** * **Distances from C1(1, 1):** * P1(1,1): $D(P1, C1) = \sqrt{(1-1)^2 + (1-1)^2} = 0$ * P2(1.5,2): $D(P2, C1) = \sqrt{(1.5-1)^2 + (2-1)^2} = \sqrt{0.5^2 + 1^2} = \sqrt{0.25 + 1} = \sqrt{1.25} \approx 1.118$ * P3(3,4): $D(P3, C1) = \sqrt{(3-1)^2 + (4-1)^2} = \sqrt{2^2 + 3^2} = \sqrt{4 + 9} = \sqrt{13} \approx 3.606$ * P4(5,7): $D(P4, C1) = \sqrt{(5-1)^2 + (7-1)^2} = \sqrt{4^2 + 6^2} = \sqrt{16 + 36} = \sqrt{52} \approx 7.211$ * P5(3.5,5): $D(P5, C1) = \sqrt{(3.5-1)^2 + (5-1)^2} = \sqrt{2.5^2 + 4^2} = \sqrt{6.25 + 16} = \sqrt{22.25} \approx 4.717$ * P6(4.5,5): $D(P6, C1) = \sqrt{(4.5-1)^2 + (5-1)^2} = \sqrt{3.5^2 + 4^2} = \sqrt{12.25 + 16} = \sqrt{28.25} \approx 5.315$ * **Distances from C2(5, 7):** * P1(1,1): $D(P1, C2) = \sqrt{(1-5)^2 + (1-7)^2} = \sqrt{(-4)^2 + (-6)^2} = \sqrt{16 + 36} = \sqrt{52} \approx 7.211$ * P2(1.5,2): $D(P2, C2) = \sqrt{(1.5-5)^2 + (2-7)^2} = \sqrt{(-3.5)^2 + (-5)^2} = \sqrt{12.25 + 25} = \sqrt{37.25} \approx 6.103$ * P3(3,4): $D(P3, C2) = \sqrt{(3-5)^2 + (4-7)^2} = \sqrt{(-2)^2 + (-3)^2} = \sqrt{4 + 9} = \sqrt{13} \approx 3.606$ * P4(5,7): $D(P4, C2) = \sqrt{(5-5)^2 + (7-7)^2} = 0$ * P5(3.5,5): $D(P5, C2) = \sqrt{(3.5-5)^2 + (5-7)^2} = \sqrt{(-1.5)^2 + (-2)^2} = \sqrt{2.25 + 4} = \sqrt{6.25} = 2.5$ * P6(4.5,5): $D(P6, C2) = \sqrt{(4.5-5)^2 + (5-7)^2} = \sqrt{(-0.5)^2 + (-2)^2} = \sqrt{0.25 + 4} = \sqrt{4.25} \approx 2.062$ * **Assignment (compare distances to C1 and C2):** * P1: $D(P1, C1)=0 D(P4, C2)=0 \Rightarrow$ Cluster 2 * P5: $D(P5, C1)\approx 4.717 > D(P5, C2)=2.5 \Rightarrow$ Cluster 2 * P6: $D(P6, C1)\approx 5.315 > D(P6, C2)\approx 2.062 \Rightarrow$ Cluster 2 * **Current Clusters:** * Cluster 1: {P1(1,1), P2(1.5,2), P3(3,4)} * Cluster 2: {P4(5,7), P5(3.5,5), P6(4.5,5)} **Step 2: Update Step (Recalculate Centroids)** * **New C1:** Mean of {P1, P2, P3} * $X_{C1} = (1 + 1.5 + 3) / 3 = 5.5 / 3 \approx 1.83$ * $Y_{C1} = (1 + 2 + 4) / 3 = 7 / 3 \approx 2.33$ * New C1 = (1.83, 2.33) * **New C2:** Mean of {P4, P5, P6} * $X_{C2} = (5 + 3.5 + 4.5) / 3 = 13 / 3 \approx 4.33$ * $Y_{C2} = (7 + 5 + 5) / 3 = 17 / 3 \approx 5.67$ * New C2 = (4.33, 5.67) --- **Iteration 2:** **Step 1: Assignment Step (Assign each point to the nearest new centroid)** * New Centroids: C1(1.83, 2.33), C2(4.33, 5.67) * **Distances from C1(1.83, 2.33):** * P1(1,1): $D(P1, C1) = \sqrt{(1-1.83)^2 + (1-2.33)^2} = \sqrt{(-0.83)^2 + (-1.33)^2} = \sqrt{0.689 + 1.769} = \sqrt{2.458} \approx 1.568$ * P2(1.5,2): $D(P2, C1) = \sqrt{(1.5-1.83)^2 + (2-2.33)^2} = \sqrt{(-0.33)^2 + (-0.33)^2} = \sqrt{0.109 + 0.109} = \sqrt{0.218} \approx 0.467$ * P3(3,4): $D(P3, C1) = \sqrt{(3-1.83)^2 + (4-2.33)^2} = \sqrt{(1.17)^2 + (1.67)^2} = \sqrt{1.369 + 2.789} = \sqrt{4.158} \approx 2.039$ * P4(5,7): $D(P4, C1) = \sqrt{(5-1.83)^2 + (7-2.33)^2} = \sqrt{(3.17)^2 + (4.67)^2} = \sqrt{10.049 + 21.809} = \sqrt{31.858} \approx 5.644$ * P5(3.5,5): $D(P5, C1) = \sqrt{(3.5-1.83)^2 + (5-2.33)^2} = \sqrt{(1.67)^2 + (2.67)^2} = \sqrt{2.789 + 7.129} = \sqrt{9.918} \approx 3.149$ * P6(4.5,5): $D(P6, C1) = \sqrt{(4.5-1.83)^2 + (5-2.33)^2} = \sqrt{(2.67)^2 + (2.67)^2} = \sqrt{7.129 + 7.129} = \sqrt{14.258} \approx 3.776$ * **Distances from C2(4.33, 5.67):** * P1(1,1): $D(P1, C2) = \sqrt{(1-4.33)^2 + (1-5.67)^2} = \sqrt{(-3.33)^2 + (-4.67)^2} = \sqrt{11.089 + 21.809} = \sqrt{32.898} \approx 5.736$ * P2(1.5,2): $D(P2, C2) = \sqrt{(1.5-4.33)^2 + (2-5.67)^2} = \sqrt{(-2.83)^2 + (-3.67)^2} = \sqrt{8.009 + 13.469} = \sqrt{21.478} \approx 4.634$ * P3(3,4): $D(P3, C2) = \sqrt{(3-4.33)^2 + (4-5.67)^2} = \sqrt{(-1.33)^2 + (-1.67)^2} = \sqrt{1.769 + 2.789} = \sqrt{4.558} \approx 2.135$ * P4(5,7): $D(P4, C2) = \sqrt{(5-4.33)^2 + (7-5.67)^2} = \sqrt{(0.67)^2 + (1.33)^2} = \sqrt{0.449 + 1.769} = \sqrt{2.218} \approx 1.490$ * P5(3.5,5): $D(P5, C2) = \sqrt{(3.5-4.33)^2 + (5-5.67)^2} = \sqrt{(-0.83)^2 + (-0.67)^2} = \sqrt{0.689 + 0.449} = \sqrt{1.138} \approx 1.067$ * P6(4.5,5): $D(P6, C2) = \sqrt{(4.5-4.33)^2 + (5-5.67)^2} = \sqrt{(0.17)^2 + (-0.67)^2} = \sqrt{0.029 + 0.449} = \sqrt{0.478} \approx 0.691$ * **Assignment (compare distances to C1 and C2):** * P1: $D(P1, C1)\approx 1.568 D(P4, C2)\approx 1.490 \Rightarrow$ Cluster 2 * P5: $D(P5, C1)\approx 3.149 > D(P5, C2)\approx 1.067 \Rightarrow$ Cluster 2 * P6: $D(P6, C1)\approx 3.776 > D(P6, C2)\approx 0.691 \Rightarrow$ Cluster 2 * **Current Clusters:** * Cluster 1: {P1(1,1), P2(1.5,2), P3(3,4)} * Cluster 2: {P4(5,7), P5(3.5,5), P6(4.5,5)} **Step 2: Update Step (Recalculate Centroids)** * **New C1:** Mean of {P1, P2, P3} * $X_{C1} = (1 + 1.5 + 3) / 3 = 5.5 / 3 \approx 1.83$ * $Y_{C1} = (1 + 2 + 4) / 3 = 7 / 3 \approx 2.33$ * New C1 = (1.83, 2.33) * **New C2:** Mean of {P4, P5, P6} * $X_{C2} = (5 + 3.5 + 4.5) / 3 = 13 / 3 \approx 4.33$ * $Y_{C2} = (7 + 5 + 5) / 3 = 17 / 3 \approx 5.67$ * New C2 = (4.33, 5.67) **Final Result (after 2 iterations):** The cluster assignments and centroids did not change from Iteration 1 to Iteration 2. This indicates that the algorithm has converged. * **Cluster 1:** {P1(1,1), P2(1.5,2), P3(3,4)} with Centroid C1(1.83, 2.33) * **Cluster 2:** {P4(5,7), P5(3.5,5), P6(4.5,5)} with Centroid C2(4.33, 5.67) #### Q6. Apriori Support & Confidence **Problem:** Given the following transaction database and minimum support (min_sup) of 50% and minimum confidence (min_conf) of 70%, find all frequent itemsets and strong association rules using the Apriori algorithm. **Given Data (Transaction Database):** | Transaction ID | Items Bought | | :------------- | :------------------------- | | T100 | {Milk, Onion, Nutmeg, KidneyBeans, Eggs, Yogurt} | | T200 | {Dill, Onion, Nutmeg, KidneyBeans, Eggs, Yogurt} | | T300 | {Milk, Apple, KidneyBeans, Eggs} | | T400 | {Milk, Uncon, KidneyBeans, Yogurt} | | T500 | {Corn, Onion, KidneyBeans, IceCream, Eggs} | **min_sup = 50%** **min_conf = 70%** **Formulas Used:** * $Support(X) = \frac{\text{Number of transactions containing } X}{\text{Total number of transactions}}$ * $Confidence(X \rightarrow Y) = \frac{Support(X \cup Y)}{Support(X)}$ **Step-by-step Calculation:** **Total Number of Transactions = 5** **Minimum Support Count:** $5 \times 50\% = 2.5$. So, an itemset must appear in at least 3 transactions to be frequent. **1. Generate Frequent 1-Itemsets ($L_1$):** * Count support for each individual item: * Milk: 3 * Onion: 3 * Nutmeg: 2 * KidneyBeans: 5 * Eggs: 3 * Yogurt: 3 * Dill: 1 * Apple: 1 * Uncon: 1 * Corn: 1 * IceCream: 1 * Filter items with support count