Chapter 1: Introduction to Data Science Data Basics (1 Mark Questions) Zip Files: Compress files to reduce size and store multiple files together. Volume Characteristics: Large amount of data generated from various sources. Semi-Structured Data Examples: JSON, XML files. XML Files: Semi-structured text format for storing and sharing data. Primary Data: Original data collected first-hand for a specific purpose. Data Science: Collecting and analyzing data to find useful patterns. Data Source: Origin/place from where data is collected. Data Science in Healthcare: Disease prediction. Structured Data: Organized in rows and columns (like tables). Raster vs. Vector Images: Raster uses pixels; vector uses shapes/lines. GZip File: Compressed file format to reduce file size. Social Media Data Example: Facebook comments. Semi-structured Data: Data with some structure but not in table format. Why learn Data Science: Better decision-making and career opportunities. Data Science Concepts (2-4 Mark Questions) Applications of Data Science: Healthcare: disease prediction Banking: fraud detection Business: sales forecasting E-commerce: recommendation systems Ways Data is Stored in Files: Text format (.txt): plain text storage Spreadsheet (.csv): tabular row-column format Tools in Data Scientist Toolbox: Python: data cleaning & analysis TensorFlow: machine learning model building 3V's of Data Science: Volume: Huge data amount Velocity: Fast data generation speed Variety: Different types of data Data Science Application in Healthcare: Disease prediction from medical data Treatment recommendation based on history Phases of Data Science Lifecycle: Data collection Data cleaning Analysis & modelling Evaluation & deployment Importance of Modeling Phase: Creates prediction/decision-making model Helps estimate future trends and patterns Significance of Deployment Phase: Makes model usable in real-life Runs predictions for users/business Key Activity in Evaluation Phase: Checking model performance and accuracy Correcting errors if needed Tools/Software in Data Science: Python R language Quantitative vs. Qualitative Data: Quantitative: numeric values Qualitative: descriptive/categorical values Discrete vs. Continuous Data: Discrete: countable values Continuous: infinite numeric range Text Data: Data in written format (e.g., emails, reviews). Dense Numerical Array: Array where almost every value is numeric Very few empty or missing values Problem with Unstructured Data: Hard to search and process automatically Needs extra tools to structure it Semi-structured Data as Intermediate: Partially organized data Does not follow fixed table format Text Data Format: Stores data as plain text (e.g., .txt file). Exploratory Data Analysis (EDA): First study of data before modeling Uses graphs and statistics to understand patterns Helps detect missing values, outliers, and trends Types of Data: Quantitative: numeric (height) Qualitative: descriptive (color) Structured: table format (Excel) Unstructured: audio/video/images Semi-structured: JSON / XML Chapter 2: Statistics in Data Science Statistical Definitions (1 Mark Questions) Quartile: Divides data into four equal parts. Mean, Median, Mode: Mean = average Median = middle value Mode = most repeated value Range: Highest value – lowest value. Variance & Standard Deviation: Variance: measures spread SD: square root of variance Inter-quartile Range (IQR): $IQR = Q3 - Q1$. Variance: Shows how far values are from the mean. Standard Deviation: Shows how much data values vary around the mean. Hypothesis Testing: Checks if a claim about data is true or false. Multiple Hypothesis Testing: Testing many hypotheses on the same dataset. Type I & Type II Error: Type I: rejecting true hypothesis Type II: accepting false hypothesis Statistical Concepts (2-4 Mark Questions) Inferential Statistics: Drawing conclusions from sample data Used to predict results for whole population Null & Alternative Hypothesis: Null Hypothesis (H0): no effect or no difference Alternative Hypothesis (H1): there is an effect or difference Statistical Data Analysis: Collecting and examining data Helps find patterns and make decisions Role of Statistics in Data Science: Helps summarize and understand data Supports prediction and decision-making Dissimilarity Matrix vs. Data Matrix: Dissimilarity matrix: stores distance between objects Data matrix: stores values/features of objects Euclidean Distance: Straight-line distance between two points Based on square root of sum of squared differences Manhattan Distance: Distance based on sum of absolute differences Also called L1 distance Minkowski Distance: General form of distance formula Special case: $p=1 \rightarrow$ Manhattan, $p=2 \rightarrow$ Euclidean Proximity Measure for Nominal Attributes: Simple matching coefficient. Measure for Ordinal Attributes: Rank correlation (uses order/position). Outlier: Value very different from others Indicates unusual or rare condition Types of Outliers: Global outlier: Value far from all others (e.g., 999 in height list) Contextual outlier: Abnormal only in a specific context (e.g., $20^\circ C$ in summer) Hypothesis Testing (Note): Procedure to check claim about population Uses sample data to accept or reject hypothesis Measures of Central Tendency: Mean, Median, Mode. Data Cube Aggregation: Summarizing data across multiple dimensions Used in OLAP for fast analysis Converts detailed data into higher-level grouped form Outlier Detection Methods: Identifies abnormal values in dataset Methods: Z-score, IQR, visualization Helps improve model accuracy Measures for Data Dispersion: Range: max – min Variance: deviation from mean Standard deviation: square root of variance Feature Extraction: Selecting meaningful features from raw data Reduces complexity and improves model performance Converts original data into informative attributes Outlier Explanation: Global outlier is extremely different from the rest (e.g., income of ₹10 crore in middle-class dataset). Needs removal or separate handling. Chapter 3: Data Processing Data Processing Definitions (1 Mark Questions) Types of Attributes: Nominal, Ordinal, Binary, Numeric. Data Object: A single record or entity in a dataset. Methods of Feature Selection: Filter, Wrapper, Embedded. Missing Values: Values that are not recorded in the dataset. Nominal Attribute: Attribute with categories but no order. Data Transformation: Converting data into a suitable format. One-hot Coding: Converting categories into binary columns. Data Quality: Accuracy and reliability of data. Outlier: A data value far different from others. Interquartile Range: $IQR = Q3 - Q1$. Data Cleaning: Removing errors and inconsistencies from data. Primary Data: First-hand data collected directly. Attribute: A property/feature of a data object. Binary Attribute: Attribute with two values (yes/no). Ordinal Attribute Example: Education level (High/Medium/Low). Numeric Attribute: Attribute with number values. Discrete Data: Countable numeric values. Missing Data: Data not recorded or unavailable. Data Normalization: Scaling values to a common range. Data Discretization: Converting numeric data into bins/categories. Noisy Data: Data with errors or random values; reduces accuracy. Data Cube: Multidimensional representation of data, used in OLAP. Purpose of Data Processing: Improve data quality, make data ready for analysis. Data Processing Concepts (2-4 Mark Questions) Noisy Data Causes: Random errors Faulty sensors, manual typing mistakes Methods for Missing Values: Replace with mean/median Remove rows with too many missing values Data Objects & Attribute Types: Data object = single record Attribute types: Nominal (gender), Numeric (age) Discrete vs. Continuous Data Attributes: Discrete: countable values (students) Continuous: infinite numeric values (height) Duplicate Data: Same record repeated twice (e.g., same customer stored twice). One-Hot Encoding: Converts categories into binary columns (e.g., Red/Blue $\rightarrow [1,0], [0,1]$). Label Encoding: Assigns numeric labels to categories (e.g., Red=1, Blue=2). Nominal vs. Ordinal Attributes: Nominal: categories without order Ordinal: categories with ranking Common Causes of Missing Values: Human input errors Device/sensor failure Duplicate Entries Example: Same row stored more than once (e.g., two identical student records). Binarization: Converting numeric value into binary form (e.g., Marks $>50 = 1$ else $0$). Data Standardization: Brings variables to same scale Improves model performance Problems from Irregular Formatting: Wrong column grouping Incorrect data reading by software Data Reduction: Reducing data size (e.g., sampling 10 rows from 100 rows). Data Transformation Techniques: Normalization: scales values between 0 and 1 Standardization: converts data to mean 0 SD 1 Encoding: converts categories to numbers Data Attributes & Types: Data attribute = property of object Nominal: color Ordinal: rating (low/medium/high) Numeric: age Data Quality Factors: Accuracy and reliability of dataset Affected by missing values, duplicates, noise One-Hot vs. Label Encoding: One-hot: binary columns (Red $\rightarrow [1,0]$, Blue $\rightarrow [0,1]$) Label: numeric labels (Red=1, Blue=2) Importance of Data Quality: Better decision-making Improves model accuracy Reduces error in prediction Types & Handling of Missing Values: MCAR, MAR, MNAR Handling: Imputation or row removal Need for Data Transformation: Makes data suitable for analysis Improves accuracy of models Solves scaling and format issues Problems from Outliers: Skews mean and variance Reduces model accuracy Causes wrong predictions Steps to Handle Missing Values: Detect missing values Replace using mean/median or delete rows Validate updated dataset Incompatible Datetime Issues: Date stored in different formats causes mismatch (e.g., 12/06/2024 vs 06-12-2024). Formatting Issues: Extra whitespace, inconsistent delimiters, invalid characters cause wrong splitting, broken columns, prevent reading. Data Discretization & Benefits: Converting numeric data into bins/groups Reduces complexity Improves pattern recognition Data Transformation Strategies: Normalization, encoding, scaling. Data Reduction & Data Cube Aggregation: Data reduction = reducing dataset size Data cube aggregation = summarizing data across dimensions Types of Attributes (detailed): Nominal: color Binary: male/female Ordinal: high/medium/low Numeric: age / salary Techniques for Cleaning Noisy Data: Binning Regression Outlier removal Smoothing Data Transformation Techniques (detailed): Rescaling: adjusting range Normalization: scale 0-1 Standardization: mean 0 SD 1 Label encoding: assign numbers One-hot encoding: binary columns Data Reduction Types: Dimensionality reduction: reduce attributes Numerosity reduction: reduce records Compression: store in compact format Missing Data Causes & Handling: Causes: sensor failure, manual errors Handling: imputation, deletion, interpolation Importance of Data Preprocessing: Removes noise and inconsistencies Enhances model training and accuracy Helps better decision-making Chapter 4: Data Visualization Visualization Basics (1 Mark Questions) Data Transformation: Changing data into a suitable format. Tools for Geospatial Data: QGIS, ArcGIS. Python Libraries for Data Analysis: NumPy, Pandas. Bubble Plot Use: Show 3 variables with bubble size. Data Visualization: Presenting data using charts or graphs. Tag Cloud: Visual display of word frequency. Visual Encoding: Representing information using size, color, or shape. Exploratory Data Analysis (EDA): Initial study of data using graphs and statistics. Histogram Use: Show distribution of numeric data. Scatter Plot: Shows relationship between two numeric variables. Pie Chart: Circular chart showing percentage share. Boxplot: Chart showing spread, median, and outliers. Heat Map: Graph showing values with colors. Word Cloud: Plot showing frequently used words. Bar Chart: Plot that compares categories using bars. Visualization Concepts (2-4 Mark Questions) WordCloud: Shows most repeated words visually; larger words = higher frequency. Data Visualization & Libraries: Representing data using graphs; Matplotlib, Seaborn. Purpose of Data Visualization: Understand patterns and trends Supports better decision-making Bar Chart vs. Histogram: Bar chart: compares categories Histogram: shows data distribution Visual Encoding in Data Visualization: Representing information using color, position, size; helps highlight data patterns. Popular Python Visualization Libraries: Matplotlib, Seaborn. Uses of Line Charts: Show trends over time Compare values across periods Bubble Plot: Scatter plot with bubble size representing 3rd variable; used to show multi-variable relationships. Dendrogram: Tree-like diagram used in clustering; shows grouping/merging of data points. Treemap: Displays hierarchical data; uses nested rectangles for representation. Pie Chart vs. Donut Chart: Pie: full filled circle Donut: circle with center hole Geospatial Data: Data linked to location on Earth (e.g., GPS coordinates). EDA Objectives: Initial analysis of data; detect patterns and outliers. Data Visualization Tools: Tableau, Power BI, Matplotlib, Plotly. Visualizing Geospatial Data: Use latitude-longitude based maps Tools: QGIS, ArcGIS, Folium Shows population, roads, climate, etc. Venn Diagram: Shows common and unique items between groups; created using overlapping circles (e.g., students liking Maths, Science). Role of EDA & Steps: Understand data before modeling Steps: Summary stats, visualization, outlier detection Example: Histogram, correlation plot Data Visualization Libraries Features: Matplotlib: basic plotting Seaborn: statistical plotting Plotly: interactive graphs Chart Types: Histogram: data distribution Bar chart: category comparison Pie chart: percentage share Scatter: relation between two variables Advanced Plots: Heat map: data values with colors Bubble plot: scatter with bubble size Boxplot: spread and outliers Dendrogram: hierarchical clustering structure Importance of Visual Encoding: Helps highlight important information Channels: color, position, shape, size Comparing Charts: Pie: whole circle for shares Donut: circle with hole Treemap: nested rectangles Area: filled line chart for trends 3D Scatter Plots: Shows relation between 3 variables Useful for multivariable comparison Example: population vs income vs education Word Cloud Creation & Applications: Count word frequency and display visually Bigger size = more frequent Used for text analytics and reviews Specialized Visualization Tools: Heat maps: large numeric table analysis Tree maps: hierarchy representation Network graphs: relationship discovery 3D plots: multi-dimension visualization Importance of Data Visualization in EDA: Makes data easier to understand Shows patterns and trends Example: histogram, scatter plot Components of Visual Encoding: Position: shows location of values Color: highlights differences Size: shows magnitude/importance Boxplots Use & Interpretation: Shows median, quartiles, and spread Detects outliers Helps compare multiple groups Heatmap Purpose & Example: Shows data values using colors (e.g., correlation heatmap). Scatter Plots Use: Shows relationship between two variables; used to detect correlation and trends. Area Plots, Line Charts, Bar Charts Comparison: Area: shows total change over time Line: tracks trend over time Bar: compares categories Dendrograms in Hierarchical Clustering: Tree-like chart showing merging of clusters; used to visualize hierarchical grouping. Bubble Charts Enhancing Scatter Plots: Adds third variable using bubble size; helps show more information in one graph. Word Clouds in Text Data Analysis: Displays frequent words visually; useful for tweets, reviews, feedback analysis. Geospatial Visualizations: Choropleth map: region color map Heat map: location-based intensity Pin map: marking specific points Statistics Formulas Mean: $(\sum X) / N$ Median: Middle value after sorting Mode: Most repeated value Range: Max – Min Variance (Sample): $\sum(X - \bar{X})^2 / (N - 1)$ Variance (Population): $\sum(X - \mu)^2 / N$ Standard Deviation: $\sqrt{\text{Variance}}$ Coefficient of Variation: $(\text{SD} / \text{Mean}) \times 100$ Quartiles: Q1 = 25%, Q2 = 50%, Q3 = 75% Interquartile Range (IQR): $Q3 - Q1$ Z-Score: $(X - \mu) / \text{SD}$ Grouped Data Formulas Mean: $\sum(f \times x) / \sum f$ Variance: $[\sum(f \times x^2) / \sum f] - (\text{Mean})^2$ Midpoint (Class mark): $(\text{Upper} + \text{Lower}) / 2$ Probability & Distributions Probability: Favourable / Total outcomes Addition Rule: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ Multiplication Rule: $P(A \cap B) = P(A) \times P(B)$ Conditional Probability: $P(A|B) = P(A \cap B) / P(B)$ Bayes' Theorem: $P(A|B) = [P(B|A)P(A)] / P(B)$ Correlation & Regression Correlation (r): $r = [N\sum XY - (\sum X)(\sum Y)] / \sqrt{[(N\sum X^2 - (\sum X)^2)(N\sum Y^2 - (\sum Y)^2)]}$ Slope (b) of Regression Line: $b = [N\sum XY - (\sum X)(\sum Y)] / [N\sum X^2 - (\sum X)^2]$ Regression Line: $Y = a + bX$ $a = \bar{Y} - b\bar{X}$ Coefficient of Determination: $R^2 = (\text{Correlation})^2$ Hypothesis Testing Z-Test: $(\bar{X} - \mu) / (\sigma / \sqrt{n})$ T-Test: $(\bar{X} - \mu) / (s / \sqrt{n})$ Standard Error (SE): $\text{SD} / \sqrt{n}$ Confidence Interval: $\bar{X} \pm Z \times (\text{SD}/\sqrt{n})$ Errors: Type I = Reject true H0 Type II = Accept false H0 Distance Formulas Euclidean Distance: $\sqrt{\sum(x_i - y_i)^2}$ Manhattan Distance: $\sum|x_i - y_i|$ Minkowski Distance: $(\sum|x_i - y_i|^p)^{(1/p)}$ Special cases: $p=1 \rightarrow$ Manhattan, $p=2 \rightarrow$ Euclidean Classification Metrics (Confusion Matrix) TP = True Positive TN = True Negative FP = False Positive FN = False Negative Accuracy: $(\text{TP} + \text{TN}) / (\text{TP} + \text{TN} + \text{FP} + \text{FN})$ Precision: $\text{TP} / (\text{TP} + \text{FP})$ Recall (Sensitivity): $\text{TP} / (\text{TP} + \text{FN})$ Specificity: $\text{TN} / (\text{TN} + \text{FP})$ F1-Score: $2 \times (\text{Precision} \times \text{Recall}) / (\text{Precision} + \text{Recall})$ Error Rate: $(\text{FP} + \text{FN}) / \text{Total}$ Clustering K-Means Objective Function: Minimize $\sum (\text{distance between point and cluster center})^2$ New Centroid: Mean of all points in the cluster Normalization / Standardization Min-Max Normalization: $(X - \text{Min}) / (\text{Max} - \text{Min})$ Z-Score Standardization: $(X - \text{Mean}) / \text{SD}$ Sampling Sampling Error: Population mean - Sample mean Sample Mean: $\sum X / n$ Sample Proportion: Number of successes / Sample size