Data & Business Analytics
Cheatsheet Content
# PAGE 1 ### Descriptive Statistics & ABC/Pareto Analysis - **Categorical Variates:** - **Nominal:** No ordering (e.g., male/female). - **Ordinal:** Can be ordered (e.g., low/medium/high). - Summarize with counts/percentages. Visualize with bar/pie charts. - **ABC Analysis (Data Segmentation):** Categorizes subjects into three groups based on cumulative contribution. - **A-category:** Top categories accounting for ~80% of data. - **B-category:** Next ~15% of data. - **C-category:** Bottom ~5% of data. - **Pareto Charting:** Visual representation combining bar chart (individual values) and line graph (cumulative percentages) to identify "vital few" from "trivial many." #### Micro-Questions (Descriptive Stats) 1. **Q:** If products A, B, C have sales of 1000, 500, $200 respectively. What is the cumulative percentage for product A? **A:** $1000 / (1000+500+200) = 1000/1700 \approx 58.82\%$ ### Linear Regression (LM) - **Model:** $y = \beta_0 + \beta_1x_1 + ... + \beta_px_p + \epsilon$ - $y$: Dependent variable (continuous). - $x_i$: Independent variables. - $\beta_i$: Coefficients. - $\epsilon$: Error term (assumed normally distributed with mean 0). - **Objective:** Choose a function $f$ to minimize Sum of Squared Errors (SSE). - **Residuals (errors):** $\epsilon_i = y_i - \hat{y}_i$ - **SSE:** $\sum_{i=1}^n (y_i - \hat{y}_i)^2$ (smaller is better). - **Total Sum of Squares (SST):** $\sum_{i=1}^n (y_i - \bar{y})^2$ - **Coefficient of Determination (R²):** Measures proportion of variance in $y$ explained by the model. - $R^2 = 1 - \frac{SSE}{SST}$ (closer to 1 is better). - **Adjusted R² (R²adj):** Penalizes for adding more independent variables. - $R^2_{adj} = R^2 - \frac{p}{n-p-1}(1-R^2)$ (closer to 1 is better). - **Interpretation of `summary(lm)` in R:** - **Coefficients:** Estimates for $\beta_i$. - **Std. Error:** Standard error of estimates. - **t value:** t-statistic for hypothesis test ($\beta_i = 0$). - **Pr(>|t|):** p-value. Small p-value ( |t|) for $x_2$ as 0.001. Is $x_2$ a significant predictor at $\alpha = 0.05$? **A:** Yes, because $0.001 ### Logistic Regression (GLM Binomial) - **Purpose:** Predict the probability ($p$) of a binary outcome (0 or 1). - **Problem with Linear Regression for Probability:** Predicted values can be outside [0, 1]. - **Logistic Function (Sigmoid):** $p = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + ... + \beta_mx_m)}}$ - Output is always between 0 and 1. - **Odds:** Ratio of probability of success to probability of failure. - $\text{Odds} = \frac{p}{1-p}$ - $p = \frac{\text{Odds}}{1+\text{Odds}}$ - **Logit Transformation:** Linearizes the model. - $\text{Logit}(p) = \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1x_1 + ... + \beta_mx_m$ - **Odds Ratio (OR):** $e^{\beta_i}$ represents how the odds of the outcome change for a one-unit increase in $x_i$, holding other variables constant. - If $e^{\beta_i} > 1$, odds increase. If $e^{\beta_i} |z|)** are used instead of t-value for significance. - **Deviance Residuals:** Used to assess model fit. - **AIC (Akaike Information Criterion):** $AIC = 2k - 2\ln(L)$ (where $k$ is number of parameters, $L$ is max likelihood). - Lower AIC is better when comparing models on the same dataset. Penalizes for more parameters and rewards better fit. - **Categorical Variables (Factors) in R:** Use `factor()` to convert categorical variables. - **Grouped Binomial Data:** When $y$ is a proportion or count of successes out of $n$ trials. - `glm(cbind(successes, failures) ~ predictors, family = binomial, data = mydata)` - OR `glm(proportion ~ predictors, weights = total_trials, family = binomial, data = mydata)` #### R Template (Logistic Regression) ```R # Fit logistic model for binary outcome model_binary ### Time Series Forecasting - **Time Series:** Sequence of values $x_1, x_2, ..., x_N$. - **Goal:** Predict future values based on past values or correlations. - **Forecast Accuracy Metrics:** - **WAPE (Weighted Absolute Percent Error):** $WAPE = 100 \times \frac{\sum_{i=1}^{N} |x_i - \hat{x}_i|}{\sum_{i=1}^{N} x_i}$ - **WAPA (Weighted Absolute Percent Accuracy):** $WAPA = 100 - WAPE$. - **Moving Average (MA):** Forecast is the average of the last $s$ observations. - $\hat{x}_{n+1} = \frac{1}{s} \sum_{i=0}^{s-1}x_{n-i}$ - **Weighted Moving Average:** Assigns different weights ($w_i$) to past observations, $\sum w_i = 1$. - $\hat{x}_{n+1} = \sum_{i=0}^{s-1} w_i x_{n-i}$. Weights can be optimized (e.g., minimize SSE). - **Exponential Smoothing (Holt 1):** Forecast is a weighted average of the most recent observation and the previous forecast. - $\hat{x}_{n+1} = \alpha x_n + (1-\alpha)\hat{x}_n$ - $\alpha$: Smoothing parameter ($0 # PAGE 2 ### SQL & Database Design/Joins - **RDBMS:** Stores data in tables, defines relationships. - **Table Components:** - **Fields:** Fixed number of columns, each with a data type (e.g., INTEGER, TEXT). - **Records:** Unlimited number of rows. - **Keys:** - **Primary Key (PK):** Uniquely identifies each record in a table. Must be unique, non-NULL. - **Foreign Key (FK):** A field in one table that refers to the Primary Key in another table, establishing a link. - **Junction (Join) Table:** Resolves Many-to-Many relationships between two tables by having both tables' PKs as FKs (and usually a composite PK). - **Database Schema:** Map of tables, fields, and relationships. - **SQL Query Clause Order:** `SELECT [DISTINCT] column_list` `FROM table_name` `[JOIN other_table ON join_condition]` `[WHERE filter_condition]` `[GROUP BY column_list]` `[HAVING group_filter_condition]` `[ORDER BY column_list [ASC|DESC]]` `[LIMIT number]` - **Aggregate Functions:** `SUM()`, `COUNT()`, `MAX()`, `MIN()`, `AVG()`. - `COUNT(*)`: Counts all rows, including those with NULLs. - `COUNT(column)`: Counts non-NULL values in a column. - `COUNT(DISTINCT column)`: Counts unique non-NULL values. - **JOIN Types:** - **INNER JOIN:** Returns records with matching values in both tables. Unmatched records are NOT returned. - **LEFT JOIN:** Returns all records from the left table, and matched records from the right table. Unmatched right table columns will be NULL. - **RIGHT JOIN:** Returns all records from the right table, and matched records from the left table. Unmatched left table columns will be NULL. (Not universally supported, e.g., in SQLite). - **FULL OUTER JOIN:** Returns all records when there is a match in either left or right table. (Not universally supported). - **Common Traps:** - **LEFT JOIN + WHERE filter trap:** Applying a `WHERE` clause on a right-table column in a `LEFT JOIN` can convert it into an `INNER JOIN` by filtering out NULLs. Use `ON` clause for filtering in `LEFT JOIN` if you want to keep unmatched left-table rows. - **`HAVING` vs `WHERE`:** `WHERE` filters individual rows *before* grouping. `HAVING` filters groups of rows *after* aggregation. #### SQL Templates 1. **Select specific columns:** `SELECT ProductName, UnitPrice FROM Products;` 2. **Filter rows:** `SELECT * FROM Orders WHERE Quantity > 10;` 3. **Group by and Aggregate:** `SELECT City, COUNT(OrderID) FROM Customers GROUP BY City;` 4. **Group by and Filter Groups:** `SELECT City, COUNT(OrderID) FROM Customers GROUP BY City HAVING COUNT(OrderID) > 5;` 5. **Order results:** `SELECT ProductName FROM Products ORDER BY UnitPrice DESC;` 6. **Rename column/table:** `SELECT SUM(Price) AS TotalSales FROM Products AS P;` 7. **INNER JOIN:** `SELECT * FROM A INNER JOIN B ON A.ID = B.ID;` 8. **LEFT JOIN:** `SELECT * FROM A LEFT JOIN B ON A.ID = B.ID;` 9. **Junction Table Example (Many-to-Many):** `SELECT C.CustomerName, P.ProductName FROM Customers C JOIN Orders O ON C.CustomerID = O.CustomerID JOIN OrderDetails OD ON O.OrderID = OD.OrderID JOIN Products P ON OD.ProductID = P.ProductID;` 10. **Subquery (example):** `SELECT ProductName FROM Products WHERE CategoryID IN (SELECT CategoryID FROM Categories WHERE CategoryName = 'Beverages');` #### Micro-Questions (SQL & DB Design) 1. **Q:** What is the primary purpose of a junction table in database design? **A:** To resolve many-to-many relationships between two tables by creating two one-to-many relationships. 2. **Q:** Write an SQL query to get the `ProductName` and `UnitPrice` for products that have a `CategoryID` of 5. **A:** `SELECT ProductName, UnitPrice FROM Products WHERE CategoryID = 5;` 3. **Q:** Explain the difference between `WHERE` and `HAVING`. **A:** `WHERE` filters individual rows *before* grouping, while `HAVING` filters groups of rows *after* aggregation. 4. **Q:** If a query uses an `INNER JOIN` and there are no matching records in the joined table for a row in the first table, will that row appear in the result? **A:** No, `INNER JOIN` only returns matching records from both tables. 5. **Q:** Write an SQL query to find the total sales for each product category from 'Products' and 'Categories' tables, only for categories with total sales over $1000. **A:** `SELECT C.CategoryName, SUM(P.Price) AS TotalSales FROM Products P INNER JOIN Categories C ON P.CategoryID = C.CategoryID GROUP BY C.CategoryName HAVING TotalSales > 1000;` ### Machine Learning - **Learning Problem:** Find a function $f$ such that $Y = f(X) + \epsilon$. - **Prediction:** Estimate $Y$ for new $X$ using $\hat{Y} = \hat{f}(X)$. - **Inference:** Understand relationships between $Y$ and $X$. - **Supervised Learning:** Learns from labeled data (input $X$ and output $Y$ are known). - **Regression:** Predicts continuous $Y$ (e.g., house price). - **Classification:** Predicts categorical $Y$ (e.g., spam/not spam). - **Unsupervised Learning:** Learns from unlabeled data (only $X$ is known). - **Clustering:** Groups similar data points (e.g., customer segmentation). - **Dimensionality Reduction:** Reduces number of features. - **Train/Test Split:** Divide data into training set (build model) and test set (evaluate performance on unseen data). - **Overfitting:** Model performs well on training data but poorly on test data. - **Bias-Variance Trade-off:** - **High Bias:** Model is too simple, underfits. - **High Variance:** Model is too complex, overfits. #### Classification Metrics (Confusion Matrix) | | Predicted Positive | Predicted Negative | |-----------------|--------------------|--------------------| | **Actual Positive** | True Positive (TP) | False Negative (FN) | | **Actual Negative** | False Positive (FP) | True Negative (TN) | - **Accuracy:** $(TP + TN) / (TP + TN + FP + FN)$ - **Precision:** $TP / (TP + FP)$ - **Recall (Sensitivity):** $TP / (TP + FN)$ - **F1-Score:** $2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$ #### Decision Trees - **Purpose:** Predictive models using a tree-like graph of decisions, mainly for categorical responses. - **Recursive Greedy Algorithm:** 1. Start with empty tree. 2. Select feature to split (e.g., minimize misclassification error, Gini impurity, entropy). 3. Stop if all data in a node belongs to one category or all features are split. 4. Otherwise, continue splitting. - **Splitting Continuous Features:** Choose a threshold with the lowest misclassification error. ##### R Template (Decision Trees) ```R # Install and load rpart package install.packages("rpart") library(rpart) # Fit decision tree tree.model