Applied Statistical Methods Cheatsheet

Shared 12/27/2025•57 views

/ 1

Cheatsheet Content

Question 1: Chi-Square Goodness-of-Fit Test Problem Description A food service manager wants to know if the distribution of where people consume takeout food (Home 53%, Car 19%, Work 14%, Other 14%) is still valid. A survey of 300 individuals was conducted, yielding the following results: Place Home Car Work Other Number 142 57 51 50 Test at $\alpha = 0.01$. R Code and Normal Calculation Hypotheses: $H_0$: The distribution is as stated ($p_{Home}=0.53, p_{Car}=0.19, p_{Work}=0.14, p_{Other}=0.14$). $H_1$: The distribution is different from what is stated. Observed Frequencies: $O = [142, 57, 51, 50]$ Expected Frequencies (Total $N=300$): $E_{Home} = 300 \times 0.53 = 159$ $E_{Car} = 300 \times 0.19 = 57$ $E_{Work} = 300 \times 0.14 = 42$ $E_{Other} = 300 \times 0.14 = 42$ $E = [159, 57, 42, 42]$ Chi-Square Test Statistic: $\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$ $\chi^2 = \frac{(142-159)^2}{159} + \frac{(57-57)^2}{57} + \frac{(51-42)^2}{42} + \frac{(50-42)^2}{42}$ $\chi^2 = \frac{(-17)^2}{159} + 0 + \frac{9^2}{42} + \frac{8^2}{42}$ $\chi^2 = \frac{289}{159} + 0 + \frac{81}{42} + \frac{64}{42} \approx 1.8176 + 0 + 1.9286 + 1.5238 \approx 5.27$ Degrees of Freedom: $df = k - 1 = 4 - 1 = 3$ Critical Value: For $\alpha = 0.01$ and $df = 3$, $\chi^2_{critical} \approx 11.345$ Decision: Since $\chi^2_{calculated} (5.27) R Code: observed Conclusion: At $\alpha = 0.01$, there is not enough evidence to conclude that the distribution of takeout food consumption places has changed. Targeting Advertisements: The original distribution still holds, with Home being the most common place (53%). A fast-food restaurant should primarily target advertisements towards people consuming food at home. Question 2: Chi-Square Test of Independence Problem Description Investigate if a relationship exists between the month and year of tornado occurrences based on the following data: 2015 2014 2013 2012 January 26 41 87 97 February 2 41 46 63 March 13 25 18 225 Test at $\alpha = 0.05$. R Code and Normal Calculation Hypotheses: $H_0$: There is no relationship between the month and year of tornado occurrences (they are independent). $H_1$: There is a relationship between the month and year of tornado occurrences (they are dependent). Observed Frequencies (Contingency Table): $O = \begin{pmatrix} 26 & 41 & 87 & 97 \\ 2 & 41 & 46 & 63 \\ 13 & 25 & 18 & 225 \end{pmatrix}$ Row Sums: $R_1=251, R_2=152, R_3=281$. Column Sums: $C_1=41, C_2=107, C_3=151, C_4=385$. Grand Total: $N=684$. Expected Frequencies: $E_{ij} = \frac{R_i \times C_j}{N}$ $E_{11} = \frac{251 \times 41}{684} \approx 15.06$ $E_{12} = \frac{251 \times 107}{684} \approx 39.30$ ... (calculate all $E_{ij}$) Chi-Square Test Statistic: $\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$ Degrees of Freedom: $df = (rows - 1)(cols - 1) = (3-1)(4-1) = 2 \times 3 = 6$ Critical Value: For $\alpha = 0.05$ and $df = 6$, $\chi^2_{critical} \approx 12.592$ R Code: tornado_data Decision: Since $\chi^2_{calculated} (227.18) > \chi^2_{critical} (12.592)$ and p-value is very small, we reject $H_0$. Conclusion: At $\alpha = 0.05$, there is sufficient evidence to conclude that a significant relationship exists between the month and year in which tornadoes occurred. Question 3: Titanic Dataset Data Cleaning (R Script) Dataset Download Link https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv 1. Familiarize with the dataset Load and Initial Examination: # Load necessary libraries library(dplyr) library(ggplot2) library(VIM) # For missing data visualization # Load the dataset titanic_df Identify Data Types for Conversion: Survived : Needs to be a factor (0/1). Pclass : Needs to be a factor (1/2/3). Sex : Needs to be a factor. Embarked : Needs to be a factor. Age , Fare , SibSp , Parch : Should be numeric/integer. Descriptive Statistics: # Survival rates mean(titanic_df$Survived) # Overall survival rate titanic_df %>% group_by(Sex) %>% summarise(SurvivalRate = mean(Survived)) titanic_df %>% group_by(Pclass) %>% summarise(SurvivalRate = mean(Survived)) # Age and Fare distributions summary(titanic_df$Age) summary(titanic_df$Fare) # Frequency tables for categorical variables table(titanic_df$Sex) table(titanic_df$Pclass) table(titanic_df$Embarked) # Check PassengerId for uniqueness length(unique(titanic_df$PassengerId)) == nrow(titanic_df) # Examine sample Name and Ticket entries head(titanic_df$Name) head(titanic_df$Ticket) # Initial visualizations hist(titanic_df$Age, main = "Distribution of Age", xlab = "Age") hist(titanic_df$Fare, main = "Distribution of Fare", xlab = "Fare") 2. Check for structural errors and inconsistencies Duplicate PassengerIds and Names: sum(duplicated(titanic_df$PassengerId)) # Should be 0 sum(duplicated(titanic_df$Name)) # Check for duplicate names (not necessarily errors, but good to know) Validate Value Ranges: # Survived: 0 or 1 unique(titanic_df$Survived) # Pclass: 1, 2, or 3 unique(titanic_df$Pclass) # Age: 0-80, non-negative sum(titanic_df$Age 80, na.rm = TRUE) # Check for extreme ages # SibSp, Parch, Fare: non-negative sum(titanic_df$SibSp Logical Inconsistencies: Passengers with Age=0 or Fare=0 : Investigate if these are valid or errors. Unusual family combinations (e.g., very large SibSp or Parch without corresponding Fare ). Outlier Detection (Age and Fare): boxplot(titanic_df$Age, main = "Boxplot of Age") boxplot(titanic_df$Fare, main = "Boxplot of Fare") # IQR method for Age Q1_age (Q3_age + 1.5 * IQR_age)] length(outliers_age) # IQR method for Fare Q1_fare (Q3_fare + 1.5 * IQR_fare)] length(outliers_fare) Name and Ticket Field Patterns: Examine patterns in Name (e.g., "Lastname, Title. Firstname"). Examine patterns in Ticket (e.g., prefixes, numeric strings). 3. Check and handle data type errors Type Conversions: # Convert Survived to factor titanic_df$Survived 4. Develop and implement a missing data strategy Calculate Percentage of Missing Values: missing_data 0] # Expected: Age (~20%), Cabin (~77%), Embarked (2 cases) Missing Data Visualization: aggr(titanic_df, numbers = TRUE, prop = FALSE, cex.axis = .7, main = "Missing Data Patterns") Imputation Strategies: Age (~20% missing): Choice: Predictive imputation using title extracted from Name. This approach leverages existing information (title) which is highly correlated with age and provides more accurate estimates than simple median imputation. It is also less complex than multiple imputation for a single variable. Justification: Titles like 'Master' (young boys), 'Miss' (unmarried females, often younger), 'Mr' (adult males), and 'Mrs' (married females) provide strong cues for age estimation, leading to more realistic imputed values. # Extract Title titanic_df$Title Embarked (2 missing): Choice: Mode imputation. Given only 2 missing values, imputing with the most frequent port ('S' for Southampton) is a simple and robust approach that minimally impacts the overall distribution. Justification: The small number of missing values means complex imputation methods are overkill. The mode is a reasonable proxy. # Find the mode of Embarked mode_embarked Cabin (~77% missing): Choice: Do not impute. Instead, create a binary indicator and extract deck letter. Due to excessive missingness, imputing 77% of values would introduce significant noise or bias. A binary indicator (Has_Cabin) is more informative. Justification: The presence or absence of cabin information itself is a strong predictor of survival. Extracting the deck letter can capture additional patterns. titanic_df$Has_Cabin Before/After Comparison: # Missing value counts missing_before 5. Create derived variables and enhance the dataset Family Variables: titanic_df$FamilySize Title Extraction and Grouping: # Title extraction already done in Age imputation step # Refine Title_Grouped based on common categories titanic_df$Title_Grouped Age Categories: titanic_df$AgeGroup Fare Variables: titanic_df$FarePerPerson Cabin Features: # CabinDeck and Has_Cabin already created in imputation step 6. Validate data quality and document the cleaning process Validation Checks: # Verify factor levels lapply(titanic_df[, c("Survived", "Pclass", "Sex", "Embarked", "FamilyCategory", "IsAlone", "Title_Grouped", "AgeGroup", "IsChild", "Has_Cabin", "CabinDeck", "FareBracket")], levels) # Age and Fare non-negative sum(titanic_df$Age_Imputed Compare Distributions (Age and Fare): # Already done in step 4 for Age. Can do similar for Fare if imputation was applied. Survival Rates Verification: # Correlation matrix for numeric variables cor(titanic_df[, c("Age_Imputed", "Fare", "SibSp", "Parch", "FamilySize", "FarePerPerson")], use = "pairwise.complete.obs") # Cross-tabulations for expected patterns table(titanic_df$Survived, titanic_df$Pclass) table(titanic_df$Survived, titanic_df$Sex) table(titanic_df$Survived, titanic_df$AgeGroup) table(titanic_df$Survived, titanic_df$FamilyCategory) # Expected: Higher survival in 1st class, females, children, small/medium families. Generate a comprehensive data quality report: This would be a separate document (PDF/HTML) summarizing the following points. Executive Summary: Dataset dimensions: 891 rows, 12 original variables, now X derived variables. Variables created: Title , Age_Imputed , Age_Imputed_Flag , Has_Cabin , CabinDeck , FamilySize , FamilyCategory , IsAlone , Title_Grouped , AgeGroup , IsChild , FarePerPerson , FareBracket . Records modified: Age (imputed 177 values), Embarked (imputed 2 values), Cabin (converted to indicators). Completeness Score: Improved data completeness for Age and Embarked . Cabin still has high "missingness" but is now handled semantically. Issues Identified Table: Variable Issue Type Count Resolution Age Missing Values 177 (19.87%) Imputed using median age by Title. Cabin Missing Values 687 (77.10%) Converted to Has_Cabin (binary) and CabinDeck (first letter). No imputation for values. Embarked Missing Values 2 (0.22%) Imputed with mode ("Southampton"). Survived Incorrect Type (int) 891 Converted to factor ("No", "Yes"). Pclass Incorrect Type (int) 891 Converted to ordered factor ("3rd", "2nd", "1st"). Sex , Embarked Incorrect Type (char) 891 Converted to factor. Name , Ticket , Cabin Whitespace (variable) Trimmed whitespace. Data Transformations Applied: Detailed description of all type conversions (factor, ordered factor, numeric, integer). Justifications for imputation choices for Age and Embarked . Explanation for handling Cabin missingness. Derived Variables Documentation: FamilySize : SibSp + Parch + 1 . Total family members. FamilyCategory : Categorized FamilySize into "Alone", "Small", "Medium", "Large". IsAlone : Binary indicator for FamilySize == 1 . Title : Extracted from Name . Title_Grouped : Consolidated Title into broader categories (e.g., "Mr", "Mrs", "Miss", "Master", "Officer/Professional", "Rare"). AgeGroup : Categorized Age_Imputed into "Child", "Teen", "Adult", "Senior". IsChild : Binary indicator for Age_Imputed . FarePerPerson : Fare / FamilySize . Fare adjusted for family size. FareBracket : Categorized Fare into quartiles ("Low", "Medium-Low", "Medium-High", "High"). Has_Cabin : Binary indicator for presence of Cabin info. CabinDeck : First letter of Cabin (A-G, T, or "Unknown"). Before/After Comparison Tables and Visualizations: Missing value summary table. Histograms of original vs. imputed Age . Summary statistics comparison for Age and Fare . Limitations Discussion: Assumptions: Assumed median imputation by title is appropriate for Age . Assumed mode imputation for Embarked is acceptable due to low count. Potential Biases: Imputation methods might slightly reduce variance or introduce bias if the assumptions are violated. Variables with Remaining Concerns: Ticket field remains complex and was not extensively cleaned for pattern extraction, though whitespace was trimmed. Fare outliers were identified but not specifically treated (e.g., capping), as they might represent legitimate extreme values. Deliverables The R script titanic_data_cleaning.R would contain all the code snippets above in logical order. # titanic_data_cleaning.R # Load necessary libraries library(dplyr) library(ggplot2) library(VIM) # --- 1. Familiarize yourself with the dataset --- titanic_df % group_by(Sex) %>% summarise(SurvivalRate = mean(Survived)) # summary(titanic_df$Age) # table(titanic_df$Embarked) # Initial visualizations # hist(titanic_df$Age, main = "Original Age Distribution", xlab = "Age") # hist(titanic_df$Fare, main = "Original Fare Distribution", xlab = "Fare") # --- 2. Check for structural errors and inconsistencies --- # sum(duplicated(titanic_df$PassengerId)) # sum(titanic_df$Age 0]) # aggr(titanic_df, numbers = TRUE, prop = FALSE, cex.axis = .7, main = "Missing Data Patterns Before Imputation") # Age imputation using Title titanic_df$Title 0]) # Should only have Cabin NAs if not handled as above # hist(titanic_df$Age_Imputed, main = "Imputed Age Distribution", xlab = "Age", col = "lightgreen") # --- 5. Create derived variables and enhance the dataset --- titanic_df$FamilySize Question 4: Mobile Banking App User Analysis Table 4.1: Variables and Descriptions Variable Name Description Type Scale Session_Duration Average session length in minutes Predictor Metric Transactions_Per_Week Average number of transactions processed per week Predictor Metric Features_Used The number of distinct app features used by the user (e.g., transfers, payments, budgeting) Predictor Metric App_Rating User-provided satisfaction rating on a 1-10 scale Predictor Metric Security_Alerts The number of security alerts acknowledged by the user Predictor Metric Subscription_Likelihood A continuous score (0-100) indicating the likelihood of a user subscribing to a premium version of the app Outcome Metric User_Tier The user's assigned tier based on overall activity: 'Bronze', 'Silver', or 'Gold' Outcome Categorical Churn_Risk A binary classification of whether the user is at high risk of churning (Yes/No) Outcome Categorical a) Multiple Regression Analysis Assessment Definition of Errors and Power: Type I Error ($\alpha$): Incorrectly rejecting a true null hypothesis. In this context, it means concluding that a predictor (e.g., Session_Duration ) has a significant effect on Subscription_Likelihood when, in reality, it does not. Practical Consequence: The firm might invest resources in optimizing a feature (e.g., increasing session duration) based on a spurious correlation, leading to wasted effort and potentially suboptimal strategies. Type II Error ($\beta$): Failing to reject a false null hypothesis. This means concluding that a predictor has no significant effect when, in reality, it does. Practical Consequence: The firm might overlook a genuinely effective predictor or relationship, missing opportunities to improve Subscription_Likelihood . For example, they might ignore a feature that truly drives subscriptions. Statistical Power ($1 - \beta$): The probability of correctly rejecting a false null hypothesis. It's the ability of the test to detect an effect if an effect truly exists. In this case, it's the probability of correctly identifying a significant predictor of Subscription_Likelihood . Practical Consequence: High power is desirable as it increases the chances of discovering true relationships, enabling the firm to make informed, data-driven decisions for product development and marketing. Adequacy of Sample Size ($n=250$): Predictor-to-Observation Ratio: With 5 predictors, $n=250$ gives a ratio of $250/5 = 50$. A common rule of thumb is at least 10-20 observations per predictor. A ratio of 50 is generally considered very good, suggesting enough data to estimate regression coefficients reliably without overfitting due to too few observations. Power Requirements: Given $R^2 = 0.20$, $\alpha = 0.05$, Power $= 0.80$, and 5 predictors. A power analysis (e.g., using G*Power or R's pwr package) would be needed to precisely calculate the required sample size. For a multiple regression with $f^2 = R^2 / (1-R^2) = 0.20 / (1-0.20) = 0.25$ (medium effect size), 5 predictors, $\alpha=0.05$, and power=0.80, the required sample size is typically around 85-100. Therefore, $n=250$ is more than adequate for the stated effect size and power. Implications and Recommendations: Implications: The sample size of 250 is robust for this analysis. The study is well-powered to detect the hypothesized effect size ($R^2=0.20$), meaning there's a low risk of Type II errors. The favorable predictor-to-observation ratio ensures stable coefficient estimates. Recommendations: Proceed with the multiple regression analysis as planned, confident in the statistical power and sample size. While statistically adequate, ensure the sample is representative of the target user base to avoid external validity issues. Beyond statistical significance, focus on the practical significance (effect size) of each predictor. A coefficient might be statistically significant but have a negligible practical impact. Consider exploring non-linear relationships or interactions between predictors if initial linear models don't fully capture the variance. b) Multivariate Statistical Techniques Objective 1: Identify distinct, naturally occurring user profiles by grouping users based on their patterns across the five user metrics. Specific Technique: Cluster Analysis (e.g., K-means, Hierarchical Clustering) Method Type: Interdependence Method Justification: Research Question: Aims to find inherent groupings or structures within the data without a predefined dependent variable. Variable Scales: All five user metrics ( Session_Duration , Transactions_Per_Week , Features_Used , App_Rating , Security_Alerts ) are metric (interval/ratio) variables, which are suitable inputs for distance-based clustering algorithms. Objective 2: Develop a robust classification model that can predict a user's Churn_Risk (Yes/No) based on the five user metrics. Specific Technique: Logistic Regression (or other classification algorithms like Support Vector Machines, Random Forest) Method Type: Dependence Method Justification: Research Question: Predicts a categorical outcome ( Churn_Risk : Yes/No) based on a set of predictor variables. Variable Scales: The outcome variable is categorical (binary). The predictor variables are metric, which are appropriate for logistic regression. Objective 3: Understand which of the five user metrics are most effective at discriminating among the three User_Tier levels ('Bronze', 'Silver', 'Gold'). Specific Technique: Discriminant Analysis (specifically, Multiple Discriminant Analysis) Method Type: Dependence Method Justification: Research Question: Aims to identify which linear combinations of predictor variables best differentiate between three or more predefined categorical groups ( User_Tier ). Variable Scales: The outcome variable is categorical (nominal with three levels). The predictor variables are metric, which is the standard input for discriminant analysis. c) Critique Analytical Proposals Proposal 1: MANOVA to determine if mean scores of five user metrics differ between two Churn_Risk groups (Yes/No). Critique: While MANOVA is a dependence technique that can analyze differences in multiple dependent variables simultaneously across groups, it is typically used when the independent variable is categorical and the dependent variables are metric. In this proposal, Churn_Risk (Yes/No) is the independent variable, and the five user metrics are the dependent variables. This is a valid application of MANOVA. However, the core goal stated is to "determine if the mean scores... differ". While MANOVA can do this, it's often followed by univariate ANOVAs to pinpoint which specific metrics differ. Is MANOVA most suitable? Yes, it is suitable for this scenario as it controls for Type I error inflation when testing multiple dependent variables and considers the correlations between them. It directly addresses whether the profile of metrics differs between churn groups. Justification for MANOVA: Dependence Method: Churn_Risk (categorical IV) affects the 5 user metrics (metric DVs). Multiple DVs: It examines differences across all 5 metrics simultaneously. Controls for DV Correlation: Accounts for the intercorrelations among the user metrics, which univariate ANOVAs would ignore, potentially leading to misleading conclusions. More appropriate statistical technique (if any): While MANOVA is appropriate, if the primary goal is simply to predict Churn_Risk based on these metrics, then Logistic Regression (as identified in Objective 2) would be a more direct approach to modeling the probability of churn. However, for understanding if the *mean profiles* of metrics differ between churners and non-churners, MANOVA is indeed a strong choice. If the goal is to *classify* new users into churn/no-churn, then Logistic Regression or Discriminant Analysis would be better. Given the phrasing "determine if the mean scores ... differ", MANOVA is appropriate. Proposal 2: Multiple Regression to predict categorical User_Tier ('Bronze', 'Silver', 'Gold') from five metric predictor variables. Fundamental Flaw: Standard multiple linear regression requires a continuous, metric dependent variable. User_Tier is a categorical variable with three nominal levels. Using linear regression for a nominal categorical outcome violates its assumptions and can lead to meaningless predictions (e.g., predicting a tier of 1.5, or predictions outside the 1-3 range). It also does not correctly model the relationship between predictors and the probability of being in a specific category. Correct Dependence Technique: Specific Technique: Multinomial Logistic Regression (or Ordinal Logistic Regression if tiers have a meaningful order, though 'Bronze', 'Silver', 'Gold' imply nominal categories here unless explicitly defined as ordered). Multiple Discriminant Analysis is another strong candidate, as discussed in Objective 3. Justification: Research Question: Predicts a categorical outcome variable with more than two levels. Variable Scales: The outcome variable ( User_Tier ) is categorical (nominal/ordinal), and the predictors are metric. Multinomial Logistic Regression: Models the probability of belonging to each category of the dependent variable relative to a reference category, ensuring that predictions are probabilities summing to one and are interpretable for a categorical outcome. Multiple Discriminant Analysis: Identifies linear combinations of predictors that best separate the groups and can be used for classification. Question 5: One-Way ANOVA and Pairwise t-test (R) Dataset Download Link https://raw.githubusercontent.com/guru99-edu/R-Programming/master/poisons.csv Variables: Time : Survival time of the Guinea pig (Metric) poison : Type of poison used (Factor: 1, 2, 3) treat : Type of treatment used (Factor: 1, 2, 3) Analysis Steps: # Load data poisons_df % group_by(poison) %>% summarise( count = n(), mean_time = mean(Time), sd_time = sd(Time) ) # Step 3: Plot a box plot boxplot(Time ~ poison, data = poisons_df, xlab = "Poison Type", ylab = "Survival Time", main = "Survival Time by Poison Type") # Step 4: Compute the one-way ANOVA test (Effect of poison on Time) anova_poison Question 6: Principal Component Analysis (PCA) Dataset: Decathlon data (decathlon.csv) 10 events (columns 1-10) are continuous variables. Columns 11-12 are rank and points. Last column is categorical (athletic meeting). Analysis Steps: # Load data (assuming decathlon.csv is in the working directory) # decathlon_df 1 # Proportion of variance explained summary(pca_result) # Look at 'Cumulative Proportion' and 'Standard deviation' (sqrt(eigenvalue)) # Rule of thumb: Cumulative variance explained (e.g., >70-80%) # From summary(pca_result), identify number of components that explain sufficient variance. # b. Run the principal component analysis for variable 1 to 10 by using the suggestion in (a). # This is already done with `pca_result Question 7: Exploratory Factor Analysis (EFA) Dataset: Cars data (cars.csv) 14 variables (Price, Safety, Exterior looks, Space and comfort, Technology, After sales service, Resale value, Fuel type, Fuel efficiency, Color, Maintenance, Test drive, Product reviews, Testimonials) on 5-point Likert scale. Analysis Steps: # Load data (assuming cars.csv is in the working directory) # cars_df 1) - from PCA on correlation matrix # eigen_values 1) # Number of factors based on Kaiser criterion # Scree plot (visual inspection) # plot(eigen_values, type = "b", main = "Scree Plot of Eigenvalues") # Based on parallel analysis, Kaiser criterion, and scree plot, decide on the number of factors. # Let's assume, for example, that 3 factors are suggested. # b. Find the correlation matrix. R_matrix 0.3 or 0.4) on a specific factor # are strongly associated with that factor. Orthogonal rotation assumes factors are uncorrelated. # Look for simple structure: each variable loads highly on one factor and low on others. # Interpret the factors based on the variables that load onto them (e.g., Factor 1 might be 'Performance', Factor 2 'Aesthetics'). # d. Use oblique rotation to rotate factors. What can you conclude? Produce its factor loading matrix. # Perform EFA with oblique rotation (oblimin is common) efa_oblique Question 8: Multiple Regression Model Dataset: Urban transportation data (transport_dataset.csv) Predict Commute Travel Time (minutes) based on multiple factors. Analysis Steps: # Load data (assuming transport_dataset.csv is in the working directory) # transport_df |t|)' column for each predictor. # Slopes with p-value 1 or 4/n). # f. Check the unusual observations. What can you conclude? # Unusual observations include outliers, high leverage points, and influential points. # High leverage points: `hatvalues(model)` # Influential points: `cooks.distance(model)` or DFFITS, DFBETAS. # `summary(model)` also flags points with high leverage/Cook's distance. # `plot(model, which = 5)` (Residuals vs Leverage) helps visualize these. # Conclusion: Identify specific observations that are unusual and comment on their potential impact # on the model (e.g., if removing them significantly changes coefficients). # g. Obtain the correlation matrix (only for numeric variables). Paste the matrix into your report. # Describe the relationship between dependent variable and each of the predictors. # Select numeric variables (assuming all are numeric except for factors if any) numeric_vars Question 9: Logistic Regression Model Dataset: Credit data from ISLR package Predict Married status (binary: Yes/No) based on other demographic and financial information. Analysis Steps: # Install and load ISLR package # install.packages('ISLR') library(ISLR) library(dplyr) # For data manipulation # install.packages('pROC') # Install if not already installed library(pROC) # For AUC # Load the Credit dataset credit_data |z|)' column for each predictor. # Variables with p-value > 0.05 are generally considered not statistically significant at the 5% level. # We will iteratively remove non-significant variables (simplest approach for this exercise). # For demonstration, let's assume 'Gender' and 'Ethnicity' are not significant # (You would check the actual p-values from `summary(model_all)`). # The significant variables in a model predicting 'Student' (now 'Married') in ISLR Credit are typically 'Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education'. # Let's list significant variables based on common findings for 'Student' prediction from ISLR Credit. # If 'Gender' and 'Ethnicity' are non-significant, we'd drop them. # Based on typical results for predicting 'Student' in ISLR Credit: # Income, Limit, Rating, Cards, Age, Education are often significant. # Gender and Ethnicity are often not significant. significant_vars_selected 0.5, "Yes", "No") test_predictions_class 0.9 excellent.

Applied Statistical Methods Cheatsheet

Related Cheatsheets

DBMS & SQL Cheatsheet

General Cheatsheet

Mass & Energy Balance Cheatsheet

Class 12 Organic Chemistry Cheatsheet

Coupled Oscillators Cheatsheet

d- and f-Orbitals Cheatsheet

Create Your Own AI Cheatsheet