### Exam Preparation Plan (May 4th Exam) #### Overview This plan focuses on revising concepts from Contact Sessions 1, 2, and 3, practicing problem-solving, and memorizing key formulas. The exam is on May 4th, so effective time management is crucial. #### Daily Schedule Suggestion (Adapt as needed) **Day 1: Concept Review - Contact Session 1** - **Morning (2-3 hours):** - Review notes/materials for Contact Session 1. - Focus on understanding core definitions and principles. - Read through the "Cheatsheet: Contact Session 1" below. - **Afternoon (2-3 hours):** - Work through assigned problems/examples related to Contact Session 1. - Identify areas of weakness. - **Evening (1 hour):** - Review formulas for Contact Session 1. - Create flashcards for difficult concepts or formulas. **Day 2: Concept Review - Contact Session 2** - **Morning (2-3 hours):** - Review notes/materials for Contact Session 2. - Focus on understanding core definitions and principles. - Read through the "Cheatsheet: Contact Session 2" below. - **Afternoon (2-3 hours):** - Work through assigned problems/examples related to Contact Session 2. - Identify areas of weakness. - **Evening (1 hour):** - Review formulas for Contact Session 2. - Create flashcards for difficult concepts or formulas. **Day 3: Concept Review - Contact Session 3** - **Morning (2-3 hours):** - Review notes/materials for Contact Session 3. - Focus on understanding core definitions and principles. - Read through the "Cheatsheet: Contact Session 3" below. - **Afternoon (2-3 hours):** - Work through assigned problems/examples related to Contact Session 3. - Identify areas of weakness. - **Evening (1 hour):** - Review formulas for Contact Session 3. - Create flashcards for difficult concepts or formulas. **Day 4: Integrated Practice & Weakness Targeting** - **Morning (3-4 hours):** - Attempt a mock exam or a comprehensive set of mixed problems from all three sessions. - Time yourself to simulate exam conditions. - **Afternoon (2-3 hours):** - Review your answers from the morning session. - Identify common mistakes and topics that still require attention. - Revisit specific sections of your notes or the cheatsheet. - **Evening (1 hour):** - Focused review of formulas and concepts from your weakest areas. **Day 5: Final Review & Rest (Day before exam)** - **Morning (2-3 hours):** - Quick scan of all cheatsheets and formula lists. - Do a few light practice problems, but avoid intense new problem-solving. - Focus on confidence-building. - **Afternoon/Evening:** - Relax, eat well, and ensure you get a good night's sleep. Avoid cramming late into the night. #### Important Tips - **Active Recall:** Don't just passively read. Test yourself regularly. - **Spaced Repetition:** Revisit topics multiple times over several days. - **Problem Solving:** Practice is key. Understand *why* an answer is correct/incorrect. - **Formula Sheets:** Create and use your own formula sheet during practice. - **Time Management:** Stick to your schedule but be flexible if a topic needs more time. - **Google Drive:** Utilize the provided Google Drive link for all course materials, assignments, and capstone project proposal (as a reference for applied concepts). ### Cheatsheet: Contact Session 1 - Introduction to Data Science & Statistics #### Key Concepts - **Data Science Lifecycle:** Problem definition, data acquisition, data cleaning/preparation, exploratory data analysis (EDA), modeling, evaluation, deployment. - **Types of Data:** - **Quantitative:** Numerical (discrete, continuous). - **Qualitative/Categorical:** Non-numerical (nominal, ordinal). - **Measures of Central Tendency:** - **Mean ($\bar{x}$):** Average. Sensitive to outliers. - **Median:** Middle value when data is ordered. Robust to outliers. - **Mode:** Most frequent value. - **Measures of Dispersion:** - **Range:** Max - Min. - **Variance ($\sigma^2$ or $s^2$):** Average of squared differences from the mean. - **Standard Deviation ($\sigma$ or $s$):** Square root of variance. Measures spread. - **Interquartile Range (IQR):** $Q_3 - Q_1$. Range of the middle 50% of data. Robust to outliers. - **Data Visualization:** Histograms, Box plots, Scatter plots, Bar charts. - **Probability Basics:** - **Experiment:** Process with uncertain outcomes. - **Outcome:** Result of an experiment. - **Sample Space ($S$):** Set of all possible outcomes. - **Event ($E$):** Subset of the sample space. - **Probability ($P(E)$):** Likelihood of an event occurring. $0 \le P(E) \le 1$. - **Complement ($E^c$):** Event does not occur. $P(E^c) = 1 - P(E)$. - **Types of Probability:** - **Classical:** Equally likely outcomes. $P(E) = \frac{\text{Number of favorable outcomes}}{\text{Total number of outcomes}}$. - **Empirical/Relative Frequency:** Based on observations. $P(E) = \frac{\text{Number of times E occurred}}{\text{Total number of trials}}$. - **Subjective:** Based on personal judgment. - **Rules of Probability:** - **Addition Rule:** - **Mutually Exclusive:** $P(A \cup B) = P(A) + P(B)$ - **General:** $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ - **Multiplication Rule:** - **Independent Events:** $P(A \cap B) = P(A) \times P(B)$ - **Dependent Events:** $P(A \cap B) = P(A) \times P(B|A)$ - **Conditional Probability:** $P(B|A) = \frac{P(A \cap B)}{P(A)}$, where $P(A) > 0$. - **Bayes' Theorem:** $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$ #### Key Formulas (Contact Session 1) - **Mean:** $\bar{x} = \frac{\sum x_i}{n}$ - **Sample Variance:** $s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}$ - **Population Variance:** $\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}$ - **Standard Deviation:** $s = \sqrt{s^2}$ or $\sigma = \sqrt{\sigma^2}$ - **Z-score:** $z = \frac{x - \mu}{\sigma}$ (for population) or $z = \frac{x - \bar{x}}{s}$ (for sample) - **IQR:** $Q_3 - Q_1$ - **General Addition Rule:** $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ - **Conditional Probability:** $P(B|A) = \frac{P(A \cap B)}{P(A)}$ - **Bayes' Theorem:** $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$ #### Application Examples - **Descriptive Statistics:** Calculate mean, median, mode, standard deviation for a given dataset of student scores. Interpret the spread of scores. - **Probability:** Given a deck of cards, calculate the probability of drawing a King or a Heart. Calculate the probability of drawing two Kings without replacement. - **Conditional Probability:** If 60% of students pass Math and 70% pass English, and 40% pass both, what is the probability a student passed Math given they passed English? ### Cheatsheet: Contact Session 2 - Probability Distributions & Sampling #### Key Concepts - **Random Variable:** A variable whose value is a numerical outcome of a random phenomenon. - **Discrete Random Variable:** Takes on a finite or countably infinite number of values (e.g., number of heads in coin flips). - **Continuous Random Variable:** Takes on any value within a given range (e.g., height, temperature). - **Probability Distribution:** Describes the probabilities of all possible outcomes for a random variable. - **Probability Mass Function (PMF):** For discrete variables, $P(X=x)$. - **Probability Density Function (PDF):** For continuous variables, $f(x)$ where $\int_{-\infty}^{\infty} f(x) dx = 1$. - **Cumulative Distribution Function (CDF):** $F(x) = P(X \le x)$. - **Expected Value ($E[X]$):** The long-run average of a random variable. - **Discrete:** $E[X] = \sum x \cdot P(X=x)$ - **Continuous:** $E[X] = \int x \cdot f(x) dx$ - **Variance of a Random Variable ($Var(X)$):** Measures the spread of the distribution. - $Var(X) = E[X^2] - (E[X])^2$ - **Common Discrete Distributions:** - **Bernoulli Distribution:** Single trial, two outcomes (success/failure). $P(X=1)=p, P(X=0)=1-p$. - **Binomial Distribution:** Number of successes in a fixed number ($n$) of independent Bernoulli trials. $X \sim B(n, p)$. - **Poisson Distribution:** Number of events in a fixed interval of time or space, given a constant average rate ($\lambda$). $X \sim P(\lambda)$. - **Common Continuous Distributions:** - **Uniform Distribution:** All values within a given interval are equally likely. - **Normal Distribution:** Bell-shaped, symmetric, characterized by mean ($\mu$) and standard deviation ($\sigma$). $X \sim N(\mu, \sigma^2)$. - **Standard Normal Distribution:** $Z \sim N(0, 1)$, mean 0, std dev 1. - **Central Limit Theorem (CLT):** For a large sample size ($n \ge 30$), the sampling distribution of the sample mean ($\bar{X}$) will be approximately normal, regardless of the population distribution, with mean $\mu_{\bar{X}} = \mu$ and standard deviation $\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$ (Standard Error). - **Sampling:** - **Population:** Entire group of interest. - **Sample:** Subset of the population. - **Sampling Methods:** Simple random sampling, stratified sampling, cluster sampling, systematic sampling. - **Sampling Error:** Difference between sample statistic and population parameter. - **Bias:** Systematic error in measurement or selection. #### Key Formulas (Contact Session 2) - **Expected Value (Discrete):** $E[X] = \sum_{x} x \cdot P(X=x)$ - **Variance (Discrete):** $Var(X) = \sum_{x} (x - E[X])^2 P(X=x)$ or $E[X^2] - (E[X])^2$ - **Binomial PMF:** $P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$ - $E[X] = np$, $Var(X) = np(1-p)$ - **Poisson PMF:** $P(X=k) = \frac{e^{-\lambda} \lambda^k}{k!}$ - $E[X] = \lambda$, $Var(X) = \lambda$ - **Z-score for Normal Distribution:** $Z = \frac{X - \mu}{\sigma}$ - **Standard Error of the Mean:** $\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$ - **Z-score for Sample Mean (CLT):** $Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}}$ #### Application Examples - **Binomial Distribution:** Calculate the probability of getting exactly 3 heads in 5 coin flips. - **Poisson Distribution:** If a call center receives an average of 5 calls per hour, what is the probability of receiving exactly 7 calls in the next hour? - **Normal Distribution:** Given a population with mean $\mu$ and std dev $\sigma$, find the probability that a randomly selected value falls within a certain range. Use Z-tables. - **Central Limit Theorem:** If the average height of students is 170cm with a std dev of 10cm, what is the probability that the mean height of a sample of 50 students is greater than 172cm? ### Cheatsheet: Contact Session 3 - Hypothesis Testing & Regression #### Key Concepts - **Inferential Statistics:** Using sample data to make inferences about a population. - **Estimation:** - **Point Estimate:** A single value used to estimate a population parameter (e.g., sample mean $\bar{x}$ for population mean $\mu$). - **Confidence Interval (CI):** A range of values likely to contain the population parameter with a certain level of confidence (e.g., 95% CI). - **Interpretation:** If we were to take many samples and construct a CI for each, approximately X% of these intervals would contain the true population parameter. - **Hypothesis Testing:** A statistical method used to make decisions about a population based on sample data. - **Null Hypothesis ($H_0$):** A statement of no effect or no difference. Assumed true until evidence suggests otherwise. - **Alternative Hypothesis ($H_1$ or $H_a$):** A statement that contradicts $H_0$. What we want to prove. - **Types of Errors:** - **Type I Error ($\alpha$):** Rejecting $H_0$ when it is true (False Positive). Significance level. - **Type II Error ($\beta$):** Failing to reject $H_0$ when it is false (False Negative). - **P-value:** The probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample, *assuming $H_0$ is true*. - **Decision Rule:** If P-value $\le \alpha$, reject $H_0$. If P-value $> \alpha$, fail to reject $H_0$. - **Test Statistic:** A value calculated from sample data used to test the hypothesis (e.g., Z-statistic, T-statistic). - **Common Hypothesis Tests:** - **Z-test:** For population mean when population standard deviation ($\sigma$) is known, or for large sample sizes ($n \ge 30$). - **T-test:** For population mean when population standard deviation ($\sigma$) is unknown and sample size is small ($n ### Practice Problems #### Contact Session 1: Data & Probability 1. **Descriptive Statistics:** Given the dataset: `[15, 22, 18, 25, 30, 10, 20, 28, 12, 18]`. * Calculate the mean, median, and mode. * Calculate the range, variance, and standard deviation. * Identify any outliers using the 1.5*IQR rule. 2. **Probability:** A bag contains 5 red, 3 blue, and 2 green marbles. * What is the probability of drawing a red marble? * What is the probability of drawing a blue or a green marble? * If you draw two marbles without replacement, what is the probability that both are red? * What is the probability that the second marble is blue, given the first was red? 3. **Bayes' Theorem:** A medical test for a disease has a 95% accuracy (correctly identifies diseased individuals) and a 98% specificity (correctly identifies healthy individuals). The prevalence of the disease in the population is 1%. If a person tests positive, what is the probability they actually have the disease? #### Contact Session 2: Distributions & Sampling 1. **Binomial Distribution:** A biased coin lands on heads with a probability of 0.6. If you flip the coin 8 times: * What is the probability of getting exactly 5 heads? * What is the probability of getting at least 6 heads? * Calculate the expected number of heads and the variance. 2. **Poisson Distribution:** On average, a website receives 12 unique visitors per minute. * What is the probability that it receives exactly 10 visitors in the next minute? * What is the probability that it receives more than 15 visitors in the next minute? 3. **Normal Distribution:** The scores on a standardized test are normally distributed with a mean of 500 and a standard deviation of 100. * What proportion of test-takers score above 650? * What is the score below which 25% of test-takers fall? 4. **Central Limit Theorem:** The average weight of a certain type of apple is 150g with a standard deviation of 15g. If you take a random sample of 40 apples: * What is the probability that the sample mean weight is between 145g and 155g? * What is the probability that the sample mean weight is less than 140g? #### Contact Session 3: Hypothesis Testing & Regression 1. **Confidence Interval:** A sample of 36 students has an average GPA of 3.2 with a sample standard deviation of 0.5. Construct a 90% confidence interval for the true average GPA of all students. 2. **One-Sample T-test:** A company claims that its new energy drink improves reaction time. The average reaction time for the general population is 0.25 seconds. A sample of 15 individuals who consumed the drink had an average reaction time of 0.22 seconds with a standard deviation of 0.04 seconds. At $\alpha = 0.05$, test if the drink significantly reduces reaction time. 3. **Chi-Square Test of Independence:** A survey asks 100 people about their gender and their preference for coffee or tea. The results are: * Coffee: 30 males, 20 females * Tea: 15 males, 35 females * Is there a significant association between gender and beverage preference at $\alpha = 0.05$? 4. **Linear Regression:** Given the following data points for advertising spend (X) and sales (Y): `X = [10, 15, 20, 25, 30]` `Y = [50, 60, 70, 85, 90]` * Calculate the slope ($b_1$) and y-intercept ($b_0$) for the regression line. * Predict sales for an advertising spend of 22. * Calculate the Pearson correlation coefficient ($r$). * Interpret the $R^2$ value (you don't need to calculate it fully, just explain what it means in this context).