Statistics Cheatsheet
Cheatsheet Content
### Chapter 8.1: Sampling Distributions - **Population:** The entire group of individuals or instances about whom we hope to learn. - **Sample:** A (presumably representative) subset of a population, examined in hope of learning about the population. - **Sampling Distribution:** The distribution of values taken by the statistic in all possible samples of the same size from the same population. - **Central Limit Theorem (CLT):** For a sufficiently large sample size ($n \ge 30$ is a common guideline), the sampling distribution of the sample mean ($\bar{x}$) will be approximately normal, regardless of the shape of the population distribution. - Mean of sampling distribution of $\bar{x}$: $\mu_{\bar{x}} = \mu$ (population mean) - Standard Deviation of Sampling Distribution of $\bar{x}$ (Standard Error): $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$ (where $\sigma$ is population standard deviation) - **Sampling Distribution of a Proportion:** For large samples, the sampling distribution of the sample proportion ($\hat{p}$) is approximately normal. - Mean of sampling distribution of $\hat{p}$: $\mu_{\hat{p}} = p$ (population proportion) - Standard Deviation of Sampling Distribution of $\hat{p}$ (Standard Error): $\sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}$ - Conditions for Normality: $np \ge 10$ and $n(1-p) \ge 10$ ### Chapter 9.1-9.3: Confidence Intervals for Proportions and Means - **Confidence Interval (CI):** An interval estimate of a population parameter. It gives a range of plausible values for the parameter. - **Confidence Level (C):** The probability that the method used to construct the interval will produce an interval that contains the true population parameter. Common levels: 90%, 95%, 99%. #### Confidence Intervals for Proportions - **Formula for a One-Sample Z-Interval for a Population Proportion (p):** $$\hat{p} \pm z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$ - $\hat{p}$: Sample proportion - $z^*$: Critical value from standard normal distribution, depends on confidence level (e.g., 1.96 for 95% CI) - $\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$: Standard error of the sample proportion - **Conditions for Inference (Proportions):** 1. **Random:** Data come from a random sample or randomized experiment. 2. **10% Condition:** Sample size $n$ is no more than 10% of the population size $N$ ($n \le 0.10N$). 3. **Large Counts Condition:** Both $n\hat{p} \ge 10$ and $n(1-\hat{p}) \ge 10$. - **Interpreting a Confidence Interval:** "We are C% confident that the interval from [lower bound] to [upper bound] captures the true proportion of [context]." - **Interpreting Confidence Level:** "If we were to take many samples of the same size from this population and construct a C% confidence interval for each sample, approximately C% of these intervals would capture the true population proportion." - **Determining Sample Size (Proportions):** To achieve a desired margin of error (ME) for a proportion: $$n = \frac{(z^*)^2 p^*(1-p^*)}{ME^2}$$ - Use a guessed $p^*$ if available, or $p^*=0.5$ for a conservative estimate (largest sample size). #### Chapter 9.3: Confidence Intervals for Means - **Formula for a One-Sample T-Interval for a Population Mean ($\mu$):** $$\bar{x} \pm t^* \frac{s}{\sqrt{n}}$$ - $\bar{x}$: Sample mean - $t^*$: Critical value from t-distribution with $df = n-1$, depends on confidence level. - $s$: Sample standard deviation (use sample standard deviation, NOT population $\sigma$) - $\frac{s}{\sqrt{n}}$: Standard error of the sample mean - **Conditions for Inference (Means):** 1. **Random:** Data come from a random sample or randomized experiment. 2. **10% Condition:** Sample size $n$ is no more than 10% of the population size $N$. 3. **Normal/Large Sample Condition:** The population distribution is normal, OR the sample size is large ($n \ge 30$, Central Limit Theorem applies), OR if $n < 30$, a graph of the sample data (boxplot, histogram) shows no strong skewness or outliers. - **t-distribution:** - Used when the population standard deviation ($\sigma$) is unknown (which is almost always the case). - Similar in shape to the standard normal curve, but with more probability in the tails (fatter tails). - Its exact shape depends on the **degrees of freedom (df)**, which for a one-sample t-procedure is $df = n-1$. - As $df$ increases, the t-distribution approaches the standard normal distribution. - **Determining Sample Size (Means):** To achieve a desired margin of error (ME) for a mean: $$n = \left(\frac{z^* s}{ME}\right)^2$$ - Note that this formula typically uses $z^*$ as a preliminary estimate since $t^*$ depends on $n$. A pilot sample's $s$ can be used, or a conservative guess for $s$. ### Chapter 10: Hypothesis Testing (Significance Tests) - **Hypothesis Test (Significance Test):** A formal procedure for comparing observed data with a claim (hypothesis) whose truth we want to assess. - **Null Hypothesis ($H_0$):** A statement of "no difference" or "no effect." It is the claim being tested. (e.g., $H_0: p = 0.5$) - **Alternative Hypothesis ($H_a$):** A statement we are trying to find evidence for. (e.g., $H_a: p \ne 0.5$, $H_a: p > 0.5$, $H_a: p ### Chapter 11: Inference for Means - **When $\sigma$ is known:** Use Z-procedures (rare in practice as $\sigma$ is usually unknown). - **One-Sample Z-Interval for $\mu$**: $\bar{x} \pm z^* \frac{\sigma}{\sqrt{n}}$ - **One-Sample Z-Test for $\mu$**: $Z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}$ - **When $\sigma$ is unknown (most common scenario):** Use T-procedures! - **t-distribution:** - Similar to normal distribution, but with heavier tails (more spread out). - Shape depends on degrees of freedom ($df = n-1$). - As $df \to \infty$, t-distribution approaches standard normal. - **Formula for a One-Sample T-Interval for a Population Mean ($\mu$):** $$\bar{x} \pm t^* \frac{s}{\sqrt{n}}$$ - $\bar{x}$: Sample mean - $t^*$: Critical value from t-distribution with $df = n-1$, depends on confidence level. - $s$: Sample standard deviation - $\frac{s}{\sqrt{n}}$: Standard error of the sample mean - **Conditions for Inference about Means:** 1. **Random:** Data come from a random sample or randomized experiment. 2. **10% Condition:** Sample size $n$ is no more than 10% of the population size $N$. 3. **Normal/Large Sample Condition:** The population distribution is normal, OR the sample size is large ($n \ge 30$, due to CLT), OR if $n < 30$, a graph of the sample data shows no strong skewness or outliers. - **Formula for a One-Sample T-Test for a Population Mean ($\mu$):** $$t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}$$ - Use $df = n-1$ to find P-value. - **Paired T-Procedures:** Used when data come from paired samples (e.g., "before and after" measurements on the same individuals, or two treatments applied to matched pairs). - Analyze the *differences* between the paired observations. - Treat the differences as a single sample and apply one-sample t-procedures to the differences ($\mu_{diff}$). - $H_0: \mu_{diff} = 0$ (no difference). - $t = \frac{\bar{x}_{diff} - \mu_{0_{diff}}}{s_{diff}/\sqrt{n}}$ with $df = n-1$.