Chapter 1: Basic Ideas Easy Problems Qualitative vs. Quantitative Variables: Classify each as qualitative or quantitative: (a) Brand of cell phone: Qualitative (b) Number of text messages sent per day: Quantitative (c) Favorite color: Qualitative (d) Weight of a newborn baby: Quantitative Discrete vs. Continuous: Classify each quantitative variable as discrete or continuous: (a) Number of cars in a parking lot: Discrete (b) Time spent studying for a test: Continuous (c) Height of a tree: Continuous (d) Number of heads in 10 coin flips: Discrete Sampling Method: A teacher wants to know the average height of students in her class. She measures the height of the first 5 students who arrive. What type of sampling is this? Convenience sampling (biased). Difficult Problems Study Design: A researcher is studying the effect of a new fertilizer on plant growth. They select 100 identical plants and randomly assign 50 to receive the new fertilizer and 50 to receive a standard fertilizer. After 4 weeks, they measure the height of all plants. (a) Is this an observational study or a randomized experiment? Explain. Randomized experiment . The researcher actively assigned treatments (fertilizers) to the plants randomly, allowing for cause-and-effect conclusions. (b) Identify the experimental units, treatment, and response variable. Experimental units: The 100 individual plants. Treatment: The type of fertilizer (new vs. standard). Response variable: Plant height. (c) What is the advantage of using randomization in this study? Randomization helps to ensure that the two groups (new fertilizer and standard fertilizer) are as similar as possible in all respects except for the fertilizer they receive. This minimizes the influence of lurking variables and allows the researcher to attribute any observed differences in growth directly to the fertilizer. Bias in Surveys: A local newspaper runs an online poll asking readers, "Should the city increase taxes to fund a new sports stadium?" 85% of respondents voted "No." The newspaper concludes that the majority of citizens oppose the tax increase. (a) Identify a potential source of bias in this survey. Voluntary response bias . People with strong opinions (especially negative ones) are more likely to participate in online polls, leading to a sample that is not representative of the general population. (b) How could the newspaper obtain a more reliable estimate of public opinion? They should use a random sampling method , such as a simple random sample or a stratified random sample, of registered voters or residents. This would give every individual an equal chance of being selected, leading to a more representative sample. Chapter 2: Graphical Summaries of Data Easy Problems Bar Graph vs. Histogram: (a) Which type of graph is used for categorical data? Bar graph (b) Which type of graph has bars that touch? Histogram Data Visualization: A survey asked 20 students for their favorite type of music. The results were: Pop, Rock, Jazz, Pop, Classical, Rock, Pop, Jazz, Hip Hop, Pop, Rock, Classical, Pop, Hip Hop, Jazz, Rock, Pop, Classical, Pop, Rock. (a) Create a frequency distribution table for this data. Music Type Frequency Pop 8 Rock 5 Jazz 3 Classical 3 Hip Hop 2 (b) What percentage of students chose Pop music? $\frac{8}{21} \approx 38.1\%$ Difficult Problems Misleading Graphs: A company wants to show that its sales have increased significantly. They create a bar graph where the y-axis starts at $90,000 instead of 0. (a) Explain why this graph might be misleading. Starting the y-axis at $90,000 exaggerates the visual difference between the bars, making a small increase appear much larger than it actually is. This violates the Area Principle . (b) How could the company create a more accurate representation of their sales data? The y-axis should start at 0, or a clear break should be indicated if the axis does not start at 0, to accurately represent the relative magnitudes of the sales figures. Stem-and-Leaf Plot & Shape: Create a stem-and-leaf plot for the following test scores and describe the shape of the distribution: 78, 85, 92, 70, 65, 88, 95, 72, 81, 60, 90, 75, 83, 68, 79, 87, 80, 74, 91, 84. Stem-and-Leaf Plot: 6 | 0 5 8 7 | 0 2 4 5 8 9 8 | 0 1 3 4 5 7 8 9 | 0 1 2 5 Shape: The distribution appears to be roughly symmetric , possibly slightly skewed to the left due to the longer tail on the lower end, but generally bell-shaped. There are no obvious outliers. Chapter 3: Numerical Summaries of Data Easy Problems Mean, Median, Mode: Find the mean, median, and mode for the following dataset: 10, 12, 15, 12, 18, 13, 12. (Sorted: 10, 12, 12, 12, 13, 15, 18) Mean: $(10+12+15+12+18+13+12)/7 = 92/7 \approx 13.14$ Median: 12 (the middle value) Mode: 12 (appears most frequently) Range: What is the range of the dataset: 5, 8, 3, 12, 7, 10? Range = Max - Min = $12 - 3 = 9$. Z-score Calculation: A data point $x=70$ comes from a distribution with mean $\mu=60$ and standard deviation $\sigma=5$. Calculate its z-score. $z = \frac{x - \mu}{\sigma} = \frac{70 - 60}{5} = \frac{10}{5} = 2$. Difficult Problems Mean vs. Median for Skewed Data: A small company has 10 employees. Their annual salaries are: $30,000, $32,000, $35,000, $35,000, $38,000, $40,000, $42,000, $45,000, $50,000, $200,000. (a) Calculate the mean and median salary. Mean: Sum $= 547,000$, $n=10$. Mean $= 547,000/10 = \$54,700$. Median: (Sorted: $30k, 32k, 35k, 35k, 38k, 40k, 42k, 45k, 50k, 200k$). Median $= (38,000 + 40,000)/2 = \$39,000$. (b) Which measure of center (mean or median) better represents a typical employee's salary? Explain why. The median ($39,000) better represents a typical employee's salary. The mean ($54,700) is highly influenced by the single outlier ($200,000 salary), pulling it upwards and making it unrepresentative of most employees' earnings. This distribution is right-skewed . Standard Deviation Interpretation: Two classes, A and B, took the same test. Both classes had an average score of 75. Class A had a standard deviation of 5, while Class B had a standard deviation of 15. (a) Which class had more consistent scores? Class A . A smaller standard deviation (5) indicates that the scores in Class A were more tightly clustered around the mean, meaning they were more consistent. (b) If a student scored 80 in each class, which student performed relatively better compared to their classmates? Calculate z-scores: Class A: $z_A = \frac{80 - 75}{5} = 1$. Class B: $z_B = \frac{80 - 75}{15} = 0.33$. The student in Class A performed relatively better. Their score of 80 was 1 standard deviation above the mean, while the student in Class B was only 0.33 standard deviations above the mean. Five-Number Summary & Outliers (using IQR rule): For the dataset: 10, 12, 15, 18, 20, 22, 25, 28, 30, 60. (a) Calculate the five-number summary. Min = 10 $Q_1$: Position is $(10+1)/4 = 2.75$, so between 2nd and 3rd value. $Q_1 = 12 + 0.75(15-12) = 12 + 2.25 = 14.25$. Median: Position is $(10+1)/2 = 5.5$, so average of 5th and 6th value. Median $= (20+22)/2 = 21$. $Q_3$: Position is $3(10+1)/4 = 8.25$, so between 8th and 9th value. $Q_3 = 28 + 0.25(30-28) = 28 + 0.5 = 28.5$. Max = 60 Five-number summary: (10, 14.25, 21, 28.5, 60). (b) Identify any outliers using the $1.5 \times IQR$ rule. $IQR = Q_3 - Q_1 = 28.5 - 14.25 = 14.25$. Lower fence = $Q_1 - 1.5 \times IQR = 14.25 - 1.5 \times 14.25 = 14.25 - 21.375 = -7.125$. Upper fence = $Q_3 + 1.5 \times IQR = 28.5 + 1.5 \times 14.25 = 28.5 + 21.375 = 49.875$. Any value below -7.125 or above 49.875 is an outlier. The value 60 is an outlier. Chapter 4: Probability Easy Problems Basic Probability: A bag contains 5 red, 3 blue, and 2 green marbles. If you pick one marble at random: (a) What is the probability of picking a red marble? $P(\text{Red}) = 5/10 = 0.5$. (b) What is the probability of picking a blue or green marble? $P(\text{Blue or Green}) = (3+2)/10 = 5/10 = 0.5$. (c) What is the probability of not picking a red marble? $P(\text{Not Red}) = 1 - P(\text{Red}) = 1 - 0.5 = 0.5$. Mutually Exclusive vs. Independent: (a) Rolling a 3 and rolling a 4 on a single die roll are mutually exclusive events. (b) Flipping a coin and getting heads, and then flipping it again and getting heads, are independent events. Conditional Probability (Simple): Given $P(A)=0.4$, $P(B)=0.5$, and $P(A \text{ and } B)=0.2$. Find $P(A|B)$. $P(A|B) = P(A \text{ and } B) / P(B) = 0.2 / 0.5 = 0.4$. Difficult Problems Two-Way Table and Probability: A company surveyed 200 employees about their preferred work environment (office or remote) and their age group. Office Remote Total Under 30 30 50 80 30-50 60 40 100 Over 50 15 5 20 Total 105 95 200 (a) What is the probability that a randomly selected employee prefers remote work? $P(\text{Remote}) = 95/200 = 0.475$. (b) What is the probability that an employee is under 30 AND prefers office work? $P(\text{Under 30 and Office}) = 30/200 = 0.15$. (c) What is the probability that an employee prefers remote work GIVEN they are over 50? $P(\text{Remote | Over 50}) = 5/20 = 0.25$. (d) Are preferring remote work and being under 30 independent events? Justify your answer. $P(\text{Remote}) = 95/200 = 0.475$. $P(\text{Remote | Under 30}) = 50/80 = 0.625$. Since $P(\text{Remote}) \ne P(\text{Remote | Under 30})$, the events are not independent . (Being under 30 increases the probability of preferring remote work). Bayes' Theorem: A medical test for a rare disease ($D$) has a 98% accuracy rate (i.e., $P(\text{Positive | D}) = 0.98$ and $P(\text{Negative | No D}) = 0.98$). The disease affects 1% of the population ($P(D) = 0.01$). If a person tests positive, what is the probability they actually have the disease ($P(D \text{ | Positive})$)? $P(D) = 0.01 \implies P(\text{No D}) = 0.99$. $P(\text{Positive | D}) = 0.98 \implies P(\text{Negative | D}) = 0.02$. $P(\text{Negative | No D}) = 0.98 \implies P(\text{Positive | No D}) = 0.02$. $P(\text{Positive}) = P(\text{Positive | D})P(D) + P(\text{Positive | No D})P(\text{No D})$ $= (0.98)(0.01) + (0.02)(0.99) = 0.0098 + 0.0198 = 0.0296$. $P(D \text{ | Positive}) = \frac{P(\text{Positive | D})P(D)}{P(\text{Positive})} = \frac{(0.98)(0.01)}{0.0296} = \frac{0.0098}{0.0296} \approx 0.331$. (Even with a positive test, the probability of having the disease is only about 33.1%, due to its rarity). Chapter 5: Discrete Probability Distributions Easy Problems Valid Probability Distribution: Which of the following is a valid probability distribution? (a) $P(X=0)=0.2, P(X=1)=0.3, P(X=2)=0.4, P(X=3)=0.1$. Valid (all $P(x) \ge 0$, sum = 1). (b) $P(X=0)=0.3, P(X=1)=0.4, P(X=2)=0.5$. Invalid (sum $> 1$). (c) $P(X=0)=0.6, P(X=1)=0.3, P(X=2)=-0.1$. Invalid (negative probability). Expected Value: A game involves rolling a fair six-sided die. If you roll a 6, you win $10. If you roll any other number, you lose $1. What is the expected value of playing this game? $P(\text{Win}) = 1/6$, $P(\text{Lose}) = 5/6$. $E(X) = (10)(1/6) + (-1)(5/6) = 10/6 - 5/6 = 5/6 \approx \$0.83$. Difficult Problems Binomial Probability: A fair coin is flipped 8 times. (a) What is the probability of getting exactly 5 heads? $P(X=5) = \binom{8}{5} (0.5)^5 (0.5)^{8-5} = \binom{8}{5} (0.5)^8 = 56 \times 0.00390625 = 0.21875$. (b) What is the probability of getting at most 2 heads? $P(X \le 2) = P(X=0) + P(X=1) + P(X=2)$. $P(X=0) = \binom{8}{0} (0.5)^8 = 1 \times 0.00390625 = 0.00390625$. $P(X=1) = \binom{8}{1} (0.5)^8 = 8 \times 0.00390625 = 0.03125$. $P(X=2) = \binom{8}{2} (0.5)^8 = 28 \times 0.00390625 = 0.109375$. $P(X \le 2) = 0.00390625 + 0.03125 + 0.109375 = 0.14453125$. Expected Value & Variance: A discrete random variable $X$ has the following probability distribution: X 0 1 2 3 P(x) 0.1 0.3 0.4 0.2 (a) Find the expected value $E(X)$. $E(X) = (0)(0.1) + (1)(0.3) + (2)(0.4) + (3)(0.2) = 0 + 0.3 + 0.8 + 0.6 = 1.7$. (b) Find the variance $Var(X)$ and standard deviation $\sigma_X$. $E(X^2) = (0^2)(0.1) + (1^2)(0.3) + (2^2)(0.4) + (3^2)(0.2) = 0 + 0.3 + 1.6 + 1.8 = 3.7$. $Var(X) = E(X^2) - [E(X)]^2 = 3.7 - (1.7)^2 = 3.7 - 2.89 = 0.81$. $\sigma_X = \sqrt{0.81} = 0.9$. Chapter 6: The Normal Distribution Easy Problems Empirical Rule: A population has a normal distribution with mean $\mu=50$ and standard deviation $\sigma=5$. (a) Approximately what percentage of data falls between 45 and 55? 68% (within 1 std dev). (b) Approximately what percentage of data falls between 40 and 60? 95% (within 2 std dev). Z-score to Percentile: A student's score on a test has a z-score of 1.5. Assuming test scores are normally distributed, what percentile is this student in? (Use a Z-table or calculator). $P(Z 93.32nd percentile . Difficult Problems Normal Probability (reverse): The heights of adult males are normally distributed with a mean of 69 inches and a standard deviation of 2.5 inches. (a) What height separates the tallest 10% of men from the rest? We need to find $x$ such that $P(X > x) = 0.10$, which means $P(X (b) What is the probability that a randomly selected man is between 68 and 72 inches tall? $z_{68} = \frac{68 - 69}{2.5} = -0.4$. $P(Z Central Limit Theorem Application: The average weight of a certain type of fruit is 200 grams with a standard deviation of 15 grams. A sample of 36 fruits is randomly selected. (a) What is the probability that the sample mean weight is less than 195 grams? $\mu_{\bar{x}} = 200$. $\sigma_{\bar{x}} = \frac{15}{\sqrt{36}} = \frac{15}{6} = 2.5$. $z = \frac{195 - 200}{2.5} = \frac{-5}{2.5} = -2$. $P(\bar{X} (b) What is the probability that the sample mean weight is between 198 and 203 grams? $z_{198} = \frac{198 - 200}{2.5} = -0.8$. $P(Z Chapter 7: Confidence Intervals Easy Problems Critical Values: Find the critical value ($z_{\alpha/2}$ or $t_{\alpha/2}$) for each: (a) 90% CI for mean, $\sigma$ known: $z_{0.05} = 1.645$. (b) 95% CI for mean, $n=20$, $\sigma$ unknown: $df = 19$, $t_{0.025} = 2.093$. (c) 99% CI for proportion: $z_{0.005} = 2.576$. Interpreting CI: A 95% confidence interval for the average height of adult women is (63 inches, 65 inches). (a) Would a 99% confidence interval for the same data be wider or narrower? Wider (higher confidence requires a larger interval). (b) Does this interval mean that 95% of adult women have heights between 63 and 65 inches? No . It means we are 95% confident that the *true average height* of adult women falls within this range. Difficult Problems CI for Mean (unknown $\sigma$): A sample of 30 bags of chips has a mean weight of 14.8 ounces and a standard deviation of 0.5 ounces. Construct a 90% confidence interval for the true mean weight of all chip bags. $\bar{x} = 14.8$, $s = 0.5$, $n = 30$. $df = 29$. For 90% CI, $t_{\alpha/2} = t_{0.05, 29} = 1.699$. Standard Error = $s/\sqrt{n} = 0.5/\sqrt{30} \approx 0.5/5.477 \approx 0.0913$. Margin of Error = $1.699 \times 0.0913 \approx 0.155$. CI = $14.8 \pm 0.155 = (14.645, 14.955)$. Minimum Sample Size for Mean: A researcher wants to estimate the mean commute time for employees in a city. They want to be 95% confident that their estimate is within 2 minutes of the true mean. Assume the population standard deviation is 8 minutes. What is the minimum sample size required? $ME = 2$, $z_{\alpha/2} = 1.96$ (for 95% CI), $\sigma = 8$. Formula: $n = (\frac{z_{\alpha/2} \times \sigma}{ME})^2 = (\frac{1.96 \times 8}{2})^2 = (1.96 \times 4)^2 = (7.84)^2 \approx 61.46$. Always round up: $n = 62$. Chapter 8: Hypothesis Testing Easy Problems Null and Alternative Hypotheses: State the null and alternative hypotheses for each scenario: (a) A company claims its new light bulbs last longer than 1000 hours. $H_0: \mu = 1000$ hours $H_1: \mu > 1000$ hours (b) A politician claims that less than 30% of voters support her opponent. $H_0: p = 0.30$ $H_1: p Decision based on P-value: For each p-value and significance level ($\alpha$), state whether you reject or fail to reject $H_0$: (a) p-value = 0.03, $\alpha = 0.05$: Reject $H_0$ (p-value $\le \alpha$). (b) p-value = 0.06, $\alpha = 0.05$: Fail to reject $H_0$ (p-value $> \alpha$). (c) p-value = 0.001, $\alpha = 0.01$: Reject $H_0$ (p-value $\le \alpha$). Difficult Problems Hypothesis Test for Proportion: A manufacturer claims that at most 5% of their products are defective. A quality control manager samples 200 products and finds 15 defective ones. Test the manufacturer's claim at the 0.01 significance level. $H_0: p = 0.05$ (or $p \le 0.05$) $H_1: p > 0.05$ $\hat{p} = 15/200 = 0.075$. $p_0 = 0.05$. $n = 200$. Test Statistic: $z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} = \frac{0.075 - 0.05}{\sqrt{\frac{0.05(0.95)}{200}}} = \frac{0.025}{\sqrt{\frac{0.0475}{200}}} = \frac{0.025}{\sqrt{0.0002375}} = \frac{0.025}{0.01541} \approx 1.62$. P-value: For a right-tailed test, $P(Z > 1.62) \approx 1 - 0.9474 = 0.0526$. Decision: Since p-value ($0.0526$) $> \alpha$ ($0.01$), we fail to reject $H_0$ . Conclusion: There is not sufficient evidence at the 0.01 significance level to conclude that the proportion of defective products is greater than 5%. The manufacturer's claim cannot be rejected. Hypothesis Test for Mean (unknown $\sigma$): A new teaching method is introduced, and 25 students are tested. Their average score is 82 with a standard deviation of 10. The historical average score for this subject is 78. Does the new method significantly improve scores at the 0.05 significance level? $H_0: \mu = 78$ $H_1: \mu > 78$ $\bar{x} = 82$, $s = 10$, $n = 25$. $df = 24$. Test Statistic: $t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} = \frac{82 - 78}{10/\sqrt{25}} = \frac{4}{10/5} = \frac{4}{2} = 2$. P-value: For a right-tailed test, $P(T > 2)$ with $df=24$. Using a t-table, $P(T > 2)$ is between 0.025 and 0.05 (exact value $\approx 0.0285$). Decision: Since p-value ($0.0285$) $\le \alpha$ ($0.05$), we reject $H_0$ . Conclusion: There is sufficient evidence at the 0.05 significance level to conclude that the new teaching method significantly improves scores. Chapter 11: Correlation and Regression Easy Problems Correlation Direction: (a) As temperature increases, ice cream sales increase. This is a positive correlation. (b) As the number of hours spent studying increases, the number of hours spent watching TV decreases. This is a negative correlation. (c) Shoe size and GPA. This likely has no (or very weak) correlation. Interpreting Slope and Y-intercept: A regression equation is $\hat{y} = 10 + 2x$, where $y$ is the predicted sales and $x$ is advertising spending (in thousands of dollars). (a) Interpret the slope. For every additional $1,000 spent on advertising, the predicted sales increase by 2 units . (b) Interpret the y-intercept. When advertising spending is $0, the predicted sales are 10 units . (Note: Y-intercept interpretation only makes sense if $x=0$ is within the range of observed data). Difficult Problems Residuals and Best Fit: A researcher fits a regression line to a dataset. They calculate the residuals. (a) What property must the sum of residuals always have for a least-squares regression line? The sum of the residuals ($\sum e_i$) for a least-squares regression line is always zero . (b) If a scatterplot of residuals vs. predicted values shows a clear pattern (e.g., a curve or a fan shape), what does this suggest about the linear model? A pattern in the residual plot suggests that a linear model is not appropriate for the data. It indicates that there is non-linear structure in the relationship between $X$ and $Y$ that the linear model failed to capture. Regression Equation from Summary Statistics: Given: $\bar{x}=10, s_x=2, \bar{y}=50, s_y=8$, and correlation coefficient $r=0.75$. Find the equation of the least-squares regression line. Slope $b_1 = r \frac{s_y}{s_x} = 0.75 \times \frac{8}{2} = 0.75 \times 4 = 3$. Y-intercept $b_0 = \bar{y} - b_1 \bar{x} = 50 - 3(10) = 50 - 30 = 20$. Regression equation: $\hat{y} = 20 + 3x$. Extrapolation and Causation: A study found a strong positive correlation between the number of hours a student spends on social media per week ($x$) and their average test score ($y$). The regression line was $\hat{y} = 70 + 0.5x$. (a) A student spends 30 hours a week on social media. Predict their test score. $\hat{y} = 70 + 0.5(30) = 70 + 15 = 85$. (b) Would it be appropriate to use this model to predict the score of a student who spends 100 hours a week on social media? Explain. No, it would likely be inappropriate . Predicting outside the range of the observed $x$ values (extrapolation) can be unreliable. The relationship might not remain linear for extremely high social media usage, and 100 hours is likely far beyond typical observed values. (c) Can we conclude that spending more time on social media causes higher test scores? Why or why not? No , correlation does not imply causation. There could be confounding variables or other explanations. For example, highly motivated students might be better at managing their time (including social media) and also perform well on tests, or perhaps students who use social media for academic collaboration might perform better. The regression shows an association, not a causal link.