### Problem 1: Lung Cancer & Coffee Study This problem involves analyzing stratified data from two studies (smokers and non-smokers) on lung cancer risk among coffee drinkers and non-coffee drinkers. The goal is to calculate a p-value for the combined data using an appropriate statistical test, specifically focusing on the coffee-lung cancer association. #### 1.1 Data Overview **Table 1: Stratified data among smokers** | Group | Coffee Drinkers | Non-Coffee Drinkers | Total | |---------------------|-----------------|---------------------|-------| | Lung Cancer Cases | 80 | 40 | 120 | | No Lung Cancer Cases| 120 | 60 | 180 | | Total | 200 | 100 | 300 | **Table 2: Stratified data among non-smokers** | Group | Coffee Drinkers | Non-Coffee Drinkers | Total | |---------------------|-----------------|---------------------|-------| | Lung Cancer Cases | 10 | 30 | 40 | | No Lung Cancer Cases| 90 | 570 | 660 | | Total | 100 | 600 | 700 | ### Problem 1 Solution: Mantel-Haenszel Test To calculate a p-value for the combined (original and adjusted) tables, especially when considering a stratified analysis while controlling for a confounding variable (smoking status), the **Mantel-Haenszel (MH) test** is appropriate. This test combines information from several 2x2 tables. #### 1.1 Building Separate Tables First, we need to extract the 2x2 tables for each stratum (smokers and non-smokers). **Stratum 1: Smokers** | | Lung Cancer (Case) | No Lung Cancer (Control) | Total | |-------------|--------------------|--------------------------|-------| | Coffee | $a_1 = 80$ | $b_1 = 120$ | $N_{11} = 200$ | | No Coffee | $c_1 = 40$ | $d_1 = 60$ | $N_{10} = 100$ | | Total | $M_{11} = 120$ | $M_{10} = 180$ | $N_1 = 300$ | **Stratum 2: Non-Smokers** | | Lung Cancer (Case) | No Lung Cancer (Control) | Total | |-------------|--------------------|--------------------------|-------| | Coffee | $a_2 = 10$ | $b_2 = 90$ | $N_{21} = 100$ | | No Coffee | $c_2 = 30$ | $d_2 = 570$ | $N_{20} = 600$ | | Total | $M_{21} = 40$ | $M_{20} = 660$ | $N_2 = 700$ | #### 1.2 Mantel-Haenszel Test Calculation The Mantel-Haenszel test statistic ($X_{MH}^2$) is given by: $$X_{MH}^2 = \frac{\left( \sum_i a_i - \sum_i E(a_i) \right)^2}{\sum_i Var(a_i)}$$ where, for each stratum $i$: - $E(a_i) = \frac{(a_i+b_i)(a_i+c_i)}{N_i}$ - $Var(a_i) = \frac{(a_i+b_i)(c_i+d_i)(a_i+c_i)(b_i+d_i)}{N_i^2(N_i-1)}$ Let's calculate these values for each stratum: **Stratum 1 (Smokers):** - $a_1 = 80$, $b_1 = 120$, $c_1 = 40$, $d_1 = 60$ - $N_1 = 300$ - $E(a_1) = \frac{(80+120)(80+40)}{300} = \frac{200 \times 120}{300} = 80$ - $Var(a_1) = \frac{(200)(100)(120)(180)}{300^2(300-1)} = \frac{432,000,000}{90,000 \times 299} \approx 16.05$ **Stratum 2 (Non-Smokers):** - $a_2 = 10$, $b_2 = 90$, $c_2 = 30$, $d_2 = 570$ - $N_2 = 700$ - $E(a_2) = \frac{(10+90)(10+30)}{700} = \frac{100 \times 40}{700} \approx 5.714$ - $Var(a_2) = \frac{(100)(600)(40)(660)}{700^2(700-1)} = \frac{1,584,000,000}{490,000 \times 699} \approx 4.61$ **Summing across strata:** - $\sum a_i = 80 + 10 = 90$ - $\sum E(a_i) = 80 + 5.714 = 85.714$ - $\sum Var(a_i) = 16.05 + 4.61 = 20.66$ **Mantel-Haenszel Test Statistic:** $$X_{MH}^2 = \frac{(90 - 85.714)^2}{20.66} = \frac{(4.286)^2}{20.66} = \frac{18.379}{20.66} \approx 0.89$$ #### 1.3 P-value Interpretation The $X_{MH}^2$ statistic follows a chi-squared distribution with 1 degree of freedom. - For $X_{MH}^2 \approx 0.89$ with 1 degree of freedom, the p-value is approximately $P(\chi^2_1 > 0.89) \approx 0.345$. **Conclusion for Problem 1:** With a p-value of approximately 0.345, which is much greater than common significance levels (e.g., 0.05), we **fail to reject the null hypothesis**. This suggests that, after controlling for smoking status, there is no statistically significant association between coffee drinking and lung cancer risk based on this combined data. #### 1.4 Odds Ratio (Optional but useful for context) The Mantel-Haenszel odds ratio ($OR_{MH}$) can also be calculated: $$OR_{MH} = \frac{\sum_i (a_i d_i / N_i)}{\sum_i (b_i c_i / N_i)}$$ - Stratum 1: $(80 \times 60) / 300 = 4800 / 300 = 16$ - Stratum 2: $(10 \times 570) / 700 = 5700 / 700 \approx 8.14$ - Numerator: $16 + 8.14 = 24.14$ - Stratum 1: $(120 \times 40) / 300 = 4800 / 300 = 16$ - Stratum 2: $(90 \times 30) / 700 = 2700 / 700 \approx 3.86$ - Denominator: $16 + 3.86 = 19.86$ $OR_{MH} = 24.14 / 19.86 \approx 1.215$. An odds ratio close to 1 supports the conclusion of no significant association. ### Problem 2: Stationary Point Process This problem asks why "moderate patients" tend to dominate in lung cancer case-control studies, especially when considering a stationary point process and interarrival times. The core of the problem lies in understanding the concept of **length-biased sampling** or the **inspection paradox**. #### 2.1 Key Concepts - **Stationary Point Process:** A stochastic process where the statistical properties do not change over time. A Poisson process is a common example. - **Interarrival Intervals ($X_i$):** The time between consecutive events in a point process. For a Poisson process, these are exponentially distributed. - **Random Interarrival Time ($X^*$):** The interval that *contains* a fixed deterministic point $t$. - **"Moderate Patients":** In the context of case-control studies for diseases, this often refers to cases where the disease duration or exposure is neither very short nor extremely long, but rather "average" or "moderate". #### 2.2 Explanation of the Hint The hint guides us to show that $X^*$ (the interval containing a fixed point $t$) is stochastically larger than a typical interarrival time $X_i$. This is the mathematical formulation of the inspection paradox. **Inspection Paradox:** If you sample intervals from a stationary point process by picking a random point in time and observing which interval it falls into, you are more likely to pick a longer interval. This is because longer intervals occupy more "time" and thus have a higher probability of containing a randomly chosen point. #### 2.3 Proof of $X^*$ being stochastically larger than $X_i$ Let the interarrival times $X_i$ be i.i.d. random variables with mean $\mu = E[X_i]$. The probability density function (PDF) of $X^*$ is given by: $$f_{X^*}(x) = \frac{x}{\mu} f_X(x)$$ where $f_X(x)$ is the PDF of a typical interarrival time $X_i$. **Intuition:** The term $x/\mu$ acts as a weighting factor. It means that the probability of observing an interval of length $x$ (when sampling by a random point) is proportional to its length. Longer intervals are "over-represented" in this sampling scheme. **Expected Value:** $E[X^*] = \int_0^\infty x f_{X^*}(x) dx = \int_0^\infty x \frac{x}{\mu} f_X(x) dx = \frac{1}{\mu} \int_0^\infty x^2 f_X(x) dx = \frac{E[X^2]}{\mu}$ We know that $Var(X) = E[X^2] - (E[X])^2$, so $E[X^2] = Var(X) + \mu^2$. Therefore, $E[X^*] = \frac{Var(X) + \mu^2}{\mu} = \frac{Var(X)}{\mu} + \mu$. Since $Var(X) \ge 0$, it follows that $E[X^*] \ge \mu = E[X_i]$. If $Var(X) > 0$ (i.e., the interarrival times are not constant), then $E[X^*] > E[X_i]$. This shows that the expected length of the interval containing a random point is greater than the expected length of an arbitrary interval. This is what "stochastically larger" implies in this context. #### 2.4 Application to Lung Cancer Case-Control Studies In case-control studies, particularly those involving disease duration or exposure periods, if the "cases" are identified by being present at a specific point in time (e.g., diagnosed and alive at the time of the study), this is analogous to the point $t$ falling within an interval. - **Disease Duration as "Interval":** If we consider the duration of the disease (from onset to diagnosis/outcome) as an "interval," then patients who are "cases" at the time of the study are effectively sampled in a length-biased way. - **"Moderate" Patients:** Patients with very short disease durations might be missed because they resolve quickly or die before being recruited. Patients with extremely long durations might be rare. The length-biased sampling means that patients with "moderate" or "average" disease durations (longer than the typical short ones) are more likely to be observed and enrolled in the study. This creates an apparent "dominance" of moderate patients, not because they are inherently more common, but because the sampling method preferentially selects them. **Conclusion for Problem 2:** The dominance of "moderate patients" in lung cancer case-control studies can be explained by the inspection paradox (or length-biased sampling). When recruitment involves identifying cases existing at a specific time, those with longer disease durations (or exposure periods) are more likely to be "caught" by the sampling process. The mathematical proof shows that the expected length of an interval containing a random point is greater than the expected length of a typical interval, thus leading to an over-representation of "moderate" to longer-duration cases. ### Problem 3: College Admissions & Gender Bias This problem examines potential gender bias in college admissions, first overall and then stratified by department, using statistical tests. #### 3.1 Overall Admission Outcomes by Gender **Table 3: Admission outcomes by gender** | | Men Admitted | Men Rejected | Men Total | Women Admitted | Women Rejected | Women Total | Overall Total | |-----------|--------------|--------------|-----------|----------------|----------------|-------------|---------------| | Admitted | 3714 (44%) | | | 1512 (35%) | | | 5226 | | Rejected | 4728 | | | 2809 | | | 7537 | | Total | 8442 | | | 4321 | | | 12763 | *Note: The table is slightly reformatted for clarity, combining admitted/rejected rows for Men/Women.* Let's reconstruct the 2x2 table for overall gender and admission status: | | Admitted | Rejected | Total | |-----------|----------|----------|-------| | Men | 3714 | 4728 | 8442 | | Women | 1512 | 2809 | 4321 | | Total | 5226 | 7537 | 12763 | #### 3.2 Overall Gender Bias Test (Chi-squared Test) To check for gender bias overall, we can use a **Chi-squared test of independence**. The null hypothesis ($H_0$) is that admission status is independent of gender. The alternative hypothesis ($H_1$) is that admission status is dependent on gender. **Expected Frequencies:** $E_{ij} = \frac{(\text{row total}) \times (\text{column total})}{\text{grand total}}$ - $E_{Men,Admitted} = \frac{8442 \times 5226}{12763} \approx 3457.6$ - $E_{Men,Rejected} = \frac{8442 \times 7537}{12763} \approx 4984.4$ - $E_{Women,Admitted} = \frac{4321 \times 5226}{12763} \approx 1768.4$ - $E_{Women,Rejected} = \frac{4321 \times 7537}{12763} \approx 2552.6$ **Chi-squared Test Statistic:** $X^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$ - $X^2 = \frac{(3714 - 3457.6)^2}{3457.6} + \frac{(4728 - 4984.4)^2}{4984.4} + \frac{(1512 - 1768.4)^2}{1768.4} + \frac{(2809 - 2552.6)^2}{2552.6}$ - $X^2 = \frac{256.4^2}{3457.6} + \frac{(-256.4)^2}{4984.4} + \frac{(-256.4)^2}{1768.4} + \frac{256.4^2}{2552.6}$ - $X^2 = \frac{65741.96}{3457.6} + \frac{65741.96}{4984.4} + \frac{65741.96}{1768.4} + \frac{65741.96}{2552.6}$ - $X^2 \approx 19.01 + 13.19 + 37.17 + 25.75 \approx 95.12$ **P-value Interpretation:** For a 2x2 table, the degrees of freedom (df) are $(rows-1)(columns-1) = (2-1)(2-1) = 1$. A $X^2$ value of 95.12 with 1 df yields a very small p-value ($P(\chi^2_1 > 95.12) \ll 0.001$). **Conclusion for Problem 3a:** The p-value is extremely small, leading us to **reject the null hypothesis** of independence. There is statistically significant evidence of an overall association between gender and admission status. Specifically, the overall admission rate for men (44%) is higher than for women (35%), suggesting an overall gender bias against women. ### Problem 3b: Stratified Admissions by Department Now we consider admissions stratified by 6 departments (A-F). This is a classic example of **Simpson's Paradox** where the overall trend (bias against women) might reverse or disappear when data is examined in subgroups. **Table 4: Admissions by gender, stratified by departments** | Dept | All App. | All Adm. | Men App. | Men Adm. | Women App. | Women Adm. | |------|----------|----------|----------|----------|------------|------------| | A | 933 | 64% | 825 | 62% | 108 | 82% | | B | 585 | 63% | 560 | 63% | 25 | 68% | | C | 918 | 35% | 325 | 37% | 593 | 34% | | D | 792 | 34% | 417 | 33% | 375 | 35% | | E | 584 | 25% | 191 | 28% | 393 | 24% | | F | 714 | 6% | 373 | 6% | 341 | 7% | | Total| 4526 | 39% | 2691 | 45% | 1835 | 30% | *Note: The "All Adm." column is the overall admission rate for the department. "Men Adm." and "Women Adm." are the gender-specific admission rates within that department.* #### 3.2.1 Reconstructing 2x2 Tables for Each Department For each department $d$, we need a 2x2 table: | | Admitted | Rejected | Total | |----------|----------|----------|-------| | Men | $a_d$ | $b_d$ | $N_{d,Men}$ | | Women | $c_d$ | $d_d$ | $N_{d,Women}$ | | Total | $M_{d,Adm}$ | $M_{d,Rej}$ | $N_d$ | Let's calculate $a_d, b_d, c_d, d_d$ for each department. - $N_{d,Men}$ = Men App. - $a_d = N_{d,Men} \times (\text{Men Adm. Rate})$ - $b_d = N_{d,Men} - a_d$ - $N_{d,Women}$ = Women App. - $c_d = N_{d,Women} \times (\text{Women Adm. Rate})$ - $d_d = N_{d,Women} - c_d$ **Department A:** - Men: App = 825, Adm Rate = 62%. $a_A = 825 \times 0.62 = 511.5 \approx 512$. $b_A = 825 - 512 = 313$. - Women: App = 108, Adm Rate = 82%. $c_A = 108 \times 0.82 = 88.56 \approx 89$. $d_A = 108 - 89 = 19$. | Dept A | Admitted | Rejected | Total | |--------|----------|----------|-------| | Men | 512 | 313 | 825 | | Women | 89 | 19 | 108 | **Department B:** - Men: App = 560, Adm Rate = 63%. $a_B = 560 \times 0.63 = 352.8 \approx 353$. $b_B = 560 - 353 = 207$. - Women: App = 25, Adm Rate = 68%. $c_B = 25 \times 0.68 = 17$. $d_B = 25 - 17 = 8$. | Dept B | Admitted | Rejected | Total | |--------|----------|----------|-------| | Men | 353 | 207 | 560 | | Women | 17 | 8 | 25 | **Department C:** - Men: App = 325, Adm Rate = 37%. $a_C = 325 \times 0.37 = 120.25 \approx 120$. $b_C = 325 - 120 = 205$. - Women: App = 593, Adm Rate = 34%. $c_C = 593 \times 0.34 = 201.62 \approx 202$. $d_C = 593 - 202 = 391$. | Dept C | Admitted | Rejected | Total | |--------|----------|----------|-------| | Men | 120 | 205 | 325 | | Women | 202 | 391 | 593 | **Department D:** - Men: App = 417, Adm Rate = 33%. $a_D = 417 \times 0.33 = 137.61 \approx 138$. $b_D = 417 - 138 = 279$. - Women: App = 375, Adm Rate = 35%. $c_D = 375 \times 0.35 = 131.25 \approx 131$. $d_D = 375 - 131 = 244$. | Dept D | Admitted | Rejected | Total | |--------|----------|----------|-------| | Men | 138 | 279 | 417 | | Women | 131 | 244 | 375 | **Department E:** - Men: App = 191, Adm Rate = 28%. $a_E = 191 \times 0.28 = 53.48 \approx 53$. $b_E = 191 - 53 = 138$. - Women: App = 393, Adm Rate = 24%. $c_E = 393 \times 0.24 = 94.32 \approx 94$. $d_E = 393 - 94 = 299$. | Dept E | Admitted | Rejected | Total | |--------|----------|----------|-------| | Men | 53 | 138 | 191 | | Women | 94 | 299 | 393 | **Department F:** - Men: App = 373, Adm Rate = 6%. $a_F = 373 \times 0.06 = 22.38 \approx 22$. $b_F = 373 - 22 = 351$. - Women: App = 341, Adm Rate = 7%. $c_F = 341 \times 0.07 = 23.87 \approx 24$. $d_F = 341 - 24 = 317$. | Dept F | Admitted | Rejected | Total | |--------|----------|----------|-------| | Men | 22 | 351 | 373 | | Women | 24 | 317 | 341 | #### 3.2.2 Cochran-Mantel-Haenszel (CMH) Test The CMH test is used for stratified analysis, similar to Problem 1, but often applied to assess association between two categorical variables while controlling for a third (stratifying) variable. Here, we're testing the association between gender and admission status, controlling for department. The CMH test statistic is: $$X_{CMH}^2 = \frac{\left( \left| \sum_k (a_k - E(a_k)) \right| - 0.5 \right)^2}{\sum_k Var(a_k)}$$ (The 0.5 is a continuity correction, often omitted for large samples). Let's calculate $E(a_k)$ and $Var(a_k)$ for each department $k$ (using the formulas from Problem 1, where $a_k$ is men admitted): | Dept | $a_k$ | $N_{k,Men}$ | $N_{k,Women}$ | $M_{k,Adm}$ | $N_k$ | $E(a_k) = \frac{N_{k,Men} \times M_{k,Adm}}{N_k}$ | $Var(a_k) = \frac{N_{k,Men} N_{k,Women} M_{k,Adm} M_{k,Rej}}{N_k^2 (N_k-1)}$ | |------|-------|-------------|---------------|-------------|-------|----------------------------------------------------|----------------------------------------------------------------------------------| | A | 512 | 825 | 108 | 601 | 933 | $(825 \times 601)/933 \approx 531.0$ | $(825 \times 108 \times 601 \times 332)/(933^2 \times 932) \approx 19.3$ | | B | 353 | 560 | 25 | 370 | 585 | $(560 \times 370)/585 \approx 353.9$ | $(560 \times 25 \times 370 \times 215)/(585^2 \times 584) \approx 3.7$ | | C | 120 | 325 | 593 | 322 | 918 | $(325 \times 322)/918 \approx 113.9$ | $(325 \times 593 \times 322 \times 596)/(918^2 \times 917) \approx 42.1$ | | D | 138 | 417 | 375 | 269 | 792 | $(417 \times 269)/792 \approx 141.6$ | $(417 \times 375 \times 269 \times 523)/(792^2 \times 791) \approx 34.9$ | | E | 53 | 191 | 393 | 147 | 584 | $(191 \times 147)/584 \approx 48.2$ | $(191 \times 393 \times 147 \times 437)/(584^2 \times 583) \approx 19.5$ | | F | 22 | 373 | 341 | 46 | 714 | $(373 \times 46)/714 \approx 24.0$ | $(373 \times 341 \times 46 \times 668)/(714^2 \times 713) \approx 6.5$ | | **Sum** | **898** | | | | | **896.6** | **126.0** | - $\sum (a_k - E(a_k)) = (512-531.0) + (353-353.9) + (120-113.9) + (138-141.6) + (53-48.2) + (22-24.0)$ - $= -19.0 - 0.9 + 6.1 - 3.6 + 4.8 - 2.0 = -4.6$ - $\sum Var(a_k) = 19.3 + 3.7 + 42.1 + 34.9 + 19.5 + 6.5 = 126.0$ CMH Test Statistic (without continuity correction for simplicity, as sum is small): $$X_{CMH}^2 = \frac{(-4.6)^2}{126.0} = \frac{21.16}{126.0} \approx 0.168$$ **Is it significant?** With $X_{CMH}^2 \approx 0.168$ and 1 degree of freedom, the p-value is $P(\chi^2_1 > 0.168) \approx 0.682$. This p-value is very high (much greater than 0.05). **Conclusion for CMH Test:** We **fail to reject the null hypothesis** of no association between gender and admission status *after controlling for department*. This indicates that, within departments, there is no statistically significant evidence of gender bias. #### 3.2.3 Any other observations/interpretations about the data? This is a classic example of **Simpson's Paradox**. - **Overall:** There appeared to be a significant bias against women (Men's admission rate 44% vs. Women's 35%). - **Stratified by Department:** When we look at individual departments, women actually have *equal or higher* admission rates in most departments (A, B, D, F) or only slightly lower (C, E). - Dept A: Men 62%, Women 82% (Women favored) - Dept B: Men 63%, Women 68% (Women favored) - Dept C: Men 37%, Women 34% (Men slightly favored) - Dept D: Men 33%, Women 35% (Women favored) - Dept E: Men 28%, Women 24% (Men slightly favored) - Dept F: Men 6%, Women 7% (Women favored) **Why the Paradox?** The paradox arises because women tend to apply to departments with lower overall admission rates, while men tend to apply to departments with higher overall admission rates. - **Departments A & B:** High admission rates (64%, 63% overall). Men constitute a large majority of applicants in these departments (825/933 in A, 560/585 in B). - **Departments C, D, E, F:** Lower admission rates (35%, 34%, 25%, 6% overall). Women constitute a larger proportion of applicants in these departments (593/918 in C, 375/792 in D, 393/584 in E, 341/714 in F). Even though women were admitted at a higher or similar rate *within* each department, their disproportionate application to generally more competitive (lower admission rate) departments led to a lower overall admission rate. The CMH test correctly adjusts for this confounding effect of department preference and shows no significant gender bias once department is accounted for. #### 3.2.4 Bonus Exercise: CMH and Common Odds Ratio **Hint: Is CMH a Score test for testing common odds?** Yes, the Cochran-Mantel-Haenszel test statistic can be viewed as a **score test for testing the null hypothesis that the common odds ratio across all strata is equal to 1**. - **Common Odds Ratio ($OR_{CMH}$):** In stratified analysis, this is an estimate of the odds ratio that is assumed to be constant across all strata. $$OR_{CMH} = \frac{\sum_k (a_k d_k / N_k)}{\sum_k (b_k c_k / N_k)}$$ - If the common odds ratio is 1, it means that the odds of the outcome (admission) are the same for both groups (men and women) within each stratum (department). - The CMH test essentially checks if this common odds ratio is significantly different from 1. **Score Test Connection:** A score test is a general method for hypothesis testing that uses the score function (the derivative of the log-likelihood function with respect to the parameter of interest, evaluated under the null hypothesis). For the specific case of testing $OR=1$ in a series of 2x2 tables, the CMH statistic is indeed equivalent to a score test. It's robust and widely used when the odds ratio is assumed to be constant across strata. If the odds ratios vary significantly between strata, other methods like logistic regression with interaction terms might be more appropriate. **Conclusion for Bonus Exercise:** The CMH test is suitable for testing the hypothesis of no association (i.e., common odds ratio of 1) between two binary variables across multiple strata. It is a score test for this specific hypothesis.