Data Processing: Tabulation & Issues Principles of Tabulation Clear Separation: Columns separated by lines for readability. Thick lines for main classes, thin for subdivisions. Numbering: Columns may be numbered for reference. Comparison: Data to be compared should be side-by-side. Percentages/averages near data. Approximation: Approximate figures before tabulation to reduce unnecessary detail. Emphasis: Use different type, spacing, indentations for significance. Alignment: Proper alignment for column figures, decimal points, and signs. Avoidance: Avoid abbreviations and ditto marks. Miscellaneous: Exceptional items in the last row. Simplicity: Tables should be logical, clear, accurate, simple. Avoid crowding large data. Totals: Row totals in extreme right, column totals at bottom. Arrangement: Chronological, geographical, alphabetical, or by magnitude to facilitate comparison. Problems in Processing Data "Don't Know" (DK) Responses Significance: Small group is insignificant; large group is a major concern. Causes: Respondent genuinely doesn't know (legitimate DK). Researcher failed to obtain info (failure of questioning). Handling DK: Design better questions. Good interviewer rapport. Estimate allocation from other data. Keep as separate category (if legitimate). Assume random occurrence, distribute proportionally among other answers. Exclude from tabulation without inflating other responses. Use of Percentages Purpose: Simplify numbers, reduce to 0-100 range, facilitate relative comparisons. Rules: Don't average percentages unless weighted by group size. Avoid too large percentages (confusing). Percentages hide the base; consider this for correct interpretation. Percentage decreases never exceed 100%. Higher figure is always the base. Work out percentages in direction of causal-factor in two-dimension tables. Elements/Types of Analysis Definition: Computation of indices/measures, searching for relationship patterns. Estimating unknown parameters, testing hypotheses. Descriptive Analysis: Study of distributions of one variable (unidimensional), two variables (bivariate), or more (multivariate). Provides profiles. Inferential Analysis (Statistical Analysis): Generalization from samples to populations. Estimation of population parameters. Testing of statistical hypotheses. Statistics in Research Role: Tool for designing research, analyzing data, drawing conclusions. Reduces raw data for readability and further analysis. Descriptive Statistics: Develops indices from raw data. Inferential Statistics (Sampling Statistics): Concerns generalization. Estimation of population parameters. Testing of statistical hypotheses. Important Statistical Measures Measures of Central Tendency (Statistical Averages): Arithmetic Mean ($\bar{X}$), Median (M), Mode (Z) - most important. Geometric Mean, Harmonic Mean - sometimes used. Measures of Dispersion: Variance ($\sigma^2$), Standard Deviation ($\sigma$) - most often used. Mean Deviation, Range - also used. Coefficient of Standard Deviation, Coefficient of Variation - for comparison. Measures of Asymmetry (Skewness): Based on mean & mode, or mean & median. Based on quartiles or moments. Measures of Relationship: Karl Pearson’s coefficient of correlation (variables). Yule’s coefficient of association (attributes). Multiple correlation coefficient, partial correlation coefficient, regression analysis. Other Measures Measures of Dispersion (Detailed) Mean Deviation (MD) Arithmetic average of the absolute deviations from the mean or median. Formula: $MD = \frac{\sum |X_i - \text{average}|}{n}$ Coefficient of mean deviation: $MD / \text{average}$ Use: Judging variability, making study of central tendency more precise. Advantage: Considers values of all items. Disadvantage: Not amenable to algebraic process (due to absolute values). Standard Deviation ($\sigma$) Most widely used measure. Defined as the square root of the average of squares of deviations from the arithmetic mean. Formula: $\sigma = \sqrt{\frac{\sum f_i (X_i - \bar{X})^2}{N}}$ (where $f_i$ is frequency, $N = \sum f_i$) Coefficient of Standard Deviation: $\sigma / \bar{X}$ (relative measure for comparison). Coefficient of Variation (CV): $(\sigma / \bar{X}) \times 100$ Variance: $\sigma^2$ (used in ANOVA). Advantages: Very satisfactory measure of dispersion. Amenable to mathematical manipulation (algebraic signs not ignored). Less affected by sampling fluctuations. Popular in estimation and hypothesis testing. Measures of Asymmetry (Skewness) Symmetrical Distribution: Normal curve, bell-shaped. Mean = Median = Mode. No skewness. Asymmetrical (Skewed) Distribution: Curve is distorted. Positive Skewness: Distorted to the right. Tail on the right. Mode Negative Skewness: Distorted to the left. Tail on the left. Mean Significance: Studies formation of series, indicates curve shape (normal or otherwise). Measurement: Difference between mean, median, or mode. Kurtosis Measure of "flat-toppedness" or "peakedness" of a curve. Mesokurtic: Normal curve (bell-shaped). Leptokurtic: More peaked than normal curve. Platykurtic: Flatter than normal curve. Significance: Crucial for statistical methods as many assume specific distribution curve nature. X f(X) Mesokurtic Leptokurtic Platykurtic Mean = Median = Mode Measures of Relationship Univariate Population: Data on one variable. Bivariate Population: Data on two variables (X, Y). Multivariate Population: Data on more than two variables (X, Y, Z, W...). Goal: Understand the relationship between variables. Key Questions: Is there association/correlation? If so, to what degree? Is there a cause-and-effect relationship? If so, to what degree and in which direction? Techniques: Correlation technique answers question (i). Regression technique answers question (ii). Bivariate Population Methods Correlation: Cross Tabulation (nominal data). Charles Spearman’s coefficient of correlation (rank correlation, ordinal data). Karl Pearson’s coefficient of correlation (simple correlation, interval/ratio data). Cause and Effect: Simple Regression Equations. Multivariate Population Methods Correlation: Coefficient of Multiple Correlation. Coefficient of Partial Correlation. Cause and Effect: Multiple Regression Equations. Cross Tabulation Use: Especially useful for nominal data. Process: Classify each variable into categories, then cross-classify. Relationships: Symmetrical: Variables vary together, neither causes the other. Reciprocal: Variables mutually influence/reinforce each other. Asymmetrical: One variable (independent) causes another (dependent). Procedure: Begins with a two-way table. Can introduce a third factor for conditional relationships. Limitation: Not a powerful statistical correlation; better for ordinal/interval/ratio data use other methods. Charles Spearman’s Coefficient of Rank Correlation ($R$) Use: Degree of correlation between two variables for ordinal data (ranks). Objective: Determine similarity/dissimilarity of two sets of rankings. Formula: $R = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$ $d_i$: Difference between ranks of $i$-th pair of variables. $n$: Number of pairs of observations. Nature: Non-parametric technique. Karl Pearson’s Coefficient of Correlation ($r$) (Simple Correlation) Most widely used method for measuring relationship between two variables. Assumptions: Linear relationship between variables. Causally related (one independent, one dependent). Large number of independent causes producing normal distribution. Formula: $r = \frac{n \sum XY - (\sum X)(\sum Y)}{\sqrt{[n \sum X^2 - (\sum X)^2][n \sum Y^2 - (\sum Y)^2]}}$ Also known as product moment correlation coefficient. Value Range: $r$ lies between $\pm 1$. Interpretation: Positive $r$: Positive correlation (variables change in same direction). Negative $r$: Negative correlation (variables change in opposite directions). $r = 0$: No association. $r = +1$: Perfect positive correlation. $r = -1$: Perfect negative correlation (independent variable explains 100% of variation in dependent variable). Values nearer to $\pm 1$ indicate high degree of correlation. Simple Regression Analysis Definition: Statistical relationship between two variables. One independent variable (cause), one dependent variable (effect). Interprets physical relationships. Basic Relationship: $Y = a + bX + e$ $Y$: Dependent variable. $X$: Independent variable. $a$: Y-intercept (value of Y when X is 0). $b$: Slope of the regression line (change in Y for a unit change in X). $e$: Error term. Regression Equation (Estimated): $\hat{Y} = a + bX$ Normal Equations to find $a$ and $b$: $\sum Y = na + b \sum X$ $\sum XY = a \sum X + b \sum X^2$ Formulas for $a$ and $b$: $b = \frac{n \sum XY - (\sum X)(\sum Y)}{n \sum X^2 - (\sum X)^2}$ $a = \bar{Y} - b\bar{X}$ Multiple Correlation and Regression Use: When there are two or more independent variables. Multiple Regression Equation (with two independent variables): $\hat{Y} = a + b_1X_1 + b_2X_2$ $Y$: Dependent variable. $X_1, X_2$: Independent variables. $a$: Intercept. $b_1, b_2$: Partial regression coefficients (change in Y for a unit change in $X_1$ or $X_2$, holding the other constant). Normal Equations (for two independent variables): $\sum Y = na + b_1 \sum X_1 + b_2 \sum X_2$ $\sum X_1Y = a \sum X_1 + b_1 \sum X_1^2 + b_2 \sum X_1X_2$ $\sum X_2Y = a \sum X_2 + b_1 \sum X_1X_2 + b_2 \sum X_2^2$ Multicollinearity Problem: Occurs when independent variables ($X_1, X_2$) are highly correlated. Makes regression coefficients ($b_1, b_2$) less reliable. Prediction for dependent variable is still possible, but care needed in selecting independent variables to minimize multicollinearity. Partial Correlation Measures the correlation between two variables while controlling for the effect of one or more other variables. Formula (e.g., $r_{12.3}$ correlation between $X_1$ and $X_2$ controlling for $X_3$): $r_{12.3} = \frac{r_{12} - r_{13}r_{23}}{\sqrt{(1 - r_{13}^2)(1 - r_{23}^2)}}$ Association in Case of Attributes Statistics of Attributes: Data collected based on attributes (qualitative characteristics). Objective: Determine if attributes are associated with each other. Example: Association between inoculation and immunity from small-pox. Attributes are associated if they appear together in a greater proportion than expected by chance.