Introduction to Categorical Data Analysis July 22, 2004

Slides:



Advertisements
Similar presentations
Tutorial: Chi-Square Distribution Presented by: Nikki Natividad Course: BIOL Biostatistics.
Advertisements

Hypothesis Testing Steps in Hypothesis Testing:
Chapter 16: Chi Square PSY —Spring 2003 Summerfelt.
Departments of Medicine and Biostatistics
1 Contingency Tables: Tests for independence and homogeneity (§10.5) How to test hypotheses of independence (association) and homogeneity (similarity)
Statistical Tests Karen H. Hagglund, M.S.
Analysis of frequency counts with Chi square
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 25, Slide 1 Chapter 25 Comparing Counts.
EPI 809 / Spring 2008 Final Review EPI 809 / Spring 2008 Ch11 Regression and correlation  Linear regression Model, interpretation. Model, interpretation.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Chi Square Test Dealing with categorical dependant variable.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 17: Chi-Square.
Data Analysis Statistics. Inferential statistics.
PSY 307 – Statistics for the Behavioral Sciences Chapter 19 – Chi-Square Test for Qualitative Data Chapter 21 – Deciding Which Test to Use.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
The 2x2 table, RxCxK contingency tables, and pair-matched data July 27, 2004.
The Chi-Square Test Used when both outcome and exposure variables are binary (dichotomous) or even multichotomous Allows the researcher to calculate a.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.5 Small Sample.
Chapter 10 Analyzing the Association Between Categorical Variables
How Can We Test whether Categorical Variables are Independent?
AM Recitation 2/10/11.
Categorical Data Prof. Andy Field.
Selecting the Correct Statistical Test
CHP400: Community Health Program - lI Research Methodology. Data analysis Hypothesis testing Statistical Inference test t-test and 22 Test of Significance.
Essentials of survival analysis How to practice evidence based oncology European School of Oncology July 2004 Antwerp, Belgium Dr. Iztok Hozo Professor.
Chapter 26: Comparing Counts AP Statistics. Comparing Counts In this chapter, we will be performing hypothesis tests on categorical data In previous chapters,
More About Significance Tests
Introduction To Biological Research. Step-by-step analysis of biological data The statistical analysis of a biological experiment may be broken down into.
For testing significance of patterns in qualitative data Test statistic is based on counts that represent the number of items that fall in each category.
Chapter 11: Applications of Chi-Square. Count or Frequency Data Many problems for which the data is categorized and the results shown by way of counts.
1 Applied Statistics Using SAS and SPSS Topic: Chi-square tests By Prof Kelly Fan, Cal. State Univ., East Bay.
ANOVA (Analysis of Variance) by Aziza Munir
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Contingency tables Brian Healy, PhD. Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical.
The binomial applied: absolute and relative risks, chi-square.
MBP1010 – Lecture 8: March 1, Odds Ratio/Relative Risk Logistic Regression Survival Analysis Reading: papers on OR and survival analysis (Resources)
FPP 28 Chi-square test. More types of inference for nominal variables Nominal data is categorical with more than two categories Compare observed frequencies.
Introduction to Biostatistics (ZJU 2008) Wenjiang Fu, Ph.D Associate Professor Division of Biostatistics, Department of Epidemiology Michigan State University.
Copyright © 2010 Pearson Education, Inc. Slide
1 Chapter 11: Analyzing the Association Between Categorical Variables Section 11.1: What is Independence and What is Association?
1 G Lect 7a G Lecture 7a Comparing proportions from independent samples Analysis of matched samples Small samples and 2  2 Tables Strength.
© Copyright McGraw-Hill 2004
More Contingency Tables & Paired Categorical Data Lecture 8.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
1 Week 3 Association and correlation handout & additional course notes available at Trevor Thompson.
Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
The p-value approach to Hypothesis Testing
Nonparametric Statistics
Fall 2002Biostat Inference for two-way tables General R x C tables Tests of homogeneity of a factor across groups or independence of two factors.
Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 11 Multinomial Experiments and Contingency Tables 11-1 Overview 11-2 Multinomial Experiments:
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Chapter 15 Analyzing Quantitative Data. Levels of Measurement Nominal measurement Involves assigning numbers to classify characteristics into categories.
Comparing Observed Distributions A test comparing the distribution of counts for two or more groups on the same categorical variable is called a chi-square.
The 2 nd to last topic this year!!.  ANOVA Testing is similar to a “two sample t- test except” that it compares more than two samples to one another.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.1 Independence.
Statistical analyses for two- way contingency tables HRP 261 January 10, 2005 Read Chapter 2 Agresti.
I. ANOVA revisited & reviewed
Nonparametric Statistics
More than two groups: ANOVA and Chi-square
The binomial applied: absolute and relative risks, chi-square
Nonparametric Statistics
Examples and SAS introduction: -Violations of the rare disease assumption -Use of Fisher’s exact test January 14, 2004.
Review for Exam 2 Some important themes from Chapters 6-9
Analyzing the Association Between Categorical Variables
Statistics II: An Overview of Statistics
Applied Statistics Using SPSS
Presentation transcript:

Introduction to Categorical Data Analysis July 22, 2004

Categorical data The t-test, ANOVA, and linear regression all assumed outcome variables that were continuous (normally distributed). Even their non-parametric equivalents assumed at least many levels of the outcome (discrete quantitative or ordinal). We haven’t discussed the case where the outcome variable is categorical.

Types of Variables: a taxonomy discrete random variables Categorical Quantitative binary nominal ordinal discrete continuous 2 categories + more categories + order matters + numerical + uninterrupted

Overview of statistical tests Independent variable=predictor Dependent variable=outcome e.g., BMD= pounds age amenorrheic (1/0) Continuous predictors Binary predictor Continuous outcome

Types of variables to be analyzed     Statistical procedure or measure of association Predictor (independent) variable/s   Outcome (dependent) variable Categorical Continuous ANOVA Dichotomous Continuous T-test Continuous Continuous Simple linear regression Multivariate Continuous Multiple linear regression Categorical Categorical Chi-square test Dichotomous Dichotomous Odds ratio, Mantel-Haenszel OR, Relative risk, difference in proportions Multivariate Dichotomous Logistic regression Kaplan-Meier curve/ log-rank test Categorical Time-to-event Time-to-event Multivariate Cox-proportional hazards model

done Today and next week Last part of course Types of variables to be analyzed     Statistical procedure or measure of association Predictor (independent) variable/s   Outcome (dependent) variable done Categorical Continuous ANOVA Dichotomous Continuous T-test Continuous Continuous Simple linear regression Multivariate Continuous Multiple linear regression Today and next week Categorical Categorical Chi-square test Dichotomous Dichotomous Odds ratio, Mantel-Haenszel OR, Relative risk, difference in proportions Multivariate Dichotomous Logistic regression Kaplan-Meier curve/ log-rank test Last part of course Categorical Time-to-event Time-to-event Multivariate Cox-proportional hazards model

Difference in proportions Example: You poll 50 people from random districts in Florida as they exit the polls on election day 2004. You also poll 50 people from random districts in Massachusetts. 49% of pollees in Florida say that they voted for Kerry, and 53% of pollees in Massachusetts say they voted for Kerry. Is there enough evidence to reject the null hypothesis that the states voted for Kerry in equal proportions?

Null distribution of a difference in proportions Standard error of a proportion= Standard error can be estimated by= (still normally distributed) Standard error of the difference of two proportions= The variance of a difference is the sum of variances (as with difference in means). Analagous to pooled variance in the ttest

Null distribution of a difference in proportions Difference of proportions For our example, null distribution=

Answer to Example We saw a difference of 4% between Florida and Massachusetts Null distribution predicts chance variation between the two states of 10%. P(our data/null distribution)=P(Z>.04/.10=.4)>.05 Not enough evidence to reject the null.

Chi-square test for comparing proportions (of a categorical variable) between groups I. Chi-Square Test of Independence When both your predictor and outcome variables are categorical, they may be cross-classified in a contingency table and compared using a chi-square test of independence.   A contingency table with R rows and C columns is an R x C contingency table.

Example Asch, S.E. (1955). Opinions and social pressure. Scientific American, 193, 31-35.

The Experiment A Subject volunteers to participate in a “visual perception study.” Everyone else in the room is actually a conspirator in the study (unbeknownst to the Subject). The “experimenter” reveals a pair of cards…

The Task Cards Standard line Comparison lines A, B, and C

The Experiment Everyone goes around the room and says which comparison line (A, B, or C) is correct; the true Subject always answers last – after hearing all the others’ answers. The first few times, the 7 “conspirators” give the correct answer. Then, they start purposely giving the (obviously) wrong answer. 75% of Subjects tested went along with the group’s consensus at least once.

Further Results In a further experiment, group size (number of conspirators) was altered from 2-10. Does the group size alter the proportion of subjects who conform?

Number of group members?   The Chi-Square test   Conformed? Number of group members? 2 4 6 8 10 Yes 20 50 75 60 30 No 80 25 40 70     Apparently, conformity less likely when less or more group members…  

20 + 50 + 75 + 60 + 30 = 235 conformed out of 500 experiments. Overall likelihood of conforming = 235/500 = .47

Number of group members?   Expected frequencies if no association between group size and conformity…   Conformed? Number of group members? 2 4 6 8 10 Yes 47 No 53      

Do observed and expected differ more than expected due to chance?   Do observed and expected differ more than expected due to chance?      

Chi-Square test Degrees of freedom = (rows-1)*(columns-1)=(2-1)*(5-1)=4 Rule of thumb: if the chi-square statistic is much greater than it’s degrees of freedom, indicates statistical significance. Here 85>>4.

The Chi-Square distribution: is sum of squared normal deviates The expected value and variance of a chi-square: E(x)=df   Var(x)=2(df)

Chi-Square test Degrees of freedom = (rows-1)*(columns-1)=(2-1)*(5-1)=4 Rule of thumb: if the chi-square statistic is much greater than it’s degrees of freedom, indicates statistical significance. Here 85>>4.

Caveat **When the sample size is very small in any cell (<5), Fischer’s exact test is used as an alternative to the chi-square test.

Example of Fisher’s Exact Test

Fisher’s “Tea-tasting experiment” Claim: Fisher’s colleague (call her “Cathy”) claimed that, when drinking tea, she could distinguish whether milk or tea was added to the cup first. To test her claim, Fisher designed an experiment in which she tasted 8 cups of tea (4 cups had milk poured first, 4 had tea poured first). Null hypothesis: Cathy’s guessing abilities are no better than chance. Alternatives hypotheses: Right-tail: She guesses right more than expected by chance. Left-tail: She guesses wrong more than expected by chance

Fisher’s “Tea-tasting experiment” Experimental Results:   Milk Tea 3 1 Guess poured first Poured First 4

Fisher’s Exact Test Step 1: Identify tables that are as extreme or more extreme than what actually happened: Here she identified 3 out of 4 of the milk-poured-first teas correctly. Is that good luck or real talent? The only way she could have done better is if she identified 4 of 4 correct.   Milk Tea 3 1 Guess poured first Poured First 4   Milk Tea 4 Guess poured first Poured First

Fisher’s Exact Test Step 2: Calculate the probability of the tables (assuming fixed marginals)   Milk Tea 3 1 Guess poured first Poured First 4   Milk Tea 4 Guess poured first Poured First

“right-hand tail probability”: p=.243 Step 3: to get the left tail and right-tail p-values, consider the probability mass function: Probability mass function of X, where X= the number of correct identifications of the cups with milk-poured-first: “right-hand tail probability”: p=.243 “left-hand tail probability” (testing the null hypothesis that she’s systematically wrong): p=.986

SAS code and output for generating Fisher’s Exact statistics for 2x2 table   Milk Tea 3 1 4

data tea; input MilkFirst GuessedMilk Freq; datalines; 1 1 3 1 0 1 0 1 1 0 0 3 run; data tea; *Fix quirky reversal of SAS 2x2 tables; set tea; MilkFirst=1-MilkFirst; GuessedMilk=1-GuessedMilk;run; proc freq data=tea; tables MilkFirst*GuessedMilk /exact; weight freq;run;

SAS output Statistics for Table of MilkFirst by GuessedMilk Statistic DF Value Prob ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Chi-Square 1 2.0000 0.1573 Likelihood Ratio Chi-Square 1 2.0930 0.1480 Continuity Adj. Chi-Square 1 0.5000 0.4795 Mantel-Haenszel Chi-Square 1 1.7500 0.1859 Phi Coefficient 0.5000 Contingency Coefficient 0.4472 Cramer's V 0.5000 WARNING: 100% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Cell (1,1) Frequency (F) 3 Left-sided Pr <= F 0.9857 Right-sided Pr >= F 0.2429 Table Probability (P) 0.2286 Two-sided Pr <= P 0.4857 Sample Size = 8

Introduction to the 2x2 Table

Introduction to the 2x2 Table   Exposure (E) No Exposure (~E) Disease (D) a b a+b = P(D) No Disease (~D) c d c+d = P(~D) a+c = P(E) b+d = P(~E) Marginal probability of disease Marginal probability of exposure

Cohort Studies Exposed Not Exposed Disease-free cohort Disease Target population Disease Disease-free TIME

The Risk Ratio, or Relative Risk (RR)   Exposure (E) No Exposure (~E) Disease (D) a b No Disease (~D) c d a+c b+d risk to the exposed risk to the unexposed

Hypothetical Data Normal BP Congestive Heart Failure No CHF 1500 3000   Normal BP Congestive Heart Failure No CHF 1500 3000 High Systolic BP 400 1100 2600

Case-Control Studies Sample on disease status and ask retrospectively about exposures (for rare diseases) Marginal probabilities of exposure for cases and controls are valid. Doesn’t require knowledge of the absolute risks of disease For rare diseases, can approximate relative risk

Case-Control Studies Exposed in past Disease (Cases) Not exposed Target population Exposed No Disease (Controls) Not Exposed

The Odds Ratio (OR) Exposure (E) No Exposure (~E) Disease (D)   Exposure (E) No Exposure (~E) Disease (D) a = P (D& E) b = P(D& ~E) No Disease (~D) c = P (~D&E) d = P (~D&~E)

The Odds Ratio Via Bayes’ Rule 1 1 When disease is rare: P(~D)  1 “The Rare Disease Assumption” 1

Properties of the OR (simulation)

Properties of the lnOR Standard deviation = Standard deviation =

Hypothetical Data 30 30 Smoker Non-smoker Lung Cancer 20 10   Smoker Non-smoker Lung Cancer 20 10 No lung cancer 6 24 30 30 Note that the size of the smallest 2x2 cell determines the magnitude of the variance NOTE how this means the smallest cell really determines the magnitude of the variance.

Example: Cell phones and brain tumors (cross-sectional data)   Brain tumor No brain tumor Own a cell phone 5 347 352 Don’t own a cell phone 3 88 91 8 435 453

Same data, but use Chi-square test or Fischer’s exact   Brain tumor No brain tumor Own 5 347 352 Don’t own 3 88 91 8 435 453

Same data, but use Odds Ratio   Brain tumor No brain tumor Own a cell phone 5 347 352 Don’t own a cell phone 3 88 91 8 435 453