Categorical Data. To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the.

Slides:



Advertisements
Similar presentations
SPSS Session 5: Association between Nominal Variables Using Chi-Square Statistic.
Advertisements

COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.
KRUSKAL-WALIS ANOVA BY RANK (Nonparametric test)
1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Analysis of Categorical Data Goodness-of-Fit Tests.
Chapter 13: The Chi-Square Test
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Analysis of frequency counts with Chi square
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #17.
© 2010 Pearson Prentice Hall. All rights reserved The Chi-Square Test of Independence.
Chapter 14 Analysis of Categorical Data
CJ 526 Statistical Analysis in Criminal Justice
Chi-square Test of Independence
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Cross-Tabulations.
Data Analysis Statistics. Levels of Measurement Nominal – Categorical; no implied rankings among the categories. Also includes written observations and.
1 Nominal Data Greg C Elvers. 2 Parametric Statistics The inferential statistics that we have discussed, such as t and ANOVA, are parametric statistics.
1 Chapter 20 Two Categorical Variables: The Chi-Square Test.
Presentation 12 Chi-Square test.
The Chi-Square Test Used when both outcome and exposure variables are binary (dichotomous) or even multichotomous Allows the researcher to calculate a.
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
Cross Tabulation and Chi-Square Testing. Cross-Tabulation While a frequency distribution describes one variable at a time, a cross-tabulation describes.
AM Recitation 2/10/11.
Estimation and Hypothesis Testing Faculty of Information Technology King Mongkut’s University of Technology North Bangkok 1.
Inferential Statistics: SPSS
Copyright © 2012 Pearson Education. All rights reserved Copyright © 2012 Pearson Education. All rights reserved. Chapter 15 Inference for Counts:
CJ 526 Statistical Analysis in Criminal Justice
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on Categorical Data 12.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 13: Nominal Variables: The Chi-Square and Binomial Distributions.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 22 Using Inferential Statistics to Test Hypotheses.
Chapter 11 Chi-Square Procedures 11.3 Chi-Square Test for Independence; Homogeneity of Proportions.
Chi-Square as a Statistical Test Chi-square test: an inferential statistics technique designed to test for significant relationships between two variables.
Chapter 9: Non-parametric Tests n Parametric vs Non-parametric n Chi-Square –1 way –2 way.
1 Chi-Square Heibatollah Baghi, and Mastee Badii.
Chapter 20 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 These tests can be used when all of the data from a study has been measured on.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Pearson Chi-Square Contingency Table Analysis.
Chi-Square X 2. Parking lot exercise Graph the distribution of car values for each parking lot Fill in the frequency and percentage tables.
Chapter-8 Chi-square test. Ⅰ The mathematical properties of chi-square distribution  Types of chi-square tests  Chi-square test  Chi-square distribution.
Chapter 11 The Chi-Square Test of Association/Independence Target Goal: I can perform a chi-square test for association/independence to determine whether.
FPP 28 Chi-square test. More types of inference for nominal variables Nominal data is categorical with more than two categories Compare observed frequencies.
Analysis of Qualitative Data Dr Azmi Mohd Tamil Dept of Community Health Universiti Kebangsaan Malaysia FK6163.
+ Chi Square Test Homogeneity or Independence( Association)
BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.
CHI SQUARE TESTS.
HYPOTHESIS TESTING BETWEEN TWO OR MORE CATEGORICAL VARIABLES The Chi-Square Distribution and Test for Independence.
Copyright © 2010 Pearson Education, Inc. Slide
CHAPTER INTRODUCTORY CHI-SQUARE TEST Objectives:- Concerning with the methods of analyzing the categorical data In chi-square test, there are 2 methods.
Chapter Outline Goodness of Fit test Test of Independence.
Chapter 11: Chi-Square  Chi-Square as a Statistical Test  Statistical Independence  Hypothesis Testing with Chi-Square The Assumptions Stating the Research.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
Copyright © Cengage Learning. All rights reserved. Chi-Square and F Distributions 10.
More Contingency Tables & Paired Categorical Data Lecture 8.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Chi-Square X 2. Review: the “null” hypothesis Inferential statistics are used to test hypotheses Whenever we use inferential statistics the “null hypothesis”
CHAPTER INTRODUCTORY CHI-SQUARE TEST Objectives:- Concerning with the methods of analyzing the categorical data In chi-square test, there are 3 methods.
Chi-Square X 2. Review: the “null” hypothesis Inferential statistics are used to test hypotheses Whenever we use inferential statistics the “null hypothesis”
1 Week 3 Association and correlation handout & additional course notes available at Trevor Thompson.
Chapter 14 – 1 Chi-Square Chi-Square as a Statistical Test Statistical Independence Hypothesis Testing with Chi-Square The Assumptions Stating the Research.
Cross Tabs and Chi-Squared Testing for a Relationship Between Nominal/Ordinal Variables.
THE CHI-SQUARE TEST BACKGROUND AND NEED OF THE TEST Data collected in the field of medicine is often qualitative. --- For example, the presence or absence.
Chapter 13 Understanding research results: statistical inference.
Fall 2002Biostat Inference for two-way tables General R x C tables Tests of homogeneity of a factor across groups or independence of two factors.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Comparing Observed Distributions A test comparing the distribution of counts for two or more groups on the same categorical variable is called a chi-square.
Determining and Interpreting Associations between Variables Cross-Tabs Chi-Square Correlation.
Presentation 12 Chi-Square test.
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Making Use of Associations Tests
Presentation transcript:

Categorical Data

To identify any association between two categorical data. Example: 1,073 subjects of both genders were recruited for a study where the onset of severe chest pain is recorded for each subject. Variables: - Onset of severe chest pain (+ve / –ve) -Gender (male / female) Categorical Data Analysis

Commonly denoted as  2 Useful in testing for independence between categorical variables (e.g. genetic association between cases / controls) Comparison of observed, against what is expected under the null hypothesis. Assumptions Sufficiently large data in each cell in the cross-tabulation table. Chi-square tests

In general, require (a) Smallest expected count is 1 or more (b) At least 80% of the cells have an expected count of 5 or more Yate’s Continuity Correction Provides a better approximation of the test statistic when the data is dichotomous (2  2) Small Cell Counts

Null hypothesis of a hypothesized distribution for the data. Expected frequencies calculated under the hypothesized distribution. For example: The number of outbreaks of flu epidemics is charted over the period 1500 to 1931, and the number of outbreaks each year is tabulated. The variable of interest counts the number of outbreaks occurring in each year of that 432 year period. E.g. there were 223 years with no flu outbreaks. Goodness-of-fit test

Hypotheses: H 0 : Data follows a Poisson distribution with mean H 1 : Data does not follow a Poisson distribution with mean Note: Mean is obtained from the sample mean. Expected frequency for X = 0 =432  P(X = 0), where X ~ Poisson(0.692) Test Statistic, with df = (6 – 1). This yields a p-value of 0.99, indicating that we will almost certainly be wrong if we reject the null hypothesis. Goodness-of-fit test Sample mean = (0 x x x x x x 0) / 432 = 0.692

Test of independence Most common usage of the Pearson’s chi-square test. H0: The two categorical variables are independent H1: The two categorical variables are associated (i.e. not independent) Under the independence assumption, if outcome A is independent to outcome B, then P(A and B happen jointly) = P(A happen) x P(B happen)

Calculating expected frequencies P(Chest pain +) = 83/1073P(Chest pain -) = 990/1073 P(Males) = 520/1073P(Females) = 553/1073 P(Males with chest pain +) = 83/1073 x 520/1073 = Expected(Males with chest pain +) = 1073 x P(.) = 1073 x = Observed(Males with chest pain +) = 46

Expected frequencies calculated by: Degrees of freedom = (r – 1)  (c – 1) Test of independence

Chi-square test

Looking at the validity of the assumption of sufficiently large sample sizes!

 2 -test identifies whether there is significant association between the two categorical variables. But does not quantify the strength and direction of the association. Need odds ratio to do this. Odds ratio defines “how many times more likely” it is to be in one category compared to the other: Example: For the previous example on severe chest pain, males are about 1.4 times more likely to experience severe chest pains than females. Quantification of the effect Always know what is the outcome/event of interest, and what is the baseline reference! Otherwise OR can be interpreted both ways!

Pos. outcomeNeg. outcome Exposure (+)ab Exposure (-)cd Odds ratio and relative risk Calculation of odds ratio is pretty straightforward. - Use the leading diagonal divided by the antidiagonal. Relative risk is more tricky though, since it’s not symmetric! While it’s commonly used interchangeable with OR, the interpretation and calculation are very different!

Case-Control Study Compare affected and unaffected individuals Usually retrospective in nature Temporal sequence cannot be established (timing for the onset of the disease) No information on population incidence of the disease Cohort Study Usually random sampling of subjects within the population Prospective, retrospective or both Long follow-up; loss to follow-up Costly to conduct Temporal sequence can be established Provides information on population incidence of the disease Exegesis on epidemiology Odds ratio is the right metric here! Relative risk is the appropriate metric here!

Not straightforward to obtain confidence intervals of odds ratio (due to complexity in obtaining the variance) Straightforward to obtain the variance of the logarithm of odds ratio. Odds ratio is always reported together with the p- values (obtained from Pearson’s Chi-square test), and the corresponding confidence intervals. Confidence interval of odds ratio

Ca (+ve)Ca (-ve) Smoking (+)1,3011,205 Smoking (-)56152 Odds and Odds Ratio Odds Ratio (OR)=(1301/56)/(1205/152)=2.93 Pearson’s Chi-square= , on df = 1  p-value= 0 Var[log(OR)]= = % Confidence interval= = (2.14, 4.02) Case study on smoking and lung cancer

Beyond 2 x 2 tables

Nominal or ordinal For categorical variables with two possible outcomes: - Does not matter whether the variable is nominal or ordinal For categorical variables with more than 2 outcomes: - Important to note whether the variable is nominal or ordinal - Test to use is very different, and thus conclusion reached can be very different. Example: Consider the same dataset on severe chest pain, suppose we have the smoking status of every individual, classified into: - Non-smoker - Daily smoker - Excessive smoker Smoking intensity

OR smoker = 1.52 (0.88, 2.63), p = OR ex-smoker = 2.11 (1.00, 4.51), p = with non-smoker as reference category. Chi-square test for trend

Linear-by-linear association Adopts a correlational approach by calculating the Pearson correlation coefficient between the rows and the columns, allowing for ordinal outcomes in either. Recode rows as: yes = 0, no = 1. Recode columns as: non-smoker = 0, daily smoker = 1, ex-smoker = 2

Linear-by-linear association 53 observations 19 observations Pearson Correlation = Consider the test statistic: T = (N – 1) r 2 ~ Chi-square(1) = (1050 – 1)  ( ) 2 =

Nominal vs. Ordinal Importance of recognising the kind of variables we have in order to identify the right test!

Summarise data using cross-tabulation tables, with percentages Recognise whether any of the variables are ordinal Perform a chi-square of independence to test for association between the two categorical variables, or the linear-by-linear test if there is at least one ordinal variable out of the two variables Check the validity of the assumption on the sample size Quantify any significant association using odds ratios Always report odds ratios with corresponding 95% confidence interval Procedure for Categorical Data Analysis

Categorical Data Analysis in SPSS

Example: Let’s consider the lung cancer and smoking example: 1. Establish the relationship between the onset of lung cancer and smoking status. Quantify this relationship if it is statistically significant. Ca (+ve)Ca (-ve) Smoking (+)1,3011,205 Smoking (-)56152

Data entry Slightly counter-intuitive, event of interest and outcome of interest should be coded as 0, and the baseline reference outcome/event coded as 1.

Define what 0 and 1 corresponds to:

Definition of 0s and 1s converted to what you specified under “Values”. Percentages are much easier and more meaningful to interpret than absolute numbers!

Highly significant, P < Odds ratio of getting lung cancer with corresponding 95% CI, with non-smoker as baseline Relative risk of getting lung cancer with corresponding 95% CI, with non-smoker as baseline

understand the use of a chi-square test for testing independence between two categorical outcomes understand the assumptions on sample sizes for the use of a chi-square test know how to quantify the association using odds ratio/relative risk, with corresponding 95% confidence intervals differentiate between the tests to be used for nominal categorical and ordinal categorical variables. perform the appropriate analyses in SPSS and RExcel Students should be able to