INTRODUCTION TO CATEGORICAL DATA ANALYSIS

Slides:



Advertisements
Similar presentations
Two-sample tests. Binary or categorical outcomes (proportions) Outcome Variable Are the observations correlated?Alternative to the chi- square test if.
Advertisements

Types of Variables Objective:
Brief introduction on Logistic Regression
Introduction to Statistics Quantitative Methods in HPELS 440:210.
Lab Chapter 14: Analysis of Variance 1. Lab Topics: One-way ANOVA – the F ratio – post hoc multiple comparisons Two-way ANOVA – main effects – interaction.
Types of question and types of variable Training session 4 GAP Toolkit 5 Training in basic drug abuse data management and analysis.
Data Analysis Statistics. Inferential statistics.
Predictive Analytics Software: What Statistics Can Do for You Brett Deneckere Dr. Kimberly Dodson April 26, 2011 A “Living Legend” Production.
Nominal Level Measurement n numbers used as ways to identify or name categories n numbers do not indicate degrees of a variable but simple groupings of.
1 Economics 240A Power One. 2 Outline w Course Organization w Course Overview w Resources for Studying.
Chapter 13 Analyzing Quantitative data. LEVELS OF MEASUREMENT Nominal Measurement Ordinal Measurement Interval Measurement Ratio Measurement.
Chapter 14 Analyzing Quantitative Data. LEVELS OF MEASUREMENT Nominal Measurement Nominal Measurement Ordinal Measurement Ordinal Measurement Interval.
Statistics—Chapter 2 Levels of Measurement. Classifying Variables by Levels of Measurement Levels of measurement—the way researchers collect data Survey.
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data.
1 Economics 240A Power One. 2 Outline w Course Organization w Course Overview w Resources for Studying.
CHAPTER 2 Basic Descriptive Statistics: Percentages, Ratios and rates, Tables, Charts and Graphs.
Analyzing quantitative data – section III Week 10 Lecture 1.
Data Analysis Statistics. Inferential statistics.
Presentation of Data.
Basic Descriptive Statistics Healey, Chapter 2
Measurement and Measurement Scales Measurement is the foundation of any scientific investigation Everything we do begins with the measurement of whatever.
Scales of Measurement What is a nominal scale? A scale that categorizes items What is an ordinal scale? A scale that categorizes and rank orders items.
Statistics & SPSS Review Fall Types of Measures / Variables Nominal / categorical – Gender, major, blood type, eye color Ordinal – Rank-order of.
Understanding Research Results
Statistical Analyses t-tests Psych 250 Winter, 2013.
Ana Jerončić, PhD Department for Research in Biomedicine and Health.
Scientific Method/Statistics. Scientific Method 1. Observe 2. Hypothesize/Predict 3. Experiment Try again! 4. Analyze 5. Conclude 6. Report.
Displaying Data Visually
PADM 582 Quantitative and Qualitative Research Methods Basic Concepts of Statistics Soomi Lee, Ph.D.
R Programming Odds & Odds Ratios 1. Session 3 Overview 1.Odds 2.Odds Ratio (OR) 3.Confidence Intervals for OR’s 4.Inference based on OR’s 2.
ORGANIZING QUALITATIVE DATA 2.1. FREQUENCY DISTRIBUTION Qualitative data values can be organized by a frequency distribution A frequency distribution.
STAGES OF SOCIAL RESEARCH Formulation of the Research Problem Research Design Measurement 1.Select variables of interest 2.Identify types and levels of.
CH. 8 MEASUREMENT OF VARIABLES: OPERATIONAL DEFINITION AND SCALES
Scale of Measurement Survey Design Tips. Level of Measurement How much information a variable conveys about the difference among values The higher the.
SW318 Social Work Statistics Slide 1 Compare Central Tendency & Variability Group comparison of central tendency? Measurement Level? Badly Skewed? MedianMeanMedian.
Multiple Regression Lab Chapter Topics Multiple Linear Regression Effects Levels of Measurement Dummy Variables 2.
Chapter 2 Statistical Concepts Robert J. Drummond and Karyn Dayle Jones Assessment Procedures for Counselors and Helping Professionals, 6 th edition Copyright.
Measurement of Variables
Regression MBA/510 Week 5. Objectives Describe the use of correlation in making business decisions Apply linear regression and correlation analysis. Interpret.
Ch. 3 Histograms In a histogram, the areas of the blocks represent percentages. –Example: Ex. Set B #3 Types of variables: –Qualitative vs. quantitative.
1 Chapter 2: Logistic Regression and Correspondence Analysis 2.1 Fitting Ordinal Logistic Regression Models 2.2 Fitting Nominal Logistic Regression Models.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Chapter 9 Correlational Research Designs. Correlation Acceptable terminology for the pattern of data in a correlation: *Correlation between variables.
Introduction. The Role of Statistics in Science Research can be qualitative or quantitative Research can be qualitative or quantitative Where the research.
Statistics Without Fear! AP Ψ. An Introduction Statistics-means of organizing/analyzing data Descriptive-organize to communicate Inferential-Determine.
BASIC STATISTICAL CONCEPTS Chapter Three. CHAPTER OBJECTIVES Scales of Measurement Measures of central tendency (mean, median, mode) Frequency distribution.
Dr. Mona Hassan Ahmed Hassan
Chapter 2: Levels of Measurement. Researchers classify variables according to the extent to which the values of the variable measure the intended characteristics.
SW388R7 Data Analysis & Computers II Slide 1 Incorporating Nonmetric Data with Dummy Variables The logic of dummy-coding Dummy-coding in SPSS.
FETP June Course: Review of 20 June Learning. Overview Scales of measurement Visual display of quantitative information Questions.
Chapter 3: Central Tendency 1. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
Measurements Statistics WEEK 6. Lesson Objectives Review Descriptive / Survey Level of measurements Descriptive Statistics.
Organizing the Data Levin and Fox Elementary Statistics In Social Research Chapter 2.
Correlations: Linear Relationships Data What kind of measures are used? interval, ratio nominal Correlation Analysis: Pearson’s r (ordinal scales use Spearman’s.
Interpretation of Common Statistical Tests Mary Burke, PhD, RN, CNE.
Dr.Rehab F.M. Gwada. Measures of Central Tendency the average or a typical, middle observed value of a variable in a data set. There are three commonly.
Descriptive Statistics Printing information at: Class website:
Modular 1. Introduction of the Course Structure and MyLabsPlus.
2 NURS/HSCI 597 NURSING RESEARCH & DATA ANALYSIS GEORGE MASON UNIVERSITY.
Measurements Statistics
EXPLORATORY DATA ANALYSIS and DESCRIPTIVE STATISTICS
LEVELS of DATA.
Introduction to bivariate statistics: Covariance and Correlation
The Effects of Age and Sex on Marital Status
Determining Levels, Understanding Constructs
Classification of Variables
Analyzing the Relationship Between Two Variables
Descriptive and Inferential
Lexico-grammar: From simple counts to complex models
Operational Definitions,
Presentation transcript:

INTRODUCTION TO CATEGORICAL DATA ANALYSIS ODDS RATIO, MEASURE OF ASSOCIATION, TEST OF INDEPENDENCE, LOGISTIC REGRESSION AND POLYTOMIOUS LOGISTIC REGRESSION

DEFINITION Categorical data are such that measurement scale consists of a set of categories. E.g. marital status: never married, married, divorced, widowed nominal E.g. attitude toward some policy: strongly disapprove, disapprove, approve, strongly approve ordinal SOME VISUALIZATION TECHNIQUES: Jittering, mosaic plots, bar plots etc. Correlation between ordinal or nominal measurements are usually referred to as association.

MEASURE OF ASSOCIATION  

ODDS RATIO  

ODDS RATIO - EXAMPLE Chinook Salmon fish captured in 1999 VARIABLES: -SEX: M or F Nominal - MODE OF CAPTURE: Hook&line or Net Nominal - RUN: Early run (before July 1) or Late run (After July 1) Ordinal - AGE: Interval (Cont. var.) - LENGTH (Eye to fork of tail in mm): Interval (Cont. Var.) What is the odds that a captured fish is a female? Consider Success = Female (Because they are heavier )

CHINOOK SALMON EXAMPLE EARLY RUN DATA Hook&Line Net TOTAL SEX F 172 165 337 M 119 202 321 291 367 658   For Hook&Line:   For Net: The odds that a captured fish is female are 77% ((1.77-1)=0.77) higher with hook&line compared to net.

ODDS RATIO In general Variable 1 Variable 2

INTERPRETATION OF OR What does OR=1 mean? Odds of success are equal number under both conditions. e.g. no matter which mode of capturing is used.

INTERPRETATION OF OR OR>1 OR<1 Odds of success is higher with condition 1 Odds of success is lower with condition 1

SHAPE OF OR The range of OR is: 0OR ln(OR) has a more symmetric distribution than OR (i.e., more close to normal distribution) OR=1 ln(OR)=0 (1)100% Confidence Interval for ln(OR): (1)100% Confidence Interval for OR: Non-symmetric

CHINOOK SALMON EXAMPLE (Contd.) The odds that a captured fish is female are about 30 to 140% greater with hook&line than with using a net.

OTHER MEASURES OF ASSOCIATION FOR 2X2 TABLES Relative Risk=

MEASURE OF ASSOCIATION FOR IxJTABLES Pearson 2 in contingency tables: EXAMPLE= Instrument Failure Location of Failure L1 L2 L3 TOTAL Type of Failure T1 50 16 31 97 T2 61 26 103 111 42 47 200

PEARSON 2 IN CONTINGENCY TABLES Question: Are the type of failure and location of failure independent? H0: Type and location are independent H1: They are not independent We will compare sample frequencies (i.e. observed values) with the expected frequencies under H0. Remember that if events A and B are independent, then P(AB)=P(A)P(B). If type and location are independent, then P(T1 and L1)=P(T1)P(L1)=(97/200)(111/200)

PEARSON 2 IN CONTINGENCY TABLES Cells~Multinomial(n,p1,…,p6)E(Xi)=npi Expected Frequency=E11=n.Prob.=200(97/200)(111/200)=53.84 E12=200(97/200)(42/200)=20.37 E13=(97*47/200)=22.8 E21=(103*111/200)=57.17 E22=(103*42/200)=21.63 E23=(103*47/200)=24.2

CRAMER’S V It adjusts the Pearson 2 with n, I and J. In the previous example,

CORRELATION BETWEEN ORDINAL VARIABLES Correlation coefficients are used to quantitatively describe the strength and direction of a relationship between two variables. When both variables are at least interval measurements, may report Pearson product moment coefficient of correlation that is also known as the correlation coefficient, and is denoted by ‘r’. Pearson correlation coefficient is only appropriate to describe linear correlation. The appropriateness of using this coefficient could be examined through scatter plots. A statistic that measures the correlation between two ‘rank’ measurements is Spearman’s ρ , a nonparametric analog of Pearson’s r. Spearman’s ρ is appropriate for skewed continuous or ordinal measurements. It can also be used to determine the relationship between one continuous and one ordinal variable. Statistical tests are available to test hypotheses on ρ. Ho: There is no correlation between the two variables (H0: ρ= 0).

MEASURES OF ASSOCIATION FOR IxJ TABLES FOR TWO ORDINAL VARIABLES  

Why are there multiple measures of association? Statisticians over the years have thought of varying ways of characterizing what a perfect relationship is: tau-b = 1, gamma = 1 tau-b <1, gamma = 1 55 35 40 55 10 25 3 7 30 Either of these might be considered a perfect relationship, depending on one’s reasoning about what relationships between variables look like.

I’m so confused!!

Rule of Thumb Gamma tends to overestimate strength but gives an idea of upper boundary. If table is square use tau-b; if rectangular, use tau-c. Pollock (and we agree): τ <.1 is weak; .1<τ<.2 is moderate; .2<τ<.3 moderately strong; .3< τ<1 strong.

MEASUREMENT OF AGREEMENT FOR IxI TABLES   Judge 2 A B C Judge 1 Prob. of agreement

EXAMPLE (COHEN’S KAPPA or Index of Inter-rater Reliability) Two pathologists examined 118 samples and categorize them into 4 groups. Below is the 2x2 table for their decisions. Pathologist Y 1 2 3 4 TOTAL Pathologist X 22 26 5 7 14 36 38 17 10 28 27 12 69 118

EXAMPLE (Contd.) The difference between observed agreement that expected under independence is about 50% of the maximum possible difference.

EVALUATION OF KAPPA If the obtained K is less than .70 -- conclude that the inter-rater reliability is not satisfactory. If the obtained K is greater than .70 -- conclude that the inter-rater reliability is satisfactory. Interpretation of kappa, after Landis and Koch (1977)

PROBABILITY MODELS FOR CATEGORICAL DATA Bernoulli/Binomial Multinomial Poisson …

TEST ON PROPORTIONS AND CONFIDENCE INTERVALS You are already familiar with tests for proportions: CI for Y=0 Pearson 2 or Deviance G2 test

CONFIDENCE INTERVAL FOR A PROPORTION For large sample size, we can use normal approximation to binomial (np5 and n(1p) 5). If np<5 or n(1p)<5, normal approximation is not realistic.

CONFIDENCE INTERVAL FOR A PROPORTION Consider Y=0 in n trials. Then, p=Y/n=0. Normal approximated CI: No matter what n is! But, observing 0 success in 1 trial or in 100 trials is different. Note that, np=0<5.

EXACT CONFIDENCE INTERVALS (Collette, 1991, Modeling Binary Data) Lower Limit: Upper Limit:

EXACT CONFIDENCE INTERVALS Going back to Example with Y=0. Let n=5. Y=0  v1=0, v2=2(5+1)=12, v3=2, v4=2(5)=10

LOGISTIC REGRESSION To analyze the relationship between a binary outcome and a set of explanatory variables when Y is binary. Assumptions of linear models do not hold. Assume Yi~Ber(i). Then, E(Yi)= i=P(Yi=1)P(Yi=0)=1-i. Logistic regression is defined as: log odds is expressed as a function of x’s

Binary Logistic Regression Logistic Distribution Transformed, however, the “log odds” are linear. P (Y=1) x ln[p/(1-p)] x

INTERPRETATION OF PARAMATERS Consider p=1. Let X*=X+1 (i.e., one unit increase in X). Then, odds ratio is: exp(1): the odds ratio for 1 unit change in X 1: the log-odds ratio for 1 unit change in X

MULTIPLE LOGISTIC REGRESSION  

ESTIMATION OF PARAMETERS Yi~Ber(i). Nonlinear equations in s. No closed form. Need iterative methods in computer!

MODEL CHECK Since errors, i takes only two values in logistic regression, “usual” residuals will not help with model checks. But, there is “deviance in residuals” in this case.    

MODEL CHECK You can plot devi vs i, which is called index plot of deviance residuals to identify outlying residuals. But this plot does not indicate whether these residuals should be treated as outliers. There are also analogues of common methods used for linear regression such as leverage values and influence diagnostics ( Dffits, Cook’s distance)… NOTE: An alternative for predicting binary response is discriminant analysis. However, this approach assumes X’s are jointly distributed as multivariate normal distribution. So, it is more reasonable when X’s are continuous. Otherwise, logistic regression should be preferred.

Binary Logistic Regression A researcher is interested in the likelihood of gun ownership in the US, and what would predict that. He uses the 2002 GSS to test the following research hypotheses: Men are more likely to own guns than women The older persons are, the more likely they are to own guns White people are more likely to own guns than those of other races The more educated persons are, the less likely they are to own guns

Binary Logistic Regression Variables are measured as such: Dependent: Havegun: no gun = 0, own gun(s) = 1 Independent: Sex: men = 0, women = 1 Age: entered as number of years White: all other races = 0, white =1 Education: entered as number of years SPSS: Anyalyze  Regression  Binary Logistic Enter your variables and for output below, under options, I checked “iteration history”

Binary Logistic Regression SPSS Output: Some descriptive information first…

Binary Logistic Regression SPSS Output: Some descriptive information first… Maximum likelihood process stops at third iteration and yields an intercept (-.625) for a model with no predictors. A measure of fit, -2 Log likelihood is generated. The equation producing this: -2(∑(Yi * ln[P(Yi)] + (1-Yi) ln[1-P(Yi)]) This is simply the relationship between observed values for each case in your data and the model’s prediction for each case. The “negative 2” makes this number distribute as a X2 distribution. In a perfect model, -2 log likelihood would equal 0. Therefore, lower numbers imply better model fit.

Binary Logistic Regression Originally, the “best guess” for each person in the data set is 0, have no gun! This is the model for log odds when any other potential variable equals zero (null model). It predicts : P = .651, like above. 1/1+ea or 1/1+.535 Real P = .349 If you added each…

Binary Logistic Regression Next are iterations for our full model…

Binary Logistic Regression Goodness-of-fit statistics for new model come next… Test of new model vs. intercept-only model (the null model), based on difference of -2LL of each. The difference has a X2 distribution. Is new -2LL significantly smaller? -2(∑(Yi * ln[P(Yi)] + (1-Yi) ln[1-P(Yi)]) The -2LL number is “ungrounded,” but it has a χ2 distribution. Smaller is better. In a perfect model, -2 log likelihood would equal 0. These are attempts to replicate R2 using information based on -2 log likelihood, (C&S cannot equal 1) Assessment of new model’s predictions

Binary Logistic Regression Interpreting Coefficients… ln[p/(1-p)] = a + b1X1 + b2X2 + b3X3 + b4X4 eb X1 X2 X3 X4 1 b1 b2 b3 b4 a Which b’s are significant? Being male, getting older, and being white have a positive effect on likelihood of owning a gun. On the other hand, education does not affect owning a gun.

Binary Logistic Regression ln[p/(1-p)] = a + b1X1 + …+bkXk, the power to which you need to take e to get: P P 1 – P So… 1 – P = ea + b1X1+…+bkXk Plug in values of x to get the odds ( = p/1-p). The coefficients can be manipulated as follows: Odds = p/(1-p) = ea+b1X1+b2X2+b3X3+b4X4 = ea(eb1)X1(eb2)X2(eb3)X3(eb4)X4 Odds = p/(1-p) = ea+.898X1+.008X2+1.249X3-.056X4 = e-1.864(e.898)X1(e.008)X2(e1.249)X3(e-.056)X4

Binary Logistic Regression The coefficients can be manipulated as follows: Odds = p/(1-p) = ea+b1X1+b2X2+b3X3+b4X4 = ea(eb1)X1(eb2)X2(eb3)X3(eb4)X4 Odds = p/(1-p) = e-2.246-.780X1+.020X2+1.618X3-.023X4 = e-2.246(e-.780)X1(e.020)X2(e1.618)X3(e-.023)X4 Each coefficient increases the odds by a multiplicative amount, the amount is eb. “Every unit increase in X increases the odds by eb.” In the example above, eb = Exp(B) in the last column.

Binary Logistic Regression Each coefficient increases the odds by a multiplicative amount, the amount is eb. “Every unit increase in X increases the odds by eb.” In the example above, eb = Exp(B) in the last column. For Sex: e-.780 = .458 … If you subtract 1 from this value, you get the proportion increase (or decrease) in the odds caused by being male, -.542. In percent terms, odds of owning a gun decrease 54.2% for women. Age: e.020 = 1.020 A year increase in age increases the odds of owning a gun 2%. White: e1.618 = 5.044 …Being white increases the odd of owning a gun by 404% Educ: e-.023 = .977 …Not significant

Binary Logistic Regression Age: e.020 = 1.020 A year increase in age increases the odds of owning a gun 2%. How would 10 years’ increase in age affect the odds? Recall (eb)X is the equation component for a variable. For 10 years, (1.020)10 = 1.219. The odds jump by 22% for ten years’ increase in age. Note: You’d have to know the current prediction level for the dependent variable to know if this percent change is actually making a big difference or not!

Binary Logistic Regression For our problem, P = e-2.246-.780X1+.020X2+1.618X3-.023X4 1 + e-2.246-.780X1+.020X2+1.618X3-.023X4 For, a man, 30, Latino, and 12 years of education, the P equals? Let’s solve for e-2.246-.780X1+.020X2+1.618X3-.023X4 = e-2.246-.780(0)+.020(30)+1.618(0)-.023(12) e-2.246 – 0 + .6 + 0 - .276 = e -1.922 = 2.71828-1.922 = .146 Therefore, P = .146 = .127 The probability that the 30 year-old, Latino with 12 1.146 years of education will own a gun is .127!!! Or you could say there is a 12.7% chance.

Binary Logistic Regression Inferential statistics are as before: In model fit, if χ2 test is significant, the expanded model (with your variables), improves prediction. This Chi-squared test tells us that as a set, the variables improve classification.

Binary Logistic Regression Inferential statistics are as before: The significance of the coefficients is determined by a “wald test.” Wald is χ2 with 1 df and equals a two-tailed t2 with p-value exactly the same.

Binary Logistic Regression So how would I do hypothesis testing? An Example: Significance test for -level = .05 Critical X2df=1= 3.84 To find if there is a significant slope in the population, Ho:  = 0 Ha:   0 Collect Data Calculate Wald, like t (z): t = b – o (1.96 * 1.96 = 3.84) s.e. Make decision about the null hypothesis Find P-value Reject the null for Male, age, and white. Fail to reject the null for education. There is a 24.2% chance that the sample came from a population where the education coefficient equals 0.

EXTENSIONS OF LOGISTIC REGRESSION  

MULTINOMIAL LOGISTIC REGRESSION There are many ways of constructing polytomous regression. Logistic regression with respect to a baseline category (e.g. last category). For nominal response:

MULTINOMIAL LOGISTIC REGRESSION 2. Adjacent categories logits (for ordinal data):

MULTINOMIAL LOGISTIC REGRESSION 3. Cumulative logits for ordinal variables. 4. Continuation-ratio logits for ordinal variables. 5. Proportional odds model for ordinal variables. (See Agresti!)