Contingency tables and Correspondence analysis

Slides:



Advertisements
Similar presentations
CSNB143 – Discrete Structure
Advertisements

Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.
2013/12/10.  The Kendall’s tau correlation is another non- parametric correlation coefficient  Let x 1, …, x n be a sample for random variable x and.
Contingency Table and Correspondence Analysis
Lecture 7: Principal component analysis (PCA)
Principal Component Analysis
Lecture 6 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is.
Principal component regression
Computer Graphics Recitation 5.
PSY 340 Statistics for the Social Sciences Chi-Squared Test of Independence Statistics for the Social Sciences Psychology 340 Spring 2010.
Factor Analysis Purpose of Factor Analysis Maximum likelihood Factor Analysis Least-squares Factor rotation techniques R commands for factor analysis References.
Factor Analysis Purpose of Factor Analysis
Basics of ANOVA Why ANOVA Assumptions used in ANOVA
Principal component analysis (PCA)
Procrustes analysis Purpose of procrustes analysis Algorithm Various modifications.
Canonical correlations
Face Recognition Jeremy Wyatt.
Chapter 3 Determinants and Matrices
Introduction Given a Matrix of distances D, (which contains zeros in the main diagonal and is squared and symmetric), find variables which could be able,
Basics of discriminant analysis
Proximity matrices and scaling Purpose of scaling Similarities and dissimilarities Classical Euclidean scaling Non-Euclidean scaling Horseshoe effect Non-Metric.
Contingency tables and Correspondence analysis Contingency table Pearson’s chi-squared test for association Correspondence analysis using SVD Plots References.
Economics 2301 Matrices Lecture 13.
Crosstabs and Chi Squares Computer Applications in Psychology.
Proximity matrices and scaling Purpose of scaling Similarities and dissimilarities Classical Euclidean scaling Non-Euclidean scaling Horseshoe effect Non-Metric.
Linear and generalised linear models
Principal component analysis (PCA)
Basics of regression analysis
Principal component analysis (PCA) Purpose of PCA Covariance and correlation matrices PCA using eigenvalues PCA using singular value decompositions Selection.
Proximity matrices and scaling Purpose of scaling Classical Euclidean scaling Non-Euclidean scaling Non-Metric Scaling Example.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Boot Camp in Linear Algebra Joel Barajas Karla L Caballero University of California Silicon Valley Center October 8th, 2008.
Separate multivariate observations
1 Statistical Analysis Professor Lynne Stokes Department of Statistical Science Lecture 5QF Introduction to Vector and Matrix Operations Needed for the.
Presentation 12 Chi-Square test.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Correspondence Analysis Chapter 14.
Chapter 2 Dimensionality Reduction. Linear Methods
Multiple Linear Regression - Matrix Formulation Let x = (x 1, x 2, …, x n )′ be a n  1 column vector and let g(x) be a scalar function of x. Then, by.
Correlation.
Some matrix stuff.
1 Measuring Association The contents in this chapter are from Chapter 19 of the textbook. The crimjust.sav data will be used. cjsrate: RATE JOB DONE: CJ.
Data analysis – Spearman’s Rank 1.Know what Spearman’s rank is and how to use it 2.Be able to produce a Spearman’s rank correlation graph for your results.
Chapter 4 – Matrix CSNB 143 Discrete Mathematical Structures.
SINGULAR VALUE DECOMPOSITION (SVD)
Linear algebra: matrix Eigen-value Problems Eng. Hassan S. Migdadi Part 1.
Correspondence Analysis Ahmed Rebai Center of Biotechnology of Sfax.
Chapter 6 Systems of Linear Equations and Matrices Sections 6.3 – 6.5.
2 2.1 © 2012 Pearson Education, Inc. Matrix Algebra MATRIX OPERATIONS.
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
Trees Example More than one variable. The residual plot suggests that the linear model is satisfactory. The R squared value seems quite low though,
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Analyzing Expression Data: Clustering and Stats Chapter 16.
State the ‘null hypothesis’ State the ‘alternative hypothesis’ State either one-tailed or two-tailed test State the chosen statistical test with reasons.
ContentFurther guidance  Hypothesis testing involves making a conjecture (assumption) about some facet of our world, collecting data from a sample,
Chi-Square Analyses.
Outline of Today’s Discussion 1.The Chi-Square Test of Independence 2.The Chi-Square Test of Goodness of Fit.
Lecture 6 Ordination Ordination contains a number of techniques to classify data according to predefined standards. The simplest ordination technique is.
Discriminant Function Analysis Mechanics. Equations To get our results we’ll have to use those same SSCP matrices as we did with Manova.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Chapter 13 Discrete Image Transforms
Copyright © Cengage Learning. All rights reserved. 14 Goodness-of-Fit Tests and Categorical Data Analysis.
MTH108 Business Math I Lecture 20.
Principal component analysis (PCA)
Linear Algebra Review.
Correlation – Regression
Maths for Signals and Systems Linear Algebra in Engineering Lectures 13 – 14, Tuesday 8th November 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR)
CSE 203B: Convex Optimization Week 2 Discuss Session
Presentation transcript:

Contingency tables and Correspondence analysis Pearson’s chi-squared test for association Correspondence analysis Plots References Exercises

Contingency tables Contingency tables are often used in social sciences (such as sociology, education, psychology). These tables can be considered as frequency tables. Rows and columns are some categorical variables. If variables are continuous then we can use bins for these continuous variables and convert them to categorical variables. Categorical variables have discrete values. For example eye colours: Light, blue, medium, dark. Contingency tables sometimes are called as incidence matrices also. Example of contingency tables. Eye and hair colours of schoolchildren from Caitness in Scotland. There are 5387 schoolchildren divided by hair and eye colours. There are 5 hair and 4 eye colours. fair red medium dark black blue 326 38 241 110 3 light 688 116 584 188 4 medium 343 84 909 412 26 dark 98 48 403 681 85 First question is if there is association between columns and rows. If there is some association then we want to find some structure in this data table. I.e. which columns might be related with which rows. First question is answered by Pearson chi-squared test and the second one is approached by the correspondence analysis.

Pearson chi-squared test Suppose that we have a data matrix N that has I rows and J columns. Elements of the matrix are nij. Let us use the following notations: Then Pearson chi-squared statistic for testing the null-hypothesis no row-column association is calculated: with degrees of freedom (I-1)(J-1=IJ-(I+J-1). If the value of this statistic is high then we can say that there is row-column association. If the value of this statistic is low then we can say that there is no row-column association. For above example chi-squared test carried out in R gives: --------------------------------------------------------------- Pearson's Chi-squared test data: caith X-squared = 1240.039, df = 12, p-value = < 2.2e-16 This test shows that null-hypothesis should rejected. I.e. there is strong evidence there is row-column association. This result could be expected.

Contingency tables: homogeneity and heterogeneity t=X2/n is the coefficient of association called as Pearson’s mean-square contingency. It is now called the total inertia. The total inertial is measure of homogeneity/heterogeneity of the table. If t is large it is measure of heterogeneity and if t is small it is measure of homogeneity of the table. Homogeneity means that there is no row-column association. t can also be calculated using: Second summation is sum of a weighted squared distance between the vector of relative frequency of the ith row (i.e. jth row profile – pij/ri) and the average row profile – c. Inverse of the elements of c are the weights. It is known as chi-squared distance between ith row profile and the average row profile.The total inertia is is further weighted sums of I chi-squared distances. The weights are the elements of r. If all elements of row profiles are close to the average row profile then table is homogenous. Otherwise table is heterogeneous. We can do similar calculations for the column profiles. It is done easily by changing roles of r and c. This distances are similar to Euclidean distances and techniques used for Euclidean distances can also be used for this case. We will learn techniques for metric scaling in one of the future lectures.

Contingency table: Correspondence analysis Usual techniques for contingency table analysis is the correspondence analysis. Its aim is to try to find some association between rows and columns. It should of course be carried out if chi-squared test shows that there might be row column associations. Let us use the following notations: P is matrix of the element pij. Then r and c can be defined as: Where 1 is corresponding dimensional column vector. I.e. if c is calculated then its dimension is I and if r is calculated its dimension is J. Let us further define diagonal matrices formed by r and c: Let matrix E be: We can see that elements of E have been used above to calculated chi-squared distances, the total inertia etc. Elements of E are related with standardised Pearsonian residuals. These residuals are elements that contribute to calculation of X2 statistic.

Contingency table and SVD Now we can use SVD of E to analyse the contingency table. SVD of E is: Where U and V are the orthogonal matrices containing eigenvectors of EET and ETE correspondingly. D is the diagonal matrix containing square roots of non-zero eigenvalues of the matrices - EET and ETE. D is also called the canonical correlation. Row profiles are calculated using: Rows of this matrix are row profiles. Column profiles are calculated using: Pairs of rows of F and G are the elements of the orthogonal decomposition of the residuals in a decreasing order. This approach is called the correspondence analysis of the contingency table. Centroids of F and G are 0. F and G are related:

Correspondence analysis Elements of D are called the principal inertias. They are also related to the canonical correlations given by the package R. Larger value of D means that the corresponding element has higher importance. It is usual to use one or two elements of F and G. Then these elements are used for various plots. For pictorial representation either columns and row are plotted in and ordered form or biplots is used to find possible association between rows and columns. Let us use the example caith from the R package and analyse it. As it was mentioned there are some association between the rows and the columns (chi-squared was very large number).

Plot of correspondence analysis: Example This is pictorial form of the table itself. Positions of rows and columns correspond to row and column scores. This picture already can tell something about the structure of the data.

Biplot for the correspondence analysis Biplot produced by R: Black are rows and red are columns. Position of the points correspond to their scores. Again from this picture we can deduce some structure about data.

R commands for contingency tables and correspondence analysis For correspondence analysis we need libraries ctest, MASS and mva. We need to load them library(mva) library(MASS) library(ctest) To perform chi-squared test we can use (load data first) data(caith) chisq.test(caith) If there is some association between rows and columns then we can start usinng the correspondence analysis: ccaith = corresp(caith,nf=1) nf is the number of factors we want to find. we can plot this using the plot command plot(ccaith) – If we have only 1 factor then result will be pictorial representation of the table. if nf=2 then result will be the biplot.

References Krzanowski WJ and Marriout FHC. (1994) Multivatiate analysis. Vol 2. Kendall’s library of statistics

Exercises 5 a) Take the data from R. Data set is deaths – monthly death rates from lung deceases in the UK. These data cannot be used directly for chisq.test and corresp commands. Data should be converted to data matrix. It can be done using data(deaths) dth = matrix(deaths,ncol=12,byrow=TRUE) Now try to analyse these data using corresponding analysis technique b) Take data set accdeaths (accidental deaths in the USA from 1973-1978). These data should also be converted to data matrix. If you are curious try drivers and analyse it. This is data set on drivers deaths. You might see an interesting feature.