Presentation is loading. Please wait.

Presentation is loading. Please wait.

Contingency Table and Correspondence Analysis

Similar presentations


Presentation on theme: "Contingency Table and Correspondence Analysis"— Presentation transcript:

1 Contingency Table and Correspondence Analysis
Nishith Kumar Department of Statistics BSMRSTU and Mohammed Nasser Department of Statistics RU

2 Overview Contingency table.
Some real world problem for contingency table Pearson chi-squared test Probabilistic interpretation of matrices Contingency tables: Homogeneity and Heterogeneity Historical background of correspondence analysis Correspondence analysis (CA) Correspondence analysis and eigenvalues. Singular value decomposition. Calculation procedure of CA Interpretation of correspondence analysis R code and examples Conclusion

3 Contingency Table In statistics, a contingency table (also referred to as cross tabulation or cross tab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. The term contingency table was first used by Karl Pearson in 1904. Sometimes contingency table is called incidence matrix. Contingency tables are often used in social sciences (such as sociology, education, psychology). These tables can be considered as frequency tables. Rows and columns are some categorical variables. If variables are continuous then we can use bins for these continuous variables and convert them into categorical ones.

4 Real Problem Cross-tabulation of age groups by perceived health status
Very Good Good Regular Bad Very Bad 16-24 243 789 167 18 6 25-34 220 809 164 35 35-44 147 658 181 41 8 45-54 90 469 236 50 16 55-64 53 414 306 106 30 65-74 44 267 284 98 20 75+ 136 157 66 17 1. Is there any relation between different age group and perceived health status? 2. How can you visualize this type of relationship? 3. How can you find the similarity of row category? 4. How can you Interprete – distances between categories of row and column variables

5 Real Problem Suppose we have the following contingency table
Smoking behavior Total none light medium heavy Senior Managers 4 2 3 11 Junior managers 7 18 Senior employees 25 10 12 51 Junior employees 24 33 13 88 Secretaries 6 TOTAL 61 45 62 193 1. How can we analyze contingency table type data? 2. How can we converts frequency table data into graphical displays. 3. How can we find the similarity of column category? 4. How can we find the similarity of row category? 5. How can we find the relationship of row and column category simultaneously?

6 Real Problem Survey of effects of four different drug types. Patients gave score for each drug type (excellent, very good, good, fair, poor). Number of all elements is 121. excellent very good good fair poor Drug A Drug B Drug C Drug D Is there is association between columns and rows? If there is some association then how can we find some structure in this data table? Can we order columns and rows by their closeness? Can we find associations between columns and rows?

7 Pearson chi-squared test
Suppose that we have a data matrix X that has I rows and J columns. Elements of the matrix are xij. Let us use the following notations: r and c are row and column sums, R and C are row and column profiles, respectively. Q is difference between P and product of row and column sums.

8 Pearson chi-squared test (Cont.)
More notations and relations: Row and column inertias are multiple of chi-squared with degrees of freedom (I-1)(J-1). Multiplicity is 1/n. If P would be probability then if there would be no association between rows and columns then Q would be 0. It is equivalent to saying that rows and columns are independent

9 Pearson chi-squared test (Cont.)
For Smoke Data: R Code: library(ca) library(MASS) ca(smoke) chisq.test(smoke) Principal inertias: Value Percentage 87.76% % % Rows: SM JM SE JE SC Inertia Columns: none light medium heavy Inertia We have seen, Chi square value = total Inertia * Grand total, df= (no. of row - 1 ) * (no. of Column -1) Chi squared = , df = 12, p-value =

10 Pearson chi-squared test (Cont.)
Drag Data R Code: library(ca) library(MASS) ca(drug) chisq.test(drug)) Principal inertias : inertias Percentage % % % Rows: Drug A Drug B Drug C Drug D Inertia Columns: excellent verygood good fair poor Inertia Chi square value = total Inertia * Grand total, df= (no. of row - 1 ) * (no. of Column -1) Chi squared = , df = 12, p-value = 4.53e-06 I.e. there is strong evidence that there is row-column association.

11 Pearson chi-squared test (Cont.)
Health Data: R Code: library(ca) library(MASS) ca(health) chisq.test(health)) Principal inertias: Value Percentage 97.25% % % % Rows: Inertia Columns: VG GOOD REG BAD VB Inertia Chi square value = total Inertia * Grand total, df= (no. of row - 1 ) * (no. of Column -1) Chi squared = , df = 24, p-value < 2.2e-16 I.e. there is strong evidence that there is row-column association.

12 Probabilistic Interpretation of Matrices
, If the matrix P would be a probability matrix i.e. each element pij are probability of happening rows and columns simultaneously then we can have the following interpretation of the involved matrices: Elements of r are the marginal probabilities of rows. Elements of c are the marginal probabilities of columns. Elements of Q are differences between joint probability and product of individual probabilities. In some sense this matrix represents the degree of dependencies of rows and columns Elements of R are the conditional probabilities of columns when row is known Elements of C are the conditional probabilities of rows when column is known Total inertia is the total indicator of dependencies of rows and columns.

13 Marginal probability of Drag Data
excellent very good good fair poor Total Drug A 6 8 10 1 5 30 Drug B 12 3 31 Drug C Drug D 7 29 19 20 33 22 27 121 X Elements of r are the marginal probabilities of columns. Elements of c are the marginal probabilities of rows Excellent Very Good Good Fair Poor Marginal Probability of Drug type Drug A 0.0413 0.248 Drug B 0.256 Drug C 0.0826 Drug D 0.0579 0.24 Marginal Probability of Patient Score 0.2231 1

14 Degree of dependencies of rows and columns
Elements of Q are differences between joint probability and product of individual probabilities. In some sense this matrix represents the degree of dependencies of rows and columns Q excellent very good good fair poor Drug A Drug B Drug C Drug D See slide no. 19 for R code

15 Conditional Probabilities and Inertias
3) Elements of R are the conditional probabilities of columns when row is known R excellent very good good fair poor Drug A Drug B Drug C Drug D 4) Elements of C are the conditional probabilities of rows when column is known C Drug A Drug B Drug C Drug D Excellent Very good good Fair poor Total inertia is the total indicator of dependencies of rows and columns. Small inertia indicate there is no row column association.

16 Similarly we can find the following measurement for Smoke data and Health Status data.
Marginal probabilities , Degree of dependencies of row and column Conditional probabilities Inertias

17 Contingency Tables: Homogeneity and Heterogeneity
t=in(I)=in(J)=X2/n is the coefficient of association called as Pearson’s mean-square contingency. It is the total inertia. The total inertia is a measure of homogeneity/heterogeneity of the table. If t is large it is a measure of heterogeneity and if t is small it is a measure of homogeneity of the table. Homogeneity means that there is no row-column association. t can also be calculated using:

18 Contingency Tables: Homogeneity and Heterogeneity( Cont.)
We can interpret the following formula by the following way Second summation is sum of a weighted squared distance between the vector of relative frequency of the ith row (i.e. jth row profile – pij/ri) and the average row profile – c. Inverse of the elements of c are the weights. It is known as chi-squared distance between ith row profile and the average row profile. The total inertia is further weighted sums of I chi-squared distances. The weights are the elements of r. If all elements of row profiles are close to the average row profile then table is homogenous. Otherwise table is heterogeneous. We can do similar calculations for the column profiles. It is done easily by changing roles of r and c.

19 Calculations of Inertia to Find Out the Homogeneity or Heterogeneity
We can calculate t by R from the following code, library(ca) library(MASS) ######Read Data######## ###### Probability Matrix####### pdrag<-drug/121 c<-colSums(pdrag) r<-rowSums(pdrag) Dr<-diag(r) Dc<-diag(c) q<-pdrag-r%*%t(c) R<-ginv(Dr)%*%as.matrix(pdrag) C<-ginv(Dc)%*%t(as.matrix(pdrag)) sp<-0 tsp<-0 t<-0 for(i in 1:4){ for (j in 1:5){ sp[j]<-((((pdrag[i,j]/r[i])-c[j])*((pdrag[i,j]/r[i])-c[j]))/c[j]) } tsp[i]<-colSums(as.matrix(sp)) t[i]<-r[i]*tsp[i] ti<-colSums(as.matrix(t)) Total inertia for Drug data is t =

20 Historical Background of Correspondence Analysis
Correspondence analysis (CA) was first proposed by Hirschfeld 1935 Hirschfeld 1935 Later CA was developed by Jean-Paul Benzécri 1973 The CA solution was shown by (Greenacre 1984) It is incorporated in R in 2009

21 Correspondence Analysis
Correspondence analysis is a statistical technique used to analyze categorical data (Benzecri, 1992) and provides a graphical representation of cross tabulations or contingency tables. Correspondence analysis (CA) can be viewed as a generalized principal component analysis tailored for the analysis of qualitative data. Although CA was originally created to analyze cross tabulation but CA is so multipurpose that it is used with a lot of other numerical data table types. It is formally applicable to any data matrix with nonnegative entries.

22 Objectives of CA The main objectives of CA are to transform a dataset into two factor scores (rows and columns) that give the best representation of the similarity structure of the rows and columns of the table. Correspondence analysis is used to reduce the dimension of a data matrix as in principal component analysis. So using CA we can visualize the data two or three dimensionally.

23 Correspondence analysis and eigenvalues
For a given contingency table we calculate row and column profiles. Now we want to find a vector (g) when multiplied by row profiles from the left will have highest possible variance. It means that we want to maximize To make this problem solvable we add an additional constraint (similar to PCA). We want weighted norm of the vector to be unit and weighted mean to be 0. Weights are column sums. So we have to maximize

24 Correspondence analysis and eigenvalues (cont.)
To maximize the function we can use the Lagrange multipliers technique. Thus the Lagrange function Now differentiating L by g and put that equal to zero Thus the problem reduces to the eigenvalue problem. As a result we will have principal coordinates for columns. Similarly we can find principal coordinates for row. This problem easily and compactly solved if we use singular value decomposition.

25 Singular Value Decomposition
X Λ = Row orthonormal containing the eigenvectors of XTX. VT U n×n n×n m×n m×n Real, where (n≤ m) Diagonal matrix, containing the singular values of matrix X. column orthonormal containing the eigenvectors of XXT. XV=U Λ, The columns U Λ indicate the PCs Left singular vector shows the structure of observations. Right singular vector shows the structure of variables.

26 Correspondence Analysis Calculation Procedure
To obtain coordinates using SVD, the computational algorithm of the row and column profiles with respect to principle axes are given below X Grand total Diagonal Matrix Row total P r Column Total c Calculate the matrix of standardized residuals [Using SVD] U is a (m×n) column orthonormal matrix (UTU=I), containing the eigenvectors of the symmetric matrix PPT and VT is a (nxn) row orthonormal matrix (VTV=I), containing the eigenvectors of the symmetric matrix PTP. The principal coordinates of rows: The principal coordinates of columns: respectively Standard row and column coordinates are First few (one or two) elements of F and G are usually taken and plotted simultaneously.

27 Interpretation of Correspondence analysis
Elements of Λ are called the principal inertias. They are also related to the canonical correlations given by the package R. Larger value of Λ means that the corresponding element has higher importance. It is usual to use one or two elements of F and G. Then these elements are used for various plots. For pictorial representation either columns or rows are plotted in and ordered form or biplots is used to find possible association between rows and columns as well as their order. Correspondence Analysis can be considered as a dimension reduction technique and can be used together with others (for example PCA). Comparative application of different dimension reduction technique may give insight to the problem and structure in the data.

28 Algorithm of Correspondence Analysis
Take a contingency table (X) and find sum of all elements (total sum= n) Divide all elements by the total sum (call it P) Find row and column sums (r and c) Calculate the matrix of standardized residuals, Find generalized SVD of the S. Find principal row and column coordinates. Take few elements and plot them Analyze the results (order and closeness of columns and rows, possible associations between columns and rows).

29 Correspondence Analysis in Drug data
R code: drug<- read.table(text = " qlt excellent verygood good fair poor DrugA DrugB Drugc DrugD ", row.names = 1, header = TRUE) plot(ca(drug), mass = c(TRUE, TRUE)) plot(ca(drug), mass = c(TRUE, TRUE), arrows = c(FALSE, TRUE)) Summary(ca(drug))

30 Biplot of Drug data using Correspondence Analysis
Principal inertias (eigenvalues): dim value % cum% Total:

31 Correspondence analysis in Smoke Data
Principal inertias (eigen values): dim value % cum% scree plot ************************* *** library(ca) data("smoke") plot(ca(smoke), mass = c(TRUE, TRUE)) Summary(ca(smoke))

32 Biplot using Correspondence analysis
library(ca) data("smoke") plot(ca(smoke), mass = c(TRUE, TRUE), arrows = c(FALSE, TRUE))

33 Three Dimensional plot using Correspondence analysis
library(ca) data("smoke") plot3d.ca(ca(smoke, nd=3))

34 Correspondence analysis in Health Data
library(ca) health<- read.table(text = " age VG GOOD REG BAD VB ", row.names = 1, header = TRUE) plot(ca(health), mass = c(TRUE, TRUE))

35 Biplot of Health Data Correspondence analysis
library(ca) health<- read.table(text = " age VG GOOD REG BAD VB ", row.names = 1, header = TRUE) plot(ca(health), mass = c(TRUE, TRUE), arrows = c(FALSE, TRUE))

36 Conclusion In conclusion we can say that correspondence analysis can
Converts frequency table data into graphical displays Show the similarity of row category Show the similarity of column category Show the relationship of row and column category simultaneously Although CA was originally created to analyze cross tabulation but CA is so multipurpose that it is used with a lot of other numerical data table types. It is formally applicable to any data matrix with nonnegative entries.

37 Future Studies Study Multiple correspondence analysis.
High dimensional data analysis using Correspondence Analysis. Assess the effect of outliers. The 1st CA axis is reliable, but 2nd and later axes are quadratic distortions of the first – produces the “arch effect”. So my future study is how to solve this problem. Application of CA in Microarray data to find out the gene pattern and similarity of gene structure. Missing value and outlier is a general problem in microarray data. So solving missing value and outlier problem my target is to propose a robust correspondence analysis method that can handle both outlier and missing value problem

38 References Benzécri, J.-P. (1973). L'Analyse des Données. Volume II. L'Analyse des Correspondances. Paris, France: Dunod. Greenacre, Michael (1983). Theory and Applications of Correspondence Analysis. London: Academic Press. ISBN  Greenacre, Michael (2007). Correspondence Analysis in Practice, Second Edition. London: Chapman & Hall/CRC. Greenacre, M. and Nenadic,O. (2007), “Correspondence Analysis in R, with Two- and Three-dimensional Graphics: The ca Package”, Journal of Statistical Software,Vol-20 ,Issue-30. Hirschfeld, H.O. (1935) "A connection between correlation and contingency", Proc. Cambridge Philosophical Society, 31, 520–524.

39 Thank You so Much for Your Patience


Download ppt "Contingency Table and Correspondence Analysis"

Similar presentations


Ads by Google