Bioinformatics R for Bioinformatics PART II Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore 2008-2009.

Slides:

Advertisements

Similar presentations

Chapter 18: The Chi-Square Statistic

Advertisements

Genetic Heterogeneity Taken from: Advanced Topics in Linkage Analysis. Ch. 27 Presented by: Natalie Aizenberg Assaf Chen.

Hypothesis Testing Steps in Hypothesis Testing:

Hypothesis: It is an assumption of population parameter ( mean, proportion, variance) There are two types of hypothesis : 1) Simple hypothesis :A statistical.

Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.

What is Interaction for A Binary Outcome? Chun Li Department of Biostatistics Center for Human Genetics Research September 19, 2007.

Logistic Regression Part I - Introduction. Logistic Regression Regression where the response variable is dichotomous (not continuous) Examples –effect.

Gene-gene and gene-environment interactions Manuel Ferreira Massachusetts General Hospital Harvard Medical School Center for Human Genetic Research.

1 How many genes? Mapping mouse traits, cont. Lecture 2B, Statistics 246 January 22, 2004.

Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.

Lecture 5 Artificial Selection R = h 2 S. Applications of Artificial Selection Applications in agriculture and forestry Creation of model systems of human.

Chapter 14 Conducting & Reading Research Baumgartner et al Chapter 14 Inferential Data Analysis.

Ordinal Logistic Regression

Structural Equation Modeling

Log-linear and logistic models

EPI 809/Spring Multiple Logistic Regression.

Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.

Statistical hypothesis testing – Inferential statistics II. Testing for associations.

1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Analysis of Categorical Data Test of Independence.

This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.

Hypothesis Testing:.

Logistic Regression III: Advanced topics Conditional Logistic Regression for Matched Data Conditional Logistic Regression for Matched Data.

Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.

Simple Linear Regression

AP STATISTICS LESSON 13 – 1 (DAY 1) CHI-SQUARE PROCEDURES TEST FOR GOODNESS OF FIT.

Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007.

POTH 612A Quantitative Analysis Dr. Nancy Mayo. © Nancy E. Mayo A Framework for Asking Questions Population Exposure (Level 1) Comparison Level 2 OutcomeTimePECOT.

Karri Silventoinen University of Helsinki Osaka University.

Chi-square (χ 2 ) Fenster Chi-Square Chi-Square χ 2 Chi-Square χ 2 Tests of Statistical Significance for Nominal Level Data (Note: can also be used for.

The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,

Chapter 12 The Analysis of Categorical Data and Goodness-of-Fit Tests.

1 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 12 The Analysis of Categorical Data and Goodness-of-Fit Tests.

Univariate modeling Sarah Medland. Starting at the beginning… Data preparation – The algebra style used in Mx expects 1 line per case/family – (Almost)

Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.

Quantitative Genetics

Data Analysis for Two-Way Tables. The Basics Two-way table of counts Organizes data about 2 categorical variables Row variables run across the table Column.

CHI SQUARE TESTS.

Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.

A generalized bivariate Bernoulli model with covariate dependence Fan Zhang.

Log-linear Models HRP /03/04 Log-Linear Models for Multi-way Contingency Tables 1. GLM for Poisson-distributed data with log-link (see Agresti.

Association mapping for mendelian, and complex disorders January 16Bafna, BfB.

Logistic Regression. Linear regression – numerical response Logistic regression – binary categorical response eg. has the disease, or unaffected by the.

Lecture 11. The chi-square test for goodness of fit.

Lecture 22: Quantitative Traits II

Multiple-Locus Genome-Wide Association Testing David Dean CSE280A.

Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions.

1 Genetic Mapping Establishing relative positions of genes along chromosomes using recombination frequencies Enables location of important disease genes.

Nonparametric Statistics

Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.

AP Stats Check In Where we’ve been… Chapter 7…Chapter 8… Where we are going… Significance Tests!! –Ch 9 Tests about a population proportion –Ch 9Tests.

Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.

LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.

Chi Square Test of Homogeneity. Are the different types of M&M’s distributed the same across the different colors? PlainPeanutPeanut Butter Crispy Brown7447.

Logistic Regression: Regression with a Binary Dependent Variable.

Nonparametric Statistics

CHAPTER 7 Linear Correlation & Regression Methods

Notes on Logistic Regression

INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE

Data Analysis for Two-Way Tables

Nonparametric Statistics

AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…

AP Stats Check In Where we’ve been… Chapter 7…Chapter 8…

Correlation for a pair of relatives

Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II

Undergraduated Econometrics

Tutorial 1: Misspecification

BOULDER WORKSHOP STATISTICS REVIEWED: LIKELIHOOD MODELS

Chapter 18: The Chi-Square Statistic

Presentation transcript:

Bioinformatics R for Bioinformatics PART II Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore

Simplified Epistasis Testing We shall now use logistic regression in R to test for epistatic interactions between locus 3 and another unlinked locus (locus 5). An epistatic interaction means that the combined effect of locus 3 and 5 is greater than the product (on the odds scale) or the sum (on the log odds scale) of the locus 3 and locus 5 individual effects. First get rid of the data in the memory and read in the new data. This data is the same as the original pedfile, but with an additional column giving genotype at (unlinked) locus 5: detach(casecon) newcasecon <- read.table("newcasecondata.txt", header=T) attach(newcasecon) You can look at the data by typing fix(newcasecon) We shall now use logistic regression in R to test for epistatic interactions between locus 3 and another unlinked locus (locus 5). An epistatic interaction means that the combined effect of locus 3 and 5 is greater than the product (on the odds scale) or the sum (on the log odds scale) of the locus 3 and locus 5 individual effects. First get rid of the data in the memory and read in the new data. This data is the same as the original pedfile, but with an additional column giving genotype at (unlinked) locus 5: detach(casecon) newcasecon <- read.table("newcasecondata.txt", header=T) attach(newcasecon) You can look at the data by typing fix(newcasecon)

Cordell practical (see statistical genetics class) Next create appropriate genotype and case variables: case <- affected-1 g3 <- genotype(loc3_1, loc3_2) g5 <- genotype(loc5_1, loc5_2) The individual effects at locus 3 and 5 are now coded by the variables g3 and g5. We can test for association at each locus separately: gcontrasts(g3) <- "genotype" logit (case ~ g3) anova(logit (case ~ g3)) gcontrasts(g5) <- "genotype" logit (case ~ g5) anova(logit (case ~ g5)) Next create appropriate genotype and case variables: case <- affected-1 g3 <- genotype(loc3_1, loc3_2) g5 <- genotype(loc5_1, loc5_2) The individual effects at locus 3 and 5 are now coded by the variables g3 and g5. We can test for association at each locus separately: gcontrasts(g3) <- "genotype" logit (case ~ g3) anova(logit (case ~ g3)) gcontrasts(g5) <- "genotype" logit (case ~ g5) anova(logit (case ~ g5))

In order to investigate epistasis, it is more convenient to create new variables that code numerically for the number of copies of allele 2 in each genotypes count3<-allele.count(g3,2) count5<-allele.count(g5,2) Check you understand how variables count3 and count5 relate to g3 and g5 by typing g3 count3 g5 count5 In order to investigate epistasis, it is more convenient to create new variables that code numerically for the number of copies of allele 2 in each genotypes count3<-allele.count(g3,2) count5<-allele.count(g5,2) Check you understand how variables count3 and count5 relate to g3 and g5 by typing g3 count3 g5 count5

We then create a variable that codes for the combined effect of locus 3 and 5 as follows: combo<-10*count3+count5 Check you understand how the 'combo' variable relates to g3 and g5 by typing g3 g5 combo Now we need to code each of these variables as 'factors' which means we simply consider the numeric codes to act as labels for the different categories rather than having numeric meaning: fact3<-factor(count3) fact5<-factor(count5) factcombo<-factor(combo) We then create a variable that codes for the combined effect of locus 3 and 5 as follows: combo<-10*count3+count5 Check you understand how the 'combo' variable relates to g3 and g5 by typing g3 g5 combo Now we need to code each of these variables as 'factors' which means we simply consider the numeric codes to act as labels for the different categories rather than having numeric meaning: fact3<-factor(count3) fact5<-factor(count5) factcombo<-factor(combo)

Check that the analysis with the 'factors' gives the same results as you found previously with the genotype variables: anova(logit (case ~ fact3)) anova(logit (case ~ fact5)) Now test whether there is significant epistasis by typing anova(logit(case ~ fact3 + fact5 + factcombo)) 1-pchisq(9.59,4) This first fits the individual locus factors, and then adds in the extra effect of looking at the model with epistasis included (i.e. a model with 9 estimated parameters corresponding to the 9 genotype combinations), and tests the difference between the models. You should get a chi-squared of 9.59 on 4 df with p value i.e. there is marginal evidence of epistasis. The above test is valid for testing for epistasis between linked or unlinked loci, although it does not allow for haplotype (phase) effects between linked loci. A more powerful test for epistasis between UNLINKED LOCI ONLY is to use 'case-only' analysis and test whether the genotypes at one locus predict those at the other, in the cases alone. This is only valid at unlinked loci, because at linked loci we expect genotypes at one locus to predict those at the other (even in controls) due to linkage disequilibrium. Check that the analysis with the 'factors' gives the same results as you found previously with the genotype variables: anova(logit (case ~ fact3)) anova(logit (case ~ fact5)) Now test whether there is significant epistasis by typing anova(logit(case ~ fact3 + fact5 + factcombo)) 1-pchisq(9.59,4) This first fits the individual locus factors, and then adds in the extra effect of looking at the model with epistasis included (i.e. a model with 9 estimated parameters corresponding to the 9 genotype combinations), and tests the difference between the models. You should get a chi-squared of 9.59 on 4 df with p value i.e. there is marginal evidence of epistasis. The above test is valid for testing for epistasis between linked or unlinked loci, although it does not allow for haplotype (phase) effects between linked loci. A more powerful test for epistasis between UNLINKED LOCI ONLY is to use 'case-only' analysis and test whether the genotypes at one locus predict those at the other, in the cases alone. This is only valid at unlinked loci, because at linked loci we expect genotypes at one locus to predict those at the other (even in controls) due to linkage disequilibrium.

To do this, we can use a chi squared test to look for correlation (association) between the loci within the case and control groups separately. First we need to set up 2 new vectors of genotypes for loci 3 and 5, using only the cases. To do this, we can take advantage of the fact that the data has been ordered in such a way that cases are the first 384 observations. (Check this by typing case or fix(newcasecon) ). So we can create genotype vectors just for the cases using the following commands caseg3<-g3[1:384] caseg5<-g5[1:384] Take a look at the vectors you have created by typing caseg3 caseg5 Now do a chi-squared test on the genotype variables to see if they are correlated with each other: table(caseg3,caseg5) chisq.test(caseg3,caseg5) To do this, we can use a chi squared test to look for correlation (association) between the loci within the case and control groups separately. First we need to set up 2 new vectors of genotypes for loci 3 and 5, using only the cases. To do this, we can take advantage of the fact that the data has been ordered in such a way that cases are the first 384 observations. (Check this by typing case or fix(newcasecon) ). So we can create genotype vectors just for the cases using the following commands caseg3<-g3[1:384] caseg5<-g5[1:384] Take a look at the vectors you have created by typing caseg3 caseg5 Now do a chi-squared test on the genotype variables to see if they are correlated with each other: table(caseg3,caseg5) chisq.test(caseg3,caseg5)

You should find much more significant evidence of epistasis (p value ) than you did using logistic regression. This is not surprising as the case-only test of interaction is a more powerful test. However, the case-only test does rely on the assumption that the two genotype variables g3 and g5 are uncorrelated in the general population. Strictly speaking, we cannot test this assumption as we do not have a population-based control sample (our controls are all unaffected). However, if the disease is rare, our controls should be reasonably close to an unselected sample. So we can use them to see if the genotype variables g3 and g5 are uncorrelated in the control population: contg3<-g3[385:1056] contg5<-g5[385:1056] contg3 contg5 table(contg3,contg5) chisq.test(contg3,contg5) You should find a non-significant p value (p=0.99). This suggests that the case-only analysis we did is valid, so there is indeed some reasonable (p=0.002) evidence for statistical interaction between these loci. You should find much more significant evidence of epistasis (p value ) than you did using logistic regression. This is not surprising as the case-only test of interaction is a more powerful test. However, the case-only test does rely on the assumption that the two genotype variables g3 and g5 are uncorrelated in the general population. Strictly speaking, we cannot test this assumption as we do not have a population-based control sample (our controls are all unaffected). However, if the disease is rare, our controls should be reasonably close to an unselected sample. So we can use them to see if the genotype variables g3 and g5 are uncorrelated in the control population: contg3<-g3[385:1056] contg5<-g5[385:1056] contg3 contg5 table(contg3,contg5) chisq.test(contg3,contg5) You should find a non-significant p value (p=0.99). This suggests that the case-only analysis we did is valid, so there is indeed some reasonable (p=0.002) evidence for statistical interaction between these loci.

Running the command lines in R Running the command lines in R to test for epistasis to test for epistasis

Resources for microarray analysis:

Review Paper: Review Paper: gene expression analysis (Slonim et al 2002)