Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics R for Bioinformatics PART II Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore 2008-2009.

Similar presentations


Presentation on theme: "Bioinformatics R for Bioinformatics PART II Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore 2008-2009."— Presentation transcript:

1 Bioinformatics R for Bioinformatics PART II Kristel Van Steen, PhD, ScD (kristel.vansteen@ulg.ac.be) Université de Liege - Institut Montefiore 2008-2009

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27 Simplified Epistasis Testing We shall now use logistic regression in R to test for epistatic interactions between locus 3 and another unlinked locus (locus 5). An epistatic interaction means that the combined effect of locus 3 and 5 is greater than the product (on the odds scale) or the sum (on the log odds scale) of the locus 3 and locus 5 individual effects. First get rid of the data in the memory and read in the new data. This data is the same as the original pedfile, but with an additional column giving genotype at (unlinked) locus 5: detach(casecon) newcasecon <- read.table("newcasecondata.txt", header=T) attach(newcasecon) You can look at the data by typing fix(newcasecon) We shall now use logistic regression in R to test for epistatic interactions between locus 3 and another unlinked locus (locus 5). An epistatic interaction means that the combined effect of locus 3 and 5 is greater than the product (on the odds scale) or the sum (on the log odds scale) of the locus 3 and locus 5 individual effects. First get rid of the data in the memory and read in the new data. This data is the same as the original pedfile, but with an additional column giving genotype at (unlinked) locus 5: detach(casecon) newcasecon <- read.table("newcasecondata.txt", header=T) attach(newcasecon) You can look at the data by typing fix(newcasecon)

28 Cordell practical (see statistical genetics class) Next create appropriate genotype and case variables: case <- affected-1 g3 <- genotype(loc3_1, loc3_2) g5 <- genotype(loc5_1, loc5_2) The individual effects at locus 3 and 5 are now coded by the variables g3 and g5. We can test for association at each locus separately: gcontrasts(g3) <- "genotype" logit (case ~ g3) anova(logit (case ~ g3)) gcontrasts(g5) <- "genotype" logit (case ~ g5) anova(logit (case ~ g5)) Next create appropriate genotype and case variables: case <- affected-1 g3 <- genotype(loc3_1, loc3_2) g5 <- genotype(loc5_1, loc5_2) The individual effects at locus 3 and 5 are now coded by the variables g3 and g5. We can test for association at each locus separately: gcontrasts(g3) <- "genotype" logit (case ~ g3) anova(logit (case ~ g3)) gcontrasts(g5) <- "genotype" logit (case ~ g5) anova(logit (case ~ g5))

29 In order to investigate epistasis, it is more convenient to create new variables that code numerically for the number of copies of allele 2 in each genotypes count3<-allele.count(g3,2) count5<-allele.count(g5,2) Check you understand how variables count3 and count5 relate to g3 and g5 by typing g3 count3 g5 count5 In order to investigate epistasis, it is more convenient to create new variables that code numerically for the number of copies of allele 2 in each genotypes count3<-allele.count(g3,2) count5<-allele.count(g5,2) Check you understand how variables count3 and count5 relate to g3 and g5 by typing g3 count3 g5 count5

30 We then create a variable that codes for the combined effect of locus 3 and 5 as follows: combo<-10*count3+count5 Check you understand how the 'combo' variable relates to g3 and g5 by typing g3 g5 combo Now we need to code each of these variables as 'factors' which means we simply consider the numeric codes to act as labels for the different categories rather than having numeric meaning: fact3<-factor(count3) fact5<-factor(count5) factcombo<-factor(combo) We then create a variable that codes for the combined effect of locus 3 and 5 as follows: combo<-10*count3+count5 Check you understand how the 'combo' variable relates to g3 and g5 by typing g3 g5 combo Now we need to code each of these variables as 'factors' which means we simply consider the numeric codes to act as labels for the different categories rather than having numeric meaning: fact3<-factor(count3) fact5<-factor(count5) factcombo<-factor(combo)

31 Check that the analysis with the 'factors' gives the same results as you found previously with the genotype variables: anova(logit (case ~ fact3)) anova(logit (case ~ fact5)) Now test whether there is significant epistasis by typing anova(logit(case ~ fact3 + fact5 + factcombo)) 1-pchisq(9.59,4) This first fits the individual locus factors, and then adds in the extra effect of looking at the model with epistasis included (i.e. a model with 9 estimated parameters corresponding to the 9 genotype combinations), and tests the difference between the models. You should get a chi-squared of 9.59 on 4 df with p value 0.048 i.e. there is marginal evidence of epistasis. The above test is valid for testing for epistasis between linked or unlinked loci, although it does not allow for haplotype (phase) effects between linked loci. A more powerful test for epistasis between UNLINKED LOCI ONLY is to use 'case-only' analysis and test whether the genotypes at one locus predict those at the other, in the cases alone. This is only valid at unlinked loci, because at linked loci we expect genotypes at one locus to predict those at the other (even in controls) due to linkage disequilibrium. Check that the analysis with the 'factors' gives the same results as you found previously with the genotype variables: anova(logit (case ~ fact3)) anova(logit (case ~ fact5)) Now test whether there is significant epistasis by typing anova(logit(case ~ fact3 + fact5 + factcombo)) 1-pchisq(9.59,4) This first fits the individual locus factors, and then adds in the extra effect of looking at the model with epistasis included (i.e. a model with 9 estimated parameters corresponding to the 9 genotype combinations), and tests the difference between the models. You should get a chi-squared of 9.59 on 4 df with p value 0.048 i.e. there is marginal evidence of epistasis. The above test is valid for testing for epistasis between linked or unlinked loci, although it does not allow for haplotype (phase) effects between linked loci. A more powerful test for epistasis between UNLINKED LOCI ONLY is to use 'case-only' analysis and test whether the genotypes at one locus predict those at the other, in the cases alone. This is only valid at unlinked loci, because at linked loci we expect genotypes at one locus to predict those at the other (even in controls) due to linkage disequilibrium.

32 To do this, we can use a chi squared test to look for correlation (association) between the loci within the case and control groups separately. First we need to set up 2 new vectors of genotypes for loci 3 and 5, using only the cases. To do this, we can take advantage of the fact that the data has been ordered in such a way that cases are the first 384 observations. (Check this by typing case or fix(newcasecon) ). So we can create genotype vectors just for the cases using the following commands caseg3<-g3[1:384] caseg5<-g5[1:384] Take a look at the vectors you have created by typing caseg3 caseg5 Now do a chi-squared test on the genotype variables to see if they are correlated with each other: table(caseg3,caseg5) chisq.test(caseg3,caseg5) To do this, we can use a chi squared test to look for correlation (association) between the loci within the case and control groups separately. First we need to set up 2 new vectors of genotypes for loci 3 and 5, using only the cases. To do this, we can take advantage of the fact that the data has been ordered in such a way that cases are the first 384 observations. (Check this by typing case or fix(newcasecon) ). So we can create genotype vectors just for the cases using the following commands caseg3<-g3[1:384] caseg5<-g5[1:384] Take a look at the vectors you have created by typing caseg3 caseg5 Now do a chi-squared test on the genotype variables to see if they are correlated with each other: table(caseg3,caseg5) chisq.test(caseg3,caseg5)

33 You should find much more significant evidence of epistasis (p value 0.0018) than you did using logistic regression. This is not surprising as the case-only test of interaction is a more powerful test. However, the case-only test does rely on the assumption that the two genotype variables g3 and g5 are uncorrelated in the general population. Strictly speaking, we cannot test this assumption as we do not have a population-based control sample (our controls are all unaffected). However, if the disease is rare, our controls should be reasonably close to an unselected sample. So we can use them to see if the genotype variables g3 and g5 are uncorrelated in the control population: contg3<-g3[385:1056] contg5<-g5[385:1056] contg3 contg5 table(contg3,contg5) chisq.test(contg3,contg5) You should find a non-significant p value (p=0.99). This suggests that the case-only analysis we did is valid, so there is indeed some reasonable (p=0.002) evidence for statistical interaction between these loci. You should find much more significant evidence of epistasis (p value 0.0018) than you did using logistic regression. This is not surprising as the case-only test of interaction is a more powerful test. However, the case-only test does rely on the assumption that the two genotype variables g3 and g5 are uncorrelated in the general population. Strictly speaking, we cannot test this assumption as we do not have a population-based control sample (our controls are all unaffected). However, if the disease is rare, our controls should be reasonably close to an unselected sample. So we can use them to see if the genotype variables g3 and g5 are uncorrelated in the control population: contg3<-g3[385:1056] contg5<-g5[385:1056] contg3 contg5 table(contg3,contg5) chisq.test(contg3,contg5) You should find a non-significant p value (p=0.99). This suggests that the case-only analysis we did is valid, so there is indeed some reasonable (p=0.002) evidence for statistical interaction between these loci.

34 Running the command lines in R Running the command lines in R to test for epistasis to test for epistasis

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80 Resources for microarray analysis: http://www.nslij-genetics.org/microarray/

81 Review Paper: Review Paper: gene expression analysis (Slonim et al 2002)

82


Download ppt "Bioinformatics R for Bioinformatics PART II Kristel Van Steen, PhD, ScD Université de Liege - Institut Montefiore 2008-2009."

Similar presentations


Ads by Google