Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recursive Partitioning And Its Applications in Genetic Studies Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University.

Similar presentations


Presentation on theme: "Recursive Partitioning And Its Applications in Genetic Studies Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University."— Presentation transcript:

1 Recursive Partitioning And Its Applications in Genetic Studies Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University

2 OUTLINE  Genetic data  Example  Basic ideas of recursive partitioning  Applications in genetic studies  linkage analysis  association analysis  Recursive-partitioning based tools for data analyses

3 Genetic Data Nuclear Family FatherMother 11111122221111112222 0 1 2 0 1 2 12121212121212121212 Affected 21211221122121122112 12345612341234561234 1 2 3456 Tree-based Analyses in Genetic Studies

4 Genetic Data 11111122221111112222 0 1 2 0 1 2 12121212121212121212 21211221122121122112 12345612341234561234 Genotype 17 1 7 22 2 2 6 3 3 7 2 2 3 1 6 2 3 1 2 2 3 7 2 2 3 3 4 2 5 3 2 4 4 3 3 2 4 3 2 5 4

5  Gene expression profiles of 2,000 genes in 22 normal and 40 colon cancer tissues  Purpose: to predict new tissue Application of Recursive Partitioning in Microarray Data (Zhang et al.,PNAS, 2001)

6 Node 1 CT:40 NT:22 Node 2 CT: 0 NT:14 Node 3 CT: 40 NT: 8 >60 M26383 Node 5 CT: 30 NT: 0 Node 4 CT: 10 NT: 8 >290 R15447 Node 7 CT: 0 NT: 7 Node 6 CT: 10 NT: 1 >770 M28214 Automatically Selected Tree (by RTREE)

7 Node 2 Node 3

8 Node 5 Node 7 Node 6

9 3-D Representation of Tree

10  The three genes, IL-8 (M26383), CANX (R15447) and RAB3B (M28214), were chosen from 2,000 genes. Concluding Remarks  Using three genes can achieve high classification accuracy.  These three genes are related to tumors.

11 Tree Growing  Impurity functions: entropy For binary outcome, y=0, 1, let p = proportion of (y=1). Entropy: -p log(p) - (1-p) log(1-p) where 0log(0) = 0 0 1 1/2 p  Splitting criterion Goodness of Split = weighted sum of node impurities Basic Ideas in Classification Trees

12 Node Impurity.6853.6365.3251.4741.6931.6829 By left right Gender 10 9 1 1 Race 9 7 2 3 Smoked 9 1 2 9 Age 7 7 4 3 Gender Male 10 9 1111 11 10 Cancer subjects 11 Normal subjects 10 right.6931 Entropy left.6918

13 Goodness of Split left right 19/21 2/21 16/21 5/21 10/21 11/21 14/21 7/21 Weight ( p(t)) s.6919.6737.4031.6897 No split:.6920 Goodness of split s = p(L)i(L) + p(R)i(R) Entropy ( i(t)) By left right Gender.6918.6931 Race.6853.6365 Smoked.3251.4741 Age.6931.6829

14 Tree Pruning  Fisher Exact Test  Misclassification cost and rate  Cost-complexity and complexity parameter  Optimal sub-trees

15 Genetic Data 11111122221111112222 0 1 2 0 1 2 12121212121212121212 21211221122121122112 12345612341234561234 Genotype 17 1 7 22 2 2 6 3 3 7 2 2 3 1 6 2 3 1 2 2 3 7 2 2 3 3 4 2 5 3 2 4 4 3 3 2 4 3 2 5 4

16 Key Idea in Tree-based Analysis If a marker locus is close to a disease locus, then individuals from a given family who are phenotypically similar are expected to be genotypically more similar than expected by chance. 1234 Sib pair

17  Covariate: the expected IBD (identity by descent) sharing at each marker locus Tree-based Linkage Analysis  Unit of observation: sib pair  The response variable y takes three possible values depending on whether none, one, or both sibs are affected, which we arbitrarily coded as 0, 1, and 2.

18 Identity by Descent (IBD) Genes (or alleles) inherited by relatives from the same ancestor. For two sibs, they can share at most one IBD gene from the father, and at most one from the mother. Thus, 0, 1, or 2 genes can be shared by two siblings. 1 3 Sib 1 2 4 Sib 2 IBD=0 1 3 Sib 1 2 3 Sib 2 IBD=1 1 3 Sib 1 1 3 Sib 2 IBD=2 1 2 Father’s genotype 3 4 Mother’s genotype

19 The Gilles de la Tourette Syndrome (GTS) Phenotype data (Joint work with Zhang et al., 2002)  Genome scan of the hoarding phenotype collected by the Tourette Syndrome Association International Consortium for Genetics (TSAICG)  We used data from 223 individuals in 51 families with 77 sib pairs.  Hoarding is a component of obsessive- compulsive disorder.  Genotypes are allele sizes from 370 markers on 22 chromosomes.

20 23 28 26 The Gilles de la Tourette Syndrome Phenotype data IBD Sharing at D5SMfd154 P=0.0011> 1.9 708708 16 28 18 Overall p-value = 2.63e-6 D4S1652 P=0.0078> 1.16 10 3 4 6 17 14 D5S408 P=0.0034> 0 080080 16 20 18 Split p-values Linkage Tree

21  The covariates include gender, the parental phenotypes, race and the variables constructed using the marker information. Tree-based Association Study  The response variable is affection status.  If a marker has n distinct alleles, then n covariates, each taking a value of 0, 1 or 2, are then constructed for this marker. For example, if n=7, then the 7 covariates take values (0,0,0,1,0,1,0) for a genotype of 4/6 and (0,0,0,0,0,0,2) for a genotype of 7/7.

22 85 135 39 29 46 88 46 77 19 54 0 11 27 23 Copies of Allele D4S403-5 D4S2632-5 D4S2431-10 > 0 P=2e-4 > 1,NA P= 0.016 > 0 P=0.0023 Overall p-value = 1.03e-7 46 106 D5S816-7 > 0,NA P= 0.0017 0 18 Split p-values The Gilles de la Tourette Syndrome Phenotype data Association Tree

23 Why Recursive Partitioning?  Attempt to discover possibly very complex structure in huge databases - genotypes for hundreds of markers - expression profiles for thousands of gene - all possibly predictors (continuous, categorical)  No need to do transformation  Impervious to outliers  Easy to use  Easy to interpret

24 Recursive partitioning based tools for data analysis  Classification and regression  RTREE (http://peace.med.yale.edu)  CART  Longitudinal data analysis  MASAL (http://peace.med.yale.edu)  Survival Analysis  STREE (http://peace.med.yale.edu)  Multivariate Adaptive Regression Splines  MASAL (http://peace.med.yale.edu)  MARS

25 References  Books  L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, 1984, Classification and Regression Trees, Wadsworth, California.  H. Zhang and B. Singer, 1999, Recursive Partitioning in the Health Sciences, Springer, New York.  T. Hastie, R. Tibshirani and J. Friedman, 2001, The Elements of Statistical Learning, Springer, New York.

26 References  Papers  Zhang, Tsai, Yu, and Bonney, 2001, Genetic Epidemiology, 21, Supplement 1, S317-S322.  Zhang, Leckman, Pauls, Tsai, Kidd, Campos and The TSAICG, 2002, American Journal of Human Genetic, 70, 896-904.  Zhang, Yu, Singer and Xiong, 2001, Proc Natl Acad Sci U S A, 98, 6730-6735.  Tsai, Acharyya, Yu and Zhang, 2002, In Recent Research Developments in Human Genetic.

27 Recent Development  Instability of Trees (high variance)  Bagging – averages many trees to reduce variance (Breiman, 1996)  Boosting (Breiman, 1998, Mason et al. 2000, Friedman el al. 1998)  Random forest (Breiman, 1999)  Lack of Smoothness  MARS procedure (Zhang & Singer, 1999, Hastie et al. 2001)  Difficulty in Capturing Additive Structure  MARS procedure

28 Competitive Tree for Colon Data

29

30 Node 1 CT: 40 NT: 22 Node 8 CT: 6 NT: 0 Node 3: CT: 6 NT: 13 (372, 1052] R87126 X15183 Node 2 CT: 34 NT: 3 Node 5 CT: 0 NT: 3 Node 6 CT: 34 NT: 0 Node 7 CT: 0 NT: 13 >1052 >457 >28 T62947 Node 4 CT: 0 NT:6 Competitive Tree

31 3-D Representation of Tree


Download ppt "Recursive Partitioning And Its Applications in Genetic Studies Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University."

Similar presentations


Ads by Google