Download presentation
Presentation is loading. Please wait.
Published byTobias Davis Modified over 9 years ago
1
Recursive Partitioning And Its Applications in Genetic Studies Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University
2
OUTLINE Genetic data Example Basic ideas of recursive partitioning Applications in genetic studies linkage analysis association analysis Recursive-partitioning based tools for data analyses
3
Genetic Data Nuclear Family FatherMother 11111122221111112222 0 1 2 0 1 2 12121212121212121212 Affected 21211221122121122112 12345612341234561234 1 2 3456 Tree-based Analyses in Genetic Studies
4
Genetic Data 11111122221111112222 0 1 2 0 1 2 12121212121212121212 21211221122121122112 12345612341234561234 Genotype 17 1 7 22 2 2 6 3 3 7 2 2 3 1 6 2 3 1 2 2 3 7 2 2 3 3 4 2 5 3 2 4 4 3 3 2 4 3 2 5 4
5
Gene expression profiles of 2,000 genes in 22 normal and 40 colon cancer tissues Purpose: to predict new tissue Application of Recursive Partitioning in Microarray Data (Zhang et al.,PNAS, 2001)
6
Node 1 CT:40 NT:22 Node 2 CT: 0 NT:14 Node 3 CT: 40 NT: 8 >60 M26383 Node 5 CT: 30 NT: 0 Node 4 CT: 10 NT: 8 >290 R15447 Node 7 CT: 0 NT: 7 Node 6 CT: 10 NT: 1 >770 M28214 Automatically Selected Tree (by RTREE)
7
Node 2 Node 3
8
Node 5 Node 7 Node 6
9
3-D Representation of Tree
10
The three genes, IL-8 (M26383), CANX (R15447) and RAB3B (M28214), were chosen from 2,000 genes. Concluding Remarks Using three genes can achieve high classification accuracy. These three genes are related to tumors.
11
Tree Growing Impurity functions: entropy For binary outcome, y=0, 1, let p = proportion of (y=1). Entropy: -p log(p) - (1-p) log(1-p) where 0log(0) = 0 0 1 1/2 p Splitting criterion Goodness of Split = weighted sum of node impurities Basic Ideas in Classification Trees
12
Node Impurity.6853.6365.3251.4741.6931.6829 By left right Gender 10 9 1 1 Race 9 7 2 3 Smoked 9 1 2 9 Age 7 7 4 3 Gender Male 10 9 1111 11 10 Cancer subjects 11 Normal subjects 10 right.6931 Entropy left.6918
13
Goodness of Split left right 19/21 2/21 16/21 5/21 10/21 11/21 14/21 7/21 Weight ( p(t)) s.6919.6737.4031.6897 No split:.6920 Goodness of split s = p(L)i(L) + p(R)i(R) Entropy ( i(t)) By left right Gender.6918.6931 Race.6853.6365 Smoked.3251.4741 Age.6931.6829
14
Tree Pruning Fisher Exact Test Misclassification cost and rate Cost-complexity and complexity parameter Optimal sub-trees
15
Genetic Data 11111122221111112222 0 1 2 0 1 2 12121212121212121212 21211221122121122112 12345612341234561234 Genotype 17 1 7 22 2 2 6 3 3 7 2 2 3 1 6 2 3 1 2 2 3 7 2 2 3 3 4 2 5 3 2 4 4 3 3 2 4 3 2 5 4
16
Key Idea in Tree-based Analysis If a marker locus is close to a disease locus, then individuals from a given family who are phenotypically similar are expected to be genotypically more similar than expected by chance. 1234 Sib pair
17
Covariate: the expected IBD (identity by descent) sharing at each marker locus Tree-based Linkage Analysis Unit of observation: sib pair The response variable y takes three possible values depending on whether none, one, or both sibs are affected, which we arbitrarily coded as 0, 1, and 2.
18
Identity by Descent (IBD) Genes (or alleles) inherited by relatives from the same ancestor. For two sibs, they can share at most one IBD gene from the father, and at most one from the mother. Thus, 0, 1, or 2 genes can be shared by two siblings. 1 3 Sib 1 2 4 Sib 2 IBD=0 1 3 Sib 1 2 3 Sib 2 IBD=1 1 3 Sib 1 1 3 Sib 2 IBD=2 1 2 Father’s genotype 3 4 Mother’s genotype
19
The Gilles de la Tourette Syndrome (GTS) Phenotype data (Joint work with Zhang et al., 2002) Genome scan of the hoarding phenotype collected by the Tourette Syndrome Association International Consortium for Genetics (TSAICG) We used data from 223 individuals in 51 families with 77 sib pairs. Hoarding is a component of obsessive- compulsive disorder. Genotypes are allele sizes from 370 markers on 22 chromosomes.
20
23 28 26 The Gilles de la Tourette Syndrome Phenotype data IBD Sharing at D5SMfd154 P=0.0011> 1.9 708708 16 28 18 Overall p-value = 2.63e-6 D4S1652 P=0.0078> 1.16 10 3 4 6 17 14 D5S408 P=0.0034> 0 080080 16 20 18 Split p-values Linkage Tree
21
The covariates include gender, the parental phenotypes, race and the variables constructed using the marker information. Tree-based Association Study The response variable is affection status. If a marker has n distinct alleles, then n covariates, each taking a value of 0, 1 or 2, are then constructed for this marker. For example, if n=7, then the 7 covariates take values (0,0,0,1,0,1,0) for a genotype of 4/6 and (0,0,0,0,0,0,2) for a genotype of 7/7.
22
85 135 39 29 46 88 46 77 19 54 0 11 27 23 Copies of Allele D4S403-5 D4S2632-5 D4S2431-10 > 0 P=2e-4 > 1,NA P= 0.016 > 0 P=0.0023 Overall p-value = 1.03e-7 46 106 D5S816-7 > 0,NA P= 0.0017 0 18 Split p-values The Gilles de la Tourette Syndrome Phenotype data Association Tree
23
Why Recursive Partitioning? Attempt to discover possibly very complex structure in huge databases - genotypes for hundreds of markers - expression profiles for thousands of gene - all possibly predictors (continuous, categorical) No need to do transformation Impervious to outliers Easy to use Easy to interpret
24
Recursive partitioning based tools for data analysis Classification and regression RTREE (http://peace.med.yale.edu) CART Longitudinal data analysis MASAL (http://peace.med.yale.edu) Survival Analysis STREE (http://peace.med.yale.edu) Multivariate Adaptive Regression Splines MASAL (http://peace.med.yale.edu) MARS
25
References Books L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, 1984, Classification and Regression Trees, Wadsworth, California. H. Zhang and B. Singer, 1999, Recursive Partitioning in the Health Sciences, Springer, New York. T. Hastie, R. Tibshirani and J. Friedman, 2001, The Elements of Statistical Learning, Springer, New York.
26
References Papers Zhang, Tsai, Yu, and Bonney, 2001, Genetic Epidemiology, 21, Supplement 1, S317-S322. Zhang, Leckman, Pauls, Tsai, Kidd, Campos and The TSAICG, 2002, American Journal of Human Genetic, 70, 896-904. Zhang, Yu, Singer and Xiong, 2001, Proc Natl Acad Sci U S A, 98, 6730-6735. Tsai, Acharyya, Yu and Zhang, 2002, In Recent Research Developments in Human Genetic.
27
Recent Development Instability of Trees (high variance) Bagging – averages many trees to reduce variance (Breiman, 1996) Boosting (Breiman, 1998, Mason et al. 2000, Friedman el al. 1998) Random forest (Breiman, 1999) Lack of Smoothness MARS procedure (Zhang & Singer, 1999, Hastie et al. 2001) Difficulty in Capturing Additive Structure MARS procedure
28
Competitive Tree for Colon Data
30
Node 1 CT: 40 NT: 22 Node 8 CT: 6 NT: 0 Node 3: CT: 6 NT: 13 (372, 1052] R87126 X15183 Node 2 CT: 34 NT: 3 Node 5 CT: 0 NT: 3 Node 6 CT: 34 NT: 0 Node 7 CT: 0 NT: 13 >1052 >457 >28 T62947 Node 4 CT: 0 NT:6 Competitive Tree
31
3-D Representation of Tree
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.