1 Cladistic Clustering of Haplotypes in Association Analysis Jung-Ying Tzeng Aug 27, 2004 Department of Statistics & Bioinformatics Research Center North Carolina State University
2 Simple Disorder vs. Complex Disorder Peltonen and McKusick (2001). Science
3 Complex Disorders Liability genes = genes containing variants increasing disease liability Goal: look for such genes Rely more on the epidemiological evidences Association analysis Case-control studies Detect liability genes by searching for association between disease status and genetic variants
4 Genetic Markers Instead of studying the whole DNA sequences, we look at a subset of them---genetic markers SNP: Single Nucleotide Polymorphism Pro: dense; bp Con: binary variants Resolved by considering adjacent SNPs jointly
5 Haplotype-based Association Analysis Haplotype = maker sequence Haplotye-based association analysis TCTC CACA CaseControl Hap 1 Hap 2 Hap 3. Hap k T C T C C A C A
6 Haplotype-based Association Analysis Problem: findings are not replicable Under-powered (Lohmueller et. al 2003; Neal and Sham 2004 ) Solution: 1. Use large samples (Lohmueller et. al 2003) 2. Reduce the dimension of the parameter space
7 Dimensionality Haplotype distribution within a block Daly et al. (2001) Nature Genetics Method I: Truncating : tag SNPs
8 Evolutionary tree of haplotypes Minimize the haplotype distance within clusters Method II: Clustering (Molitor et al. 2003; Durrant et al. 2004)
9 Method II: Clustering
Method II: Clustering
11 Observed Hap ={ 000, 001, 010, 100,110, 101, 011, 111 } Method III: Cladistic Grouping (Templeton 1995) (Seltman et al. 2003) Cladogram
12 Include all samples Incorporate both haplotype distance and age High frequency ancient (Crandall & Templeton 1995) Low frequency young Allow uncertainty in inferring the underlying evolutionary relationship Desired Features
13 Possible Hap = { 000, 001, 010, 100, 110, 101, 011, 111 } { 110 } (2) * (i) t = (i) t + (i+1) t B (i+1 ) { 000, 010, 111, 100 } { 001, 011, 101 } (1) (0) B (2) B (1) Proposed Approach: Cladistic Clustering p 1-p q1q1 q2q2 1-q 1 -q 2 * t = t B = (0) t (1) t (2) t B (2) B (1) B (1) I
14 Issues 1.Determine major nodes (0) 2.Construct conditional allocating matrix B (i)
{ 110 } { 000, 010, 100, 111 } { 001, 011, 101 } B (2) = C = ( ) c c c c (2) (1) (0) Conditional Allocating Matrix B ( i ) * (1) t = (2) t B (2) + (1)t [0,1 likelihood of one step movement B (2)
16 B (1) = * t = (0) t + (1) t B (1) + (2) t B (2) B (1) Conditional Allocating Matrix B ( i ) 111 010
17 Determine Information criteria Net Information (Shannon’s Information content)
18 Net Information and (0)
19 Association Analysis Based on * Coalescent simulation (Hudson’s 2002) : Prevalence = 0.01 Relative Risk = 2 Frequencies of liability Allele = (0.1, 0.3, 0.5) Location of liability allele = ( hot spot, blocky, very blocky ) Draw 200 cases and 200 controls Test of homogeneity based on * cs and * cn
20 Power and Type I error Gene Pelc Gene IL01RB
21 Summary Provide a mechanism of cladistic clustering by * B Combine the ideas of Truncating and Clustering Based on evolutionary relationship without reconstruct cladogram Incorporate haplotype frequencies and distance in cluster assignment One-step conditional regrouping can accommodate multiple step regrouping: self-repeating, algebraic multiplicative Reserve (0) based on information criteria * increases test efficiency Increased power even for large samples and haplotypes in block regions
22 End of Slides
23 Approach Two stages: Stage I: (Where) Identify the susceptible regions across genome (multiple testing problem) Approaches based on haplotype similarity Stage II: (Which) Determine and pinpoint the specific liability variants Study individual effects of groups of haplotypes
24 I. Haplotype Similarity Van Der Meulen and te Meerman 1997; Bourgain et al ; Tzeng et al. 2003ab Search for extra haplotype sharing among cases Pro: 1 degree of freedom Con: not study individual haplotype effect Usage: good for genome screening Strategies of Reducing Degrees of Freedom
25 Strategies of Reducing Degrees of Freedom Freq (%) 1AC A CCCCCGGG C C G A CT T G.TATTA A C. T.T.A...A A T C C. T.T.A T T G.TATTA.... 1ACG 2.A. 3T.. 4..A 5TA. (1) T.. (6) T.. tag SNP II.Haplotype Tagging (Johnson et al. 2001) Pro: efficiently capture the major diversity Con: discard rare haplotypes
26 III. Haplotype Clustering Molitor et al. 2003; Seltman et al 2001, 2003; Durrant et al 2004 Similar haplotypes induce similar liability effect Cluster haplotypes and perform analysis based on clusters of haplotypes Pro: incorporating all data Con: may cluster two major haplotypes in the same group Strategies of Reducing Degrees of Freedom
27 Approach Two stages: Stage I: (Where) Identify the susceptible regions across genome (multiple testing problem) Approaches based on haplotype similarity Stage II: (Which) Determine and pinpoint the specific liability variants Study individual effects of groups of haplotypes
28 Haplotype Grouping Focus on Stage II Combine the pros of haplotype tagging and clustering
29 Power and Type I error Gene Pelc Gene IL01RB