Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007
Association Mapping of Diseases SNPs Cases Controls Diploid: two sequences per individuals Problem: Where are (unobserved) disease mutations? This talk: Genealogy-based approach 01
Genealogy: Evolutionary History of Genomic Sequences Tells how individuals in a population are related Helps to explain diseases: disease mutations occur on branches and all descendents carry the mutations Problem: How to determine the genealogy for “unrelated” individuals? Not easy with recombination Individuals in current population Diseased (case) Healthy (control) Disease mutation
4 Recombination One of the principle genetic forces shaping sequence variations within species Two equal length sequences generate a third new equal length sequence in genealogy Prefix Suffix Breakpoint
Ancestral Recombination Graph (ARG) S1 = 00 S2 = 01 S3 = 10 S4 = 10 Mutations S1 = 00 S2 = 01 S3 = 10 S4 = Recombination Assumption: At most one mutation per site
6 Mapping Disease Gene with Inferred Genealogy “..the best information that we could possibly get about association is to know the full coalescent genealogy…” – Zollner and Pritchard, 2005 But we do not know the true ARG! Goal: infer ARGs from sequences for association mapping –Not easy and often approximation is used (e.g. Zollner and Pritchard)
7 The ARG Approaches First practical ARG association mapping method (Minichiello and Durbin, 2006) –Use plausible ARGs: heuristic My work: Generate ARGs with a provable property, and works on a well-defined complex disease model –minARGs: Most parsimonious ARGs that use the minimum number of recombinations. –Uniform sampling of minARGs: generate one minARG from the space of all minARGs with equal probability. (Sampling is a scheme often used in genealogy-based approaches)
Counting minARGs by Dynamic Programming (This paper) N = 124*1 + 32*2 = 188 It turns out no other row choices contribute to the minARG space N1=124 Recursion N2=32 Assume only input sequences are generated.
N2=32 1. Random value Rnd = 0.3 < minARGs Select with prob = 124/188 = 0.66, and with prob = 32*2/188 = Pick as last row to derive 3. Move to reduced matrix N1=124 Idea: Use counting of minARGs in selecting the order of sequences to generate. Can be easily extend to weighted sampling, e.g. generate less frequent sequences later.
10 ARGs Represent a Set of Marginal Trees Clear separation of cases/controls: NOT expected for complex diseases! Case Control Possible disease mutation
Realities of Mappping Complex Diseases SNPs 1 2 Multiple disease mutations! Cases Controls Incomplete penetrance Diploid: two sequences per individuals Trying to find one tree branch which clearly separate cases and controls may not work for complex diseases! Solution: Inference on a well- defined disease model.
12 Complex Disease Model: How A Disease Affects Population (Zollner & Pritchard, 2005) Disease mutations: Poisson Process Two alleles: wild-type and mutant Probability of disease mutations occur at the branch (computed from mutation rate and branch length) A formal model of the complex disease is needed to assess the significance of a chosen marginal tree for real data.
13 Disease Penetrance (Zollner & Pritchard) P A,1 : probability of a mutant sequence becomes a case P C,1 = P A,1 P A,0 : probability of a wild- type sequence becomes a case P C,0 = P A, cAse Control P A,1 = 0.8, P C,1 = 0.2P A,0 = 0.1, P C,0 = 0.9
14 Phenotype Likelihood: How Likely are Phenotypes Generated on a Marginal Tree? ( Zollner and Pritchard) The disease model specifies a probabilistic way of assigning phenotypes for a given tree. But we have many trees and at which tree disease mutations occurs? Given a tree T and case/control phenotypes of its leaves, what is the probability of observing on T? –High phenotype likelihood: disease mutations may occur in T –Computable in linear time and adopted in this work
15 This Paper: Expected Phenotype Likelihood We need to assess statistical significance of computed phenotype likelihood. –Null model: randomly permute case/control status of leaves in the given tree. –P-value by permutation tests: computational bottleneck! My result: O(n 3 ) algorithm computing expected value (and variance) of phenotype likelihood. –Exact, fully deterministic method. –But, computing P-value precisely and efficiently remains open.
16 This Paper: Diploid Penetrance Is Hard Diploid (e.g. humans): two sequences per individual Diploid penetrance: P A,00 : prob. Individual with two wild-type sequences becomes a case P A,01 : prob. Individual with one wild-type and one mutant becomes a case P A,11 : … Case Control Efficient computation of phenotype likelihood: stated but unresolved in Zollner and Pritchard My result: computing phenotype likelihood with diploid penetrance is NP-hard
Simulation Results Comparison: TMARG, LATAG (Z. P.), MARGARITA (M. D.). TMARG (my program) and MARGRITA are much faster (20 times or more) than LATAG. Important for whole genome scan. Average mapping error for 50 simulated datasets from Zollner and Pritchard Average over 50 genealogies Date: January, 2007
18 Acknowledgement Software available at: I want to thank –Dan Gusfield –Dan Brown –Chuck Langley –Yun S. Song