Association Mapping by Local Genealogies Bioinformatics Research Center University of Aarhus Thomas Mailund.

Slides:



Advertisements
Similar presentations
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.
Advertisements

Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Population Genetics, Recombination Histories & Global Pedigrees Finding Minimal Recombination Histories Global Pedigrees Finding.
Recombination and genetic variation – models and inference
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Preview What does Recombination do to Sequence Histories. Probabilities of such histories. Quantities of interest. Detecting & Reconstructing Recombinations.
The HAP webserver: Tools for the Discovery of Genetic Basis of Human Disease HYUN MIN KANG Computer Science and Engineering University of California, San.
1 Cladistic Clustering of Haplotypes in Association Analysis Jung-Ying Tzeng Aug 27, 2004 Department of Statistics & Bioinformatics Research Center North.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
BiRC Bioinformatics Research Center...and Association Mapping through Local Genealogies.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Inferring human demographic history from DNA sequence data Apr. 28, 2009 J. Wall Institute for Human Genetics, UCSF.
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
CSE182-L17 Clustering Population Genetics: Basics.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
Association Mapping by Local Genealogies Bioinformatics Research Center University of Aarhus Thomas Mailund.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:
Molecular phylogenetics
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Speciation history inferred from gene trees L. Lacey Knowles Department of Ecology and Evolutionary Biology University of Michigan, Ann Arbor MI
Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
Simon Myers, Gil McVean Department of Statistics, Oxford Recombination and genetic variation – models and inference.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
Minimal Recombinations Histories and Global Pedigrees Finding Minimal Recombination Histories Acknowledgements Yun Song - Rune Lyngsø - Mike Steel - Carsten.
The Haplotype Blocks Problems Wu Ling-Yun
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Association mapping Fundamental Principles and a few methods Thomas Mailund Slides:
Fundamental Principles (and Applications)
Fast association mapping by incompatibilities
Of Sea Urchins, Birds and Men
Constrained Hidden Markov Models for Population-based Haplotyping
Searching for Disease Causing Genes Thomas Mailund Bioinformatics ApS
Searching for Disease Causing Genes Thomas Mailund
L4: Counting Recombination events
Estimating Recombination Rates
ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination Dan Gusfield U. Oregon , May 8, 2012.
BI820 – Seminar in Quantitative and Computational Problems in Genomics
The coalescent with recombination (Chapter 5, Part 1)
Recombination, Phylogenies and Parsimony
Outline Cancer Progression Models
Data Mining Applied to Linkage Disequilibrium Mapping
Presentation transcript:

Association Mapping by Local Genealogies Bioinformatics Research Center University of Aarhus Thomas Mailund

Disease mapping... --A C A----G---X----T---C---A T G A----G---X----C---C---A A G G----G---X----C---C---A A C A----G---X----T---C---A T C A----G---X----T---C---A T C A----T---X----T---A---A A C A----G---X----T---C---A A C A----G---X----T---C---G T C A----T---X----T---C---A A C A----G---X----T---C---A A C G----T---X----C---A---A A C A----G---X----C---C---G---- Locate disease locus  Unlikely to be among our genotyped markers  Use information from available markers Cases (affected) Controls (unaffected)

Indirect signal for causal locus --T G A----G---X----C---C---A A G G----G---X----C---C---A A C A----G---X----T---C---A T C A----G---X----T---C---A T C A----T---X----T---A---A A C A----G---X----T---C---A A C A----G---X----T---C---G T C A----T---X----T---C---A A C A----G---X----T---C---A A C G----T---X----C---A---A A C A----G---X----C---C---G---- The markers are not independent  Knowing one marker is partial knowledge of others  This dependency decreases with distance --A C A----G---X----T---C---A----

The Ancestral Recombination Graph Locally, the genealogy of a small genomic region is the Ancestral Recombination Graph (ARG)‏ (Hudson 1990, Griffith&Marjoram 1996)

The Ancestral Recombination Graph Sampled sequences MRCA (Hudson 1990, Griffith&Marjoram 1996)

The Ancestral Recombination Graph Recombination Coalescence (Hudson 1990, Griffith&Marjoram 1996)

The Ancestral Recombination Graph Non-ancestral material Non- ancestral material Ancestral material (Hudson 1990, Griffith&Marjoram 1996)

The Ancestral Recombination Graph Mutations (Hudson 1990, Griffith&Marjoram 1996)

(Larribe, Lessard and Schork, 2002) The unknown ARG, mutation locus and disease status can be explored using statistical sampling methods This is very CPU demanding! The Ancestral Recombination Graph

(Lyngsø, Song and Hein, 2005; Minichiello and Durbin, 2006) The unknown ARG, mutation locus and disease status can be explored using statistical sampling methods This is very CPU demanding! Sampling only (near-) minimal ARGs improves matters  Still CPU demanding The Ancestral Recombination Graph

Local trees For each “point” on the chromosome, the ARG determines a (local) tree:

Local trees For each “point” on the chromosome, the ARG determines a (local) tree:

Local trees For each “point” on the chromosome, the ARG determines a (local) tree:

Local trees For each “point” on the chromosome, the ARG determines a (local) tree:

Local trees Type 1: No change Type 2: Change in branch lengths Type 3: Change in topology From Hein et al. 2005

Local trees Recombination rate From Hein et al Tree measure: where

Using the local trees Tree genealogies  Each site a different genealogy  Nearby genealogies only slightly different --T G A----G---X----C----C-----A-- --A G G----G---X----C----C-----A-- --A C A----G---X----T----C-----A-- --T C A----G---X----T----C-----A-- --T C A----T---X----T----A-----A-- --A C A----G---X----T----C-----A-- AAATTTCCGGCC AAAGAAGGGGGTTTCCTTCCCCCAAAAAAA A nearby tree is an imperfect local tree

Tree at disease site:  “Perfect” setup  Incomplete penetrance  Other disease causes HHHHHHHH DDDDD HHHHHHHH DDDHD HDHHHDHH DDDHD Templeton et al 1987 Using the local trees

At the disease site:  A significant clustering of diseased/healthy HDHHHDHH DDDHD Using the local trees Templeton et al 1987

--T G A----G---X----C----C-----A-- --A G G----G---X----C----C-----A-- --A C A----G---X----T----C-----A-- --T C A----G---X----T----C-----A-- --T C A----T---X----T----A-----A-- --A C A----G---X----T----C-----A-- AAATTT CCGGCCAAAGAAGGGGGTTTCCTTCCCCCAAAAAAA Tree at disease site resembles neighbours Using the local trees

Near the disease site:  A significant clustering of diseased/healthy HDHHHDHH DDDHD Using the local trees Zöllner and Pritchard 2005; Mailund et al 2006 ; Sevon et al 2006

Approach:  Infer trees over regions  Score the regions wrt their clustering HDHHHDHH DDDHD Zöllner and Pritchard 2005; Mailund et al 2006 ; Sevon et al 2006 Using the local trees

In the infinite sites model:  Each mutation occurs only once  Each mutation splits the sample in two  A consistent tree can efficiently be inferred for a recombination free region Mailund et al 2006 BLOck aSSOCiation (BLOSSOC)

Use the four-gamete test to find regions, around each locus, that can be explained by a tree Mailund et al 2006 BLOck aSSOCiation (BLOSSOC)

Build a tree for each such region Mailund et al 2006 BLOck aSSOCiation (BLOSSOC)

Build a tree for each such region Mailund et al 2006 BLOck aSSOCiation (BLOSSOC)

Build a tree for each such region Mailund et al 2006 BLOck aSSOCiation (BLOSSOC)

Build a tree for each such region Mailund et al 2006 BLOck aSSOCiation (BLOSSOC)

Score the tree, and assign the score to the locus Mailund et al 2006 BLOck aSSOCiation (BLOSSOC)

If there are too many incompatibilities, we just cheat (but try to keep the cheating low in the tree) Mailund et al 2006 BLOck aSSOCiation (BLOSSOC)

If there are too many incompatibilities, we just cheat (but try to keep the cheating low in the tree) Mailund et al 2006 BLOck aSSOCiation (BLOSSOC)

If there are too many incompatibilities, we just cheat (but try to keep the cheating low in the tree) Mailund et al 2006 BLOck aSSOCiation (BLOSSOC)

If there are too many incompatibilities, we just cheat (but try to keep the cheating low in the tree) Mailund et al 2006 BLOck aSSOCiation (BLOSSOC)

Ding et al 2007 The tree construction is more complicated – but still possible and still efficient – for un-phased sequence data The Perfect Phylogeny Haplotyping (PPH) problem  Gusfield 2002; Ding et al 2005 (The “cheating” still requires local phasing; the most time consuming step)

BLOck aSSOCiation (BLOSSOC) Ding et al 2007 The tree construction is more complicated – but still possible and still efficient – for un-phased sequence data The Perfect Phylogeny Haplotyping (PPH) problem  Gusfield 2002; Ding et al 2005 (The “cheating” still requires local phasing; the most time consuming step) Min markers: Phased: Unphased:

Scoring trees Red=cases Green=controls Are the case chromosomes significantly overrepresented in some sub-trees? Mailund et al 2006

Scoring trees Mailund et al 2006

Scoring trees Mutation We can place “mutations” on the tree edges and partition chromosomes into “mutants” and “wild-types”... Mailund et al 2006 Mutants Wild-types

Scoring trees...and assign different risks based on the implied genotypes Mutants Wild-types Likelihoods Haploid data: Null model: Diploid data: Mailund et al 2006

Scoring trees Generalizes to more mutations in the obvious way Likelihoods Haploid data: Null model: Diploid data: Mailund et al 2006 Wild-types Mutant A Mutant B

Scoring trees Tree score Mailund et al 2006 Wild-types Mutant A Mutant B Likelihood Number of parameters Penalty weight Depending on penalty weight we get Akaiki Information Criteria, Bayesian Information Criteria, Hanna and Quinn Criteria,...

Scoring trees Tree score Mailund et al 2006 Wild-types Mutant A Mutant B Likelihood Number of parameters Penalty weight For efficiency reasons, we only explore the mutations top down, stopping when the score no longer improves

Scoring trees Mailund et al 2006; Ding et al 2007 Mutants Wild-types Likelihoods Haploid data: Null model: Diploid data:

Scoring trees Using an uninformative Beta prior, β (1,1), we can integrate the risk parameters out Mailund et al 2006; Ding et al 2007 Mutants Wild-types Marginal likelihoods Haploid data: Null model: Diploid data: Balding 2006 ; Waldron et al 2006

Scoring trees For the tree, we take the mean score over all edges. The score is the Bayes factor of the tree likelihood vs the null model likelihood. Mailund et al 2006; Ding et al 2007 Mutants Wild-types Null model: Tree model: Score:

Scoring trees This generalises to several mutations (more complicated implied genotypes; computationally slower) Through Bayes factors we can test for the number of mutations. Mailund et al 2006; Ding et al 2007

Scoring trees Generalises to quantitative traits as well with minor changes to the scoring approach... Besenbacher et al. 2007

Fine mapping example cases / 500 controls 100 SNPs on 100 Kbp 2 mutations at same locus with same risk P(case|aa) = 5% ; GRR = 2

Fine mapping example...

Localization accuracy 1 causal mutation Max BF / min p-val used as point estimate

Localization accuracy 2 causal mutations Max BF / min p-val used as point estimate

Comparison with Margarita Margarita is the (near-)minimal ARG method of Minichiello and Durbin Data sets:  1000 cases / 1000 controls  300 markers Comparisons with both phased and unphased data Acknowledgment:  Experiments done by Yun S. Song

Comparison with Margarita Unphased data; Min markers: 5

Comparison with Margarita Phased data; Min markers: 5

Comparison with Margarita Unphased data; Min markers: 9

Comparison with Margarita Phased data; Min markers: 9

Comparison with Margarita How did we do?  Generally between single-marker test and Margarita  Depends heavily on the scoring function  Quite well on time: Margarita: phased unphased Blossoc unphased: -fH -fB -fA -fG -fP -fX m = 1: m = 2: m = 3: m = 4: m = 5: m = 6: m = 7: m = 8: m = 9:

Choice of scoring function Is one scoring function generally better than the rest?  Unfortunately not  Simulations show a (small) trend:  Small datasets(<1000 individuals)-fP  Medium datasets(~1000 individuals)-fH  Larger datasets(>1000 individuals)-fA

Comparison with HPM (Toivonen et al. 2000)

Comparison with HapMiner P(case|AA)=15%; P(case|Aa)=10%; P(case|aa)=5% 200 markers, rho=40, 500 cases / 500 controls, P(A)=18-22% P(case|AA)=20%; P(case|Aa)=8%; P(case|aa)=5% (Li and Jiang 2005)

Comparison with HapMiner P(case|AA)=15%; P(case|Aa)=10%; P(case|aa)=5% 200 markers, rho=40, 500 cases / 500 controls, P(A)=18-22% P(case|AA)=20%; P(case|Aa)=8%; P(case|aa)=5% (Li and Jiang 2005) Blossoc: ~5 sec per data set HapMiner: ~40 min per data set

Comparison with HapCluster (Waldron et al. 2006)

Implementation freely available Homepage: Command line and graphical user interface...

The end Thank you! More at

References A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping – A.R. Templeton, E. Boerwinkle, and C.F. Sing; Genetics Gene genealogies and the coalescent process – R.R. Hudson; Oxford Surveys in Evolutionary Biology Ancestral inference from samples of DNA sequences with recombination – R.C. Griffith and P. Majoram; J Comput Biol 3: Data Mining Applied to Linkage Disequilibrium Mapping – H.T.T. Toivonen, P. Onkamo, K. Vasko, V. Ollikainen, P. Sevon, H. Mannila, M. Herr and J. Kere; Am J. of Human Gen Gene mapping via the ancestral recombination graph – F. Larribe, S. Lessard, and N.J. Schork; Theor Popul Biol 62: Haplotyping as Perfect Phylogeny: Conceptual Framework and Efficient Solutions – D. Gusfield; RECOMB Gene genealogies, variation, and evolution – J. Hein, M.H. Schierup, and C. Wiuf; Oxford University Press 2005 Coalescent-based association mapping and fine mapping of complex trait loci – S. Zöllner and J.K. Pritchard; Genetics 169: Minimum Recombination Histories by Branch and Bound – R.B. Lyngsø, Y.S. Song and J. Hein; WABI 2005, LNCS , 2005 A linear-time algorithm for the perfect phylogeny haplotyping (PPH) problem – Z. Ding, V. Filkov and D. Gusfield; RECOMB Haplotype-based linkage disequilibrium mapping via direct data mining – J Li and T Jiang; Bioinformatics 21(24) Fine mapping of disease genes via haplotype clustering – E.R.B. Waldron, J.C. Whittaker, and D.J. Balding; Genet Epidemiol 30: Whole genome association mapping by incompatibilities and local perfect phylogenies – T. Mailund, S. Besenbacher, and M.H. Schierup; BMC Bioinformatics 7: TreeDT: Tree pattern mining for gene mapping – P. Sevon, H. Toivonen, V. Ollikainen; IEEE/ACM Transactions on Computational Biology and Bioinformatics A tutorial on statistical methods for population association studies – D.J. Balding; Nat Rev Genet 7: Mapping Trait Loci by Use of Inferred Ancestral Recombination Graphs – M. Minichiello and R. Durbin; Am J. of Human Gen 2006 Using unphased perfect phylogenies for efficient whole-genome association mapping – Z. Ding, T. Mailund and Y.S. Song; In preparation 2007