Imputation-based local ancestry inference in admixed populations

Slides:



Advertisements
Similar presentations
Imputation for GWAS 6 December 2012.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
METHODS FOR HAPLOTYPE RECONSTRUCTION
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Genotype and Haplotype Reconstruction from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University.
University of Connecticut
Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Computational Statistical Genetics.
Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Ion Mandoiu Computer Science and Engineering Department
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.
Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads Justin Kennedy 1 Joint work with Sanjiv Dinakar 1, Yozen.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Algorithms for Genotype and Haplotype Inference from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University.
Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint.
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
BIO341 Meiotic mapping of whole genomes (methods for simultaneously evaluating linkage relationships among large numbers of loci)
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Imputation 2 Presenter: Ka-Kit Lam.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
INTRODUCTION TO ASSOCIATION MAPPING
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
California Pacific Medical Center
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
The International Consortium. The International HapMap Project.
Imputation-based local ancestry inference in admixed populations
Biostatistics-Lecture 19 Linkage Disequilibrium and SNP detection
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Understanding human admixture, and association mapping in admixed populations. Simon Myers.
Hidden Markov Models BMI/CS 576
Gil McVean Department of Statistics
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Inferring Missing Genotypes in Large SNP Panels
Constrained Hidden Markov Models for Population-based Haplotyping
CSC 594 Topics in AI – Natural Language Processing
Genome Wide Association Studies using SNP
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Hidden Markov Models Part 2: Algorithms
Genome-wide Association Studies
Haplotype Reconstruction
Introgression of Neandertal- and Denisovan-like Haplotypes Contributes to Adaptive Variation in Human Toll-like Receptors  Michael Dannemann, Aida M.
Vineet Bafna/Pavel Pevzner
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Methods for High-Density Admixture Mapping of Disease Genes
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Accuracy of Haplotype Frequency Estimation for Biallelic Loci, via the Expectation- Maximization Algorithm for Unphased Diploid Genotype Data  Daniele.
Proportioning Whole-Genome Single-Nucleotide–Polymorphism Diversity for the Identification of Geographic Population Structure and Genetic Ancestry  Oscar.
Brian K. Maples, Simon Gravel, Eimear E. Kenny, Carlos D. Bustamante 
Outline Cancer Progression Models
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Shuhua Xu, Wei Huang, Ji Qian, Li Jin 
A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals  Brian L. Browning, Sharon.
Hidden Markov Models By Manish Shrivastava.
Yu Zhang, Tianhua Niu, Jun S. Liu 
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Genotype-Imputation Accuracy across Worldwide Human Populations
Introgression of Neandertal- and Denisovan-like Haplotypes Contributes to Adaptive Variation in Human Toll-like Receptors  Michael Dannemann, Aida M.
Presentation transcript:

Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with J. Kennedy and B. Pasaniuc

Outline Motivation and problem definition Factorial HMM model of genotype data Algorithms for genotype imputation and ancestry inference Preliminary experimental results Summary and ongoing work

Population admixture Admixture mapping is a method for localizing disease causing genetic variants that differ in frequency across populations. It is most advantageous to apply this approach to populations that have descended from a recent mix of two ancestral groups that have been geographically isolated for many tens of thousands of years (e.g. African Americans) http://www.garlandscience.co.uk/textbooks/0815341857.asp?type=resources 3

Admixture mapping Admixture mapping is a method for localizing disease causing genetic variants that differ in frequency across populations. It is most advantageous to apply this approach to populations that have descended from a recent mix of two ancestral groups that have been geographically isolated for many tens of thousands of years (e.g. African Americans) Patterson et al, AJHG 74:979-1000, 2004 4

Inferred local ancestry Local ancestry inference problem Given: Reference haplotypes for ancestral populations P1,…,Pn Whole-genome SNP genotype data for extant individual Find: Allele ancestries at each locus Reference haplotypes 1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000 11110010011001101001110010110101011111011110111000 1110001001000100111110001111011100111?111110111000 011101100110011011111100101101110111111111?0110000 11100010010001001111100010110111001111111110110000 011?001?011001101111110010?10111011111111110110000 11100110010001001111100011110111001111111110111000 1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000 11110010011001101001110010110101011111011110111000 1110001001000100111110001111011100111?111110111000 011101100110011011111100101101110111111111?0110000 11100010010001001111100010110111001111111110110000 011?001?011001101111110010?10111011111111110110000 11100110010001001111100011110111001111111110111000 1110001?0100110010011001111101110111?1111110111000 11100011010011001001100?100101?10111110111?0111000 11110010011001101001110010110101011111011110111000 1110001001000100111110001111011100111?111110111000 011101100110011011111100101101110111111111?0110000 11100010010001001111100010110111001111111110110000 011?001?011001101111110010?10111011111111110110000 11100110010001001111100011110111001111111110111000 Inferred local ancestry Admixture mapping is a method for localizing disease causing genetic variants that differ in frequency across populations. It is most advantageous to apply this approach to populations that have descended from a recent mix of two ancestral groups that have been geographically isolated for many tens of thousands of years (e.g. African Americans) rs11095710 P1 P1 rs11117179 P1 P1 rs11800791 P1 P1 rs11578310 P1 P2 rs1187611 P1 P2 rs11804808 P1 P2 rs17471518 P1 P2 ... SNP genotypes rs11095710 T T rs11117179 C T rs11800791 G G rs11578310 G G rs1187611 G G rs11804808 C C rs17471518 A G ... 5

Previous work MANY methods Two main classes Ancestry inference at different granularities, assuming different amounts of info about genetic makeup of ancestral populations Two main classes HMM-based: SABER [Tang et al 06], SWITCH [Sankararaman et al 08a], HAPAA [Sundquist et al. 08], … Window-based: LAMP [Sankararaman et al 08b], WINPOP [Pasaniuc et al. 09] Poor accuracy when ancestral populations are closely related (e.g. Japanese and Chinese) Methods based on unlinked SNPs outperform methods that model LD! Longer terms for observations: For individual from recently admixed populations the local ancestry of a SNP locus is typically shared with a large number of neighboring loci. The accuracy of SNP genotype imputation within such a neighborhood is typically higher when using the factorial HMMs corresponding to the correct local ancestry compared to a mis-specified model.

Haplotype structure in panmictic populations

HMM model of haplotype frequencies Similar models proposed in [Schwartz 04, Rastas et al. 05, Kennedy et al. 07, Kimmel&Shamir 05, Scheet&Stephens 06,…]

Graphical model representation F1 F2 Fn … H1 H2 Hn Random variables Fi = founder haplotype at locus i, between 1 and K Hi = observed allele at locus I Model training Based on haplotypes using Baum-Welch algo, or Based on genotypes using EM [Rastas et al. 05] Given haplotype h, P(H=h|M) can be computed in O(nK2) using a forward algorithm, where n=#SNPs, K=#founders

Factorial HMM for genotype data in a window with known local ancestry … F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n The Hierarchical-Factorial HMM we used to describe haplotype sequences is similar to models recently proposed by others. At the core of the model are 2 regular HMM representing haplotype frequencies in the populations of origin of the sequenced individual’s parents. Under this model each haplotype in the population is viewed as a mosaic formed as a result of historical recombination among a set of K founder haplotypes. Increasing the number of founders allows more paths to describe a haplotype sequence, but at a cost of a longer run time. Training & the estimation of this HMM is done via a Baum-Welch algorithm based on haplotype inferred from a panel representing the population of origin of each parent. H'1 H'2 H'n G1 G2 Gn 10

HMM Based Genotype Imputation Probability of missing genotype given the typed genotype data:  gi is imputed as

Forward-backward computation fi … … hi f’i … … h’i gi 12

Forward-backward computation fi … … hi f’i … … h’i gi 13

Forward-backward computation fi … … hi f’i … … h’i gi 14

Forward-backward computation fi … … hi f’i … … h’i gi 15

Runtime Direct recurrences for computing forward probabilities: Runtime reduced to O(nK3) by reusing common terms: where 16

Imputation-based ancestry inference View local ancestry inference as a model selection problem Each possible local ancestry defines a factorial HMM Pick model that re-imputes SNPs most accurately around the locus of interest Fixed-window version: pick ancestry that maximizes the average posterior probability of true SNP genotypes within a fixed-size window centered at the locus Multi-window version: weighted voting over window sizes between 200-3000, with window weights proportional to average posterior probabilities Longer terms for observations: For individual from recently admixed populations the local ancestry of a SNP locus is typically shared with a large number of neighboring loci. The accuracy of SNP genotype imputation within such a neighborhood is typically higher when using the factorial HMMs corresponding to the correct local ancestry compared to a mis-specified model.

HMM imputation accuracy Missing data rate and accuracy for imputed genotypes at different thresholds (WTCCC 58BC/Hapmap CEU)

Window size effect N=2,000 g=7 =0.2 n=38,864 r=10-8

Number of founders effect CEU-JPT N=2,000 g=7 =0.2 n=38,864 r=10-8

Comparison with other methods Alpha is the (X100) percentage of individuals coming from one population, with 1-alpha being the percentage of individuals coming from the second population. N=2,000 g=7 =0.2 n=38,864 r=10-8 21

Summary and ongoing work Imputation-based local ancestry inference achieves significant improvement over previous methods for admixtures between close ancestral populations Code at http://dna.engr.uconn.edu/software/ Ongoing work Evaluating accuracy under more realistic admixture scenarios (multiple ancestral populations/gene flow/drift in ancestral populations) Extension to pedigree data Exploiting inferred local ancestry for more accurate untyped SNP imputation and phasing of admixed individuals Extensions to sequencing data Inference of ancestral haplotypes from extant admixed populations

Untyped SNP imputation accuracy in admixed individuals Alpha is the (X100) percentage of individuals coming from one population, with 1-alpha being the percentage of individuals coming from the second population. N=2,000 g=7 =0.5 n=38,864 r=10-8 23

HMM-based phasing … F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n H'1 H'2 H'n G1 G2 Gn Maximum likelihood genotype phasing: given g, find (h1,h2) = argmax h1+h2=g P(h1|M)P(h2|M)

HMM-based phasing Bad news: Cannot approximate maxh1+h2=g P(h1|M)P(h2|M) within a factor of O(n1/2 -), unless ZPP=NP [KMP08] Good news: Viterbi-like heuristics yields phasing accuracy comparable to PHASE in practice [Rastas et al. 05]

Factorial HMM model for sequencing data … F1 F2 Fn H1 H2 Hn … F'1 F'2 F'n The Hierarchical-Factorial HMM we used to describe haplotype sequences is similar to models recently proposed by others. At the core of the model are 2 regular HMM representing haplotype frequencies in the populations of origin of the sequenced individual’s parents. Under this model each haplotype in the population is viewed as a mosaic formed as a result of historical recombination among a set of K founder haplotypes. Increasing the number of founders allows more paths to describe a haplotype sequence, but at a cost of a longer run time. Training & the estimation of this HMM is done via a Baum-Welch algorithm based on haplotype inferred from a panel representing the population of origin of each parent. H'1 H'2 H'n G1 G2 Gn R1,1 … R1,c R2,1 … R2,c Rn,1 … Rn,c 1 2 n 26

Acknowledgments J. Kennedy and B. Pasaniuc Work supported in part by NSF awards IIS-0546457 and DBI-0543365.