Scalable Algorithms for Analysis of Genomic Diversity Data Bogdan Paşaniuc Department of Computer Science & Engineering University of Connecticut.

Slides:

Advertisements

Similar presentations

Imputation for GWAS 6 December 2012.

Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.

Hidden Markov Model in Biological Sequence Analysis – Part 2

Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.

Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.

METHODS FOR HAPLOTYPE RECONSTRUCTION

Sharlee Climer, Alan R. Templeton, and Weixiong Zhang

Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.

High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.

Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

University of Connecticut

Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.

Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Ion Mandoiu Computer Science and Engineering Department

DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.

Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads Justin Kennedy 1 Joint work with Sanjiv Dinakar 1, Yozen.

The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.

Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.

A Comparison of Algorithms for Species Identification based on DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Alexander.

ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State.

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.

Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.

Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.

Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.

Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint.

DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods CSE 377 – Bioinformatics - Spring 2006 Sotirios Kentros Univ. of.

Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.

Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology

Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.

Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Imputation 2 Presenter: Ka-Kit Lam.

Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.

CS177 Lecture 10 SNPs and Human Genetic Variation

Gene Hunting: Linkage and Association

Informative SNP Selection Based on Multiple Linear Regression

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.

Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.

Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.

California Pacific Medical Center

The International Consortium. The International HapMap Project.

Imputation-based local ancestry inference in admixed populations

Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.

Populations: defining and identifying. Two major paradigms for defining populations Ecological paradigm A group of individuals of the same species that.

Biostatistics-Lecture 19 Linkage Disequilibrium and SNP detection

Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.

Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs

Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.

Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College

National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.

Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.

National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.

International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.

The Haplotype Blocks Problems Wu Ling-Yun

Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor

Constrained Hidden Markov Models for Population-based Haplotyping

Imputation-based local ancestry inference in admixed populations

Phasing of 2-SNP Genotypes Based on Non-Random Mating Model

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Brian K. Maples, Simon Gravel, Eimear E. Kenny, Carlos D. Bustamante

Simultaneous Genotype Calling and Haplotype Phasing Improves Genotype Accuracy and Reduces False-Positive Associations for Genome-wide Association Studies

A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals Brian L. Browning, Sharon.

Approximation Algorithms for the Selection of Robust Tag SNPs

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Presentation transcript:

Scalable Algorithms for Analysis of Genomic Diversity Data Bogdan Paşaniuc Department of Computer Science & Engineering University of Connecticut

Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) High density in the human genome:  1x10 7 out of 3  10 9 base pairs Vast majority bi-allelic  0/1 encoding Single Nucleotide Polymorphisms … ataggtcc C tatttcgcgc C gtatacacggg A ctata … … ataggtcc G tatttcgcgc C gtatacacggg T ctata … … ataggtcc C tatttcgcgc C gtatacacggg T ctata …

Haplotypes and Genotypes Haplotype: description of SNP alleles on a chromosome  0/1 vector: 0 for major allele, 1 for minor Diploids: two homologous copies of each autosomal chromosome  One inherited from mother and one from father Genotype: description of alleles on both chromosomes  0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles two haplotypes per individual genotype

Introduction Haplotype data  exact DNA sequence  function Haplotypes  increased power of association Directly determining haplotype data is expensive and time consuming Cost effective high-throughput technologies to determine genotype data Need for computational methods for inferring haplotypes from genotype data: genotype phasing problem

Outline Background on genomic diversity The genotype phasing problem Hidden Markov Model of Haplotype Diversity Genotype Imputation DNA Barcoding Conclusions

Genotype Phasing For a genotype with k 2’s there are 2 k-1 possible pairs of haplotypes explaining it g: ? h1: h2: h3: h4: Computational approaches to genotype phasing  Statistical methods: PHASE, Phamily, PL, GERBIL …  Combinatorial methods: Parsimony, HAP, 2SNP, ENT …

Minimum Entropy Genotype Phasing Phasing – function f that assigns to each genotype g a pair of haplotypes (h,h’) that explains g Coverage of h in f – number of times h appears in the image of f Entropy of a phasing: Minimum Entropy Genotype Phasing [Halperin&Karp 04]: Given a set of genotypes, find a phasing with minimum entropy

ENT Algorithm Initialization Start with random phasing Iterative improvement step While there exists a genotype whose re-phasing decreases the entropy, find the genotype that yields the highest decrease in entropy and re-phase it Min Entropy Objective is uninformative for long genotypes  each haplotype compatible with 1 genotype  all haplotypes have coverage of 1  entropy of all phasings = -log(1/2G)

Overlapping Window approach Entropy is computed over short windows of size l+f  l “locked” SNPs previously phased  f “free” SNPs are currently phased locked free …4321g1gn…4321g1gn … … Only phasings consistent with the l locked SNPs are considered

Effect of Window Size

Experimental setup (1) International HapMap Project, Phase II datasets  3.7 million SNP loci  3 populations:  CEU, YRI: 30 trios  JPT+CHB: 90 unrelated individuals  Reference haplotypes obtained using PHASE Accuracy  Relative Genotype Error (RGE): percentage of missing genotypes inferred differently as reference method  Relative Switching Error (RSE): number of switches needed to convert inferred haplotype pairs into the reference haplotype pairs

Experimental setup (2) Compared algorithms  ENT  2SNP [Brinza&Zelikovsky 05]  Pure Parsimony Trio Phasing (ILP) [Brinza et al. 05]  PHASE [Stephens et al 01]  HAP [Halperin&Eskin 04]  FastPhase [Scheet&Stephens 06]

Results on HapMap Phase II Panels Averages over the 22 chromosomes Runtime:  ENT  few hours  PHASE  months of CPU time on cluster of 238 nodes

Results on [Orzack et al 03] dataset [Orzack et al. 03]  80 unrelated genotypes over 9 SNPs  Haplotypes determined experimentally Ranking of algorithms remains the same Slight underestimation of true error rate

Effect of pedigree information

Outline Background on genomic diversity The genotype phasing problem Hidden Markov Model of Haplotype Diversity Genotype Imputation DNA Barcoding Conclusions

Founder Haplotypes Haplotypes in the current population arose from small number of founder haplotypes by mutation and recombination events Obtained using HaploVisual

HMM Model Similar to models proposed by [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06] Models the ancestral haplotype population  Paths with high transition probability  “founder” haplotypes  Transitions from one founder to other founder  recombination events  Emissions  mutation events

HMM Training Previous works use EM training of HMM based on unrelated genotype data 2-step procedure: 1. Infer haplotypes using ENT  Uses all available pedigree information 2. Baum-Welch training on inferred haplotypes  Maximizes the likelihood of the haplotypes

Maximum Probability Genotype Phasing Phase G as pair (h 1,h 2 ) = argmax P(h 1 )P(h 2 ) Maximum phasing probability: How hard is to compute maximum phasing probability in the HMM?  Conjectured to be NP-hard [Rastas et al 07] Theorem  Cannot approximate P(G) within O(n 1/2 -  ), unless ZPP=NP, where n is the number of SNP loci

Complexity of Computing Maximum Phasing Probability Reduction from Max Clique  Transitions  1,1/2  Initial transition 2 deg(v)+1(2) /α  All haps  prob 1/α

Complexity of Computing Maximum Phasing Probability H  representing clique of size k will be emitted along k paths  P(H) = k/α By construction H’ (complement of H) can be emitted along second block G = 22…22  P(G)=max(P(H)) 2 G has a clique of size k or more iff P(G) ≥ (k/α) 2 Maximum probability genotype phasing is NP- hard

Heuristic Decoding Algorithms Viterbi Decoding  Maximum probability of emitting a haplotype pair that explain G along two HMM paths  Efficiently computed using Viterbi’s algorithm Posterior Decoding  For each SNP choose the states that are most likely at that locus given the genotype G  Find most likely emissions at each SNP to explain G  Efficiently computed using forward and backward algorithm Sampling from the HMM posterior distribution  generate pairs of haplotypes that explain G conditional on the haplotype distribution represented by the HMM  Combine the sample into a single phasing

Greedy Likelihood Decoding Uses forward values computed by forward algorithm  f h (i,q) = the total probability of emitting the first i alleles of the haplotype h and ending up at state q at level i.  P(H|M)= ∑ f h (n,q) Constructs (h, h’) with (x,y) at SNP i, s.t. the probability of the phasing up to locus i, given the already determined phasing for the first i, is maximized 2 variants: left-to-right or right-to-left

Combined Greedy Likelihood decoding Left to right phasingRight to left phasing Combined phasing SNP i P(Comb. phasing at SNP i) = ∑ f h (i,q)b h (i,q) x ∑ f h’ (i,q)b h’ (i,q) SNP i that gives best improvement found in O(Kn) time given forward and backward values for the 4 haplotypes h h’ h h’

Tweaking a Phasing by Local Switching New phasing obtained by switching at SNP i P (new phasing) = ∑ f h (i,q) b h (i,q) x ∑ f h’ (i,q) b h’ (i,q) SNP i that gives best improvement found in O(Kn) time given forward and backward values for the 2 haplotypes Iterative 1-OPT procedure  While there exists a SNP that improves the likelihood of the phasing obtained by switching at that SNP, find the SNP that yields the highest increase and perform switching SNP i

Experimental Setup ADHD dataset  Chromosome X genotype data from the Genetic Association Information Network (GAIN) study on Attention Deficit Hyperactivity Disorder (ADHD)  958 parent-child trios from the International Multi-site ADHD Genetics (IMAGE) project  Phased the children as unrelated on a 50 SNP window Decoding alg. Tweaking Viterbi Posterior HMM Sampling Greedy left to right Greedy right to left Greedy combined Random phasing Method Tweaking ENT fastPHASE PHASE v SNP BEAGLE r= BEAGLE r=

Outline Background on genomic diversity The genotype phasing problem Hidden Markov Model of Haplotype Diversity Genotype Imputation DNA Barcoding Conclusions

Genome-wide case-control association studies Preferred method for finding the genetic basis of human diseases 1. Large number of markers (SNPs) typed in cases and controls 2. Statistical test of association  disease-correlated locus Disease causal SNPs unlikely to be typed directly  Limited coverage of current genotyping platforms  Vast number of SNPs present across the human genome

Genotype Imputation Imputation of genotypes at un-typed SNP loci  Powerful technique for increasing the power of association studies  Typed markers in conjunction with catalogs of SNP variation (e.g. HapMap)  predictors for SNP not present on the array Challenge: Optimally combining the multi-locus information from current + multi-locus variation from HapMap

HMM Based Genotype Imputation 1. Integrate the HapMap variation information into the HMM Train HMM using the haplotypes from the panel related to the studied population (e.g. CEU panel: Utah residents with ancestry from northern and western Europe) 2. Compute probabilities of missing genotypes given the typed genotype data g i is imputed as x, where

Related Problems Missing Data Recovery  Fill in the genotypes uncalled by the genotyping algorithm Genotype Error Detection and Correction  If g i is present, then the increase in likelihood obtained by replacing g i with x is:

Likelihood Computation P(G|M) = probability with which M emits any two haplotypes that explain G along any pair of paths. Computed in O(nK 3 ) by a 2-path extension of the forward algorithm followed by a factor K speed-up [Rastas07]

Experimental Setup WTCCC Dataset  Genotype data of the 1958 birth cohort from the The Welcome Trust Consortium genome-association study  1,444 individuals from this cohort were typed using both the Affymetrix 500k platform and a custom Illumina 15k platform Affymetrix data + CEU HapMap haplotypes used to impute genotypes at the SNP loci present of the Illumina chip and not on the Affymetrix chip The actual Illumina genotypes were then used to estimate the imputation accuracy

Results Estimates of the allele 0 frequencies based on Imputation vs. Illumina 15k

Results Accuracy and missing data rate for imputed genotypes for different thresholds. Dashed line = missing data rate Solid line = discordance rate

Effect of Errors and Missing Data Added additional 1% genotyping errors and 1% missing genotypes TP Rate = correctly flagged errors out of total errors inserted FP Rate = incorrectly flagged genotype out of total correct genotypes Error Correction Accuracy = correctly recovered out of flagged ones Error DetectionError CorrectionMissing Data RecoveryImputation TP Rate(%) FP Rate(%)Accuracy(%)Error Rate(%) EDC+MDR+IMP MDR+IMP IMP

Outline Background on genomic diversity The genotype phasing problem Hidden Markov Model of Haplotype Diversity Genotype Imputation DNA Barcoding Conclusions

DNA barcoding Recently(2003) proposed by taxonomists as a tool for rapid species identification Use short DNA region as “fingerprint” for species Region of choice: cytochrome c oxidase subunit 1 mitochondrial gene ("COI", 648 base pairs long). Key assumption:  Existence of “barcoding gap”  Inter-species variability >> than intra-species variability

BOLD: The Barcode of Life Data Systems [Ratnasingham&Hebert07] Currently: 38,539 species, 388,582 barcodes

DNA barcoding challenges Efficient algorithms for species identification  Millions of species Meaningful confidence measures  BOLD identification system showed to have unclear confidence measures [Ekrem et al.07]: New species discovery Sample size optimization  #barcodes per species required  Barcode length  Barcode quality Number of regions required

Species identification problem Several methods proposed for assigning specimens to species  TaxI (Steinke et al.05), Likelihood ratio test (Matz&Nielsen06), BOLD-IDS(Ratnasingham&Hebert 07)… No direct comparisons on standardized benchmarks This work:  Direct comparison of methods from three main classes  Distance-based, tree-based, and statistical model-based  Explore the effect of repository size  #barcodes/species, #species Given repository containing barcodes from known species and a new barcode find its species

Methods Distance-based  Hamming distance, Aminoacid Similarity, Convex Score similarity, Tri-nucleotide frequency distance, Combined method Tree-based  Exemplar NJ [Meyer&Paulay05]  Profile NJ [Muller et al 04]  Phylogenetic transversal Statistical model-based  Likelihood ratio test [Matz&Nielsen06]  PWMs  Inhomogeneous Markov Chains

Inhomogeneous Markov Chain (IMC) Takes into account dependencies between consecutive loci  start A C T G A C T G A C T G A C T G … locus 1locus 2locus 3locus 4

Comparison of representative methods ACGBirdsBats GuyanaFish AustraliaCowries MIN-HD98.81%97.59%100.00%99.30%88.80% IMC95.27%97.23%100.00%99.58%89.83% Phylo93.29%92.33%98.55%99.30%81.00% Leave one out experiment Hesperidia of the ACG 1 [Hajibabaei M. et al, 05]: 4267 barcodes, 561 species Birds of North America [Kerr K.C.R. et al, 07]: 2589 barcodes, 656 species Bats of Guyana [Clare E.L. et al, 06]: 840 barcodes, 96 species Fishes of Australia Container Part [Ward et. al, 05]: 754 barcodes, 211 species Cowries [Meyer and Paulay, 05]: 2036 barcodes, 263 species

Accuracy vs Species size MIN-HD IMC Phylo

Accuracy vs. #Species

Conclusions Highly scalable method for genotype phasing  Several orders of magnitude faster than current methods  Phasing accuracy close to the best methods  Exploits all pedigree information available HMM model of haplotype diversity  Hardness result for genotype phasing  Improved decoding algorithms for phasing  Imputation of genotypes at un-typed SNPs DNA-barcoding  Introduced new methods for species identification  Comprehensive comparison to existing methods

Acknowledgments Prof. Ion Mandoiu Profs. Sanguthevar Rajasekaran, Alex Russell Sasha Gusev (Entropy phasing, DNA barcoding) Justin Kennedy (HMM Imputation and Error detection) James Lindsay, Sotiris Kentros (DNA barcoding)

References Genotype phasing:  B. Pasaniuc and I.I. Mandoiu. Highly scalable genotype phasing by entropy minimization. In Proc. 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages ,  A. Gusev, I.I. Mandoiu, and B. Pasaniuc. Highly scalable genotype phasing by entropy minimization. IEEE/ACM Trans. on Computational Biology and Bioinformatics 5, pp , HMM model, genotype imputation and error detection:  J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. Genotype error detection using hidden Markov models of haplotype diversity. In Proc. 7th Workshop on Algorithms in Bioinformatics(WABI07) LNBI, pp 73-84,  J. Kennedy, I.I. Mandoiu, and B. Pasaniuc. GEDI: Genotype Error Detection and Imputation using Hidden Markov Models of Haplotype Diversity. (in preparation). DNA-barcoding:  B. Pasaniuc, S. Kentros and I.I. Mandoiu. DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers, The DNA Barcode Data Analysis Initiative (DBDAI): Developing Tools for a New Generation of Biodiversity Data Workshop,  B. Pasaniuc, S. Kentros and I.I. Mandoiu. Model-based species identification using DNA barcodes, 39th Symposium on the Interface: Computing Science and Statistics,  B. Pasaniuc, A. Gusev, S. Kentros, J. Lindsay and I.I. Mandoiu. A Comparison of Algorithms for Species Identification Based on DNA Barcodes. 2nd International Barcode of Life Conference, Academia Sinica, Taipei, Taiwan, Sept , 2007