ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State.

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Genetic linkage analysis Dotan Schreiber According to a series of presentations by M. Fishelson.
Basics of Linkage Analysis
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
University of Connecticut
Ronnie A. Sebro Haplotype reconstruction BMI /21/2004.
Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity Justin Kennedy Dissertation Defense for the Degree.
Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Scalable Algorithms for Analysis of Genomic Diversity Data Bogdan Paşaniuc Department of Computer Science & Engineering University of Connecticut.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Standardization of Pedigree Collection. Genetics of Alzheimer’s Disease Alzheimer’s Disease Gene 1 Gene 2 Environmental Factor 1 Environmental Factor.
Introduction to BST775: Statistical Methods for Genetic Analysis I Course master: Degui Zhi, Ph.D. Assistant professor Section on Statistical Genetics.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
A single-nucleotide polymorphism tagging set for human drug metabolism and transport Kourosh R Ahmadi, Mike E Weale, Zhengyu Y Xue, Nicole Soranzo, David.
Imputation 2 Presenter: Ka-Kit Lam.
Calculation of IBD State Probabilities Gonçalo Abecasis University of Michigan.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
Gene Hunting: Linkage and Association
Informative SNP Selection Based on Multiple Linear Regression
Bayesian MCMC QTL mapping in outbred mice Andrew Morris, Binnaz Yalcin, Jan Fullerton, Angela Meesaq, Rob Deacon, Nick Rawlins and Jonathan Flint Wellcome.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
California Pacific Medical Center
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
The International Consortium. The International HapMap Project.
Imputation-based local ancestry inference in admixed populations
Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Meiotic gene conversion in humans: rate, sex ratio, and GC bias Amy L. Williams June 19, 2013 University of Chicago.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Inferring Missing Genotypes in Large SNP Panels
Constrained Hidden Markov Models for Population-based Haplotyping
Imputation-based local ancestry inference in admixed populations
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
IBD Estimation in Pedigrees
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State University)

Outline Background on genetic variation Genotype phasing Error detection Disease association search Disease susceptibility prediction

3 Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) High density in the human genome:  1  10 7 SNPs out of total 3  10 9 base pairs Single Nucleotide Polymorphisms … ataggtcc C tatttcgcgc C gtatacacggg A ctata … … ataggtcc G tatttcgcgc C gtatacacggg T ctata … … ataggtcc C tatttcgcgc C gtatacacggg T ctata …

Haplotypes and Genotypes Diploids: two homologous copies of each autosomal chromosome One inherited from mother and one from father Haplotype: description of SNP alleles on a chromosome 0/1 vector: 0 for major allele, 1 for minor Genotype: description of alleles on both chromosomes 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles two haplotypes per individual genotype

5 Identification and fine mapping of disease-related genes Methods: Linkage analysis, allele-sharing, association studies Genotype data: large pedigrees, sibling pairs, trios, unrelated Why SNPs?

6 Latest technologies deliver 1M SNP genotypes per sample, at low cost Major challenges Efficiency Reproducibility  Need simple methods! Challenges in SNP Data Analysis

Genotype Phasing

For a genotype with k 2’s there are 2 k-1 possible pairs of haplotypes explaining it g: ? h1: h2: h3: h4: Computational approaches to genotype phasing Statistical methods: PHASE, Phamily, PL, GERBIL … Combinatorial methods: Parsimony, HAP, 2SNP, ENT …

Minimum Entropy Genotype Phasing Phasing – function f that assigns to each genotype g a pair of haplotypes (h,h’) that explains g Coverage of h in f – number of times h appears in the image of f Entropy of a phasing: Minimum Entropy Genotype Phasing [HalperinKarp 04]: Given a set of genotypes, find a phasing with minimum entropy

Connection with Likelihood Maximization

Iterative Improvement Algorithm [Gusev et al. 07] Initialization Start with random phasing Iterative improvement step While there exists a genotype whose re-phasing decreases the entropy, find the genotype that yields the highest decrease in entropy and re-phase it

Overlapping Window approach Entropy is computed over short windows of size l+f l “locked” SNPs previously phased f “free” SNPs are currently phased locked free …4321g1gn…4321g1gn … … Only phasings consistent with the l locked SNPs are considered

Effect of Window Size

Time Complexity n unrelated genotypes over k SNPs k/f windows n*2 f candidate haplotype pairs evaluated per window O(1) time per pair to compute the entropy gain Empirically, the number of iterations is linear in n, but is reduced to O(log 3 n) by re-explaining multiple genotypes per iteration (batching) Total runtime O(n log 3 n 2 f k/f)

Empirical Runtime

Extension to general pedigrees Parent-child relationships can be exploited to infer haplotype phase for a substantial fraction of the SNPs Phasing related genotypes based on the no recombination assumption Algorithm modifications: At each step re-explain an entire family Cache inheritance pattern given by first window to speed-up computations for subsequent windows Entropy computation based on founder haplotypes only

Enumeration No-Recombination Phasings for a Pedigree Gaussian elimination [Jiang et al.] [Gusev et al. 07] implementation based on simple backtracking

Empirical Evaluation International HapMap Project, Phase I & II datasets 3.7 million SNP loci Trio and unrelated genotypes from 4 different populations Reference haplotypes obtained using PHASE Accuracy measures Relative Genotype Error (RGE): percentage of missing genotypes inferred differently from the reference method Relative Switching Error (RSE): number of switches needed to convert inferred haplotype pairs into the reference haplotype pairs

Empirical Evaluation (cont.) Compared algorithms ENT [Gusev et al. 07] 2SNP [Brinza&Zelikovsky 05] Pure Parsimony Trio Phasing (PPTP) [Brinza et al. 05] PHASE [Stephens et al 01] HAP [Halperin&Eskin 04] FastPhase [Scheet & Stephens 06]

Results on Hapmap Phase II Trio Populations ENT needs only few hours on a regular workstation to phase the entire HapMap Phase II dataset, compared to PHASE which required months of CPU time on two clusters with a total of 238 nodes

Results on [Orzack et al. 03] Dataset

Complex Pedigree Phasing Exploiting pedigree info significantly improves accuracy!

Application of Phasing: Missing data recovery

Genotype Error Detection

Genotyping Errors A real problem despite advances in technology & typing algorithms 1.1% of 20 million dbSNP genotypes typed multiple times are inconsistent [Zaitlen et al. 2005] Systematic errors (e.g., assay failure) typically detected by departure from HWE [Hosking et al. 2004] In pedigrees, some errors detected as Mendelian Inconsistencies (MIs) Many errors remain undetected As much as 70% of errors are Mendelian consistent for mother/father/child trios [Gordon et al. 1999]

Effects of Undetected Genotyping Errors Even low error levels can have large effects for some study designs (e.g. rare alleles, haplotype-based) Errors as low as.1% can increase Type I error rates in haplotype sharing transmission disequilibrium test (HS-TDT) [Knapp&Becker04] 1% errors decrease power by 10-50% for linkage, and by 5-20% for association [Douglas et al. 00, Abecasis et al. 01]

Related Work Improved genotype calling algorithms [Di et al. 05, Rabbee&Speed 06, Nicolae et al. 06] Explicit modeling in analysis methods [Sieberts et al. 01, Sobel et al. 02, Abecasis et al. 02,Cheng 06] Computationally complex Separate error detection step [Douglas et al. 00, Abecasis et al. 02, Becker et al. 06] Detected errors can be retyped, imputed, or ignored in downstream analyses

Likelihood Sensitivity Approach to Error Detection [Becker et al. 06] Mother Father Child Likelihood of best phasing for original trio T h h h h h h 4

Likelihood Sensitivity Approach to Error Detection [Becker et al. 06] Mother Father Child Likelihood of best phasing for original trio T ? h’ h’ h’ h’ h’ h’ 4 Likelihood of best phasing for modified trio T’

Likelihood Sensitivity Approach to Error Detection [Becker et al. 06] Mother Father Child ?  Large change in likelihood suggests likely error  Flag genotype as an error if L(T’)/L(T) > R, where R is the detection threshold (e.g., R=10 4 )

Implementation in FAMHAP [Becker et al. 06] Window-based algorithm For each window including the SNP under test, generate list of H most frequent haplotypes (default H=50) Find most likely trio phasings by pruned search over the H 4 quadruples of frequent haplotypes Flag genotype as an error if L(T’)/L(T) > R for at least one window Mother … Father … Child …

Limitations of FAMHAP Implementation Truncating the list of haplotypes to size H may lead to sub- optimal phasings and inaccurate L(T) values False positives caused by nearby errors (due to the use of multiple short windows) [Kennedy et al.] HMM model of haplotype diversity  all haplotypes are represented + no need for short windows Alternate likelihood functions  scalable runtime

HMM Model Similar to models proposed by [Schwartz 04, Rastas et al. 05, Kimmel&Shamir 05, Scheet&Stephens 06] Block-free model, paths with high transition probability correspond to “founder” haplotypes (Figure from Rastas et al. 07)

HMM Training Previous works use EM training of HMM based on unrelated genotype data 2-step algo exploiting pedigree info [Kennedy et al. 07] Step 1: Infer haplotypes using pedigree-aware algorithm based on entropy-minimization Step 2: train HMM based on inferred haplotypes, using Baum- Welch

Complexity of Computing Maximum Phasing Probability How hard is to compute the likelihood function of Becker et al.? Theorem [Kennedy et al. 07] Cannot approximate L(T) within O(n 1/4 -  ), unless ZPP=NP, where n is the number of SNP loci For unrelated genotypes, computing maximum phasing probability is hard to approximate within a factor of O(n ½-  ) Open: complexity for fixed number of founder haplotypes

Complexity of Computing Maximum Phasing Probability Reductions from the clique problem

Alternate Likelihood Functions Viterbi probability (ViterbiProb): the maximum probability of a set of 4 HMM paths that emit 4 haplotypes compatible with the trio Probability of Viterbi Haplotypes (ViterbiHaps): product of total probabilities of the 4 Viterbi haplotypes Total Trio Probability (TotalProb): total probability P(T) that the HMM emits four haplotypes that explain trio T along all possible 4-tuples of paths

For a fixed trio, Viterbi paths can be found using a 4-path version of Viterbi’s algorithm in time K 3 speed-up by factoring common terms: Efficient Computation of Viterbi Probability for Trios = maximum probability of emitting SNP genotypes at locus j+1 from states  = transition probability Where:

Viterbi probability Likelihoods of all 3N modified trios can be computed within time using forward-backward algorithm Overall runtime for M trios Probability of Viterbi haplotypes Obtain haplotypes from standard traceback, then compute haplotype probabilities using forward algorithms Overall runtime Total trio probability Similar pre-computation speed-up & forward-backward algorithm Overall runtime Overall Runtimes

Empirical Evaluation Real dataset [Becker et al. 2006] 35 SNP loci on chromosome 16 covering a region of 91kb 551 trios Synthetic datasets 35 SNPs, trios, same missing data pattern as real dataset Haplotypes assigned to trios based on frequencies inferred from real dataset 1% error rate, four error insertion models Random allele Random genotype Heterozygous-to-homozygous Homozygous-to-heterozygous

Empirical Evaluation (contd.) Two strategies for handling MIs Set child only to unknown (preserving parents’ original data Set all three individuals to unknown prior to error detection  Similar accuracy for both methods Two testing strategies Test one SNP genotype at a time Simultaneously test all 3 SNP genotypes at a locus  Simultaneous testing has very poor sensitivity, not recommended

Comparison of Alternative Likelihood Functions (1% Random Allele Errors)

Parents vs. Children (1% Random Allele Errors) FPs caused by same-locus errors in parents

“Combined” Detection Method Compute 4 likelihood ratios Trio Mother-child duo Father-child duo Child (unrelated) Flag as error if all ratios are above detection threshold

Comparison with FAMHAP (Children)

Comparison with FAMHAP (Parents)

Sample Size Effect

Acknowledgements Sasha Gusev, Justin Kennedy, Bogdan Pasaniuc NSF funding (Awards and ) Software available at