Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.

Model-based species identification using DNA barcodes Bogdan Paşaniuc CSE Department, University of Connecticut Joint work with Ion Măndoiu and Sotirios.

METHODS FOR HAPLOTYPE RECONSTRUCTION

Sharlee Climer, Alan R. Templeton, and Weixiong Zhang

University of Connecticut

June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.

Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.

Design and Optimization of Universal DNA Arrays Ion Mandoiu CSE Department & BME Program University of Connecticut.

Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.

Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.

Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.

CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.

ISBRA 2007 Tutorial A: Scalable Algorithms for Genotype and Haplotype Analysis Ion Mandoiu (University of Connecticut) Alexander Zelikovsky (Georgia State.

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.

Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.

Finding Bit Patterns Applying haplotype models to association study design Natalie Castellana Kedar Dhamdhere Russell Schwartz August 16, 2005.

Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.

Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.

Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint.

May 25, GSU Biotech Symposium1 Minimum PCR Primer Set Selection with Amplification Length and Uniqueness Constraints Ion Mandoiu University of.

APBC Improved Algorithms for Multiplex PCR Primer Set Selection with Amplification Length Constraints Kishori M. Konwar Ion I. Mandoiu Alexander.

Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.

National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.

BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos.

Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.

Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Informative SNP Selection Based on Multiple Linear Regression

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.

Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.

Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.

Imputation-based local ancestry inference in admixed populations

Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.

Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs

Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.

Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.

Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.

National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.

National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.

International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.

The Haplotype Blocks Problems Wu Ling-Yun

Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,

KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.

Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),

ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.

Introduction to SNP and Haplotype Analysis

Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor

Constrained Hidden Markov Models for Population-based Haplotyping

Imputation-based local ancestry inference in admixed populations

How Accurate is Pure Parsimony Haplotype Inferencing

Introduction to SNP and Haplotype Analysis

Estimating Recombination Rates

Haplotype Reconstruction

Phasing of 2-SNP Genotypes Based on Non-Random Mating Model

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

CS 394C: Computational Biology Algorithms

Approximation Algorithms for the Selection of Robust Tag SNPs

Approximation Algorithms for the Selection of Robust Tag SNPs

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Parsimony population haplotyping

Presentation transcript:

Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department

Outline Biological background Maximum likelihood tag SNP selection Maximum likelihood population haplotyping Ongoing and future work

Human Genome  3  10 9 base pairs Main form of variation between individual genomes: single nucleotide polymorphisms (SNPs) –Single base changes in the genome sequence that occurs in a significant proportion (more than 1 percent) of the population –Most SNPs are bi-allelic Total #SNPs  1  10 7 Difference b/w any two individuals  3  10 6 SNPs (  0.1% of entire genome) Genomic Variation and SNPs

Diploid organisms: cells have two homologous sets of chromosomes Haplotype: description of SNP alleles on a chromosome –0/1 vector, e.g., (0 is for major, 1 is for minor allele) Genotype: combined description of SNP alleles on pairs of homologous chromosomes –0/1/2 vector, e.g., (0=0+0, 1=1+1, 2=0+1 or 1+0) –Each genotype with k 2’s can be explained by 2 k-1 pairs of haplotypes Haplotypes and Genotypes  

Limitations of current technologies: –High cost per (user selected) SNP  Tag SNP selection problem –Find genotypes, not haplotypes  Haplotype inference problem Effective solutions require combining accurate probabilistic models with scalable combinatorial optimization techniques! Computational Challenges

Outline Biological background Maximum likelihood tag SNP selection Maximum likelihood population haplotyping Ongoing and future work

Two-Stage Sampling Methodology Pilot Study –All SNPs of interest are genotyped in a small sample of the population –Common haplotypes are inferred using statistical methods –A set of tag SNPs is selected Population Study –Tag SNPs are genotyped in remaining population –Statistical methods are used to infer haplotypes over the tag SNPs –Haplotypes over the tag SNPs are extrapolated to full haplotypes

Haplotype pairs (tag SNPs) Haplotype pairs (all SNPs) Sample haplotypes (with frequencies) Remaining Population Population Sample Tag SNP Set Genotypes (tag SNPs) Extrapolation Phasing Tag SNP Selection Pilot Study Population Study Flow 1: Haplotype-Extrapolation Genotypes (all SNPs)

Haplotype pairs (all SNPs) Sample haplotypes (with frequencies) Remaining Population Population Sample Tag SNP Set Genotypes (tag SNPs) Phasing Extrapolation Phasing Tag SNP Selection Pilot Study Population Study Flow 2: Genotype-Extrapolation Genotypes (all SNPs)

Previous Works on Tag SNP Selection Statistical correlation based methods –Poor control over the number of tag SNPs [Bafna et al. 03] Informative SNP Set Problem –Find set of k SNPs with maximum “ informativeness ” [Sebastiani et al. 03] Best Enumeration of SNP Tags (BEST) –Finds minimum number of SNPs that distinguishes all given haplotypes –No control over the number of tag SNPs!

Fully Informative Tag SNP Set Selection by Integer Programming Given: haplotypes h 1, h 2, …, h m over n SNPs Find: minimum number of tag SNPs Such that: every two distinct haplotypes differ in at least one tag SNP Integer Program Formulation 0/1 variable x j for every SNP -x j = 1 if SNP j is selected as a tag SNP -x j = 0 otherwise Can be solved efficiently using general purpose solvers such as CPLEX -In practice significantly faster than BEST

Extrapolation Approaches [Halperin et al. 05] –Each SNP genotype predicted individually –Only immediate neighbor tag SNPs used in prediction [He&Zelikovsky 06] –Each SNP genotype predicted individually –All tag SNPs used in prediction Maximum likelihood –Pick the most likely full genotype compatible with short genotype over tag SNPs –Full genotype predicted in a single step

Tag Selection for Maximum Likelihood Genotype Extrapolation Idea: Select K tag SNPs maximizing correct prediction probability h1h1 h2h2 hnhn h3h3 Tag SNP 1 Tag SNP 2

Tag Selection for Maximum Likelihood Genotype Extrapolation

Synthetic datasets generated following [Forton et al. 05] - 2 populations (European and West African) + 2 genomic regions (IL8 and 5q31) - For each of the 4 populations, we used haplotypes and frequencies inferred in [Forton et al. 05] from the real data to generate 5 datasets containing between 200 and 1000 individuals - Fixed block size of 10 SNPs - For each dataset we picked 5 random samples with size 50 Maximum likelihood (ML) flows 1 and 2 were compared to the Multivariate Linear Regression (MLR) algorithm of [He&Zelikovsky 06] -Genotype frequencies estimated from haplotype frequencies used to generate the datasets (pop), respectively from haplotype frequecies inferred from sample using PHASE (phase) Experimental Setup

Haplotype Accuracy

Genotype Accuracy

Outline Biological background Maximum likelihood tag SNP selection Maximum likelihood population haplotyping Ongoing and future work

Population Haplotyping Problem Given the set G of genotypes observed in a population of individuals, infer a set H of haplotypes explaining G Numerous approaches: entropy minimization, perfect phylogeny, Bayesian networks, pure parsimony, … Maximum likelihood approach: 1.Estimate for each haplotype h its probability p h in the population under study 2.Find set H that explains G and has maximum likelihood

Haplotypes  graph vertices - Weight of vertex h = -log(p h ) Genotypes  edge colors - Edge (h, h’) with color g iff g can be explained by haplotypes h and h’ Graph Theoretical Reformulation Minimum Weight Multi-Colored Subgraph Problem (MWMCSP): Find min- weight set of vertices that induce at least one edge of each given color

Approximation Algorithms [Lancia et al. 02] - Algorithms with approximation factors of (for unweighted version) and q, where n is the number of genotypes and q is the maximum number of haplotype pairs compatible with a genotype [Huang et al. 05] - O(log n) approximation using semidefinite programming, but big O constant hides factor of q [Hassin&Segev 05] - Greedy algorithm with approximation factor of [Hajiaghayi et al. 06] - LP-rounding algorithm with approximation factor of

Integer Program Formulation Extends formulation of [Gusfield 03] 0/1 variable x u for every vertex u - x u is set to 1 if u is selected, 0 otherwise 0/1 variable y e for every edge e - y e set to 1 if e is induced by selected vertices, 0 otherwise

Outline Biological background Maximum likelihood population haplotyping Maximum likelihood tag SNP selection Ongoing and future work

Haplotype Frequency Estimation Accurate haplotype frequency estimation becomes key to overall accuracy of likelihood maximization methods Important to capture frequencies of haplotypes that may not appear in the sample – phasing and counting gives poor estimates Existing high-quality algorithms, e.g., Haplofreq [Halperin&Hazan 05], do not have good scaling runtime

HMM-Based Frequency Estimation Hidden Markov Models (HMMs) are uniquely suited for modeling haplotype frequencies in a population Recently used very successfully in haplotype inference [Rastas et al. 05], disease association [Kimmel&Shamir 05] –Main computational bottleneck: HMM training based on genotype data

HMM-Based Frequency Estimation Good compromise in context of two stage experiments –Sample consisting of trios (child, mother, and father) –Sample phased using fast trio-aware phasing method (e.g., entropy phasing [Pasaniuc&M 06]) –HMM trained on resulting (highly accurate) haplotypes –Haplotype frequencies computed efficiently using k-shortest paths algorithm

Other Problems Identification of genotyping errors by likelihood maximization [Becker et al. 06] Pedigree reconstruction and kinship analysis Population structure Bicriteria tag SNP selection: likelihood maximization and genotyping cost optimization

Acknowledgments J. Jun, B. Pasaniuc (UCONN) M.T. Hajiaghayi (CMU), K. Jain (Microsoft Research), L.C. Lau (U. Toronto), A. Russell (UCONN), V.V. Vazirani (Georgia Tech) Funding from NSF (CAREER Award IIS ) and UCONN Research Foundation