Yufeng Wu and Dan Gusfield University of California, Davis

Slides:



Advertisements
Similar presentations
A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007.
Advertisements

Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Fast Algorithms For Hierarchical Range Histogram Constructions
Techniques for Dealing with Hard Problems Backtrack: –Systematically enumerates all potential solutions by continually trying to extend a partial solution.
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.
Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover recombination in populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Haplotyping via Perfect Phylogeny: A Direct Approach
ReCombinatorics: Phylogenetic Networks with Recombination CPM, June 18, 2008 Pisa, Italy Two recent results and Two Open Questions.
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover Recombination in Populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
Evolutionary Algorithms BIOL/CMSC 361: Emergence Lecture 4/03/08.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Mobile Agent Migration Problem Yingyue Xu. Energy efficiency requirement of sensor networks Mobile agent computing paradigm Data fusion, distributed processing.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Introduction to Genetic Algorithms. Genetic Algorithms We’ve covered enough material that we can write programs that use genetic algorithms! –More advanced.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Of Sea Urchins, Birds and Men
SNP Haplotype Block Partition and tagSNP Finding
Lecture 11: Tree Search © J. Christopher Beck 2008.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
Example: Applying EC to the TSP Problem
Estimating Recombination Rates
Bart M. P. Jansen June 3rd 2016, Algorithms for Optimization Problems
Finding Fastest Paths on A Road Network with Speed Patterns
ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination Dan Gusfield U. Oregon , May 8, 2012.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The coalescent with recombination (Chapter 5, Part 1)
Algorithms for Budget-Constrained Survivable Topology Design
Minimizing the Aggregate Movements for Interval Coverage
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

Yufeng Wu and Dan Gusfield University of California, Davis Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis CSB 2006

Haplotypes/Genotypes Diploid organisms have two copies of (not identical) chromosomes. A single copy is a haplotype, vector of 0,1. The mixed description is a genotype, vector of 0,1,2. At each site, If both haplotypes are 0, genotype is 0 If both haplotypes are 1, genotype is 1 If one is 0 and the other is 1, genotype is 2 Key fact: easier to collect genotypes, but many downstream applications work better with haplotypes

Haplotyping Sites: 1 2 3 4 5 6 7 8 9 Phasing the 2s 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 1 0 1 1 1 Haplotype Genotype 2 1 2 1 0 0 1 2 0 2 1 2 1 0 0 1 2 0 Haplotype Inference (HI) Problem: given a set of n genotypes, infer the real n haplotype pairs that form the given genotypes

Two-stage Approach Given a set of genotypes G, we are interested in downstream problems Many HI solutions for G Two stage: first infer the “correct” HI solution from the genotypes, then do the downstream analysis with the inferred haplotypes Haplotype inference: extensively studied and believed to be accurate to certain extent

One-stage Approach What effect does haplotyping inaccuracy have on downstream questions? Our work: directly use genotype data for downstream problems Without fixing a choice for the HI solution Minimum recombination problem

Recombination: Single Crossover Recombination is one of the principle genetic force shaping variation within species Two equal length sequences generate a third equal length sequence Prefix 110001111111001 11000 0000001111 breakpoint Suffix 000110000001111

Kreitman’s Data (1983) 0000000011000000001101110111100000000000000 0010000000000000001101110111100000000000000 0000000000000000000000000000000000010000101 0000000000000000110000000000000000010011000 0001100010110011110000000000000000001000000 0010000000000001000000000000001010111000010 0010000000000001000000000000011111101000000 1111100010111001000000000000011111101100000 1111111110000101000010001000011111101000000 Question: what is the minimum number of recombinations needed to derive these sequences? Assume at most 1 mutation per site

Minimizing Recombination Compute the minimum number of recombinations (Rmin) for deriving a set of haplotypes, assuming at most 1 mutation per site NP-hard in general Heuristics Lower bounds on Rmin

Lower Bounds on Genotypes For a particular recombination lower bound method L, what is the range of possible bounds for L over all possible HI solutions? MinL(G): minimum L over all HI solutions for G. MaxL(G): maximum L over all HI solutions for G. This paper: HK bound, connected component bound and relaxed haplotype bound. Polynomial-time algorithms for MaxHK, MinCC. Heuristic method for relaxed haplotype bound.

Lower Bound: Incompatibility 1 2 3 4 5 Incompatibility Graph (IG): A node each site, edge between incompatible pair a b c d e f g 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 M 1 2 3 4 5 Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11 Sites p,q are incompatible  A recombination must occur between p,q

HK Bound (1985) Arrange the nodes of the incompatibility graph on the line in order that the sites appear in the sequence. HK bound = maximum number of non-overlapping edges in incompatibility graph (IG). Easy to compute for haplotype data. 1 2 3 4 5 HK Lower Bound = 1

IG for HI Solutions HK = 1 HI1 1 2 3 4 5 01010 10101 00202 22200 00000 00101 01000 10100 1 2 3 4 5 HK = 1 HI1 01010 10101 00202 22200 01010 10101 00001 00100 00000 11100 1 2 3 4 5 HK = 3 HI2

HK Bounds on Genotypes Known efficient algorithm for MinHK(G) (Wiuf, 2004). This paper: polynomial-time algorithm for MaxHK(G)

Maximal Incompatibility Graph MIG(G) 01010 10101 00202 22200 E(G) = {12, 23, 35} 1 2 3 4 5 An edge between sites p and q if there is a phasing of p, q so p and q are incompatible Each pair of sites is considered independently E(G): a maximum-sized set of non-overlapping edges in MIG(G)

MaxHK(G) Claim: MaxHK(G) = |E(G)| MaxHK(G)  |E(G)| MIG(G): supergraph of IG(H) for any HI solution H If we can find an HI solution H, whose every pair of sites in E(G) is incompatible, then HK(H)  |E(G)| Together, MaxHK(G) = |E(G)|

Finding such an H MIG(G) Phase sites from left to right. Each component in E(G) is a simple path Each site only constrained by at most one site to the left

Phasing G for Incompatibility 01010 10101 00?0? 0??00 1??00 01010 10101 00?0? 00?00 11?00 01010 10101 0010? 0000? 00000 11100 No matter how a previous site p is phased, can always phase this site q to make p, q incompatible

Haplotyping With Minimum Number of Recombinations Compute Rmin(G) Haplotyping on a network with fewest recombinations NP-hard This paper: A branch and bound method computing exact Rmin(G) for data with small number of sites APOE data: 47 non-trivial genotypes, 9 sites Our method: 2 minutes, Rmin(G) = 5

Application: Recombination Hotspot Recombination hotspot: regions where recombination rate is much higher than neighboring regions Previous study (Bafna and Bansal, 2005): a recombination lower bound with inferred haplotypes were used to identify recombination hotspots Our work: compute the exact Rmin(G) with genotypes for a sliding window of a small number of SNPs to detect recombination hotspots

MS32 data (Jeffreys, et al. 2001) Result from haplotypes (Bafna and Bansal, 2005) Result from original genotypes (this paper)

Other Applications Finding true Rmin from genotypes G Two stage approach: run PHAS to get an HI solution H, and compute Rmin(H) One stage approach: directly compute Rmin(G) Accuracy of haplotype inference on a minimum network Simulation results: comparable, slightly weaker and non-conclusive

Summary Main goal of this paper: develop computational tools for the minimum recombination problem with genotypes Polynomial-time algorithm for MaxHK and MinCC problems Practical heuristics for other problems Simulation results to several application questions are not conclusive Our tools facilitate the study of these problems

Thank You Software: available upon request