Algorithmic Problems Related to Sequences and Phylogenetic Trees

Slides:



Advertisements
Similar presentations
Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Effective Heuristics for NP-Hard Problems Arising in Molecular Biology Richard M. Karp Bangalore, January 5, 2011.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
Bioinformatics Chromosome rearrangements Chromosome and genome comparison versus gene comparison Permutations and breakpoint graphs Transforming Men into.
Introduction to Bioinformatics Algorithms Greedy Algorithms And Genome Rearrangements.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Genome Rearrangements CSCI : Computational Genomics Debra Goldberg
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Probabilistic methods for phylogenetic trees (Part 2)
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Structural alignments of Proteins using by TOPOFIT method Vitkup D., Melamud E., Moult J., Sander C. Completeness in structural genomics. Nature Struct.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Descendent Subtrees Comparison of Phylogenetic Trees with Applications to Co-evolutionary Classifications in Bacterial Genome Yaw-Ling Lin 1 Tsan-Sheng.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Order independent structural alignment of circularly permutated proteins T. Andrew Binkowski Bhaskar DasGupta  Jie Liang ‡ Bioengineering Computer Science.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.
MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance Andrew I. Jewett, Conrad C. Huang and Thomas.
Sequence Alignment.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
1 Genome Rearrangements (Lecture for CS498-CXZ Algorithms in Bioinformatics) Dec. 6, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
Chapter 7. Classification and Prediction
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Distance based phylogenetics
Original Synteny Vincent Ferretti, Joseph H. Nadeau, David Sankoff, 1996 Presented by: Suzy Sun.
The ideal approach is simultaneous alignment and tree estimation.
Core-Sets and Geometric Optimization problems.
Genome Rearrangement and Duplication Distance
Gil McVean Department of Statistics, Oxford
Analysis and design of algorithm
Hierarchical clustering approaches for high-throughput data
1 Department of Engineering, 2 Department of Mathematics,
Hidden Markov Models Part 2: Algorithms
Objective of This Course
1 Department of Engineering, 2 Department of Mathematics,
BNFO 602 Phylogenetics Usman Roshan.
1 Department of Engineering, 2 Department of Mathematics,
Mattew Mazowita, Lani Haque, and David Sankoff
Chapter 11 Limitations of Algorithm Power
Chapter 19 Molecular Phylogenetics
CS 394C: Computational Biology Algorithms
A Framework for Testing Query Transformation Rules
September 1, 2009 Tandy Warnow
Computational Genomics Lecture #3a
Fragment Assembly 7/30/2019.
Rearrangement Phylogeny of Genomes in Contig form
Presentation transcript:

Algorithmic Problems Related to Sequences and Phylogenetic Trees Bhaskar DasGupta Department of Computer Science University of Illinois at Chicago Chicago, IL 60607-7053 Email: dasgupta@cs.uic.edu 11/22/2018

Substructure Comparison Problems Sequences Outline Introduction Substructure Comparison Problems Sequences Nonoverlapping local alignment Proteins Transformation Based Distances Phylogenetic Trees Why compare? A few distances Genomes Syntenic Distance Conclusions 11/22/2018

Computational Molecular Biology A Computer Scientist’s Participation Get to know the computational problems Talk to biologists State the computational problems as precisely as possible Investigate computational aspects of the problems exact solutions difficult/easy ? time/space efficient solutions ? approximate solutions (if exact solution is hard or not time/space efficient) guaranteed quality of approximation ? (tradeoff with space/time?) deterministic vs. randomized algorithms implementation aspects programming cleverness to reduce space/time algorithmic engineering techniques to reduce space/time interaction with the biologists are the solutions biologically meaningful ? 11/22/2018

Few Computer Science Jargons When we say What we really mean Maximization/minimization problem Problem in which we maximize/minimize some objective function Problem is NP-complete/hard Exact solution for large size problem will most likely require too much time Polynomial-time solution Solvable in reasonable time in a reasonably fast computer Approximation algorithm An approximate solution computed in reasonable time with approximation ratio r with an objective function value of a (for maximization/minimization) least (at most r) of the optimum 11/22/2018

Substructure Similarity (or, equivalently, Dissimilarity) a matches to a’ with similarity 10 b matches to b’ with similarity 15 c matches to c’ with similarity 11 total similarity 36 Goal: match disjoint substructures to maximize total similarity 11/22/2018

Many short vs. fewer long substructures Few Complications Many short vs. fewer long substructures Measure of similarity between substructures Examples: rmsd (root-mean-square distance) between 3D substructures edit distance between subsequences syntenic distance between multi-chromosome genomes 11/22/2018

Non-overlapping local alignment Sequences Non-overlapping local alignment total similarity 10+15=25 11/22/2018

The problem Input: pairs of fragments, one from each sequence (or, equivalently a set of rectangles). the weight of each pair (rectangle) is their similarity measure Output: a set of pairs (rectangles) such that no two rectangles overlap on the x-axis (i.e., matched fragments of the first sequence are disjoint) no two rectangles overlap on the y-axis (i.e., matched fragments of the 2nd sequence are disjoint) total similarity of selected fragment pairs is maximized 11/22/2018

not allowed in the input data Further assumption We can preprocess input data (rectangles or fragment pairs) to ensure that for any two rectangles, the projection of one on the y-axis does not enclose that of another not allowed in the input data for any two rectangles, the projection of one on the x-axis does not enclose that of another 11/22/2018

An optimal solution of total similarity 25 An illustration Input: A G 15 G 2 C 1 C 10 T A A G C A C C An optimal solution of total similarity 25 11/22/2018

(n = number of rectangles (fragment pairs)) Previous results (n = number of rectangles (fragment pairs)) Bafna, Narayanan and Ravi (WADS’95) NP-complete O(n2) time approximation algorithm with approximation ratio 3.25 converts to a problem of finding maximum-weight independent set in a 5-clawfree graph gives approximation algorithm for (d+1)-clawfree graphs with approximation ratio of Halldórsson (SODA’95) approximation algorithm with approximation ratio of about 2.5 when all weights are one again uses clawfree graphs Berman (SWAT’00) O(n4) time algorithm with approximation ratio of about 2.5 via clawfree graphs again 11/22/2018

(Berman, DasGupta and Muthukrishnan, SODA’02) Our recent results (Berman, DasGupta and Muthukrishnan, SODA’02) O(n log n) time approximation algorithm with approximation ratio 3 very simple to implement uses a 2-phase approach (or, equivalently, the local-ratio technique) Extensions to d dimensions (d > 2) Inputs are similarity measures of d fragments, one from each of given d sequences Motivation: multiple sequence comparison problems Generalization of our above approach: O(n d log n) time approximation algorithm with approximation ratio of 2d-1 current best (Bar-Yehuda, Halldórsson, Naor, Shachnai and Shapira, SODA’02): polynomial time algorithm with approximation ratio 2d uses repeated linear programming and continuous version of local-ratio techniques 11/22/2018

Common substructure between protein structures (work in progress.......with Jie Liang and Andrew Binkowski) Comparison of 2 4-helix bundles that differ by topological rearrangement, ROP and cytochrome b56 Topological cartoons of 1ROP and 256B. Helices are drawn as cylinders and loops as lines. Residue numbers of structurally equivalent segments are indicated on the cylinders. The alignment is non-sequential. 11/22/2018

Few interesting points: Motivation: discovering similar substructures from different proteins is essential for recognizing remote evolutionary relationship at the level of protein fragments Few interesting points: it is not easy to characterize topological structures such as void, pocket, or tunnel where ligand and other molecules bind. Current computational tools do not perform very well on discovering similar substructures. For example: (a) protein structures are typically represented by distance matrices or contact maps, which record pairwise inter-distances between selected atoms (typically Cα atoms) on the primary sequences (b) finding common substructures becomes matching submatrices of the two contact maps (c) Heuristic algorithms have been developed and have proven to be useful. But, they are time consuming (typically O(n6)), and cannot be used for more demanding tasks such as identifying spatial functional motifs 11/22/2018

Our approach in work in progress reduce the problem to various constrained rectangle-packing problems use combinatorial methods (such as the local-ratio technique) to design approximation algorithms for these problems Our final goal identification of the most discriminating geometric and chemical features and their combinations for various proteins development of a robust method to compute the similarity/dissimilarity of two shape distributions of these features 11/22/2018

Transformation based distances Objects Transformation rules (with costs) 10 15 12 9 Goal: find distance between two specified objects 10 15 9 cost = 10+15+9 = 34 10 12 cost = 10+12 = 22 distance between and is 22 11/22/2018

Distances between Phylogenetic trees Objects: Evolutionary trees (phylogenies) on n nodes Transformation Rules: How to modify trees locally consistent with biological applications? 11/22/2018

inferring phylogenies Why compute distances between phylogenies ? First motivation parsimony method compatibility method compare them for similarity and discrepancy input data maximum-likelihood method distance matrix method different methods for inferring phylogenies 11/22/2018

Why compute distances between phylogenies ? Second motivation To find out information about rare genetic events such as recombination or gene conversion recombination gene conversion 11/22/2018

Few distances that we have looked at...... Nearest neighbor interchange (nni) distance Linear cost subtree transfer distances Synopsis of our works on these distances proving that exact solution is NP-hard providing fast approximate solutions investigating fixed-parameter tractability some implementation works ..... 11/22/2018

order of genes in any chromosome is unknown or ignored Genomic Distance Syntenic distance between multi-chromosome genomes (Ferretti, Nadeau and Sankoff, 1996) treats genomes at a higher level of abstraction chromosome gene 4 9 3 6 10 8 1 5 7 2 order of genes in any chromosome is unknown or ignored intra-chromosomal events (e.g., reversal, transposition) do not affect chromosomal assignment inter-chromosomal events are important 11/22/2018

Inter-chromosomal events Fission Fusion 1 2 3 5 2 1 3 5 4 4 1 4 2 3 5 1 4 2 3 5 (Reciprocal) translocation 5 1 2 3 4 6 7 11/22/2018

Syntenic distance between two genomes minimum number of fission, fusion and translocations necessary to transform one genome to another Other related problems finding the median of 3 genomes for the syntenic distance metric (useful for phylogentic tree inference problem from synteny data) Synopsis of our work on these problems showing NP-hardness of exact computation giving efficient approximation algorithms exhibiting fixed-parameter tractability 11/22/2018

Genome partitioning with applications to DNA microarray chip design Other problems...... Genome partitioning with applications to DNA microarray chip design Consensus sequence reconstruction problems 11/22/2018

THE END 11/22/2018