Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France.

Slides:



Advertisements
Similar presentations
ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.
Advertisements

Indexing DNA Sequences Using q-Grams
Longest Common Subsequence
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
Fast Algorithms For Hierarchical Range Histogram Constructions
CYK Parser Von Carla und Cornelia Kempa. Overview Top-downBottom-up Non-directional methods Unger ParserCYK Parser.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
On the Genetic Evolution of a Perfect Tic-Tac-Toe Strategy
UNIT-III By Mr. M. V. Nikum (B.E.I.T). Programming Language Lexical and Syntactic features of a programming Language are specified by its grammar Language:-
Advanced Algorithm Design and Analysis (Lecture 6) SW5 fall 2004 Simonas Šaltenis E1-215b
1 FireμSat : An Algorithm to Detect Tandem Repeats in DNA.
Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside.
JM - 1 Introduction to Bioinformatics: Lecture IV Sequence Similarity and Dynamic Programming Jarek Meller Jarek Meller Division.
Chapter 4 Normal Forms for CFGs Chomsky Normal Form n Defn A CFG G = (V, , P, S) is in chomsky normal form if each rule in G has one of.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau.
CPM '05 Sensitivity Analysis for Ungapped Markov Models of Evolution David Fernández-Baca Department of Computer Science Iowa State University (Joint work.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Algorithm Design Techniques: Induction Chapter 5 (Except Section 5.6)
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
Data Structures – LECTURE 10 Huffman coding
Intro to AI Genetic Algorithm Ruth Bergman Fall 2002.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Multiple Sequence alignment Chitta Baral Arizona State University.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Carmine Cerrone, Raffaele Cerulli, Bruce Golden GO IX Sirmione, Italy July
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Class 2: Basic Sequence Alignment
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then Shlomo Moran. Background Readings:
Genetic Programming.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Gene Matching Using JBits Steven A. Guccione Eric Keller.
The CYK Parsing Method Chiyo Hotani Tanya Petrova CL2 Parsing Course 28 November, 2007.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Improved Gene Expression Programming to Solve the Inverse Problem for Ordinary Differential Equations Kangshun Li Professor, Ph.D Professor, Ph.D College.
© Negnevitsky, Pearson Education, Lecture 10 Evolutionary Computation: Evolution strategies and genetic programming Evolution strategies Evolution.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
A Simple Algorithm for Stable Minimum Storage Merging Pok-Son Kim Kookmin University, Department of Mathematics, Seoul , Korea Arne Kutzner Seokyeong.
Schemata Theory Chapter 11. A.E. Eiben and J.E. Smith, Introduction to Evolutionary Computing Theory Why Bother with Theory? Might provide performance.
Toward Efficient Flow-Sensitive Induction Variable Analysis and Dependence Testing for Loop Optimization Yixin Shou, Robert A. van Engelen, Johnnie Birch,
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Genetic Algorithms Michael J. Watts
Sorting by Cuts, Joins and Whole Chromosome Duplications
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
A New Evolutionary Approach for the Optimal Communication Spanning Tree Problem Sang-Moon Soak Speaker: 洪嘉涓、陳麗徽、李振宇、黃怡靜.
Mining Evolutionary Model MEM Rida E. Moustafa And Edward J. Wegman George Mason University Phone:
CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 21.
Dipankar Ranjan Baisya, Mir Md. Faysal & M. Sohel Rahman CSE, BUET Dhaka 1000 Degenerate String Reconstruction from Cover Arrays (Extended Abstract) 1.
An Improved Search Algorithm for Optimal Multiple-Sequence Alignment Paper by: Stefan Schroedl Presentation by: Bryan Franklin.
Core String Edits, Alignments, and Dynamic Programming.
 2004 SDU Uniquely Decodable Code 1.Related Notions 2.Determining UDC 3.Kraft Inequality.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Intelligent Exploration for Genetic Algorithms Using Self-Organizing.
WABI: Workshop on Algorithms in Bioinformatics
CSE 589 Applied Algorithms Spring 1999
Bioinformatics Algorithms and Data Structures
Double Cut and Join with Insertions and Deletions
Unit Genomic sequencing
Computational Genomics Lecture #3a
Presentation transcript:

Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France

Outline Biology Evolutionary Model Problem description Previous works Algorithms Results

Biology… Minisatellites consist of tandem arrays of short repeat units found in genome of most higher eukaryotes. High degree of polymorphism at minisatellites has applications from forensic studies to investigation of origin of modern humans.

…Biology… These repeats are called variants. MVR-PCR is designed to find the variants. As an example, MSY1 is the minisatellite on the human Y-chromosomes. There are five different repeats (variants) in MSY1.

Different Repeat Types (Variants) of MSY1

Graphical representations of Minisatellite Maps of 13 males

Summary Biologists are able to compute the minisatellite maps : A sequence in which each of the repeats is replaced by its symbol. Study of evolution of minisatellites is an important problem, in human genetics studies.

Computer Science Model Each variant is a symbol of an alphabet. A minisatellite is a string on this alphabet. We need to compare these strings.

Evolutionary Operations Insertion Deletion Mutation Amplification (p-plication) Contraction (p-contraction)

Examples of operations(1) Insertion of d abbc —  abbdc Deletion of c abbcb —  abbb Mutation of c into d caab —  daab

Examples of operations(2) 4-plication of c abcb — > abccccb 2-contraction of b abbc — > abc No subword replication or contraction

Cost Functions I(x) : insertion of letter x D(x) : deletion of letter x M(x,y) : mutation of x to y A p (x) : p-plication of letter x C p (x) : p-contraction of letter x

Hypotheses All the costs are positive. The cost of duplications (and contractions) is less than all other operations. Distance : M(x,x)=0 Triangle Inequality holds: M(x,y)+M(y,z) ≥ M(x,z)

Transformation of s into t Applying a sequence of operations on s transforming it into t. An example : xyy — >> xbcbxzc xyy — > xy — > xxy — > xxz — > xbxz — > xbbxz — > xbbbxz — > xbcbxz — > xbcbxzc The cost of a transformation is the sum of costs of its operations.

Transformation distance between s and t TD = Minimum cost for a possible transformation of s into t. The transformation which gives this minimum is called optimal transformation.

Previous Works Jobling & al. Bérard & Rivals (RECOMB’02) B.B. & J.M.S. (CPM2003, WABI2004): this work

Optimal Transformation between s and t For any transformation of s into t there are 2 different types for the symbols of s. Generative vs Vanishing letters of s : — create a substring in t (generation) or — disappear (reduction)

Basic lemmas The optimal generation of a non-empty string s from a symbol x can be achieved by a non-decreasing generation. In an optimal transformation of a string s to a string t any contraction operation can be done before any generation.

The schema of the proof Sequence u is eliminated sometime during the process The right-hand side transformation is equivalent and less expensive w.r.t. evolution.

Optimal Transformation generative and vanishing symbols can be transformed in two distinct optimized phases.

The Algorithm Preprocessing ( Substring generation costs) by Dynamic programming Main part (Transformation distance) by Dynamic programming

Substring generation costs G[i, j, x] : minimum generation cost of t[i..j] from symbol x among all generations which do not start by a mutation. T[i,j,x] is the minimum generation cost of substring t[i..j] from symbol x. mc[i,j,p,x] is the minimum generation cost for generating t[i..j] from symbol x among all possible generations starting with a p- plication.

mc[i, j, p, x]

Substring generation costs

Substring Reduction cost S[i,j] is the minimum cost of reduction of the substring s[i..j] into s[i]. S[i,j] is determined in the same way.

Complexity The time complexity is The space complexity is The maximum possible p for a p-plication is noted by

Transformation Distance –TD[i,j] is the transformation distance between s[1..i] and t[1..j].

Complexity The main algorithm complexity is O(n³) in time and O(n²) in space. The total time complexity is The total space complexity is

Further improvements Improving the complexity using the Run Length Encoded string representation. The RLE of aaaabbbbcccabbbbcc is a4b4c3a1b4c2 also written a 4 b 4 c 3 ab 4 c 2 The lengths of the encoded strings with original lengths m and n are denoted by m' and n'.

Generation of Runs There exists an optimal generation of a non- empty string t from a single symbol x in which for every run of size k > 1 in t the k-1 right symbols of the run are generated by duplications of the leftmost symbol of the run.

New configurations in the transformation Generations could split runs into several parts... Similarly for reductions... See on examples different configurations

x x

x j i y x j’j’ i’i’ y

PreProcessing: Generation Costs Compute the generation cost of all substrings of the target string t from any symbol x of the alphabet. Fill a table G t [x,i,j] by recurrence.

Core Algorithm The Transformation Distances TD between s[1..i] and t[1..j] are computed by recurrence according to lemmas derived to the situations Generalized dynamic programming is used again Complexity : O(n' 3 +m' 3 +mn' 2 +nm' 2 +mn)

BS algorithms vs. BR algorithms Complexity improvement O(n 4 ) to O(n 3 ) and more with RLE (O(n 2 ) experimentally) Generalization 1: amplifications and contractions of order > 2 Generalization 2: symbol-dependent cost functions The triangle hypotheses on cost functions are not restrictive and can be released by some preprocessing.

Dataset Provided by Prof. M. Jobling Minisatellite maps of 690 Y chromosomes from worldwide population. The length of the sequences is between 48 and 118. Distances were computed for 690x690 pairs

Running times Minisatellites DataRandom SequencesAlgorithm sec 0.03 sec sec 2.14 sec This Work sec 2.49 sec sec 2.51 sec Behzadi & Steyaert (2003) sec sec sec sec Berard and Rivals (2002) CorePre-Processing CorePre-Processing

Conclusion More efficient algorithm to compute faster the distances and thus the phylogenetic trees. A more general framework which can be used for modelling more complicated biological evolutions.

Thank you