Download presentation
Presentation is loading. Please wait.
Published byClement Shaw Modified over 9 years ago
1
Algorithms for Generalized Comparison of Minisatellites Behshad Behzadi & Jean-Marc Steyaert LIX, Ecole Polytechnique France
2
Outline Biology Evolutionary Model Problem description Previous works Algorithms Results
3
Biology… Minisatellites consist of tandem arrays of short repeat units found in genome of most higher eukaryotes. High degree of polymorphism at minisatellites has applications from forensic studies to investigation of origin of modern humans.
4
…Biology… These repeats are called variants. MVR-PCR is designed to find the variants. As an example, MSY1 is the minisatellite on the human Y-chromosomes. There are five different repeats (variants) in MSY1.
5
Different Repeat Types (Variants) of MSY1
6
Graphical representations of Minisatellite Maps of 13 males
7
Summary Biologists are able to compute the minisatellite maps : A sequence in which each of the repeats is replaced by its symbol. Study of evolution of minisatellites is an important problem, in human genetics studies.
8
Computer Science Model Each variant is a symbol of an alphabet. A minisatellite is a string on this alphabet. We need to compare these strings.
9
Evolutionary Operations Insertion Deletion Mutation Amplification (p-plication) Contraction (p-contraction)
10
Examples of operations(1) Insertion of d abbc — abbdc Deletion of c abbcb — abbb Mutation of c into d caab — daab
11
Examples of operations(2) 4-plication of c abcb — > abccccb 2-contraction of b abbc — > abc No subword replication or contraction
12
Cost Functions I(x) : insertion of letter x D(x) : deletion of letter x M(x,y) : mutation of x to y A p (x) : p-plication of letter x C p (x) : p-contraction of letter x
13
Hypotheses All the costs are positive. The cost of duplications (and contractions) is less than all other operations. Distance : M(x,x)=0 Triangle Inequality holds: M(x,y)+M(y,z) ≥ M(x,z)
14
Transformation of s into t Applying a sequence of operations on s transforming it into t. An example : xyy — >> xbcbxzc xyy — > xy — > xxy — > xxz — > xbxz — > xbbxz — > xbbbxz — > xbcbxz — > xbcbxzc The cost of a transformation is the sum of costs of its operations.
15
Transformation distance between s and t TD = Minimum cost for a possible transformation of s into t. The transformation which gives this minimum is called optimal transformation.
16
Previous Works Jobling & al. Bérard & Rivals (RECOMB’02) B.B. & J.M.S. (CPM2003, WABI2004): this work
17
Optimal Transformation between s and t For any transformation of s into t there are 2 different types for the symbols of s. Generative vs Vanishing letters of s : — create a substring in t (generation) or — disappear (reduction)
18
Basic lemmas The optimal generation of a non-empty string s from a symbol x can be achieved by a non-decreasing generation. In an optimal transformation of a string s to a string t any contraction operation can be done before any generation.
19
The schema of the proof Sequence u is eliminated sometime during the process The right-hand side transformation is equivalent and less expensive w.r.t. evolution.
20
Optimal Transformation generative and vanishing symbols can be transformed in two distinct optimized phases.
21
The Algorithm Preprocessing ( Substring generation costs) by Dynamic programming Main part (Transformation distance) by Dynamic programming
22
Substring generation costs G[i, j, x] : minimum generation cost of t[i..j] from symbol x among all generations which do not start by a mutation. T[i,j,x] is the minimum generation cost of substring t[i..j] from symbol x. mc[i,j,p,x] is the minimum generation cost for generating t[i..j] from symbol x among all possible generations starting with a p- plication.
23
mc[i, j, p, x]
24
Substring generation costs
25
Substring Reduction cost S[i,j] is the minimum cost of reduction of the substring s[i..j] into s[i]. S[i,j] is determined in the same way.
26
Complexity The time complexity is The space complexity is The maximum possible p for a p-plication is noted by
27
Transformation Distance –TD[i,j] is the transformation distance between s[1..i] and t[1..j].
28
Complexity The main algorithm complexity is O(n³) in time and O(n²) in space. The total time complexity is The total space complexity is
29
Further improvements Improving the complexity using the Run Length Encoded string representation. The RLE of aaaabbbbcccabbbbcc is a4b4c3a1b4c2 also written a 4 b 4 c 3 ab 4 c 2 The lengths of the encoded strings with original lengths m and n are denoted by m' and n'.
30
Generation of Runs There exists an optimal generation of a non- empty string t from a single symbol x in which for every run of size k > 1 in t the k-1 right symbols of the run are generated by duplications of the leftmost symbol of the run.
31
New configurations in the transformation Generations could split runs into several parts... Similarly for reductions... See on examples different configurations
32
x x
33
x j i y x j’j’ i’i’ y
35
PreProcessing: Generation Costs Compute the generation cost of all substrings of the target string t from any symbol x of the alphabet. Fill a table G t [x,i,j] by recurrence.
38
Core Algorithm The Transformation Distances TD between s[1..i] and t[1..j] are computed by recurrence according to lemmas derived to the situations Generalized dynamic programming is used again Complexity : O(n' 3 +m' 3 +mn' 2 +nm' 2 +mn)
39
BS algorithms vs. BR algorithms Complexity improvement O(n 4 ) to O(n 3 ) and more with RLE (O(n 2 ) experimentally) Generalization 1: amplifications and contractions of order > 2 Generalization 2: symbol-dependent cost functions The triangle hypotheses on cost functions are not restrictive and can be released by some preprocessing.
40
Dataset Provided by Prof. M. Jobling Minisatellite maps of 690 Y chromosomes from worldwide population. The length of the sequences is between 48 and 118. Distances were computed for 690x690 pairs
41
Running times Minisatellites DataRandom SequencesAlgorithm 32.54 sec 0.03 sec 810.93 sec 2.14 sec This Work 1012.38 sec 2.49 sec 1014.32 sec 2.51 sec Behzadi & Steyaert (2003) 1062.23 sec 16.37 sec 1058.44 sec 15.90 sec Berard and Rivals (2002) CorePre-Processing CorePre-Processing
42
Conclusion More efficient algorithm to compute faster the distances and thus the phylogenetic trees. A more general framework which can be used for modelling more complicated biological evolutions.
43
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.