Download presentation
Presentation is loading. Please wait.
Published byMaximillian Perkins Modified over 8 years ago
1
Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian Will
2
RNA CPM 2012, Helsinki RNA R is an ordered pair (S,B) where: CUCGUCAGUACGACU U U C U C G U C A G U A C G AC U B is a set of base pairs C-G, G-C, A-U, or U-A base pair single base S is a sequence defined over = {A,C,G,U} backbone connection
3
RNA CPM 2012, Helsinki RNA R is an ordered pair (S,B) where: CUCGUCAGUACGACU U U C U C G U C A G U A C G AC U B presents the secondary structure of R S presents the primary structure of R
4
RNA Representations CPM 2012, Helsinki CUCGUCAGUACGACU U U GC UA UACGAC CUU U C U C G U C A G U A C G AC U Arc annotated string Tree
5
RNA Secondary Structure CPM 2012, Helsinki A G U C A U C G C G U A U C C G A G C G C A C G AC G U C A G U A C G AC G C A U U A C G A Determines the activity and functionality of the RNA The secondary structures of RNA is highly researched Usually more preserved during evolution
6
RNA Structure CPM 2012, Helsinki A G U C A U C G C G U A U C C G A G C G C A C G AC G U C A G U A C G AC G C A U U A C G A Predicting the secondary structure of RNA molecule is a difficult task The structure is sometimes given in a non-fixed form, where each base pair has a probability ≤ 1 to exist in the RNA
7
Nested Structure CPM 2012, Helsinki CUCGUCAGUACGACU U U GC UA UACGAC CUU U C U C G U C A G U A C G AC U In all of these examples, the structure of R is Nested: Each base can be connected by a bond connection to at most one other base, and there are no crossing arcs
8
Unlimited Structure CPM 2012, Helsinki Arc annotated substrings can represent Unlimited structures, as well CUACCGAGUCAGUACGACGCAUUAC
9
Bounded-Unlimited Structure CPM 2012, Helsinki Arc annotated substrings can represent Bounded-Unlimited structures: Each base can be connected to a constant number of other bases, CUACCGAGUCAGUACGACGCAUUAC and crossing arcs are allowed
10
RNA Similarity Algorithms Many algorithms for finding similarity between RNA molecules use tree similarity algorithms GC UA AU GC UACGAC UA CG UA UACGAC CG CGAAUC Tree Edit Distance: Tai (’79) O(n 6 ) Zhang & Shasha (‘89) O(n 4 ) Klein (‘98) O(n 3 logn) Ma et al. (‘99) O(n 3 logn) Demaine et al. (‘07) O(n 3 ) CPM 2012, Helsinki
11
RNA Similarity Algorithms Many algorithms for finding similarity between RNA molecules use tree similarity algorithms GC UA AU GC UACGAC UA CG UA UACGAC CG CGAAUC Tree Alignment: Jiang et al. (’95) Schirmer & Giegerich (‘11) Backofen et al. (‘07) Mohl et al. (’09) CPM 2012, Helsinki
12
RNA Similarity Algorithms Many algorithms for finding similarity between RNA molecules use tree similarity algorithms GC UA AU GC UACGAC UA CG UA UACGAC CG CGAAUC Longest Arc Preserving Common Subsequence: Evans (’99) Lin et al. (’02) Alber et al. (’04) Jiang et al. (’04) CPM 2012, Helsinki
13
RNA Similarity Algorithms Many algorithms for finding similarity between RNA molecules use tree similarity algorithms GC UA AU GC UACGAC UA CG UA UACGAC CG CGAAUC Similar Subforests Jansson & Peng (’11) CPM 2012, Helsinki
14
Exact Pattern Matching Problem In this work, we search for local common sequence-structure regions (patterns) between two given RNA molecules Pattern CPM 2012, Helsinki
15
Patterns in RNAs In this work, we search for local common sequence-structure regions (patterns) between two given RNA molecules CPM 2012, Helsinki
16
Exact Pattern Matching Problem UCUACUCAGCGUACG Finding all maximal common structure-sequence regions between two RNAs UCAAGUCAGAGAACCCG Solved by Backofen & Siebert in O(n 2 ) for fixed Nested x Nested Structures CGUU AACU CPM 2012, Helsinki single base matchleft endpoint matchtype mismatch
17
Exact Pattern Matching Problem In this work, we solve the problem for non-fixed Nested x Nested Structures UCUACUCAGCGUACG UCAAGUCAGAGAACCCG CGUU AACU arc breaking CPM 2012, Helsinki
18
Arc Breaking Operation We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty GCCCGCUAAGAGGUUGAC single bases base pair CPM 2012, Helsinki
19
Arc Breaking Operation We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty G CC C G C U A A G A G G U U G A C single bases base pair CPM 2012, Helsinki
20
Arc Breaking We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty GC UA AU GC UACGAC UA CG UA UACGAC CG CGAAUC U A CPM 2012, Helsinki
21
Arc Breaking Patterns are now less restricting: CPM 2012, Helsinki
22
Exact Pattern Matching Algorithms We describe three algorithms for finding the local exact pattern matching between two RNAs: A simple O(n 4 ) algorithm (using ideas from Zhang & Shasha (‘89) ) An improved O(n 3 logn) algorithm (using ideas from Klein (‘98) ) An O(n 3 ) algorithm (using ideas from Demaine, Weimann et al. (‘07) ) CPM 2012, Helsinki
23
Exact Pattern Matching Algorithm Input: R1=(S1,B1) and R2=(S2,B2), |R1|=n, |R2|=m, n>m Output: Local exact pattern matching between R1 and R2 CPM 2012, Helsinki R1: R2:
24
Exact Pattern Matching Algorithm We compare each base pair from R1 with each base pair from R2, in increasing order of their sizes CPM 2012, Helsinki R1: R2:
25
Exact Pattern Matching Algorithm For each two base pairs we compute the matching inside the base pairs, and the extensions to their outsides CPM 2012, Helsinki …… ……
26
Matching Inside the Base Pairs Dynamic programming algorithm Similar to the LCS\Edit distance algorithms of strings CPM 2012, Helsinki
27
Matching Inside the Base Pairs On each comparison we compute only prefixes of the substrings and select the maximal score over 4 expressions : Match base pairs S1(i)==S2(j) ? CPM 2012, Helsinki i j bp 1 bp 2 1 1 + +
28
Matching Inside the Base Pairs Match single bases CPM 2012, Helsinki S1(i)==S2(j) ? i j bp 1 bp 2 1 1
29
Matching Inside the Base Pairs Delete from R1 CPM 2012, Helsinki i j bp 1 bp 2 1 1 i-1 Delete from R2
30
Matching Inside the Base Pairs On each comparison we compute the maximal match from left-to-right UCGAGAUAUUAACGCC UUCGAACAAUCUAAGUCUAG AG CPM 2012, Helsinki … …… C … C i j 1 1
31
Matching Inside the Base Pairs UCGAGAUAUUAACGCC UUCGAACAAUCUAAGUCUAG AG CPM 2012, Helsinki On each comparison we compute the maximal match from right-to-left … …… C … C i j 1 1
32
Matching Inside the Base Pairs UCGAGAUAUUAACGCC UUCGAACAAUCUAAGUCUAG AG CPM 2012, Helsinki There are two tricky parts here: What happens when a mismatch occurs? C G … …… C … C i j 1 1
33
Matching Inside the Base Pairs CPM 2012, Helsinki UCGAGAUAUUAACGCC UUCGAACAAUCUAAGUCUAG AG There are two tricky parts here: What happens when the matchings overlap? … …… C … C i j 1 1
34
Matching Inside the Base Pairs UCGAGAUAUUAACGCC UUCGAACAAUCUAAGUCUAG AG CPM 2012, Helsinki The solution: on each comparison we compute the best score going from both right-to-left and left-to-right … …… C … C i j 1 1
35
We only compare prefixes of the base pairs There are O(n 2 ) prefixes for each RNA Each comparison is computed in O(1) time The total time is O(n 4 ) Time Complexity CPM 2012, Helsinki
36
Extending the Match CPM 2012, Helsinki We compute the maximal pattern extension for all bases in R1 and all bases in R2 in one run. The time complexity: O(n 2 ) … … i j n m R1: R2:
37
Total Time Complexity CPM 2012, Helsinki Computing the pattern match inside all base pairs is done in O(n 4 ) Computing the pattern match extensions to the right and to the left is done in O(n 2 ) The total time complexity is O(n 4 ) + =
38
An O(n 3 logn) Algorithm CPM 2012, Helsinki The root base pair is marked light, and continue recursively: Select the maximal child base pair and mark it as heavy, mark the rest of the children as light C GAGCCCGGGU UCUAGGCCGAAUC We use Klein’s Tree Edit Distance (‘98) ideas: we decompose the largest RNA into heavy paths:
39
For each base pair we define its special substrings Special Substrings C GAGCCCGGGU UU C A C C ACCCGGGU U axyb CPM 2012, Helsinki C ACCCGGGU UU C A C ACCCGGGU UU C C ACCCGGGU UU C ACCCGGGU UU C A CC ACCCGGGU UU C A CGG G CACCCGGGUUUCA C The no. of special substrings of a base pair is: |bp| - |hp| + 1 Lemma (Sleator & Tarjan ‘83): There are O(nlog n) special substring in R of size n
40
We compare all O(n 2 ) substrings of R2 with O(nlogn) special substrings of R1 An O(n 3 logn) Algorithm C GAGCCCGGGU UU C A C C ACCCGGGU UU C A C ACCCGGGU UU C C ACCCGGGU UUC ACCCGGGU U axyb C ACCCGGGU UU C A CGG C ACCCGGGU UU C A C G CACCCGGGUUUCA C CPM 2012, Helsinki
41
The comparisons are made between the rightmost or leftmost bases, according to the special substring An O(n 3 logn) Algorithm CPM 2012, Helsinki C GAGCCCGGGU UU C A C C ACCCGGGU UU C A C ACCCGGGU UU C C ACCCGGGU UUC ACCCGGGU U axyb C ACCCGGGU UU C A CGG C ACCCGGGU UU C A C G CACCCGGGUUUCA C
42
The total number of compared substrings is O(n 3 logn), each one computed in O(1) time, which gives a total of O(n 3 logn) running time. An O(n 3 logn) Algorithm CPM 2012, Helsinki This algorithm works for Nested x Bounded-Unlimited structures also.
43
Based on Demaine et al. (‘07) algorithm we decompose both RNAs to heavy paths, the special substrings are decided on each base pairs comparison: the base pair that has the largest root light base pair, is the dominant one An O(n 3 ) Algorithm C GAGCCCGGGU UCUAGGCCGAAUC C AGCUGUGCU UCUCACUCG U 1 2 3 R1:R1: R2:R2: 5 C 4 6 7 8 9 A B C D E F CPM 2012, Helsinki A
44
The number of compared substrings is O(n 3 ) An O(n 3 ) Algorithm C GAGCCCGGGU UCUAGGCCGAAUC C AGCUGUGCU UCUCACUC G G U 1 2 3 R1:R1: R2:R2: 5 C 4 6 7 8 9 A B C D E F CPM 2012, Helsinki This algorithm can work with Nested X Nested structures only
45
Find the local approximate pattern matching between Nested x Nested structures in O(n 3 k 2 ) for k allowed mismatches Find the local approximate pattern matching between Nested x Bounded-Unlimited structures in O(n 3 k 2 logn) for k allowed mismatches Find the most similar sibling substructures between Nested x Nested structures in O(n 3 ) More Algorithms CPM 2012, Helsinki
46
KO! H YU A T N
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.