Download presentation
Presentation is loading. Please wait.
1
Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer Science, Systems and Communication University of Milano – Bicocca Milan, Italy
2
CPM 2003 - MoreliaGiulio Pavesi2 Why Is RNA So Interesting? After the completion of various genome projects, the attention of many researchers has shifted from coding to non – coding parts More than 95% of our genome is not coding: what about the rest? Non – coding RNA: RNA that is transcribed from DNA, but does not encode directly for a protein (tRNA, microRNA, etc.)
3
CPM 2003 - MoreliaGiulio Pavesi3 A Motivating Example Post-transcriptional regulation of gene expression
4
CPM 2003 - MoreliaGiulio Pavesi4 The Problem Functionally related RNA sequences present structural similarity, at least in some parts Given two or more RNA molecules, find similar (supposedly functional) structural elements in them Sequence similarity implies structure similarity, but this is not always that true for RNA..... Given two or more RNA sequences of unknown structure, find similar structural elements in them (motifs) Low sequence similarity can anyway correspond to high structure similarity
5
CPM 2003 - MoreliaGiulio Pavesi5 “Know Thine Enemy” RNA secondary structure: list of the base pairs among nucleotides in the sequences, such that: –No nucleotide takes part in more than a single base pair (usually, Watson – Crick pairs and wobble pairs G – T, i.e. canonical base pairs) –Base pairs never cross: if nucleotide i is bound to nucleotide j and k with l, then either i < j < k < l or i < k < l < j
6
CPM 2003 - MoreliaGiulio Pavesi6 RNA Secondary Structure.((..(((.((....)))))...(((.(((...)))...)))))
7
CPM 2003 - MoreliaGiulio Pavesi7 Motifs in RNA Secondary Structure Many functional motifs can be described by secondary structure alone Two types of similarity: –sequence similarity (in unpaired nucleotides, mainly) –structure similarity
8
CPM 2003 - MoreliaGiulio Pavesi8 Data Structures? When dealing with DNA or protein sequences, some significant advantages have been obtained by using suitable text— indexing structures (e.g. suffix trees) RNA secondary structure can be described by a string Is there a “good” structure that will do for RNA sequences, allowing us to consider sequence and structure at the same time?
9
CPM 2003 - MoreliaGiulio Pavesi9 Affix Trees Affix tree for string S = ATATC Suffix and prefix edges Suffix edges spell the substrings of string S Prefix edges (dotted) spell substrings of S -1 (the reverse Built in linear time Takes linear space
10
CPM 2003 - MoreliaGiulio Pavesi10 Affix Trees The affix tree of a string S indexes all the substrings of both S and S -1 Once a substring of S has been located in the tree, we can extend it to the right (by following suffix edges) and to the left (by following prefix edges) Good if we search for patterns in the sequences with some kind of symmetry
11
CPM 2003 - MoreliaGiulio Pavesi11 The Hairpin The basic element of RNA secondary structure is the hairpin (or stem— loop) structure The hairpin is symmetric!!!! (((((...... ))))) AGGTC CAGTCA GATCT
12
CPM 2003 - MoreliaGiulio Pavesi12 First Try Predict the secondary structure of each input sequence Build the affix tree for the folded sequences (in bracket notation) Search exhaustively for patterns describing hairpin structures (possibly with differences) Report those occurring in at least q sequences
13
CPM 2003 - MoreliaGiulio Pavesi13 Searching for Hairpins in Affix Trees For each loop size l: 1.Find l dots in the tree, on suffix edges (hairpin loop) 2.Add a base pair: a)Find a ) on suffix edges b)Find a ( on prefix edges 3.If the result appears in at least q sequences, jump to 2, else return from jump 4.Add internal loops: a)Find a dot on prefix edges: jump to 2; b)Find a dot on suffix edges: jump to 2;
14
CPM 2003 - MoreliaGiulio Pavesi14 Recursive Algorithm 1....(ok) 2 (....)(ok) 2 ((....))(ok) 2 (((....)))(no) 3a.((....))(ok) 2 (.((....)))(ok) On each path, we keep a pointer for the prefix edge, and another for the suffix edge Speed—up: represent the unpaired elements with a single symbol describing type and size, so to compare two symbols instead of two regions
15
CPM 2003 - MoreliaGiulio Pavesi15 Approximate Search We can allow some approximation: –Hairpin loops of different size (range value at step 1) –Internal loops of different size at the same position along the stem –Internal loops or bulges at different positions along the stem –Stems of different size (base pairs) –Any combination of the previous
16
CPM 2003 - MoreliaGiulio Pavesi16 Complexity Given a set of k folded sequences of overall length N : –Construction of the tree: O(N) –Annotation of the tree: O(kN) –Search: O(V(m)kN), where m is the length of the longest pattern found –V(m) depends on the degree of approximation –In practice, the most time consuming part is predicting the structure of the sequences
17
CPM 2003 - MoreliaGiulio Pavesi17 Does It Work? Test: Iron Responsive Element, located in the UTRs of mRNA coding for proteins involved in iron metabolism (e.g. ferritin, transferrin) Does it appear in all the predicted structures? Alas, it does not!!!!!!
18
CPM 2003 - MoreliaGiulio Pavesi18 Why? The “real structure”often does not correspond to the optimal one!!!! The motif “disappears” from the (supposedly) optimal structure
19
CPM 2003 - MoreliaGiulio Pavesi19 One Possible Solution Idea: for each sequence, consider also a number of alternative sub-optimal structures All the possible structures can be enumerated Check whether a motif appears in at least one alternative structure per sequence The affix tree can handle efficiently even hundreds of alternative structures per input sequence Downside: the number of potential secondary structures for a sequence of length n is O(2 n ) If similarity is not stringent, we have too many candidates
20
CPM 2003 - MoreliaGiulio Pavesi20 But..... If the same structure has to appear in a set of sequences, then the same pattern of complementary base pairs has to appear in the sequences (((((...... ))))) AGGTC CAGTCA GATCT GCGAG CAGTCT CTTGC CCCAG CAGTCA CTGGG
21
CPM 2003 - MoreliaGiulio Pavesi21 Idea! Instead of working on folded sequences, build the affix tree for the sequences alone, and find complementary base pairs on the fly The search can be implemented with the same parameters of the folded case
22
CPM 2003 - MoreliaGiulio Pavesi22 Building Hairpins on the Fly By working on unfolded sequences, the theoretical time complexity is higher, since different paths correspond to the same structure In practice it is much faster, since we do not have to run the prediction algorithm on the input sequences We need to “validate” the candidate structures, e.g. according to their energy
23
CPM 2003 - MoreliaGiulio Pavesi23 Post - Processing So far we have considered structure alone More than a single motif occurrence per sequence is often reported, especially if structural constraints are loose Post processing: compare the candidate occurrences by evaluating sequence similarity in unpaired elements Find the group of instances that are more similar at the sequence level
24
CPM 2003 - MoreliaGiulio Pavesi24 Results and Work in Progress The second approach gave better results, in terms of reliability and efficiency Candidate hairpins can be validated according to their energy value (more reliable, in this case!) Good results on “harder” tests Too many input parameters yet Extend to more complex structures
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.