Presentation is loading. Please wait.

Presentation is loading. Please wait.

SPIRE Normalized Similarity of RNA Sequences

Similar presentations


Presentation on theme: "SPIRE Normalized Similarity of RNA Sequences"— Presentation transcript:

1 SPIRE 2005 - Normalized Similarity of RNA Sequences
Local Alignment of RNA Sequences with Arbitrary Scoring Schemes Rolf Backofen Danny Hermelin Gad M. Landau Oren Weimann We are doing local we will start with the global

2 RNA sequences C G G C U A A U C A G U C G U A

3 RNA sequences C C G U A G U A C C A C A G U G U G G C G C G G C C A U

4 RNA sequences C C G U A G U A C C A C A G U G U G G C G C G G C C A U

5 SPIRE 2005 - Normalized Similarity of RNA Sequences
Alignment of Strings S1 = U C A C C G __ A __ G S2 = U C G C G G U A U G Global Alignment: White=match, red=indels, yellow=mismatch

6 Alignment of RNA sequences
SPIRE Normalized Similarity of RNA Sequences Alignment of RNA sequences A A G G C C C U G A U A G A C C G U U A U Red=character indels & arc indels, yellow=character mismatch & arc mismatch, white=character match & arc match Arc is a whole entity so we either delete an arc or match it to another arc.

7 Alignment of RNA sequences
SPIRE Normalized Similarity of RNA Sequences Alignment of RNA sequences A A G G C C C U G A U A G A C C G U U U If we match the blinking arcs we need to match the colored segments (the sequence between the arc endpoints)

8 Alignment of RNA sequences
SPIRE Normalized Similarity of RNA Sequences Alignment of RNA sequences A A G G C C C U G A U A G A C C G U U U The theorem says we get the score of aligning any possible colored segments (between arc endpoints) RNA Global Alignment via tree edit distance: [SZ 1989] Theorem: All these algorithms compute the edit distance between any two arcs provided we match these arcs. [K 1998] n [DMRW 2006] m

9 SPIRE 2005 - Normalized Similarity of RNA Sequences
The Alignment graph U C A C C G A G U C G C G G U This is the string alignment graph. We will turn it into an RNA alignment graph where there is a one to one correspondence between HEAVIEST paths and OPTIMAL alignments A U G Theorem: There is a one to one correspondence between all paths in the alignment graph and all alignments of substrings of R1 and R2.

10 SPIRE 2005 - Normalized Similarity of RNA Sequences
The Alignment graph U C A C C G A G U C G C G G U A U G Theorem: There is a one to one correspondence between all paths in the alignment graph and all alignments of substrings of R1 and R2.

11 SPIRE 2005 - Normalized Similarity of RNA Sequences
The Alignment graph U C A C C G A G U C G C G G U We put in the arcs now A U G

12 SPIRE 2005 - Normalized Similarity of RNA Sequences
The Alignment graph U C A C C G A G U C G C G G U We don’t want to match one endpoint without the other so we remove diagonal edges from column/row of an arc endpoint. We will take care of arc matches later. A U G

13 SPIRE 2005 - Normalized Similarity of RNA Sequences
The Alignment graph U C A C C G A G U C G C G G U Notice that we split the cost of deleting an arc to the cost of deleting its two endpoints – this is done for the case where a path (or alignment) crosses an arc. An open problem is to charge the cost of deleting a arc to only one arc endpoint. Now we add the shortcut edges for matching arcs. Their weight is obtained by the preprocessing step- one run of Klein or Demaine’ A U G Theorem: There is a one to one correspondence between all paths in the alignment graph and all alignments of substrings of R1 and R2 in which all arcs are deleted.

14 SPIRE 2005 - Normalized Similarity of RNA Sequences
The Alignment graph U C A C C G A G U C G C G G U We add a shortcut edge from the cell that represents the beginning of the arcs to the one representing its end. A U G

15 SPIRE 2005 - Normalized Similarity of RNA Sequences
The Alignment graph U C A C C G A G U C G C G G U Clearly the global alignment is just the cost of the shortcut edge from (0,0) to (n,n). The reason it is only the OPTIMAL alignments and not all alignments is that we do not get alignments that match two arcs but take a non optimal alignment of the substring between the arc endpoints. A U G Theorem: There is a one to one correspondence between HEAVIEST paths in the alignment graph and OPTIMAL alignments of substrings of R1 and R2.

16 The Local Alignment algorithms
SPIRE Normalized Similarity of RNA Sequences The Local Alignment algorithms We use the alignment graph to compute the local similarity between two RNA sequences according to two well known metrics: Smith-Waterman – the highest scoring alignment between any pair of substrings of the input RNAs. It’s normalized version.

17 Standard Local Similarity (Smith-Waterman)
SPIRE Normalized Similarity of RNA Sequences Standard Local Similarity (Smith-Waterman) U C A C C G A G U C The score is computed via dynamic program: Score(i,j) = max G C G G U A Score(i,j) is best alignment that ends in (i,j). This is very similar to string smith waterman, only there every vertex had exactly 3 incoming edges and here some have 2 and some 3, and one incoming edge can come from far. U G Score(i’,j’) + Weight of the incoming edge from (i’,j’), Time complexity: O(mn) + one run of a global algorithm = n m

18 Normalized Local Similarity
SPIRE Normalized Similarity of RNA Sequences Normalized Local Similarity The weakness of Smith Waterman approach [AP 2001]: Solution: look for the substrings (with their arcs) that maximize: and some given value. AP= Arslan and Pevzner One mach is always optimal so demand that ED(R1’,R2’) is greater than some given value.

19 Normalized Local Similarity
SPIRE Normalized Similarity of RNA Sequences Normalized Local Similarity U C A C C G A G Again, dynamic program: U C G Define Length(k,i,j) to be the length of the shortest path that ends at vertex (i,j) and has weight equal to k. C G G U The best k/Length(k,i,j) over all i,j,k is the normalized score. A U G k/Length(k,i,j) = normalized score, and the best k/Length(k,i,j) among all n^2m is the best normalized score. n^2m because 1<k<n.

20 Normalized Local Similarity
SPIRE Normalized Similarity of RNA Sequences Normalized Local Similarity Again, dynamic program: Length(k-w,i’,j’) w Define Length(k,i,j) to be the length of the shortest path that ends at vertex (i,j) and has weight equal to k. j’-j i’-i For every k,i,j compute Length(k,i,j) = min Length(k,i,j) Length(k-w,i’,j’) + (j’-j+i’-i) | where w = weight of the incoming edge from (i’,j’) Time complexity: + one run of a global algorithm = n m

21 Open Problems Arc deletion: Improve global tree edit distance
U C A C C G A G U C G C G G U A U G

22 Muchas Gracias por la atencion


Download ppt "SPIRE Normalized Similarity of RNA Sequences"

Similar presentations


Ads by Google