SPIRE 2005 - Normalized Similarity of RNA Sequences Rolf Backofen Danny Hermelin Gad M. Landau Oren Weimann
RNA sequences C G G C U A A U C A G U C G U A
RNA sequences C C G U A G U A C C A C A G U G U G G C G C G G C C A U
SPIRE 2005 - Normalized Similarity of RNA Sequences LCS of Strings S1 = C C G U A G U A C C A C A G U G U G G S2 = G A G C A G C C C U C G G G A A U U G Global LCS: [Hirschberg 1977]
SPIRE 2005 - Normalized Similarity of RNA Sequences LCS of RNA sequences C G U A R1 = Left arc match Right arc match G A C U R2 = Arc match and non arc match, we get lCS of every 2 arcs (one in R1 one in R2) RNA Global LCS: [Klein 1998]
SPIRE 2005 - Normalized Similarity of RNA Sequences Global Similarity - LCS C G A U Look for largest set of matches strictly increasing in both rows and columns that obey the arc restrictions
Local Similarity – Normalized LCS SPIRE 2005 - Normalized Similarity of RNA Sequences Local Similarity – Normalized LCS Report the most similar substring pair according to some scoring scheme. In our case, we look for the substrings (with their arcs) that maximize: Can be viewed as measure of the density of the matches. One mach is always optimal so set a minimum score of M
Local Similarity in Strings Local edit distance O(nm) [Smith Waterman 1981] Normalized LCS O(mnlogn) [Arslan Pevzner 2001] Normalized LCS for sparse matrices O(rLloglogn) [Efraty Landau 2004]
Our Result A novel local similarity metric for comparing RNA sequences. An time algorithm for computing this metric. As fast as the global algorithm (in contrast to the case of strings).
SPIRE 2005 - Normalized Similarity of RNA Sequences Definitions A chain is a sequence of matches that is strictly increasing in rows and columns. The length of a chain from (i,j) to match (i’,j’) is i’-i+j’-j. n m R2 R1 A k-chain(i,j) is the shortest chain of k matches starting from (i,j). R1 J’ n i i’ m J R2 (i,j) (i’,j’) The chain is legal in arcs The chain will never really start in a mismatch but needed for dp The normalized value of k-chain(i,j) is k divided by its length. ( )
SPIRE 2005 - Normalized Similarity of RNA Sequences General idea - Construct (k+1)-chain(i,j) by concatenating (i,j) to k-chain(i’,j’) . a a b c a d e c f h c g g b f h e c For the moment lets assume no arcs. When I will say BEST k-chain I mean value of new chain (yellow+chain) is best. g g g f d e f
Decomposing k-Chains C G A U
Decomposing k-Chains (non arc match) U Best (k-1)-Chain
Decomposing k-Chains (mismatch) U Best k-Chain
SPIRE 2005 - Normalized Similarity of RNA Sequences Decomposing k-Chains (right arc match) C G A U Treat it like mismatch (since we can’t use this match for the chain starting at him). Cannot connect to same row or column (column can be seen in figure) since the matches there are right arc matches Best k-Chain
SPIRE 2005 - Normalized Similarity of RNA Sequences Decomposing k-Chains (left arc match) C G A U Option 1: don’t use the match – use any k-chain in the gray area
SPIRE 2005 - Normalized Similarity of RNA Sequences Decomposing k-Chains (left arc match I) C G A U Best k-Chain Option 1: don’t use the match – use any k-chain in the gray area
SPIRE 2005 - Normalized Similarity of RNA Sequences Example 2-Chain C G A U Example for option 1
SPIRE 2005 - Normalized Similarity of RNA Sequences Decomposing k-Chains (left arc match II) C G A U Option 2: use match then we need to take the whole arc!
SPIRE 2005 - Normalized Similarity of RNA Sequences Decomposing k-Chains (left arc match II) C G A U k ≥ lcs Option 3: use match then we need to take the whole arc && k>lcs Best (k-lcs)-Chain
SPIRE 2005 - Normalized Similarity of RNA Sequences Decomposing k-Chains (left arc match III) A G U C k lcs Option 2: use match then we need to take the whole arc && k<=lcs
SPIRE 2005 - Normalized Similarity of RNA Sequences Example 3-Chain C G A U Option 2: use match then we need to take the whole arc && k<=lcs
The Algorithm (Given R1,R2) SPIRE 2005 - Normalized Similarity of RNA Sequences The Algorithm (Given R1,R2) Run Klein’s algorithm to get LCS of every arc in R1 with every arc in R2. For k=1,2,…,n: Construct all k-chains from bottom right to top left using DP. Report best k-chain. Total of - as fast as global LCS Bottleneck = global LCS
The DP
Muchas Gracias por la atencion