Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell.

Similar presentations


Presentation on theme: "Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell."— Presentation transcript:

1 Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell

2 Slide 2 EE3J2 Data Mining Objectives  Revise dynamic programming  Examples

3 Slide 3 EE3J2 Data Mining Alignment path A C X C C D ABCDABCD d(C,X)

4 Slide 4 EE3J2 Data Mining Accumulated Distance  The accumulated distance along the path p is the sum of distances along its length  Large accumulative distance = poor matches between symbols = poor path  Small accumulative distance = good matches between symbols = good path  The path with the smallest accumulated distance is called the optimal path  Computed using Dynamic Programming

5 Slide 5 EE3J2 Data Mining Dynamic Programming A C X C C D ABCDABCD Accumulated distance to this point… …is minimum of accumulated distances to possible previous points Plus local, incremental cost

6 Slide 6 EE3J2 Data Mining Formally… Accumulated distance up to the point (m,n) Deletion penalty ‘Local’ distance between m th symbol in sequence 1 and n th symbol in sequence 2

7 Slide 7 EE3J2 Data Mining Example application: sequence retrieval …… AAGDTDTDTDD AABBCBDAAAAAAA BABABABBCCDF GGGGDDGDGDGDGDTDTD DGDGDGDGD AABCDTAABCDTAABCDTAAB CDCDCDTGGG GGAACDTGGGGGAAA ……. Corpus of sequential data ‘query’ sequence Q …BBCCDDDGDGDGDCDTCDTTDCCC… Dynamic Programming Distance Calculation Calculate ad(S,Q) for each sequence S in corpus

8 Slide 8 EE3J2 Data Mining Example: Edit Distance S 1 = AABCDK DEL =0 S 2 = ABCCDK INS = 0 Distance matrix A B C C D A 0 1 1 1 1 B 1 0 1 1 1 C 1 1 0 0 1 D 1 1 1 1 0 Accumulated distance matrix A B C C D A 0 1 2 3 4 B 1 0 1 2 3 C 2 1 0 0 1 D 2 1 1 1 0 Forward path matrix A B C C D A \ _ _ _ _ A | _ _ _ _ B | \ _ _ _ C | | \ _ _ D | | | | \ A B C C D A \ _ _ _ _ A | _ _ _ _ B | \ _ _ _ C | | \ _ _ D | | | | \ AABCCD

9 Slide 9 EE3J2 Data Mining Example 2: Edit Distance S 1 = AABCDK DEL =2 S 2 = ABCCDK INS = 2 Distance matrix A B C C D A 0 1 1 1 1 B 1 0 1 1 1 C 1 1 0 0 1 D 1 1 1 1 0 Accumulated distance matrix A B C C D A 0 3 6 9 12 A 2 1 4 7 10 B 5 2 2 5 8 C 8 5 2 2 5 D 11 8 5 3 2 Forward path matrix A B C C D A \ _ _ _ _ A | \ _ _ _ B | \ \ _ _ C | | \ \ _ D | | | \ \ A B C C D A \ _ _ _ _ A | \ _ _ _ B | \ \ _ _ C | | \ \ _ D | | | | \ ABCCD

10 Slide 10 EE3J2 Data Mining edit-dist.c  New C program on course website  Computes the edit distance between two sequences  Prints out: –Distance matrix –Forward accumulated distance matrix –Forward path matrix –Optimal path –Optimal alignment

11 Slide 11 EE3J2 Data Mining edit-dist.c  Format: edit-dist seq1 seq2  Seq1 and seq2 are the sequences  and optional, default 0

12 Slide 12 EE3J2 Data Mining Matching partial sequences  In some applications the interest is in whether one sequence matches a subsequence of another sequence  Example: Bioinformatics –Look for examples of a simple DNA sequence within a more complex sequence –Infer evolutionary relationship between two organisms

13 Slide 13 EE3J2 Data Mining Partial alignment  Simple intuitive solution is to allow Dynamic Programming to: –Start at any point in the first row –End at any point in the final row  Then proceed as before  Unfortunately this has limitations…

14 Slide 14 EE3J2 Data Mining Finding matching sub-sequences Start DP from here Best scoring end point Lower cost path

15 Slide 15 EE3J2 Data Mining Backwards Pass DP Forward pass Backward pass

16 Slide 16 EE3J2 Data Mining Backwards Pass DP  Starts in bottom row, works right-to-left and bottom- to-top  Otherwise, backwards accumulated distance matrix and backwards path matrix calculations analogous with forward-pass DP

17 Slide 17 EE3J2 Data Mining Forward-backward DP  Suppose that we have done a complete forward DP and a complete backward DP  We will have two path matrices: –Forward path matrix –Backward path matrix  For any point in bottom row can trace-back through forward path matrix and recover path ending in top row  For any point in top row can trace-back through backward path matrix and recover path ending in bottom row

18 Slide 18 EE3J2 Data Mining Matching sub-sequences Choose a point in the bottom row. Traceback though forward path matrix Identify start of path. Then traceback through backward path matrix Are paths the same? If so, then we have a matching subsequence

19 Slide 19 EE3J2 Data Mining Matching subsequences  If a path occurs as a consequence of tracing-back through the forward path matrix and tracing-back through the backward path matrix, then the corresponding section of the horizontal sequence is called a matching subsequence  The matching subsequences are those which achieve a good match with the vertical pattern

20 Slide 20 EE3J2 Data Mining Matching subsequences ABBCABBC X Z A B C C Y Z matching subsequence We say that this subsequence most closely resembles the original sequence ABBC

21 Slide 21 EE3J2 Data Mining Summary  Revision of Dynamic Programming  Examples: Edit distance  Motivation for interest in optimal subsequences  Forward and backward dynamic programming  Matching subsequences, subsequences which most closely resemble a given sequence


Download ppt "Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis (2) Martin Russell."

Similar presentations


Ads by Google