Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell
Slide 2 EE3J2 Data Mining Objectives To motivate the need for sequence analysis To introduce the notions of: –Alignment path –Accumulated distance along a path To introduce Dynamic Programming and the principle of optimality To explain the use of dynamic programming to find the distance between two sequences
Slide 3 EE3J2 Data Mining Sequences Sequential data common in real applications: –DNA analysis in bioinformatics and forensic science –Sequences of the letters A, G, C and T –Signature recognition biometrics –Words and text –Spelling and grammar checkers, author verification,.. –Speech, music and audio –Speech/speaker recognition, speech coding and synthesis –Electronic music –Radar signature recognition…
Slide 4 EE3J2 Data Mining Retrieving sequential data Large corpora of sequential data –E.g: DNA databases Individual sequences may not be amenable to human interpretation Need for automated sequential data retrieval Need to ‘mine’ sequential data Fundamental requirement is a measure of distance between two sequences
Slide 5 EE3J2 Data Mining Basic definitions In a typical sequence analysis application we have a basic alphabet consisting of N symbols Examples: –In conventional text A is the set of letters plus punctuation plus ‘white space’ –In bioinformatics, scientists us {A, B, D, F} to describe DNA sequences
Slide 6 EE3J2 Data Mining Distance between sequences (1) Consider two sequences of symbols from the alphabet A = {A, B, C, D} How similar are the sequences: –S 1 = ABCD –S 2 = ABD Intuitively S 2 is obtained from S 1 by deleting C An alternative explanation is that S 2 is obtained from S 1 by substituting D for C and then deleting D
Slide 7 EE3J2 Data Mining Distance between sequences (2) Alternatively S 2 was obtained from S 1 be deleting ABCD and inserting ABC … Why is the first explanation intuitively ‘correct’ and the last one intuitively ‘wrong’? Because we favour the simplest explanation which involves the minimum number of insertions, deletions and substitutions Or do we?
Slide 8 EE3J2 Data Mining Distance between sequences (3) Consider: –S 1 = AABC –S 2 = SABC –S 3 = PABC –S 4 = ASCB If we know that these sequences were typed then S 2 is closer to S 1 than S 3 is, because A and S are adjacent on a keyboard Similarly S 4 is close to S 2 because letter-swapping (SA AS etc) is a common typographical error
Slide 9 EE3J2 Data Mining Alignments The relationship between two sequences can be expressed as an alignment between their elements E.G: Insertion A B B C ABCABC
Slide 10 EE3J2 Data Mining Alignment: deletion and substitution A X C ABCABC substitution A C ABCABC deletion N.B: All edits described relative to vertical string
Slide 11 EE3J2 Data Mining General alignment path A C X C C D ABCDABCD Which alignment is best?
Slide 12 EE3J2 Data Mining The Distance Matrix Suppose that d(A,B) denotes the distance between the alphabet symbols A and B Examples: –d(A,B) = 0 if A = B, otherwise d(A,B) = 1 –In typing, d(A,B) might indicate how unlikely it is that A would be mistyped as B –In bioinformatics d(A,B) might indicate how unlikely it is that symbol A in a DNA string is replaced by B
Slide 13 EE3J2 Data Mining Notation Suppose we have an alphabet: A distance matrix for A is an N N matrix where is the distance between the m th and n th alphabet symbols
Slide 14 EE3J2 Data Mining The Accumulated Distance Consider two sequences: –S 1 = ABCD –S 2 = ACXCCD For any alignment path p between S 1 and S 2 we define the accumulated distance between S 1 and S 2, denoted by AD p (S 1,S 2 ), to be the sum over all the nodes of p of the corresponding distances between elements of S 1 and S 2.
Slide 15 EE3J2 Data Mining Accumulated distance along p A C X C C D ABCDABCD Path p
Slide 16 EE3J2 Data Mining Accumulated distance (continued) A C X C C D ABCDABCD K DEL K INS K INS K INS
Slide 17 EE3J2 Data Mining Optimal path The optimal path is the path with minimum accumulated distance Formally the optimal path is where: The accumulated distance AD(S 1,S 2 ) between S 1 and S 2 is given by:
Slide 18 EE3J2 Data Mining Calculating the optimal path Given –the distance matrix D, –the insertion penalty K INS, and –the deletion penalty K DEL * How can we compute the optimal path between two (potentially long) sequences S 1 and S 2 ? In a real application the lengths of S 1 and S 2 might be large * If K DEL and K INS are not defined you should assume that they are zero
Slide 19 EE3J2 Data Mining Dynamic Programming (DP) The optimal path is calculated using Dynamic Programming (DP) DP is based on the principle of optimality Suppose all paths from X to Y must pass through A or B immediately before reaching Y. The optimal path from X to Y is the best of: (i) the best path from X to A plus the cost of going from A to Y (ii) the best path from X to A plus the cost of going from B to Y XY A B
Slide 20 EE3J2 Data Mining DP – step 1 Step 1: draw the trellis of all possible paths A C X C C D ABCDABCD
Slide 21 EE3J2 Data Mining DP – forward pass – initialisation A C X C C D ABCDABCD (Forward) path matrix d(A,A) (Forward) accumulated distance matrix ad(2,1)=ad(1,1)+d(2,1)
Slide 22 EE3J2 Data Mining ad(i,j) ad(i,j) is the sum of distances along the best (partial) path from (1,1) to (i,j) It is calculated using the principle of optimality In addition, the forward path matrix records the local optimal path at each point (i-1,j) (i,j-1) (i-1,j-1)
Slide 23 EE3J2 Data Mining DP – forward pass – continued A C X C C D ABCDABCD (Forward) path matrix (Forward) accumulated distance matrix ad(1,2) = ad(1,1)+d(A,C)
Slide 24 EE3J2 Data Mining DP – forward pass – continued A C X C C D ABCDABCD (Forward) path matrix (Forward) accumulated distance matrix
Slide 25 EE3J2 Data Mining DP – forward pass – continued A C X C C D ABCDABCD (Forward) path matrix (Forward) accumulated distance matrix
Slide 26 EE3J2 Data Mining DP – forward pass – continued A C X C C D ABCDABCD (Forward) path matrix (Forward) accumulated distance matrix Optimal path obtained by back-tracking through the forward path matrix, starting at the bottom right-hand corner
Slide 27 EE3J2 Data Mining Summary Introduction to sequence analysis Definitions of: –distance, –forward accumulated distance, –forward path matrix, –optimal path Dynamic programming How to computed the accumulated distance using DP How to recover the optimal path