Presentation is loading. Please wait.

Presentation is loading. Please wait.

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association.

Similar presentations


Presentation on theme: "7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association."— Presentation transcript:

1 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules

2 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 2 Sequences and Strings A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence –Sequences and strings are often used as synonyms –String elements (characters, letters, or symbols) are nominal –A type of particularly long string  text |x| denotes the length of sequence x –|AGCTTC| is 6 Any contiguous string that is part of x is called a substring, segment, or factor of x –GCT is a factor of AGCTTC

3 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 3 Recognition with Strings String matching –Given x and text, determine whether x is a factor of text Edit distance (for inexact string matching) –Given two strings x and y, compute the minimum number of basic operations (character insertions, deletions and exchanges) needed to transform x into y

4 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 4 String Matching Given |text| >> |x|, with characters taken from an alphabet A –A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…} A shift s is an offset needed to align the first character of x with character number s+1 in text Find if there exists a valid shift where there is a perfect match between characters in x and the corresponding ones in text

5 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 5 Naïve (Brute-Force) String Matching Given A, x, text, n = |text|, m = |x| s = 0 while s ≤ n-m if x[1 …m] = text [s+1 … s+m] then print “pattern occurs at shift” s s = s + 1 Time complexity (worst case): O((n-m+1)m) One character shift at a time is not necessary

6 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 6 Boyer-Moore and KMP See StringMatching.ppt and do not use the following alg Given A, x, text, n = |text|, m = |x|Given A, x, text, n = |text|, m = |x| F(x) = last-occurrence function G(x) = good-suffix function; s = 0 while s ≤ n-m j = m j = m while j>0 and x[j] = text [s+j] j = j-1 j = j-1 if j = 0 then print “pattern occurs at shift” s then print “pattern occurs at shift” s s = s + G(0) else s = s + max[G(j), j-F(text[s+j0])] else s = s + max[G(j), j-F(text[s+j0])]

7 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 7 Edit Distance ED between x and y describes how many fundamental operations are required to transform x to y. Fundamental operations (x=‘excused’, y=‘exhausted’) –Substitutions e.g. ‘c’ is replaced by ‘h’ –Insertions e.g. ‘a’ is inserted into x after ‘h’ –Deletions e.g. a character in x is deleted ED is one way of measuring similarity between two strings

8 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 8 Classification using ED Nearest-neighbor algorithm can be applied for pattern recognition. –Training: data of strings with their class labels stored –Classification (testing): a test string is compared to each stored string and an ED is computed; the nearest stored string’s label is assigned to the test string. The key is how to calculate ED. An example of calculating ED

9 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 9 Hidden Markov Model Markov Model: transitional states Hidden Markov Model: additional visible states Evaluation Decoding Learning

10 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 10 Markov Model The Markov property: –given the current state, the transition probability is independent of any previous states. A simple Markov Model –State ω(t) at time t –Sequence of length T: ω T = {ω(1), ω(2), …, ω(T)} –Transition probability P(ω j (t+1)| ω i (t)) = a ij –It’s not required that a ij = a ji

11 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 11 Hidden Markov Model Visible states –V T = {v(1), v(2), …, v(T)} Emitting a visible state v k (t) –P(v k (t)| ω j (t)) = b jk Only visible states v k (t) are accessible and states ω i (t) are unobservable. A Markov model is ergodic if every state has a nonzero prob of occuring give some starting state.

12 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 12 Three Key Issues with HMM Evaluation –Given an HMM, complete with transition probabilities a ij and b jk. Determine the probability that a particular sequence of visible states V T was generated by that model Decoding –Given an HMM and a set of observations V T. Determine the most likely sequence of hidden states ω T that led to V T. Learning –Given the number of states and visible states and a set of training observations of visible symbols, determine the probabilities a ij and b jk.

13 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 13 Other Sequential Patterns Mining Problems Sequence alignment (homology) and sequence assembly (genome sequencing) Trend analysis – Trend movement vs. cyclic variations, seasonal variations and random fluctuations Sequential pattern mining – Various kinds of sequences (weblogs) – Various methods: From GSP to PrefixSpan Periodicity analysis – Full periodicity, partial periodicity, cyclic association rules

14 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 14 Periodic Pattern Full periodic pattern –ABC ABC ABC Partial periodic pattern –ABC ADC ACC ABC Pattern hierarchy –ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE Sequences of transactions [ABC:3|DE:4]

15 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 15 Sequence Association Rule Mining SPADE (Sequential Pattern Discovery using Equivalence classes) Constrained sequence mining (SPIRIT)

16 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 16 Bibliography R.O. Duda, P.E. Hart, and D.G. Stork, 2001. Pattern Classification. 2nd Edition. Wiley Interscience.

17 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 17 a 33 11 33 22 a 31 a 22 a 11 a 12 a 32 a 23 a 13 a 21

18 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 18 a 33 11 33 22 a 31 a 22 a 11 a 12 a 32 a 23 a 13 a 21 b 31 v1v1 v2v2 v3v3 v4v4 v4v4 v1v1 v2v2 v3v3 v2v2 v3v3 v4v4 v1v1 b 32 b 34 b 33 b 21 b 22 b 23 b 24 b 11 b 12 b 13 b 14

19 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 19 vkvk 11 33 33 33 33 22 22 22 22 11 11 11 11 cc cc cc cc 22 33 cc ………….............................. a 12 a 22 a 32 a c2 b 2k  1 (2)  2 (2)  3 (2)  c (2) 123T-1Tt =

20 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 20 0 0.010.00 77 0.00 02 0 0.09 0.0 052 0.00 24 0 0 000.00 11 0.2 0.00 57 0.00 07 0 1 0 0 01234t = 33 22 11 00 v3v3 v1v1 v3v3 v2v2 v0v0 0.2 x 2 0.3 x 0.3 0.1 x 0.1 0.4 x 0.5

21 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 21 11 22 33 44 55 66 77 00 /v/ /i/ /t//e//r//b//i//-/

22 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 22 00 33 33 33 33 22 22 22 22 00 00 00 00 cc cc cc cc 22 33 cc ………….............................. 123T-1Tt = 11 11 11 11 11 ………… 00 22 33 cc...... 11 4  max (1)  max (2)  max (3)  max (T-1)  max (T)

23 7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 23 0 0.010.00 77 0.00 02 0 0.09 0.0 052 0.00 24 0 0 000.00 11 0.2 0.00 57 0.00 07 0 1 0 0 01234t = 33 22 11 00 v3v3 v1v1 v3v3 v2v2 v0v0


Download ppt "7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association."

Similar presentations


Ads by Google