7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

2 1 Discrete Markov Processes (Markov Chains) 3 1 First-Order Markov Models.
1 Pattern Recognition Chapter 3 Hidden Markov Models (HMMs)
Hidden Markov Model Jianfeng Tang Old Dominion University 03/03/2004.
HIDDEN MARKOV MODELS Prof. Navneet Goyal Department of Computer Science BITS, Pilani Presentation based on: & on presentation on HMM by Jianfeng Tang Old.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Introduction to Hidden Markov Models
O PTICAL C HARACTER R ECOGNITION USING H IDDEN M ARKOV M ODELS Jan Rupnik.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Spring 2003Data Mining by H. Liu, ASU1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Sabegh Singh Virdi ASC Processor Group Computer Science Department
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 14: Introduction to Hidden Markov Models Martin Russell.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
6-1 String Matching Learning Outcomes Students are able to: Explain naïve, Rabin-Karp, Knuth-Morris- Pratt algorithms Analyse the complexity of these algorithms.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Hidden Markov Model: Extension of Markov Chains
Chapter 3 (part 3): Maximum-Likelihood and Bayesian Parameter Estimation Hidden Markov Model: Extension of Markov Chains All materials used in this course.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Introduction to Profile Hidden Markov Models
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Homework 1 Reminder Due date: (till 23:59) Submission: – – Write the names of students in your team.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Evaluation Decoding Dynamic Programming.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
1 String Matching Algorithms Topics  Basics of Strings  Brute-force String Matcher  Rabin-Karp String Matching Algorithm  KMP Algorithm.
Hidden Markov Models (HMMs) Chapter 3 (Duda et al.) – Section 3.10 (Warning: this section has lots of typos) CS479/679 Pattern Recognition Spring 2013.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Pattern Recognition NTUEE 高奕豪 2005/4/14. Outline Introduction Definition, Examples, Related Fields, System, and Design Approaches Bayesian, Hidden Markov.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Building Phylogenies. Phylogenetic (evolutionary) trees Human Gorilla Chimp Gibbon Orangutan Describe evolutionary relationships between species Cannot.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Hidden Markov Models BMI/CS 576
IMAGE PROCESSING RECOGNITION AND CLASSIFICATION
LECTURE 15: HMMS – EVALUATION AND DECODING
String Processing.
Sequential Pattern Discovery under a Markov Assumption
LECTURE 14: HMMS – EVALUATION AND DECODING
Handwritten Characters Recognition Based on an HMM Model
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
String Processing.
Presentation transcript:

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 2 Sequences and Strings A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence –Sequences and strings are often used as synonyms –String elements (characters, letters, or symbols) are nominal –A type of particularly long string  text |x| denotes the length of sequence x –|AGCTTC| is 6 Any contiguous string that is part of x is called a substring, segment, or factor of x –GCT is a factor of AGCTTC

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 3 Recognition with Strings String matching –Given x and text, determine whether x is a factor of text Edit distance (for inexact string matching) –Given two strings x and y, compute the minimum number of basic operations (character insertions, deletions and exchanges) needed to transform x into y

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 4 String Matching Given |text| >> |x|, with characters taken from an alphabet A –A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…} A shift s is an offset needed to align the first character of x with character number s+1 in text Find if there exists a valid shift where there is a perfect match between characters in x and the corresponding ones in text

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 5 Naïve (Brute-Force) String Matching Given A, x, text, n = |text|, m = |x| s = 0 while s ≤ n-m if x[1 …m] = text [s+1 … s+m] then print “pattern occurs at shift” s s = s + 1 Time complexity (worst case): O((n-m+1)m) One character shift at a time is not necessary

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 6 Boyer-Moore and KMP See StringMatching.ppt and do not use the following alg Given A, x, text, n = |text|, m = |x|Given A, x, text, n = |text|, m = |x| F(x) = last-occurrence function G(x) = good-suffix function; s = 0 while s ≤ n-m j = m j = m while j>0 and x[j] = text [s+j] j = j-1 j = j-1 if j = 0 then print “pattern occurs at shift” s then print “pattern occurs at shift” s s = s + G(0) else s = s + max[G(j), j-F(text[s+j0])] else s = s + max[G(j), j-F(text[s+j0])]

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 7 Edit Distance ED between x and y describes how many fundamental operations are required to transform x to y. Fundamental operations (x=‘excused’, y=‘exhausted’) –Substitutions e.g. ‘c’ is replaced by ‘h’ –Insertions e.g. ‘a’ is inserted into x after ‘h’ –Deletions e.g. a character in x is deleted ED is one way of measuring similarity between two strings

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 8 Classification using ED Nearest-neighbor algorithm can be applied for pattern recognition. –Training: data of strings with their class labels stored –Classification (testing): a test string is compared to each stored string and an ED is computed; the nearest stored string’s label is assigned to the test string. The key is how to calculate ED. An example of calculating ED

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 9 Hidden Markov Model Markov Model: transitional states Hidden Markov Model: additional visible states Evaluation Decoding Learning

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 10 Markov Model The Markov property: –given the current state, the transition probability is independent of any previous states. A simple Markov Model –State ω(t) at time t –Sequence of length T: ω T = {ω(1), ω(2), …, ω(T)} –Transition probability P(ω j (t+1)| ω i (t)) = a ij –It’s not required that a ij = a ji

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 11 Hidden Markov Model Visible states –V T = {v(1), v(2), …, v(T)} Emitting a visible state v k (t) –P(v k (t)| ω j (t)) = b jk Only visible states v k (t) are accessible and states ω i (t) are unobservable. A Markov model is ergodic if every state has a nonzero prob of occuring give some starting state.

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 12 Three Key Issues with HMM Evaluation –Given an HMM, complete with transition probabilities a ij and b jk. Determine the probability that a particular sequence of visible states V T was generated by that model Decoding –Given an HMM and a set of observations V T. Determine the most likely sequence of hidden states ω T that led to V T. Learning –Given the number of states and visible states and a set of training observations of visible symbols, determine the probabilities a ij and b jk.

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 13 Other Sequential Patterns Mining Problems Sequence alignment (homology) and sequence assembly (genome sequencing) Trend analysis – Trend movement vs. cyclic variations, seasonal variations and random fluctuations Sequential pattern mining – Various kinds of sequences (weblogs) – Various methods: From GSP to PrefixSpan Periodicity analysis – Full periodicity, partial periodicity, cyclic association rules

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 14 Periodic Pattern Full periodic pattern –ABC ABC ABC Partial periodic pattern –ABC ADC ACC ABC Pattern hierarchy –ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE Sequences of transactions [ABC:3|DE:4]

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 15 Sequence Association Rule Mining SPADE (Sequential Pattern Discovery using Equivalence classes) Constrained sequence mining (SPIRIT)

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 16 Bibliography R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification. 2nd Edition. Wiley Interscience.

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 17 a 33 11 33 22 a 31 a 22 a 11 a 12 a 32 a 23 a 13 a 21

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 18 a 33 11 33 22 a 31 a 22 a 11 a 12 a 32 a 23 a 13 a 21 b 31 v1v1 v2v2 v3v3 v4v4 v4v4 v1v1 v2v2 v3v3 v2v2 v3v3 v4v4 v1v1 b 32 b 34 b 33 b 21 b 22 b 23 b 24 b 11 b 12 b 13 b 14

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 19 vkvk 11 33 33 33 33 22 22 22 22 11 11 11 11 cc cc cc cc 22 33 cc ………… a 12 a 22 a 32 a c2 b 2k  1 (2)  2 (2)  3 (2)  c (2) 123T-1Tt =

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) t = 33 22 11 00 v3v3 v1v1 v3v3 v2v2 v0v0 0.2 x x x x 0.5

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 21 11 22 33 44 55 66 77 00 /v/ /i/ /t//e//r//b//i//-/

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 22 00 33 33 33 33 22 22 22 22 00 00 00 00 cc cc cc cc 22 33 cc ………… T-1Tt = 11 11 11 11 11 ………… 00 22 33 cc 11 4  max (1)  max (2)  max (3)  max (T-1)  max (T)

7/03Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) t = 33 22 11 00 v3v3 v1v1 v3v3 v2v2 v0v0