CSE Data Mining, 2002Lecture 10.1 Data Mining - CSE5230 Hidden Markov Models (HMMs) CSE5230/DMS/2002/10
CSE Data Mining, 2002Lecture 10.2 Lecture Outline u Time- and space-varying processes u First-order Markov models u Hidden Markov models u Examples: coin toss experiments u Formal Definition u Use of HMMs for classification u References
CSE Data Mining, 2002Lecture 10.3 Time- and Space-varying Processes (1) u The data mining techniques we have discussed so far have focused on the classification, prediction or characterization of single data points, e.g.: vAssign a record to one of a set of classes »Decision trees, back-propagation neural networks, Bayesian classifiers, etc. vPredicting the value of a field in a record given the values of the other fields »Regression, back-propagation neural networks, etc. vFinding regions of feature space where data points are densely grouped »Clustering, self-organizing maps
CSE Data Mining, 2002Lecture 10.4 Time- and Space-varying Processes (2) u In the methods we have considered so far, we have assumed that each observed data point is statistically independent from the observation that preceded it, e.g. vClassification: the class of data point x t is not influenced by the class of x_ t-1 (or indeed any other data point) vPrediction: the value of a field for a record depends only on the values of the field of that record, not on values in any other records. u Several important real-world data mining problems can not be modeled in this way.
CSE Data Mining, 2002Lecture 10.5 Time- and Space-varying Processes (3) u We often encounter sequences of observations, where each observation may depend on the observations which preceded it u Examples vSequences of phonemes (fundamental sounds) in speech (speech recognition) vSequences of letters or words in text (text categorization, information retrieval, text mining) vSequences of web page accesses (web usage mining) vSequences of bases (CGAT) in DNA (genome projects [human, fruit fly, etc.)) vSequences of pen-strokes (hand-writing recognition) u In all these cases, the probability of observing a particular value in the sequence can depend on the values which came before it
CSE Data Mining, 2002Lecture 10.6 Example: web log u Consider the following extract from a web log: u Cleary the URL which is requested depends on the URL which was requested before vIf the user uses the “Back” button in his/her browser, the requested URL may depend on earlier URLs in the sequence too u The given a particular observed URL, we can calculate the probabilities of observing all the other possible URLs next. vNote that we may even observe the same URL next. xxx - - [16/Sep/2002:14:50: ]"GET /courseware/cse5230/ HTTP/1.1" xxx - - [16/Sep/2002:14:50: ]"GET /courseware/cse5230/html/research_paper.html HTTP/1.1" xxx - - [16/Sep/2002:14:51: ]"GET /courseware/cse5230/html/tutorials.html HTTP/1.1" xxx - - [16/Sep/2002:14:51: ]"GET /courseware/cse5230/assets/images/citation.pdf HTTP/1.1" xxx - - [16/Sep/2002:14:51: ]"GET /courseware/cse5230/assets/images/citation.pdf HTTP/1.1" xxx - - [16/Sep/2002:14:51: ]"GET /courseware/cse5230/assets/images/clustering.pdf HTTP/1.1" xxx - - [16/Sep/2002:14:51: ]"GET /courseware/cse5230/assets/images/clustering.pdf HTTP/1.1" xxx - - [16/Sep/2002:14:51: ]"GET /courseware/cse5230/assets/images/NeuralNetworksTute.pdf HTTP/1.1" xxx - - [16/Sep/2002:14:51: ]"GET /courseware/cse5230/assets/images/NeuralNetworksTute.pdf HTTP/1.1" xxx - - [16/Sep/2002:14:52: ]"GET /courseware/cse5230/html/lectures.html HTTP/1.1" xxx - - [16/Sep/2002:14:52: ]"GET /courseware/cse5230/assets/images/week03.ppt HTTP/1.1" xxx - - [16/Sep/2002:14:52: ]"GET /courseware/cse5230/assets/images/week06.ppt HTTP/1.1"
CSE Data Mining, 2002Lecture 10.7 First-Order Markov Models (1) u In order to model processes such as these, we make use of the idea of states. At any time t, we consider the system to be in state w(t). u We can consider a sequence of successive states of length T: wT = {w(1), w(2), …, w(T)} u We will model the production of such a sequence using transition probabilities: u The probability that the system will be in state w j and time t+1 given that it was in state w i at time t
CSE Data Mining, 2002Lecture 10.8 First-Order Markov Models (2) u A model of states and transition probabilities, such as the one we have just described, is called a Markov model. u Since we have assumed that the transition probabilities depend only on the previous state, this is a first-order Markov model vHigher order Markov models are possible, but we will not consider them here. u For example, Markov models for human speech could have states corresponding phonemes vA Markov model for the word “cat” would have states for /k/, /a/, /t/ and a final silent state
CSE Data Mining, 2002Lecture 10.9 Example: Markov model for “cat” /k//a//t/ /silent/
CSE Data Mining, 2002Lecture Hidden Markov Models u In the preceding example, we have said that the states correspond to phonemes u In a speech recognition system, however, we don’t have access to phonemes – we can only measure properties of the sound produced by a speaker u In general, our observed data does not correspond directly to a state of the model: the data corresponds to the visible states of the system vThe visible states are directly accessible for measurement. u The system can also have internal “hidden” states, which can not be observed directly vFor each hidden state, there is a probability of observing each visible state. u This sort of model is called Hidden Markov Model (HMM)
CSE Data Mining, 2002Lecture Example: coin toss experiments u Let us imagine a scenario where we are in a room which is divided in two by a curtain. u We are on one side of the curtain, and on the other is a person who will carry out a procedure using coins resulting in a head (H) or a tail (T). u When the person has carried out the procedure, they call out the result, H or T, which we record. This system will allow us to generate a sequence of Hs and Ts, e.g. HHTHTHTTHTTTTTHHTHHHHTHHHTTHHHHHHTTT TTTTTHTHHTHTTTTTHHTHTHHHTHTHHTTTTHHT TTHHTHHTTTHTHTHTHTHHHTHHTTHT ….
CSE Data Mining, 2002Lecture Example: single fair coin u Imagine that the person behind the curtain has a single fair coin (i.e. it has equal probabilities of coming up heads or tails) u We could model the process producing the sequence of Hs and Ts as a Markov model with two states, and equal transition probabilities: u Note that here the visible states correspond exactly to the internal states – the model is not hidden u Note also that states can transition to themselves TH 0.5
CSE Data Mining, 2002Lecture Example: a fair and a biased coin u Now let us imagine a more complicated scenario. The person behind the curtain has two coins, one fair and one biased (for example, P(T) = 0.9) 1.The person starts by picking a coin a random 2.The person tosses the coin, and calls out the result (H or T) 3.If the result was H, the person switches coins 4.Go back to step 2, and repeat. This process generates sequences like: TTTTTTTTTTTTTTTTTTTTTTTTHHTTTTTTTHHTTTTTTT TTTTTTTTTTTTTTTHHTTTTTTTTTHTTHTTHHTTTTTHHT TTTTTTTTTHHTTTTTTTTHTHHHTTTTTTTTTTTTTTHHTT TTTTTHTHTTTTTTTHHTTTTT… u Note this looks quite different from the sequence for the fair coin example.
CSE Data Mining, 2002Lecture Example: a fair and a biased coin u In this scenario, the visible state no longer corresponds exactly to the hidden state of the system: vVisible state: output of H or T vHidden state: which coin was tossed u We can model this process using a HMM: Biased Fair T H H T
CSE Data Mining, 2002Lecture Example: a fair and a biased coin u We see from the diagram on the preceding slide that we have extended our model vThe visible states are shown in blue, and the emission probabilities are shown too. u As well as internal states w(t) and state transition probabilities a ij, we have visible states v(t) and emission probabilities b jk vNote that the b jk do not need to be related to the a ij as they are in the example above. u We now have full model such as this is called a Hidden Markov Model
CSE Data Mining, 2002Lecture HMM: formal definition u We can now give a more formal definition of a first-order Hidden Markov Model (adapted from [RaJ1986]: vThere is a finite number of (internal) states, N vAt each time t, a new state is entered, based upon a transition probability distribution which depends on the state at time t – 1. Self-transitions are allowed vAfter each transition is made, a symbol is output, according to a probability distribution which depends only on the current state. There are thus N such probability distributions. u Estimating the number of states N, and the transition and emission probabilities are complex issues, but solutions do exist.
CSE Data Mining, 2002Lecture Use of HMMs u We have now seen what sorts of processes can be modeled using HMMs, and how an HMM is specified mathematically. u We now consider how HMMs are actually used. u Consider the two H and T sequences we saw in the previous examples: vHow could we decided which coin-toss system was most likely to have produced each sequence? To which system would you assign these sequences? 1: TTTHHTTTTTTTTTTTTTHHTTTTTTHHTTTHH 2: THHTTTHHHTTHTHTTHTHHTTHHHTTHTHTHT 3: THHTHTHTHTHHHTTHTTTHHTTHTTTTTHHHT 4: HTTTHTTHTTTTHTTTHHTTHTHTTTTTTTTHT u We can answer question this using a Bayesian formulation (see last week’s lecture)
CSE Data Mining, 2002Lecture Use of HMMs for classification u HMMs are often used to classify sequences u To do this, a separate HMM is built and trained (i.e. the parameters are estimated) for each class of sequence in which we are interested ve.g. we might have an HMM for each word in a speech recognition system. The hidden states would correspond to phonemes, and the visible states to measured sound features u For a given observed sequence v T, we estimate the probability that each HMM M l generated it: u We assign the sequence to the model with the highest posterior probability. u The algorithms for calculating these probabilities are beyond the scope of this unit, but can be found in the references.
CSE Data Mining, 2002Lecture References u [DHS2000] Richard O. Duda, Peter E. Hart and David G. Stork, Pattern Classification (2nd Edn), Wiley, New York, NY, 2000, pp u [RaJ1986] L. R. Rabiner and B. H. Juang, An introduction to hidden Markov models, IEEE Magazine on Acoustics, Speech and Signal Processing, 3, 1, pp. 4-16, January 1986.