Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Slides:



Advertisements
Similar presentations
Link Prediction and Path Analysis using Markov Chains
Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Probabilistic models Haixu Tang School of Informatics.
Worksheet I. Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.
Probability Unit 3.
The Estimation Problem How would we select parameters in the limiting case where we had ALL the data? k → l  l’ k→ l’ Intuitively, the actual frequencies.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Hidden Markov Model.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 11 (Lab): Probability reminder.
Rolling Dice Data Analysis - Hidden Markov Model Danielle Tan Haolin Zhu.
Intro to Bayesian Learning Exercise Solutions Ata Kaban The University of Birmingham 2005.
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Week 8 Video 4 Hidden Markov Models.
Albert Gatt Corpora and Statistical Methods Lecture 8.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Modeling biological data and structure with probabilistic networks I Yuan Gao, Ph.D. 11/05/2002 Slides prepared from text material by Simon Kasif and Arthur.
CPSC 322, Lecture 31Slide 1 Probability and Time: Markov Models Computer Science cpsc322, Lecture 31 (Textbook Chpt 6.5) March, 25, 2009.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Probabilistic Latent Semantic Analysis
CS 188: Artificial Intelligence Fall 2009 Lecture 19: Hidden Markov Models 11/3/2009 Dan Klein – UC Berkeley.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Reinforcement Learning (1)
Class 5 Hidden Markov models. Markov chains Read Durbin, chapters 1 and 3 Time is divided into discrete intervals, t i At time t, system is in one of.
. Markov Chains Tutorial #5 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
CS6800 Advanced Theory of Computation Fall 2012 Vinay B Gavirangaswamy
Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
1.3 Simulations and Experimental Probability (Textbook Section 4.1)
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
M ONTE C ARLO SIMULATION Modeling and Simulation CS
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
40S Applied Math Mr. Knight – Killarney School Slide 1 Unit: Statistics Lesson: ST-5 The Binomial Distribution The Binomial Distribution Learning Outcome.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
MATH 256 Probability and Random Processes Yrd. Doç. Dr. Didem Kivanc Tureli 14/10/2011Lecture 3 OKAN UNIVERSITY.
Stochastic Processes and Transition Probabilities D Nagesh Kumar, IISc Water Resources Planning and Management: M6L5 Stochastic Optimization.
Probability and Simulation The Study of Randomness.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
AP Statistics From Randomness to Probability Chapter 14.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Hidden Markov Models BMI/CS 576
Probability Imagine tossing two coins and observing whether 0, 1, or 2 heads are obtained. It would be natural to guess that each of these events occurs.
Probability and Statistics Chapter 3 Notes
Markov Chains Tutorial #5
Bayes Net Learning: Bayesian Approaches
Student Activity 1: Fair trials with two dice
Hidden Markov Models Part 2: Algorithms
Hidden Markov Autoregressive Models
1.
Honors Statistics From Randomness to Probability
CONTEXT DEPENDENT CLASSIFICATION
Markov Chains Tutorial #5
M248: Analyzing data Block A UNIT A3 Modeling Variation.
Presentation transcript:

Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Sequence Example1: a b a c a b a b a c Example2: Example3: Roll a six-sided die N times. You get a sequence. Roll it again: You get another sequence. Here is a sequence of characters, can you see it? What is a sequence? Alphabet1 = {a,b,c}, Alphabet2={0,1}, Alphabet3={1,2,3,4,5,6}

Probabilistic Model Model = system that simulates the sequence under consideration Probabilistic model = model that produces different outcomes with different probabilities – It includes uncertainty – It can therefore simulate a whole class of sequences & assigns a probability to each individual sequence Could you simulate any of the sequences on the previous slide?

Random sequence model Back to the die example (can possibly be loaded) – Model of a roll: has 6 parameters: p(1),p(2),p(3),p(4),p(5),p(6) – Here, p(i) is the probability of throwing i – To be probabilities, these must be non-negative and must sum to one. – What is the probability of the sequence [1, 6, 3]? p(1)*p(6)*p(3) NOTE: in the random sequence model, the individual symbols in a sequence do not depend on each other. This is the simplest sequence model.

Maximum Likelihood Parameter estimation The parameters of a probabilistic model are typically estimated from large sets of trusted examples, called the training set. Example (t=tail, h=head) : [t t t h t h h t] – Count up the frequencies: t  5, h  3 – Compute probabilities: p(t)=5/(5+3), p(h)=3/(5+3) – These are the Maximum Likelihood (ML) estimates of the parameters of the coin. – Does it make sense? – What if you know the coin is fair?

Overfitting A fair coin has probabilities p(t)=0.5, p(h)=0.5 If you throw it 3 times and get [t, t, t], then the ML estimates for this sequence are p(t)=1, p(h)=0. Consequently, from this estimate, the probability of e.g. the sequence [h, t, h, t] = …………. This is an example of what is called overfitting. Overfitting is the greatest enemy of Machine Learning! Solution1: Get more data Solution2: Build what you already know into the model. (Will return to this during the module)

Why is it called Maximum Likelihood? It can be shown that using the frequencies to compute probabilities maximises the total probability of all the sequences given the model (the likelihood). That is, P(Data|Parameters), the probability of observing the (training) data given (any) set of parameters is maximised by setting the parameters in this way.

Probabilities Have two dice D1 and D2 The probability of rolling i given die D1 is called P(i|D1). This is a conditional probability Pick a die at random with probability P(Dj), j=1 or 2. The probability for picking die Dj and rolling i is is called joint probability and is P(i,Dj)=P(Dj)P(i|Dj). For any events X and Y, P(X,Y)=P(X|Y)P(Y) If we know P(X,Y), then the so-called marginal probability P(X) can be computed as:

Now, we show that maximising P(Data|Parameters) for the random sequence model leads to the frequency-based computation that we did intuitively.

Why did we bother? Because in more complicated models we cannot ‘guess’ the result.

Markov Chains Further examples of sequences: – Bio-sequences – Web page request sequences while browsing These are not anymore random sequences, but have a time-structure. How many parameters would such a model have? We need to make simplifying assumptions to end up with a reasonable number of parameters The first order Markov assumption: the observation only depends on the immediately previous one, no longer history Markov Chain = sequence model which makes the Markov assumption

Markov Chains The probability of a Markov sequence: The alphabet’s symbols are also called states Once the parameters are estimated from training data, the Markov chain can be used for prediction Amongst others, Markov Chains are successful for web browsing behavior prediction

Markov Chains A Markov Chain is stationary if at any time, it has the same transition probabilities. We assume stationary models here. Then the parameters of the model consist of the transition probability matrix & initial state probabilities.

ML parameter estimation We can derive how to compute the parameters of a Markov Chain from data, using Maximum Likelihood, as we did for random sequences. The ML estimate of the transition matrix will be again very intuitive: Remember that

Simple example If it is raining today, it will rain tomorrow with probability 0.8  implies the contrary has probability 0.2 If it is not raining today, it will rain tomorrow with probability 0.6  implies the contrary has probability 0.4 Build the transition matrix Be careful which numbers need to sum to one and which don’t. Such a matrix is called stochastic matrix. Q: It rained all week, including today. What does this model predict for tomorrow? Why? What does it predict for a day from tomorrow? (*Homework)

Examples of Web Applications HTTP request prediction: – To predict the probabilities of the next requests from the same user based on the history of requests from that client. Adaptive Web navigation: – To build a navigation agent which suggests which other links would be of interest to the user based on the statistics of previous visits. – The predicted link does not strictly have to be a link present in the Web page currently being viewed. Tour generation: – Is given as input the starting URL and generates a sequence of states (or URLs) using the Markov chain process.

Building Markov Models from Web Log Files A Web log file is a collection of records of user requests for documents on a Web site, an example: Transition matrix can be seen as a graph – Link pair: ( r - referrer, u - requested page, w - hyperlink weight) – Link graph: it is called the state diagram of the MarkovChain a directed weighted graph a hierarchy from the homepage down to multiple levels [04/Apr/1999:00:01: ] "GET /studaffairs/ccampus.html HTTP/1.1" " "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"

Link Graph: an example (University of Ulster site) State diagram: - Nodes: states - Weighted arrows: number of transitions Zhu et al. 2002

Experimental Results (Sarukkai, 2000) Simulations : – ‘Correct link’ refers to the actual link chosen at the next step. – ‘depth of the correct link’ is measured by counting the number of links which have a probability greater than or equal to the correct link. – Over 70% of correct links are in the top 20 scoring states. – Difficulties: very large state space

Simple exercise Build the Markov transition matrix of the following sequence: [a b b a c a b c b b d e e d e d e d] State space: {…………….}

Further topics Hidden Markov Model – Does not make the Markov assumption on the observed sequence – Instead, it assumes that the observed sequence was generated by another sequence which is unobservable (hidden), and this other sequence is assumed to be Markovian – More powerful – Estimation is more complicated Aggregate Markov model – Useful for clustering sub-graphs of a transition graph

HMM at an intuitive level Suppose that we know all the parameters of the following HMM, as shown on the state-diagram below. What is the probability of observing the sequence [A,B] if the initial state is S1? The same question if the initial state is chosen randomly with equal probability? ANSWER: If the initial state is S1: 0.2*(0.4* *0.7) = In the second case: 0.5* *0.3*(0.3* *0.8) =

Conclusions Probabilistic Model Maximum Likelihood parameter estimation Random sequence model Markov chain model Hidden Markov Model Aggregate Markov Model

Any questions?