. The Relative Entropy Rate of Two Hidden Markov Processes Or Zuk Dept. of Phys. Of Comp. Systems Weizmann Inst. Of Science Rehovot, Israel.

Slides:

Advertisements

Similar presentations

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.

Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions

. The sample complexity of learning Bayesian Networks Or Zuk*^, Shiri Margel* and Eytan Domany* *Dept. of Physics of Complex Systems Weizmann Inst. of.

. On the Number of Samples Needed to Learn the Correct Structure of a Bayesian Network Or Zuk, Shiri Margel and Eytan Domany Dept. of Physics of Complex.

Dynamic Bayesian Networks (DBNs)

Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

More MR Fingerprinting

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.

Statistical NLP: Lecture 11

Hidden Markov Models Theory By Johan Walters (SR 2003)

Statistical NLP: Hidden Markov Models Updated 8/12/2005.

Hidden Markov Models Fundamentals and applications to bioinformatics.

Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.

Infinite Horizon Problems

Visual Recognition Tutorial

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

Maximum likelihood (ML) and likelihood ratio (LR) test

. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.

Mutual Information Mathematical Biology Seminar

Random Matrices Hieu D. Nguyen Rowan University Rowan Math Seminar

Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.

Fundamental limits in Information Theory Chapter 10 :

Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.

Lecture 5: Learning models using EM

Maximum likelihood (ML) and likelihood ratio (LR) test

Polymers PART.2 Soft Condensed Matter Physics Dept. Phys., Tunghai Univ. C. T. Shih.

1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.

Fast Temporal State-Splitting for HMM Model Selection and Learning Sajid Siddiqi Geoffrey Gordon Andrew Moore.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Maximum likelihood (ML)

CS6800 Advanced Theory of Computation Fall 2012 Vinay B Gavirangaswamy

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.

Ch 8.1 Numerical Methods: The Euler or Tangent Line Method

Isolated-Word Speech Recognition Using Hidden Markov Models

Gaussian Mixture Model and the EM algorithm in Speech Recognition

CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 21- Forward Probabilities and Robotic Action Sequences.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Channel Capacity.

Comp. Genomics Recitation 3 The statistics of database searching.

SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.

Efficient computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm Anders Eriksson and Anton van den Hengel.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,

CHAPTER 5 SIGNAL SPACE ANALYSIS

CS Statistical Machine learning Lecture 24

A generalized bivariate Bernoulli model with covariate dependence Fan Zhang.

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

Probabilistic Automaton Ashish Srivastava Harshil Pathak.

. Entropy of Hidden Markov Processes Or Zuk 1 Ido Kanter 2 Eytan Domany 1 Weizmann Inst. 1 Bar-Ilan Univ. 2.

Stochastic Processes and Transition Probabilities D Nagesh Kumar, IISc Water Resources Planning and Management: M6L5 Stochastic Optimization.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Channel Coding Theorem (The most famous in IT) Channel Capacity; Problem: finding the maximum number of distinguishable signals for n uses of a communication.

Chapter 6 Large Random Samples Weiqi Luo ( 骆伟祺 ) School of Data & Computer Science Sun Yat-Sen University ：

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

Other Models for Time Series. The Hidden Markov Model (HMM)

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Hidden Markov Models BMI/CS 576

Context-based Data Compression

General Strong Polarization

Presentation transcript:

. The Relative Entropy Rate of Two Hidden Markov Processes Or Zuk Dept. of Phys. Of Comp. Systems Weizmann Inst. Of Science Rehovot, Israel

2 Overview u Introduction u Distance Measures and Relative Entropy rate u Results: Generalization from Entropy Rate. u Future Directions

3 Introduction Hidden Markov Processes are relevant: u Error Correction (Markovian source +noise) u Signal Processing, Speech recognition u Experimental physics -telegraph noise, TLS+noise, quantum jumps. u Bioinformatics -biological sequences, gene expression t Transmission Noise 10% Markov chainHMP Quantum jumps R (Mohm) Mesoscopic wires

4 HMP - Definitions Markov Process: X – Markov Process M λ – Transition Matrix u m λ (i,j) = Pr(X n+1 = j| X n = i) Hidden Markov Process : Y – Noisy Observation of X R λ – Noise/Emission Matrix u r λ (i,j) = Pr(Y n = j| X n = i) MλMλ RλRλ RλRλ XnXn X n+1 Y n+1 YnYn Models are denoted by λ and µ.

5 Example: Binary HMP 0 1 p(1|0) p(0|1) p(1|1) p(0|0) 0 1 q(0|0) q(1|0) q(0|1) q(1|1) Transition Emission

6 Example: Binary HMP (Cont.) u A simple, Symmetric Binary HMP : u M = R = u All properties of the process depend on two parameters, p and . Assume w.l.og. p,  < ½

7 Overview u Introduction u Distance Measures and Relative Entropy rate u Results: Generalization from Entropy Rate. u Future Directions

8 Distance Measures for Two HMPs u Why important ? u Often, one learns a HMP from data. It is important to know how different is the learned model from the true model. u Sometimes, many HMPs may represent different sources (e.g. different authors, different protein families etc.), and we wish to know which sources are similar. u What distance measure to use? u Look at joint distributions of N consecutive Y symbols P λ (N) and P µ (N).

9 Relative Entropy (RE) Rate u Notation : u Relative Entropy for finite (N-symbol) distributions: u Take the limit to get the RE-rate: u Alternative definition, using conditional relative entropy:

10 Relative Entropy (RE) Rate u First proposed for HMPs by [Juang&Rabiner 85]. u Not a norm (not symmetric, no triangle inequality). u Still it has several natural interpretations: -If one generates data from λ, and gives likelihood score to µ, then D(λ || µ) is the average likelihood-loss per symbol (compared to the optimal model λ). -If one compresses data generated λ, assuming erroneously it was generated by µ, then one ‘looses’ on average D(λ || µ) per symbol. u For Markov chains, D(λ || µ) is easily given by:

11 Relative Entropy (RE) Rate u For HMPs, D(λ || µ) is difficult to compute. So far only bounds [Silva&Narayanan] or approximation algorithms [Li et al. 05, Do 03, Mohammad&Tranter 05] are known. u D(λ || µ) generalizes the concept of the Shannon entropy rate, using: H(λ) = log s – D(λ || u) Where u is the uniform model, s is the alphabet size of Y. u The entropy rate H for an HMP is a Lyapunov Exponent, which is hard to compute generally. [Jacquet et al 04] u What is known (for H) ? Lyapunov exponent representation, analyticity, asymptotic expansions in different Regimes. u Generalize results and techniques to the RE-rate.

12 Why is calculating D(λ || µ) difficult? Markov Chains: -All states with the same no. of flips have the same prob. Polynomial number of types (probs). 2N2N X X X Y 2N2N X 2N2N Y HMPs :Many Markov chains, {X} contributes to the same Y. Different {Y}s have different probs. Exponential number of types (probs). Method of types does not work here.

13 Overview u Introduction u Distance Measures and Relative Entropy rate u Results: Generalization from Entropy Rate. u Future Directions

14 RE-Rate and Lyapunov Exponents u What is Lyapunov exponent? u Arises in Dynamical Systems, Control Theory, Statistical Physics etc. Measures the stability of the system. u Take two (square) matrices A,B. Choose each time at random A (with prob. p) or B (w.p. 1-p). Look at the norm: (1/N) log ||ABBBAABAB…BA|| The limit: -Exists a.s. [Furstenberg&Kesten 60] -Called Top Lyaponov Exponent. -Independent of Matrix Norm chosen. u HMP entropy rate is given as a Lyaponov Exponent [Jacquet et al. 04]

15 RE-Rate and Lyapunov Exponents u What about RE-rate? u Given as the difference of two Lyapunov Exponents: -The G’s are random matrices, which are simply obtained from M and R using the forward equations. -Different matrices appear in the two Lyapunov exponents, but the probabilities selecting the matrices are the same.

16 Analyticity of the RE-Rate u Is the RE-rate continuous, ‘smooth’, or even analytic in the parameters governing the HMPs? u For Lyapunov exponents: Known analyticity in the matrix entries [Rulle 79], and their probabilities [Peres 90,91] separately. u For HMP entropy rate, analyticity was recently shown by [Han&Marcus 05].

17 Analyticity of the RE-Rate u Using both results, we are able to show: Thm: The RE-rate is analytic in the HMPs parameters. u Analyticity is shown only in the interior of the parameters domain (i.e. strictly positive probabilities). u Behavior on the boundaries is more complicated. Sometimes analyticity remains on the boundaries (and beyond). Sometimes we encounter singularities. Full characterization is still lacking [Marcus&Han 05].

18 RE-Rate Taylor Series Expansion u While in general the RE-rate is not known, there are specific parameters values for which it is easily given in closed-form (e.g. for Markov-Chains). Perhaps we can ‘expand’ around these values, and get asymptotic results near them. u Similar approach was used for Lyapunov exponents [Derrida], and for HMP entropy rate [Jacquet et al. 04, Weizmann&Ordenlich 04, Zuk et al. 05] giving first-order asymptotics in various regimes.

19 Different Regimes – Binary Case p -> 0, p -> ½ (  fixed)  -> 0,  -> ½ (p fixed) We concentrate on the ‘High-SNR regime’  -> 0, and ‘almost-memoryless regime’ p-> ½. p  0 0 ½ ½ For High-SNR (η= λ,µ) : Solution can be given as a power-series in  :

20 RE-Rate Taylor Series Expansion u In [Zuk,Domany,Kanter&Aizenman 06] we give a procedure for calculating the full Taylor-Series Expansion for the HMP entropy rate, in the ‘High SNR’, and ‘almost memoryless’ regime. u Main observation: Finite systems give the correct RE rate up to a given order: u Was discovered using computer experiments (symbolic computation in Maple). u Stronger result holds for the entropy rate (orders ‘settle’ for N ≥ (k+3)/2) u Does not hold for any regime. For some regimes (e.g. p->0), even first order never settles.

21 Proof Outline ( with M. Aizenman ) X Y (k+3)/2 H(p,  ) up to O(  k ) Two main Ideas: A.To distinguish between noise at different site    2  3  ….  j …. B. When  m =0, the observation Y m =X m, conditioning back to the past is ‘blocked’  m =0 k+2 H(λ) D(λ||µ)

22 Overview u Introduction u Distance Measures and Relative Entropy rate u Results: Generalization from Entropy Rate. u Future Directions

23 RE-Rate Taylor Series Expansion u First order : u Higher orders were computed for the binary symmetric case. u Similar results for the ‘almost-memoryless’ regime. u Radius of convergence seems larger for the latter expansion, albeit no rigorous results are known.

24 Future Directions oStudy other regimes. (e.g. two ‘close’ models). oBehavior of the EM algorithm. oGeneralizations (e.g. different alphabets sizes, continuous case). oPhysical realization of HMPs (mesoscopic systems, quantum jumps) oDomain of Analyticity - Radius of convergence.

25 Thanks oEytan Domany (Weizmann Inst.) oIdo Kanter (Bar-Ilan Univ.) oMichael Aizenman (Princeton Univ.) oLibi Hertzberg (Weizmann Inst.)