LECTURE 23: INFORMATION THEORY REVIEW

Slides:

Advertisements

Similar presentations

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.

Advertisements

Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)

Pattern Recognition and Machine Learning

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.

Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.

Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.

Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.

Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.

Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.

Bayesian Decision Theory

Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Probabilistic inference

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Machine Learning CMPT 726 Simon Fraser University

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Review: Probability Random variables, events Axioms of probability

Bayesian Decision Theory Making Decisions Under uncertainty 1.

ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.

Principles of Pattern Recognition

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.

ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.

Computer Vision Lecture 6. Probabilistic Methods in Segmentation.

Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.

CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.

Lecture 2. Bayesian Decision Theory

Lecture 1.31 Criteria for optimal reception of radio signals.

Lecture 15. Pattern Classification (I): Statistical Formulation

Probability Theory and Parameter Estimation I

LECTURE 03: DECISION SURFACES

LECTURE 10: EXPECTATION MAXIMIZATION (EM)

Special Topics In Scientific Computing

LECTURE 05: THRESHOLD DECODING

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Statistical Models for Automatic Speech Recognition

LECTURE 05: THRESHOLD DECODING

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Pattern Recognition and Machine Learning

LECTURE 07: BAYESIAN ESTIMATION

LECTURE 05: THRESHOLD DECODING

Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]

Presentation transcript:

LECTURE 23: INFORMATION THEORY REVIEW Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation (MMIE) Minimum Classification Error (MCE) Resources: K.V.: Discriminative Training X.H.: Spoken Language Processing F.O.: Maximum Entropy Models J.B.: Discriminative Training

Entropy and Mutual Information The entropy of a discrete random variable is defined as: The joint entropy between two random variables, X and Y, is defined as: The conditional entropy is defined as: The mutual information between two random variables, X and Y, is defined as: There is an important direct relation between mutual information and entropy: By symmetry, it also follows that: Two other important relations:

Mutual Information in Pattern Recognition Given a sequence of observations, X = {x1, x2, …, xn}, which typically can be viewed as features vectors, we would like to minimize the error in prediction of the corresponding class assignments, Ω = {Ω1, Ω2, …,Ωm}. A reasonable approach would be to minimize the amount of uncertainty about the correct answer. This can be stated in information theoretic terms as minimization of the conditional entropy: Using our relations from the previous page, we can write: If our goal is to minimize H(Ω|X), then we can try and minimize H(Ω) or maximize I(Ω|X). The minimization of H(Ω) corresponds to finding prior probabilities that minimize entropy, which amounts to accurate prediction of class labels from prior information. (In speech recognition, this is referred to as finding a language model with minimum entropy.) Alternately, we can estimate parameters of our model that maximize mutual information -- referred to as maximum mutual information estimation (MMIE).

Posteriors and Bayes Rule The decision rule for the minimum error rate classifier selects the class, ωi, with the maximum posterior probability, P(ωI |x). Recalling Bayes Rule: The evidence, p(x), can be expressed as: As we have discussed, we can ignore p(x) since it is constant with respect to the maximization (choosing the most probable class). However, during training, the value of p(x) depends on the parameters of all models and varies as a function of x. A conditional maximum likelihood estimator, θ*CMLE, is defined as: From the previous slide, we can invoke mutual information: However, we don’t know the joint distribution, p(X,Ω)!

Summary Reviewed basic concepts of entropy and mutual information from information theory. Discussed the use of these concepts in pattern recognition, particularly the goal of minimization of conditional entropy Demonstrated the relationship between minimization of conditional entropy and mutual information.