A New Iteration Algorithm for Maximum Mutual Information Classifications on Factor Spaces ——Based on a Semantic information theory Chenguang Lu Email:

Slides:

Advertisements

Similar presentations

Introduction to Neural Networks Computing

Advertisements

Pattern Recognition and Machine Learning

Minimum Redundancy and Maximum Relevance Feature Selection

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.

Data Mining Techniques Outline

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

Machine Learning CMPT 726 Simon Fraser University

Optimal Adaptation for Statistical Classifiers Xiao Li.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Linear Discriminant Functions Chapter 5 (Duda et al.)

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Radial Basis Function Networks

Collaborative Filtering Matrix Factorization Approach

Isolated-Word Speech Recognition Using Hidden Markov Models

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

An Introduction to Support Vector Machines (M. Law)

Today Ensemble Methods. Recap of the course. Classifier Fusion

Non-Bayes classifiers. Linear discriminants, neural networks.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.

Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

UNIT I. Entropy and Uncertainty Entropy is the irreducible complexity below which a signal cannot be compressed. Entropy is the irreducible complexity.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Lecture 2. Bayesian Decision Theory

Semi-Supervised Clustering

Chapter 7. Classification and Prediction

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

Classification of unlabeled data:

Statistical Models for Automatic Speech Recognition

10701 / Machine Learning.

Classification with Perceptrons Reading:

Machine Learning Basics

Data Mining Lecture 11.

Asymmetric Gradient Boosting with Application to Spam Filtering

Linear Discriminators

Statistical Learning Dong Liu Dept. EEIS, USTC.

Collaborative Filtering Matrix Factorization Approach

Ying shen Sse, tongji university Sep. 2016

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

信道匹配算法用于混合模型——挑战EM算法 Channels’ Matching Algorithm for Mixture Models ——A Challenge to the EM Algorithm 鲁晨光 Chenguang Lu Hpage:

Pattern Recognition and Machine Learning

Generally Discriminant Analysis

The loss function, the normal equation,

语义信道和Shannon信道相互匹配和迭代for检验,估计和分类with最大互信息和最大似然度 Semantic Channel and Shannon Channel Mutually Match and Iterate for Tests, Estimations, and Classifications.

LECTURE 07: BAYESIAN ESTIMATION

Class #212 – Thursday, November 12

Mathematical Foundations of BME Reza Shadmehr

From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models Chenguang Lu 从EM算法到CM-EM算法求混合模型全局收敛.

Semantic Information G Theory for Falsification and Confirmation

Presentation transcript:

A New Iteration Algorithm for Maximum Mutual Information Classifications on Factor Spaces ——Based on a Semantic information theory Chenguang Lu Email: Survival99@gmail.com, Homepage: http://survivor99.com

Shannon’s Mutual Information Formula The classical information formula: Shannon’s channel Shannon’s mutual information formula： It means the saved average coding length for X because of prediction Y. 𝐼( 𝑥 𝑖 ; 𝑦 𝑗 )=log 𝑃( 𝑥 𝑖 | 𝑦 𝑗 𝑃( 𝑥 𝑖 Source channel destination

Maximum Mutual Information (MMI) Classifications on Factor Spaces Z: one-dimensional factor space Z: two-dimensional factor space A model for medical tests, signal detections, watermelon classifications, junk email classifications…. We need to optimize partitioning boundaries z’ We call the Z-space is the factor space proposed by Peizhuang Wang Z-space is also called the feature space. We use “Factor space” to emphasize that not any set of attributions can be used for the feature space. Z is a laboratory datum or feature vector Z: A laboratory datum

Maximum Mutual Information (MMI) Classifications of Unseen Instances ——A Most Difficult Problem Left by Shannon We can only see Z without seeing X. Shannon uses distortion criterion instead of the MMI criterion. Why? To optimize z’, the problem is that without z’ we cannot express mutual information I(X;Y); without the expression of I(X; Y), we cannot optimize z’. The partition z’ and I(X;Y) are interdependent.

The Popular Methods for MMI Classifications and Estimations Using parameters to construct boundaries Writing I(X;Y) Then optimizing parameters by Gradient Descent or Newton’s method. Disadvantages: Complicated Slow Convergence is not reliable

My Story about Looking for z*: Similar to Catching a Cricket When I tried to optimize z’, for any start z’, my excel file told me: The best dividing point for MMI is next one! After I used the next one, it still said: The best point is the next one! ……Fortunately, the z’ converged！It is similar to catching … Can this method converge in any case? Let’s prove the convergence by my semantic information theory.

My Semantic Information Theory is a Natural Generalization of Shannon’s information Theory Several semantic information theories: Carnap and Bar-Hillel’s Floridi’s Yixin Zhong’s Mine: A Chinese Book published in 1993: An English paper published in 1999: Lu C., A generalization of Shannon's information theory[J].Int. J. of General Systems, 1999，28 (6): 453-490.

The New Method: the Channel Matching Algorithm Based on the Semantic Information Theory It uses two types of channels: transition probability matrix Shannon’s Channel: which consists of a set of transition probability functions. Semantic Channel: true value martrix which consists of a set of truth functions

The Bayes’ Theorem Can Be Distinguished into Three Types Bayes’ Theorem I between two logical probabilities proposed by Bayes: T(B|A)=T(A|B)T(B)/T(A) Bayes’ Theorem II between two statistical probabilities, used by Shannon : Baye’s Theorem III between a statistical probability and a logical probability to link the likelihood function and the truth function： Membership function or truth function： Semantic likelihood function Logical probability

Sematic Information Measure The classical information formula Semantic Information of yj about xi is defined with log-normalized-likelihood The less the logical probability is, the more the information there is. The larger the true value is, which means that the hypothesis can survive the test, the more information there is. A tautology or a contradiction conveys no information. Reflects Popper’s falsification thought. If T(θj |x)≡1 then it becomes Carnap and Bar-Hillel’s formula I=log(1/T(θj)]. 𝐼( 𝑥 𝑖 ; 𝑦 𝑗 )=log 𝑃( 𝑥 𝑖 | 𝑦 𝑗 𝑃( 𝑥 𝑖

Semantic Mutual Information Formula Averaging I(xi;θj), we have semantic mutual information If T(θj|X)=exp[-k(xi-xj)2], a Gaussian function without coefficient, then It is easy to find that the maximum semantic information criterion is a special Regularized Least Square (RLS) criterion, and the cross-entropy is the regularizer.

R(G) Function: the Matching Function of Shannon’s Mutual Information and Semantic mutual information To develop Rate-Distortion function R(D), we obtain function R(G) R(G) means the minimum of R=I(X;Y) for given G=I(X;θ) G(R) means the maximum of G for given R. Rmax: MMI Matching point

T*(θj|X)=P(yj|X)/max[P(yj|X)] Channels’ Matching Match I: semantic channel matches Shannon’s channel Label learning or training: To obtain the optimized semantic channel T(θj|X), j=1,2,… from Shannon’s channel P(yj|X) by T*(θj|X)=P(yj|X)/max[P(yj|X)] Or by Matching II：Shannon’s channel matches the semantic channel Classification or reasoning By the classifier It encourages us select a compound label with least denotation.

Channels’ Matching (CM) Iteration Algorithm for MMI Classifications of Unseen Instances Given P(X), P(Z|X) and start dividing point z’, repeat the two steps： Matching I：T(θj|X) matches P(yj|X) Matching II：For given z’, there are information lines I(X;θj|Z), j=1,2,… Classifier for new z’: If z’ unchanges, end; else, Goto Matching I. Fast convergence, need 3-5 iterations. Convergence proof：http:\\survivor99.com\lcg\CM\CM4tests.pdf To optimize z’ 有病没病阳性阴性 I(X;θ0|Z) I(X;θ1|Z)

Using R(G) Function to Prove the CM Algorithm’s Convergence Iterative steps and convergence reasons: 1)Matching I: For each Shannon channel, there is a matched semantic channel that maximizes average log-likelihood; 2)Matching II: For given P(X) and semantic channel, we can find a better Shannon channel; 3)Repeating the two steps can obtain the Shannon channel that maximizes Shannon mutual information and average log-likelihood. A R(G) function serves as a ladder letting R climb up, and find a better semantic channel and a better ladder.

An Example Shows the Speed and Reliability Two iterations can make I(X;Y) reach 99.9% of the MMI.

An Example of the MMI classification with a Bad Initial Partition

Comparison For MMI classifications on high-dimensional feature spaces, we need to combine the CM algorithm with the Neural Network.

Summary End Thanks for your listening! Welcome to exchange ideas. The MMI classification is a difficult problem left by Shannon; it can be solved by the semantic information method. Channel Matching Algorithm: Matching I: improving the semantic channel by T(θ|X)∝P(Y|X) or Matching II: improving the Shannon channel by classifier: Repeat the above two steps until R=Rmax. End Thanks for your listening! Welcome to exchange ideas. For more papers about semantic information theory and machine learning, see http://survivor99.com/lcg/cm/recent.html or http://arxiv.org/a/lu_c_3