语义信道和Shannon信道相互匹配和迭代for检验,估计和分类with最大互信息和最大似然度 Semantic Channel and Shannon Channel Mutually Match and Iterate for Tests, Estimations, and Classifications.

Slides:

Advertisements

Similar presentations

Detection Chia-Hsin Cheng. Wireless Access Tech. Lab. CCU Wireless Access Tech. Lab. 2 Outlines Detection Theory Simple Binary Hypothesis Tests Bayes.

Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Image Modeling & Segmentation

Evaluating Classifiers

Pattern Recognition and Machine Learning

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Chapter 4: Linear Models for Classification

Visual Recognition Tutorial

EE-148 Expectation Maximization Markus Weber 5/11/99.

Lecture 5: Learning models using EM

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Machine Learning CMPT 726 Simon Fraser University

Visual Recognition Tutorial

Thanks to Nir Friedman, HU

Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Crash Course on Machine Learning

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Computer Vision Lecture 6. Probabilistic Methods in Segmentation.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Machine Learning 5. Parametric Methods.

Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.

Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Data Modeling Patrice Koehl Department of Biological Sciences

Applied statistics Usman Roshan.

Lecture 2. Bayesian Decision Theory

CS479/679 Pattern Recognition Dr. George Bebis

Chapter 3: Maximum-Likelihood Parameter Estimation

Hidden Markov Models.

Learning Tree Structures

Model Inference and Averaging

LECTURE 03: DECISION SURFACES

Classification of unlabeled data:

Statistical Models for Automatic Speech Recognition

LECTURE 10: EXPECTATION MAXIMIZATION (EM)

Clustering (3) Center-based algorithms Fuzzy k-means

Clustering Evaluation The EM Algorithm

Latent Variables, Mixture Models and EM

Statistical Process Control

Statistical Learning Dong Liu Dept. EEIS, USTC.

Bayesian Models in Machine Learning

ECE539 final project Instructor: Yu Hen Hu Fall 2005

Discrete Event Simulation - 4

Statistical Models for Automatic Speech Recognition

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

10701 / Machine Learning Today: - Cross validation,

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

信道匹配算法用于混合模型——挑战EM算法 Channels’ Matching Algorithm for Mixture Models ——A Challenge to the EM Algorithm 鲁晨光 Chenguang Lu Hpage:

Generally Discriminant Analysis

LECTURE 09: BAYESIAN LEARNING

LECTURE 07: BAYESIAN ESTIMATION

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models Chenguang Lu 从EM算法到CM-EM算法求混合模型全局收敛.

Parametric Methods Berlin Chen, 2005 References:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Semantic Information G Theory for Falsification and Confirmation

A New Iteration Algorithm for Maximum Mutual Information Classifications on Factor Spaces ——Based on a Semantic information theory Chenguang Lu

Presentation transcript:

语义信道和Shannon信道相互匹配和迭代for检验,估计和分类with最大互信息和最大似然度 Semantic Channel and Shannon Channel Mutually Match and Iterate for Tests, Estimations, and Classifications with Maximum Mutual Information and Maximum Likelihood 鲁晨光 Chenguang Lu lcguang@foxmail.com Hpage: http://survivor99.com/lcg/; http://www.survivor99.com/lcg/english/ This ppt may be downloaded from http://survivor99.com/lcg/CM/CM4MMIandML.ppt

1. The Tasks of Tests and Estimations For tests and estimations with given P(X) and P(Z|X) or a sample {(x(t), z(t)|t=1,2,…,N}, How do we partition C (get boundaries) so that Shannon’s Mutual Information and Average Log Likelihood (ALL) are maximum.

Mutual Information and Average Log Likelihood Changes with z’ in Tests Z’ move right, the likelihood provided by y1, P(X|θ1) will increase; yet, P(X|θ0) will degrease; Kullback-Leibler information I(X; y1) will increase, yet, I(X;y0) will decrease. There is the best dividing point z* that maximizes mutual information and average log likelihood.

A Story about Looking for z*: Similar to Catching a Cricket I research semantic information theory, define: Information=log [P(X|θj)/P(X)]. The information criterion is compatible with likelihood or likelihood ratio criterion. Once I use information criterion to find optimal z* in a test, an interesting thing happens! For any start z’, my excel file tells me: The best dividing point is next one! After I use the next, it still says: The best point is next one! ……Fortunately, It converges！It is similar to catching … Do you know this secret? Can this method converge in any case? Let’s prove the convergence by my semantic information theory.

5. The Research History Recently, I found this theory could be used to 1993：《广义信息论》(A generalized Information Theory），中国科技大学出版社； 1994:《广义熵和广义互信息的编码意义》,《通信学报》, 5卷6期,37-44. 1997：《投资组合的熵理论和信息价值》，中国科技大学出版社； 1999: A generalization of Shannon‘s information theory (a short version of the book） , Int. J. of General Systems, 28: (6) 453-490，1999 Recently, I found this theory could be used to Improve statistical learning in many aspects. See http://www.survivor99.com/lcg/books/GIT/ http://www.survivor99.com/lcg/Recent.html Home page: http://survivor99.com/lcg/ Blog：http://blog.sciencenet.cn/?2056 Published in 1993

6. Important step: Using Truth Function to Produce Likelihood Function Using membership function mAj(X) as truth function of a hypothesis yj=“X is in Aj”: T(θj|X)=mAj(X), θj=Aj (a fuzzy set) as a sub-model Important step 1： Using T(θj|X) and source P(X) to produce semantic likelihood function: How to predict real position according to GPS? Is the car on the building? Logical probability =exp[(xj-xi) 2 /(2d)2] Most possible position

7. Semantic Information Measure Compatible with Shannon，Popper，Fisher，and Zadeh’s Thoughts Using log normalized likelihood to define semantic information: If T(θj|X)=exp[-|X-xj|2/(2d2)], j=1, 2, …, n, then =Bar-Hillel and Carnap’s information – standard deviation Reflects Popper’s thought well: The less the logical probability is, the more information there is; The larger the deviation is, the less information there is; A wrong estimation conveys negative information.

8. Semantic Kullback-Leibler Information and Semantic Mutual Information Important step 2： Averaging I(xi;θj) to get Semantic Kullback-Leibler Information: sampling distribution likelihood Its simple relation to normalized log-likelihood: Averaging I(X;θj) to get Semantic Mutual Information: To maximize I(X; θj) is to Maximize likelihood Sampling distribution

9. Channels’Matching Algorithm The Shannon channel The semantic channel (consists of truth functions)： The semantic mutual information formula: We may fix one and optimize another alternatively achieve MLE. yj不变X变 X Transition probability function Shannon channel Semantic Channel Sampling distribution Likelihood function

10. Semantic Channel Matches Shannon’s Channel Optimize the truth function and the semantic channel: When the sample is large enough, the optimized truth function is proportional to the transition probability function, or say, Semantic Channel matches Shannon’s channel. xj* makes P(yj|xj*) be the maximum of P(yj|X). If P(yj|X) or P(yj) is hard to obtain, we may use With T*(θj|X), the semantic Bayesian prediction is equivalent to traditional Bayesian prediction: P*(X|θj)=P(X|yj). Longitudinal normalizing constant Semantic channel Shannon channel

11. Multi-label Logical Classification and Selective Classification with ML Criterion Receivers’ logical classification is to get membership functions. When the sample is not big enough, When the sample is big enough, , Senders’ selective classification is to select a yj ( or make Bayes’ decision ): If X is unseen and we can only see observed condition Z as in a test or estimation, then we may use this formula

12. Two Information Amounts Change with z y0: test-negative y1: test-positive To optimize T(θj|X): T(θ1|x1)=T(θ0|x0)=1 T(θ1|x0)=b1’*= P(y1|x0)/P(y1|x1) T(θ0|x1)= b0’*=P(y0|x1)/P(y0|x0) To optimize classifier： j=1, 2

13. Using R(G) function to Prove Iterative Convergence Shannon’s Information proposed the rate distortion function: R(D) where R(D) means minimum R for given D. Replaced D by G: We have R(G) function： It describes All R(G) functions are bowl like. Matching Point

14. Using R(G) Function to Prove CM Algorithm’s Convergence for Tests and Estimations Iterative steps and convergence reasons: 1)For each Shannon channel, there is a matched semantic channel that maximizes average log-likelihood; 2)For given P(X) and semantic channel, we can find a better Shannon channel; 3)Repeating the two steps can obtain the Shannon channel that maximizes Shannon mutual information and average log-likelihood. A R(G) function serves as a ladder letting R climb up, and find a better semantic channel and a better ladder.

15. An Example for Estimations Shows the Convergent Reliability A 3×3 Shannon channel to show reliable convergence Even if a pair of bad start points are used, the convergence is also reliable. Using good start points, the number of iterations is 4; Using very bad start points, the number of iterations is 11. Start After convergence

16. The CM Algorithm for Mixture Models Difference: Looking for the True Shannon channel Semantic mutual information Shannon mutual information where Main formula for mixture models (without Jensen's inequality) : Three steps: 1) Left-step-a for 2) Left-step-b for H(Y||Y+1)→0; 3)Right-step for guessed Shannon channel maximizing G The CM vs The EM: Left-step-a ≈ E-step; Left-step-b + Right-step≈M-step = =∑i P(xi)P(yj|xi) Using an inner iteration

17. Illustrating the Convergence of the CM Algorithm The central idea of The CM is Finding the point G≈R on R-G plane, two-dimensional plane； also looking for R→R*——EM algorithm neglects R→R* Minimizing H(Q||P)= R(G)-G (similar to min-max method)； Two examples: Start R<R* or Q<Q* Start R>R* or Q>Q* A counterexample against the EM; Q is decreasing Target

18. A Counterexample with R>R* or Q>Q* against the EM True, starting, and ending parameters: Excel demo files can be can downloaded from: http://survivor99.com/lcg/cc-iteration.zip The number of iterations is 5

19. Illustrating Fuzzy Classification for Mixture Models After we obtain optimized P(X|Θ), we need to select Y (to make decision or classification) according to X. The parameter s in R(G) function reminds us that we may use the following Shannon channel as classifying function: j=1, 2, …, n When s->∞, P(yj|X)=0 or 1.

20. The Numbers of Iterations for Convergence For Gaussian mixture models with Component number n=2. Algorithm EM or improved EM CM for H(Q||H)≤0.001 Number of iterations for convergence 15-33(according to references) 4-12; 5 is the most possible number

21. MSI in Comparison with MLE and MAP MSI(estimation)——Maximum Semantic Information (estimation) MLE: MAP: MSI： MSI has features: 1）compatible with MLE，but, suitable to cases with variable source P(X); 2）compatible with traditional Bayesian predictions; 3）using truth functions as predictive models so that the models reflect communication channels’ features. (For example, GPS and Medical tests provide Shannon channels and semantic channels)

Thank you for your listening！ 22. Summary The Channel’s Matching (CM) algorithm is a new tool for statistical learning. It can be used it to resolve the problems with tests, estimations, multi-label logical classifications, and mixture models more conveniently. ——End—— Thank you for your listening！ Welcome to criticize！ 2018 IEEE International Conference on Big Data and Smart Computing January 15-18, 2018, Shanghai, China More papers about the author’s semantic information theory: http://survivor99.com/lcg/books/GIT/index.htm