Presentation is loading. Please wait.

Presentation is loading. Please wait.

语义信道和Shannon信道相互匹配和迭代for检验,估计和分类with最大互信息和最大似然度 Semantic Channel and Shannon Channel Mutually Match and Iterate for Tests, Estimations, and Classifications.

Similar presentations


Presentation on theme: "语义信道和Shannon信道相互匹配和迭代for检验,估计和分类with最大互信息和最大似然度 Semantic Channel and Shannon Channel Mutually Match and Iterate for Tests, Estimations, and Classifications."— Presentation transcript:

1 语义信道和Shannon信道相互匹配和迭代for检验,估计和分类with最大互信息和最大似然度 Semantic Channel and Shannon Channel Mutually Match and Iterate for Tests, Estimations, and Classifications with Maximum Mutual Information and Maximum Likelihood 鲁晨光 Chenguang Lu Hpage: This ppt may be downloaded from

2 1. The Tasks of Tests and Estimations
For tests and estimations with given P(X) and P(Z|X) or a sample {(x(t), z(t)|t=1,2,…,N}, How do we partition C (get boundaries) so that Shannon’s Mutual Information and Average Log Likelihood (ALL) are maximum.

3 Mutual Information and Average Log Likelihood Changes with z’ in Tests
Z’ move right, the likelihood provided by y1, P(X|θ1) will increase; yet, P(X|θ0) will degrease; Kullback-Leibler information I(X; y1) will increase, yet, I(X;y0) will decrease. There is the best dividing point z* that maximizes mutual information and average log likelihood.

4 A Story about Looking for z*: Similar to Catching a Cricket
I research semantic information theory, define: Information=log [P(X|θj)/P(X)]. The information criterion is compatible with likelihood or likelihood ratio criterion. Once I use information criterion to find optimal z* in a test, an interesting thing happens! For any start z’, my excel file tells me: The best dividing point is next one! After I use the next, it still says: The best point is next one! ……Fortunately, It converges!It is similar to catching … Do you know this secret? Can this method converge in any case? Let’s prove the convergence by my semantic information theory.

5 5. The Research History Recently, I found this theory could be used to
1993:《广义信息论》(A generalized Information Theory), 中国科技大学出版社; 1994:《广义熵和广义互信息的编码意义》,《通信学报》, 5卷6期,37-44. 1997:《投资组合的熵理论和信息价值》, 中国科技大学出版社; 1999: A generalization of Shannon‘s information theory (a short version of the book) , Int. J. of General Systems, 28: (6) ,1999 Recently, I found this theory could be used to Improve statistical learning in many aspects. See Home page: Blog: Published in 1993

6 6. Important step: Using Truth Function to Produce Likelihood Function
Using membership function mAj(X) as truth function of a hypothesis yj=“X is in Aj”: T(θj|X)=mAj(X), θj=Aj (a fuzzy set) as a sub-model Important step 1: Using T(θj|X) and source P(X) to produce semantic likelihood function: How to predict real position according to GPS? Is the car on the building? Logical probability =exp[(xj-xi) 2 /(2d)2] Most possible position

7 7. Semantic Information Measure Compatible with Shannon,Popper,Fisher,and Zadeh’s Thoughts
Using log normalized likelihood to define semantic information: If T(θj|X)=exp[-|X-xj|2/(2d2)], j=1, 2, …, n, then =Bar-Hillel and Carnap’s information – standard deviation Reflects Popper’s thought well: The less the logical probability is, the more information there is; The larger the deviation is, the less information there is; A wrong estimation conveys negative information.

8 8. Semantic Kullback-Leibler Information and Semantic Mutual Information
Important step 2: Averaging I(xi;θj) to get Semantic Kullback-Leibler Information: sampling distribution likelihood Its simple relation to normalized log-likelihood: Averaging I(X;θj) to get Semantic Mutual Information: To maximize I(X; θj) is to Maximize likelihood Sampling distribution

9 9. Channels’Matching Algorithm
The Shannon channel The semantic channel (consists of truth functions): The semantic mutual information formula: We may fix one and optimize another alternatively achieve MLE. yj不变X变 X Transition probability function Shannon channel Semantic Channel Sampling distribution Likelihood function

10 10. Semantic Channel Matches Shannon’s Channel
Optimize the truth function and the semantic channel: When the sample is large enough, the optimized truth function is proportional to the transition probability function, or say, Semantic Channel matches Shannon’s channel. xj* makes P(yj|xj*) be the maximum of P(yj|X). If P(yj|X) or P(yj) is hard to obtain, we may use With T*(θj|X), the semantic Bayesian prediction is equivalent to traditional Bayesian prediction: P*(X|θj)=P(X|yj). Longitudinal normalizing constant Semantic channel Shannon channel

11 11. Multi-label Logical Classification and Selective Classification with ML Criterion
Receivers’ logical classification is to get membership functions. When the sample is not big enough, When the sample is big enough, , Senders’ selective classification is to select a yj ( or make Bayes’ decision ): If X is unseen and we can only see observed condition Z as in a test or estimation, then we may use this formula

12 12. Two Information Amounts Change with z
y0: test-negative y1: test-positive To optimize T(θj|X): T(θ1|x1)=T(θ0|x0)=1 T(θ1|x0)=b1’*= P(y1|x0)/P(y1|x1) T(θ0|x1)= b0’*=P(y0|x1)/P(y0|x0) To optimize classifier: j=1, 2

13 13. Using R(G) function to Prove Iterative Convergence
Shannon’s Information proposed the rate distortion function: R(D) where R(D) means minimum R for given D. Replaced D by G: We have R(G) function: It describes All R(G) functions are bowl like. Matching Point

14 14. Using R(G) Function to Prove CM Algorithm’s Convergence for Tests and Estimations
Iterative steps and convergence reasons: 1)For each Shannon channel, there is a matched semantic channel that maximizes average log-likelihood; 2)For given P(X) and semantic channel, we can find a better Shannon channel; 3)Repeating the two steps can obtain the Shannon channel that maximizes Shannon mutual information and average log-likelihood. A R(G) function serves as a ladder letting R climb up, and find a better semantic channel and a better ladder.

15 15. An Example for Estimations Shows the Convergent Reliability
A 3×3 Shannon channel to show reliable convergence Even if a pair of bad start points are used, the convergence is also reliable. Using good start points, the number of iterations is 4; Using very bad start points, the number of iterations is 11. Start After convergence

16 16. The CM Algorithm for Mixture Models Difference: Looking for the True Shannon channel
Semantic mutual information Shannon mutual information where Main formula for mixture models (without Jensen's inequality) : Three steps: 1) Left-step-a for 2) Left-step-b for H(Y||Y+1)→0; 3)Right-step for guessed Shannon channel maximizing G The CM vs The EM: Left-step-a ≈ E-step; Left-step-b + Right-step≈M-step = =∑i P(xi)P(yj|xi) Using an inner iteration

17 17. Illustrating the Convergence of the CM Algorithm
The central idea of The CM is Finding the point G≈R on R-G plane, two-dimensional plane; also looking for R→R*——EM algorithm neglects R→R* Minimizing H(Q||P)= R(G)-G (similar to min-max method); Two examples: Start R<R* or Q<Q* Start R>R* or Q>Q* A counterexample against the EM; Q is decreasing Target

18 18. A Counterexample with R>R* or Q>Q* against the EM
True, starting, and ending parameters: Excel demo files can be can downloaded from: The number of iterations is 5

19 19. Illustrating Fuzzy Classification for Mixture Models
After we obtain optimized P(X|Θ), we need to select Y (to make decision or classification) according to X. The parameter s in R(G) function reminds us that we may use the following Shannon channel as classifying function: j=1, 2, …, n When s->∞, P(yj|X)=0 or 1.

20 20. The Numbers of Iterations for Convergence
For Gaussian mixture models with Component number n=2. Algorithm EM or improved EM CM for H(Q||H)≤0.001 Number of iterations for convergence 15-33(according to references) 4-12; 5 is the most possible number

21 21. MSI in Comparison with MLE and MAP
MSI(estimation)——Maximum Semantic Information (estimation) MLE: MAP: MSI: MSI has features: 1)compatible with MLE,but, suitable to cases with variable source P(X); 2)compatible with traditional Bayesian predictions; 3)using truth functions as predictive models so that the models reflect communication channels’ features. (For example, GPS and Medical tests provide Shannon channels and semantic channels)

22 Thank you for your listening!
22. Summary The Channel’s Matching (CM) algorithm is a new tool for statistical learning. It can be used it to resolve the problems with tests, estimations, multi-label logical classifications, and mixture models more conveniently. ——End—— Thank you for your listening! Welcome to criticize! 2018 IEEE International Conference on Big Data and Smart Computing January 15-18, 2018, Shanghai, China More papers about the author’s semantic information theory:


Download ppt "语义信道和Shannon信道相互匹配和迭代for检验,估计和分类with最大互信息和最大似然度 Semantic Channel and Shannon Channel Mutually Match and Iterate for Tests, Estimations, and Classifications."

Similar presentations


Ads by Google