Download presentation
Presentation is loading. Please wait.
1
Middle Term Exam 03/04, in class
2
Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign you to a “tough” project Important date 03/23: project proposal 04/27 and 04/29: presentation 05/02: final report
3
Project Proposal Introduction: describe the research problem Related wok: describe the existing approaches and their deficiency Proposed approaches: describe your approaches and its potential to overcome the shortcomings of existing approaches Plan: the plan for this project (code development, data sets, and evaluation) Format: it should look like a research paper The required format (both Microsoft Word and Latex) can be downloaded from www.cse.msu.edu/~cse847/assignments/format.zip www.cse.msu.edu/~cse847/assignments/format.zip Warning: any submission that does not follow the format will be given zero score.
4
Project Report The same format as the proposal Expand the proposal with detailed description of your algorithm and evaluation results Presentation 25 minute presentation 5 minute discussion
5
Introduction to Information Theory Rong Jin
6
Information Information knowledge Information: reduction in uncertainty Example: 1.flip a coin 2. roll a die #2 is more uncertain than #1 Therefore, more information is provided by the outcome of #2 than #1
7
Definition of Information Let E be some event that occurs with probability P(E). If we are told that E has occurred, then we say we have received I(E)=log 2 (1/P(E)) bits of information Example: Result of a fair coin flip (log 2 2=1 bit) Result of a fair die roll (log 2 6=2.585 bits)
8
Entropy A zero-memory information source S is a source that emits symbols from an alphabet {s 1, s 2,…, s k } with probability {p 1, p 2,…,p k }, respectively, where the symbols emitted are statistically independent. Entropy is the average amount of information in observing the output from S
9
Entropy 1.0 H(P) logk 2.Measures the uniformness of a distribution P: The further P is from uniform, the lower the entropy. 3.For any other probability distribution {q 1,…,q k },
10
A Distance Measure Between Distributions Kullback-Leibler distance between distributions P and Q 0 D(P, Q) The smaller D(P, Q), the more Q is similar to P Non-symmetric: D(P, Q) D(Q, P)
11
Mutual Information Indicate the amount of information shared between two random variables Symmetric: I(X;Y) = I(Y;X) Zero iff X and Y are independent
12
Maximum Entropy Rong Jin
13
Motivation Consider a translation example English ‘in’ French {dans, en, à, au-cours-de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on translation Case 2: 30% of times either dans or en is used
14
Maximum Entropy Model: Motivation Case 3: 30% of time dans or en is used, and 50% of times dans or à is used Need a measure the uninformness of a distribution
15
Maximum Entropy Principle (MaxEnt) p(dans) = 0.2, p(a) = 0.3, p(en)=0.1 p(au-cours-de) = 0.2, p(pendant) = 0.2
16
MaxEnt for Classification Objective is to learn p(y|x) Constraints Appropriate normalization
17
MaxEnt for Classification Constraints Consistent with data Feature function Empirical mean of feature functions Model mean of feature functions
18
MaxEnt for Classification No assumption about p(y|x) (non-parametric) Only need the empirical mean of feature functions
19
MaxEnt for Classification Feature function
20
Example of Feature Functions f 1 (x) I(x {dans, en}) f 2 (x) I(x {dans, a}) dans11 en10 au-cours-de00 a01 pendant00 Empirical Average0.30.5
21
Solution to MaxEnt Identical to conditional exponential model Solve W by maximum likelihood estimation
22
Iterative Scaling (IS) Algorithm Assume
23
Iterative Scaling (IS) Algorithm Compute the empirical mean for every feature and every class Initialize Repeat Compute p(y|x) for each training example (x i, y i ) using W Compute the model mean of every feature for every class Update W
24
Iterative Scaling (IS) Algorithm It guarantees that the likelihood function always increases
25
Iterative Scaling (IS) Algorithm How about features that can take both positive and negative values? How about the sum of features is not a constant?
26
MaxEnt for Classification
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.