Introduction LING 572 Fei Xia Week 1: 1/4/06
Outline Course overview Mathematical foundation: (Prereq) –Probability theory –Information theory Basic concepts in the classification task
Course overview
General info Course url: –Syllabus (incl. slides, assignments, and papers): updated every week. –Message board –ESubmit Slides: –I will try to put the slides online before class. –“Additional slides” are not required and not covered in class.
Office hour Fei: – address: Subject line should include “ling572” The 48-hour rule –Office hour: Time: Fr 10-11:20am Location: Padelford A-210G
Lab session Bill McNeil – –Lab session: what time is good for you? Explaining homework and solution Mallet related questions Reviewing class material I highly recommend you to attend lab sessions, especially the first few sessions.
Time for Lab Session Time: –Monday: 10:00am - 12:20pm, or –Tues: 10:30 am - 11:30 am, or –?? Location: ?? Thursday 3-4pm, MGH 271?
Misc Ling572 Mailing list: EPost Mallet developer mailing list:
Prerequisites Ling570 –Some basic algorithms: FSA, HMM, –NLP tasks: tokenization, POS tagging, …. Programming: If you don’t know Java well, talk to me. –Java: Mallet Basic concepts in probability and statistics –Ex: random variables, chain rule, Gaussian distribution, …. Basic concepts in Information Theory: –Ex: entropy, relative entropy, …
Expectations Reading: –Papers are online: –Reference book: Manning & Schutze (MS) –Finish reading papers before class I will ask you questions.
Grades Assignments (9 parts): 90% –Programming language: Java Class participation: 10% No quizzes, no final exams No “incomplete” unless you can prove your case.
Course objectives Covering basic statistical methods that produce state-of-the-art results Focusing on classification algorithms Touching on unsupervised and semi-supervised algorithms Some material is not easy. We will focus on applications, not theoretical proofs.
Course layout Supervised methods –Classification algorithms: Individual classifiers: –Naïve Bayes –kNN and Rocchio –Decision tree –Decision list: ?? –Maximum Entropy (MaxEnt) Classifier ensemble: –Bagging –Boosting –System combination
Course layout (cnt) Supervised algorithms (cont) –Sequence labeling algorithms: Transformation-based learning (TBL) FST, HMM, … Semi-supervised methods –Self-training –Co-training
Course layout (cont) Unsupervised methods –EM algorithm Forward-backward algorithm Inside-outside algorithm …
Questions for each method Modeling: –what is the model? –How does the decomposition work? –What kind of assumption is made? –How many types of model parameters? –How many “internal” (or non-model) parameters? –How to handle multi-class problem? –How to handle non-binary features? –…–…
Questions for each method (cont) Training: how to estimate parameters? Decoding: how to find the “best” solution? Weaknesses and strengths? –Is the algorithm robust? (e.g., handling outliners) scalable? prone to overfitting? efficient in training time? Test time? –How much data is needed? Labeled data Unlabeled data
Relation between 570/571 and /571 are organized by tasks; 572 is organized by learning methods. 572 focuses on statistical methods.
NLP tasks covered in Ling570 Tokenization Morphological analysis POS tagging Shallow parsing WSD NE tagging
NLP tasks covered in Ling571 Parsing Semantics Discourse Dialogue Natural language generation (NLG) …
A ML method for multiple NLP tasks Task (570/571): –Tokenization –POS tagging –Parsing –Reference resolution –…–… Method (572): –MaxEnt
Multiple methods for one NLP task Task (570/571): POS tagging Method (572): –Decision tree –MaxEnt –Boosting –Bagging –….
Projects: Task 1 Text Classification Task: 20 groups –P1: First look at the Mallet package –P2: Your first tui class Naïve Bayes –P3: Feature selection Decision Tree –P4: Bagging Boosting Individual project
Projects: Task 2 Sequence labeling task: IGT detection –P5: MaxEnt –P6: Beam Search –P7: TBA –P8: Presentation: final class –P9: Final report Group project (?)
Both projects Use Mallet, a Java package Two types of work: –Reading code to understand ML methods –Writing code to solve problems
Feedback on assignments “Misc” section in each assignment –How long it takes to finish the homework? –Which part is difficult? –…
Mallet overview It is a Java package, that includes many –classifiers, –sequence labeling algorithms, –optimization algorithms, –useful data classes, –…–… You should –read “Mallet Guides” –attend mallet tutorial: next Tuesday 10:30-11:30am: LLC109 –start on Hw1 I will use Mallet class/method names if possible.
Questions for “course overview”?
Outline Course overview Mathematical foundation –Probability theory –Information theory Basic concepts in the classification task
Probability Theory
Basic concepts Sample space, event, event space Random variable and random vector Conditional probability, joint probability, marginal probability (prior)
Sample space, event, event space Sample space (Ω): a collection of basic outcomes. –Ex: toss a coin twice: {HH, HT, TH, TT} Event: an event is a subset of Ω. –Ex: {HT, TH} Event space (2 Ω ): the set of all possible events.
Random variable The outcome of an experiment need not be a number. We often want to represent outcomes as numbers. A random variable X is a function: Ω R. –Ex: toss a coin twice: X(HH)=0, X(HT)=1, …
Two types of random variables Discrete: X takes on only a countable number of possible values. –Ex: Toss a coin 10 times. X is the number of tails that are noted. Continuous: X takes on an uncountable number of possible values. –Ex: X is the lifetime (in hours) of a light bulb.
Probability function The probability function of a discrete variable X is a function which gives the probability p(x i ) that X equals x i : a.k.a. p(x i ) = p(X=x i ).
Random vector Random vector is a finite-dimensional vector of random variables: X=[X 1,…,X k ]. P(x) = P(x 1,x 2,…,x n )=P(X 1 =x 1,…., X n =x n ) Ex: P(w 1, …, w n, t 1, …, t n )
Three types of probability Joint prob: P(x,y)= prob of x and y happening together Conditional prob: P(x|y) = prob of x given a specific value of y Marginal prob: P(x) = prob of x for all possible values of y
Common tricks (I): Marginal prob joint prob
Common tricks (II): Chain rule
Common tricks (III): Bayes rule
Common tricks (IV): Independence assumption
Prior and Posterior distribution Prior distribution: P( ) a distribution over parameter values θ set prior to observing any data. Posterior Distribution: P( |data) It represents our belief that θ is true after observing the data. Likelihood of the model : P(data | ) Relation among the three: Bayes Rule: P( | data) = P(data | ) P( ) / P(data)
Two ways of estimating Maximum likelihood: (ML) * = arg max P(data | ) Maxinum A-Posterior: (MAP) * = arg max P( data)
Information Theory
Information theory It is the use of probability theory to quantify and measure “information”. Basic concepts: –Entropy –Joint entropy and conditional entropy –Cross entropy and relative entropy –Mutual information and perplexity
Entropy Entropy is a measure of the uncertainty associated with a distribution. The lower bound on the number of bits it takes to transmit messages. An example: –Display the results of horse races. –Goal: minimize the number of bits to encode the results.
An example Uniform distribution: p i =1/8. Non-uniform distribution: (1/2,1/4,1/8, 1/16, 1/64, 1/64, 1/64, 1/64) (0, 10, 110, 1110, , , , ) Uniform distribution has higher entropy. MaxEnt: make the distribution as “uniform” as possible.
Joint and conditional entropy Joint entropy: Conditional entropy:
Cross Entropy Entropy: Cross Entropy: Cross entropy is a distance measure between p(x) and q(x): p(x) is the true probability; q(x) is our estimate of p(x).
Relative Entropy Also called Kullback-Leibler divergence: Another “distance” measure between prob functions p and q. KL divergence is asymmetric (not a true distance):
Mutual information It measures how much is in common between X and Y: I(X;Y)=KL(p(x,y)||p(x)p(y))
Perplexity Perplexity is 2 H. Perplexity is the weighted average number of choices a random variable has to make.
Questions for “Mathematical foundation”?
Outline Course overview Mathematical foundation –Probability theory –Information theory Basic concepts in the classification task
Types of ML problems Classification problem Estimation problem Clustering Discovery … A learning method can be applied to one or more types of ML problems. We will focus on the classification problem.
Definition of classification problem Task: –C= {c 1, c 2,.., c m } is a set of pre-defined classes (a.k.a., labels, categories). –D={d 1, d 2, …} is a set of input needed to be classified. –A classifier is a function: D £ C {0, 1}. Multi-label vs. single-label –Single-label: for each d i, only one class is assigned to it. Multi-class vs. binary classification problem –Binary: |C| = 2.
Conversion to single-label binary problem Multi-label single-label –We will focus on single-label problem. –A classifier: D £ C {0, 1} becomes D C –More general definition: D £ C [0, 1] Multi-class binary problem –Positive examples vs. negative examples
Examples of classification problems Text classification Document filtering Language/Author/Speaker id WSD PP attachment Automatic essay grading …
Problems that can be treated as a classification problem Tokenization / Word segmentation POS tagging NE detection NP chunking Parsing Reference resolution …
Labeled vs. unlabeled data Labeled data: –{(x i, y i )} is a set of labeled data. –x i 2 D: data/input, often represented as a feature vector. –y i 2 C: target/label Unlabeled data –{x i } without y i.
Instance, training and test data x i with or without y i is called an instance. Training data: a set of (labeled) instances. Test data: a set of unlabeled instances. The training data is stored in an InstanceList in Mallet, so is test data.
Attribute-value table Each row corresponds to an instance. Each column corresponds to a feature. A feature type (a.k.a. a feature template): w -1 A feature: w -1 =book Binary feature vs. non-binary feature
Attribute-value table f1f1 f2f2 …fKfK Target d1d1 yes1no-1000c2c2 d2d2 d3d3 … dndn
Feature sequence vs. Feature vector Feature sequence: a (featName, featValue) list for features that are present. Feature Vector: a (featName, featValue) list for all the features. Representing data x as a feature vector.
Data/Input a feature vector Example: –Task: text classification –Original x: a document –Feature vector: bag-of-words approach In Mallet, the process is handled by a sequence of pipes: –Tokenization –Lowercase –Merging the counts –…–…
Classifier and decision matrix A classifier is a function f: f(x) = {(c i, score i )}. It fills out a decision matrix. {(c i, score i )} is called a Classification in Mallet. d1d1 d2d2 d3d3 …. c1c … c2c … c3c3 …
Trainer (a.k.a Learner) A trainer is a function that takes an InstanceList as input, and outputs a classifier. Training stage: –Classifier train (instanceList); Test stage: –Classification classify (instance);
Important concepts (summary) Instance, InstanceList Labeled data, unlabeled data Training data, test data Feature, feature template Feature vector Attribute-value table Trainer, classifier Training stage, test stage
Steps for solving an NLP task with classifiers Convert the task into a classification problem (optional) Split data into training/test/validation Convert the data into attribute-value table Training Decoding Evaluation
Important subtasks (for you) Converting the data into attribute-value table –Define feature types –Feature selection –Convert an instance into a feature vector Understanding training/decoding algorithms for various algorithms.
Notation Classification in general Text categorization Input/dataxixi didi Target/labelyiyi cici Featuresfkfk t k (term) ………
Questions for “Concepts in a classification task”?
Summary Course overview Mathematical foundation –Probability theory –Information theory M&S Ch2 Basic concepts in the classification task
Downloading Hw1 Mallet Guide Homework Guide
Coming up Next Tuesday: –Mallet tutorial on 1/8 (Tues): 10:30-11:30am at LLC 109. –Classification algorithm overview and Naïve Bayes: read the paper beforehand. Next Thursday: –kNN and Rocchio: read the other paper Hw1 is due at 11pm on 1/13
Additional slides
An example 570/571: –POS tagging: HMM –Parsing: PCFG –MT: Model 1-4 training 572: –HMM: forward-backward algorithm –PCFG: inside-outside algorithm –MT: EM algorithm All special cases of EM algorithm, one method of unsupervised learning.
Proof: Relative entropy is always non-negative
Entropy of a language The entropy of a language L: If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:
Cross entropy of a language The cross entropy of a language L: If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:
Conditional Entropy