Introduction LING 572 Fei Xia Week 1: 1/4/06. Outline Course overview Mathematical foundation: (Prereq) –Probability theory –Information theory Basic.

Slides:



Advertisements
Similar presentations
Segmentation via Maximum Entropy Model. Goals Is it possible to learn the segmentation problem automatically? Using a model which is frequently used in.
Advertisements

The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
What is Statistical Modeling
Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
K nearest neighbor and Rocchio algorithm
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Decision Tree Algorithm
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Course outline and schedule Introduction Event Algebra (Sec )
Machine Learning CMPT 726 Simon Fraser University
Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Course outline and schedule Introduction (Sec )
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
1 Introduction LING 572 Fei Xia, Dan Jinguji Week 1: 1/08/08.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Scalable Text Mining with Sparse Generative Models
1 Introduction LING 575 Week 1: 1/08/08. Plan for today General information Course plan HMM and n-gram tagger (recap) EM and forward-backward algorithm.
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Probability and Statistics Review Thursday Sep 11.
Online Learning Algorithms
Introduction to information theory
Crash Course on Machine Learning
Bayesian Decision Theory Making Decisions Under uncertainty 1.
Some basic concepts of Information Theory and Entropy
2. Mathematical Foundations
Final review LING572 Fei Xia Week 10: 03/11/
General information CSE : Probabilistic Analysis of Computer Systems
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Text Classification, Active/Interactive learning.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
CS 445/545 Machine Learning Winter, 2012 Course overview: –Instructor Melanie Mitchell –Textbook Machine Learning: An Algorithmic Approach by Stephen Marsland.
1 Introduction LING 570 Fei Xia Week 1: 9/26/07. 2 Outline Course overview Tokenization Homework #1 Quiz #1.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Optimal Bayes Classification
1 Probability theory LING 570 Fei Xia Week 2: 10/01/07.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Final review LING 572 Fei Xia 03/07/06. Misc Parts 3 and 4 were due at 6am today. Presentation: me the slides by 6am on 3/9 Final report: .
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
CS 445/545 Machine Learning Winter, 2014 See syllabus at
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
ICS 280 Learning in Graphical Models
Bayes Net Learning: Bayesian Approaches
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 07: BAYESIAN ESTIMATION
Presentation transcript:

Introduction LING 572 Fei Xia Week 1: 1/4/06

Outline Course overview Mathematical foundation: (Prereq) –Probability theory –Information theory Basic concepts in the classification task

Course overview

General info Course url: –Syllabus (incl. slides, assignments, and papers): updated every week. –Message board –ESubmit Slides: –I will try to put the slides online before class. –“Additional slides” are not required and not covered in class.

Office hour Fei: – address: Subject line should include “ling572” The 48-hour rule –Office hour: Time: Fr 10-11:20am Location: Padelford A-210G

Lab session Bill McNeil – –Lab session: what time is good for you? Explaining homework and solution Mallet related questions Reviewing class material  I highly recommend you to attend lab sessions, especially the first few sessions.

Time for Lab Session Time: –Monday: 10:00am - 12:20pm, or –Tues: 10:30 am - 11:30 am, or –?? Location: ??  Thursday 3-4pm, MGH 271?

Misc Ling572 Mailing list: EPost Mallet developer mailing list:

Prerequisites Ling570 –Some basic algorithms: FSA, HMM, –NLP tasks: tokenization, POS tagging, …. Programming: If you don’t know Java well, talk to me. –Java: Mallet Basic concepts in probability and statistics –Ex: random variables, chain rule, Gaussian distribution, …. Basic concepts in Information Theory: –Ex: entropy, relative entropy, …

Expectations Reading: –Papers are online: –Reference book: Manning & Schutze (MS) –Finish reading papers before class  I will ask you questions.

Grades Assignments (9 parts): 90% –Programming language: Java Class participation: 10% No quizzes, no final exams No “incomplete” unless you can prove your case.

Course objectives Covering basic statistical methods that produce state-of-the-art results Focusing on classification algorithms Touching on unsupervised and semi-supervised algorithms Some material is not easy. We will focus on applications, not theoretical proofs.

Course layout Supervised methods –Classification algorithms: Individual classifiers: –Naïve Bayes –kNN and Rocchio –Decision tree –Decision list: ?? –Maximum Entropy (MaxEnt) Classifier ensemble: –Bagging –Boosting –System combination

Course layout (cnt) Supervised algorithms (cont) –Sequence labeling algorithms: Transformation-based learning (TBL) FST, HMM, … Semi-supervised methods –Self-training –Co-training

Course layout (cont) Unsupervised methods –EM algorithm Forward-backward algorithm Inside-outside algorithm …

Questions for each method Modeling: –what is the model? –How does the decomposition work? –What kind of assumption is made? –How many types of model parameters? –How many “internal” (or non-model) parameters? –How to handle multi-class problem? –How to handle non-binary features? –…–…

Questions for each method (cont) Training: how to estimate parameters? Decoding: how to find the “best” solution? Weaknesses and strengths? –Is the algorithm robust? (e.g., handling outliners) scalable? prone to overfitting? efficient in training time? Test time? –How much data is needed? Labeled data Unlabeled data

Relation between 570/571 and /571 are organized by tasks; 572 is organized by learning methods. 572 focuses on statistical methods.

NLP tasks covered in Ling570 Tokenization Morphological analysis POS tagging Shallow parsing WSD NE tagging

NLP tasks covered in Ling571 Parsing Semantics Discourse Dialogue Natural language generation (NLG) …

A ML method for multiple NLP tasks Task (570/571): –Tokenization –POS tagging –Parsing –Reference resolution –…–… Method (572): –MaxEnt

Multiple methods for one NLP task Task (570/571): POS tagging Method (572): –Decision tree –MaxEnt –Boosting –Bagging –….

Projects: Task 1 Text Classification Task: 20 groups –P1: First look at the Mallet package –P2: Your first tui class Naïve Bayes –P3: Feature selection Decision Tree –P4: Bagging Boosting Individual project

Projects: Task 2 Sequence labeling task: IGT detection –P5: MaxEnt –P6: Beam Search –P7: TBA –P8: Presentation: final class –P9: Final report Group project (?)

Both projects Use Mallet, a Java package Two types of work: –Reading code to understand ML methods –Writing code to solve problems

Feedback on assignments “Misc” section in each assignment –How long it takes to finish the homework? –Which part is difficult? –…

Mallet overview It is a Java package, that includes many –classifiers, –sequence labeling algorithms, –optimization algorithms, –useful data classes, –…–… You should –read “Mallet Guides” –attend mallet tutorial: next Tuesday 10:30-11:30am: LLC109 –start on Hw1 I will use Mallet class/method names if possible.

Questions for “course overview”?

Outline Course overview Mathematical foundation –Probability theory –Information theory Basic concepts in the classification task

Probability Theory

Basic concepts Sample space, event, event space Random variable and random vector Conditional probability, joint probability, marginal probability (prior)

Sample space, event, event space Sample space (Ω): a collection of basic outcomes. –Ex: toss a coin twice: {HH, HT, TH, TT} Event: an event is a subset of Ω. –Ex: {HT, TH} Event space (2 Ω ): the set of all possible events.

Random variable The outcome of an experiment need not be a number. We often want to represent outcomes as numbers. A random variable X is a function: Ω  R. –Ex: toss a coin twice: X(HH)=0, X(HT)=1, …

Two types of random variables Discrete: X takes on only a countable number of possible values. –Ex: Toss a coin 10 times. X is the number of tails that are noted. Continuous: X takes on an uncountable number of possible values. –Ex: X is the lifetime (in hours) of a light bulb.

Probability function The probability function of a discrete variable X is a function which gives the probability p(x i ) that X equals x i : a.k.a. p(x i ) = p(X=x i ).

Random vector Random vector is a finite-dimensional vector of random variables: X=[X 1,…,X k ]. P(x) = P(x 1,x 2,…,x n )=P(X 1 =x 1,…., X n =x n ) Ex: P(w 1, …, w n, t 1, …, t n )

Three types of probability Joint prob: P(x,y)= prob of x and y happening together Conditional prob: P(x|y) = prob of x given a specific value of y Marginal prob: P(x) = prob of x for all possible values of y

Common tricks (I): Marginal prob  joint prob

Common tricks (II): Chain rule

Common tricks (III): Bayes rule

Common tricks (IV): Independence assumption

Prior and Posterior distribution Prior distribution: P(  ) a distribution over parameter values θ set prior to observing any data. Posterior Distribution: P(  |data) It represents our belief that θ is true after observing the data. Likelihood of the model  : P(data |  ) Relation among the three: Bayes Rule: P(  | data) = P(data |  ) P(  ) / P(data)

Two ways of estimating  Maximum likelihood: (ML)  * = arg max  P(data |  ) Maxinum A-Posterior: (MAP)  * = arg max  P(  data)

Information Theory

Information theory It is the use of probability theory to quantify and measure “information”. Basic concepts: –Entropy –Joint entropy and conditional entropy –Cross entropy and relative entropy –Mutual information and perplexity

Entropy Entropy is a measure of the uncertainty associated with a distribution. The lower bound on the number of bits it takes to transmit messages. An example: –Display the results of horse races. –Goal: minimize the number of bits to encode the results.

An example Uniform distribution: p i =1/8. Non-uniform distribution: (1/2,1/4,1/8, 1/16, 1/64, 1/64, 1/64, 1/64) (0, 10, 110, 1110, , , , )  Uniform distribution has higher entropy.  MaxEnt: make the distribution as “uniform” as possible.

Joint and conditional entropy Joint entropy: Conditional entropy:

Cross Entropy Entropy: Cross Entropy: Cross entropy is a distance measure between p(x) and q(x): p(x) is the true probability; q(x) is our estimate of p(x).

Relative Entropy Also called Kullback-Leibler divergence: Another “distance” measure between prob functions p and q. KL divergence is asymmetric (not a true distance):

Mutual information It measures how much is in common between X and Y: I(X;Y)=KL(p(x,y)||p(x)p(y))

Perplexity Perplexity is 2 H. Perplexity is the weighted average number of choices a random variable has to make.

Questions for “Mathematical foundation”?

Outline Course overview Mathematical foundation –Probability theory –Information theory Basic concepts in the classification task

Types of ML problems Classification problem Estimation problem Clustering Discovery …  A learning method can be applied to one or more types of ML problems.  We will focus on the classification problem.

Definition of classification problem Task: –C= {c 1, c 2,.., c m } is a set of pre-defined classes (a.k.a., labels, categories). –D={d 1, d 2, …} is a set of input needed to be classified. –A classifier is a function: D £ C  {0, 1}. Multi-label vs. single-label –Single-label: for each d i, only one class is assigned to it. Multi-class vs. binary classification problem –Binary: |C| = 2.

Conversion to single-label binary problem Multi-label  single-label –We will focus on single-label problem. –A classifier: D £ C  {0, 1} becomes D  C –More general definition: D £ C  [0, 1] Multi-class  binary problem –Positive examples vs. negative examples

Examples of classification problems Text classification Document filtering Language/Author/Speaker id WSD PP attachment Automatic essay grading …

Problems that can be treated as a classification problem Tokenization / Word segmentation POS tagging NE detection NP chunking Parsing Reference resolution …

Labeled vs. unlabeled data Labeled data: –{(x i, y i )} is a set of labeled data. –x i 2 D: data/input, often represented as a feature vector. –y i 2 C: target/label Unlabeled data –{x i } without y i.

Instance, training and test data x i with or without y i is called an instance. Training data: a set of (labeled) instances. Test data: a set of unlabeled instances. The training data is stored in an InstanceList in Mallet, so is test data.

Attribute-value table Each row corresponds to an instance. Each column corresponds to a feature. A feature type (a.k.a. a feature template): w -1 A feature: w -1 =book Binary feature vs. non-binary feature

Attribute-value table f1f1 f2f2 …fKfK Target d1d1 yes1no-1000c2c2 d2d2 d3d3 … dndn

Feature sequence vs. Feature vector Feature sequence: a (featName, featValue) list for features that are present. Feature Vector: a (featName, featValue) list for all the features. Representing data x as a feature vector.

Data/Input  a feature vector Example: –Task: text classification –Original x: a document –Feature vector: bag-of-words approach In Mallet, the process is handled by a sequence of pipes: –Tokenization –Lowercase –Merging the counts –…–…

Classifier and decision matrix A classifier is a function f: f(x) = {(c i, score i )}. It fills out a decision matrix. {(c i, score i )} is called a Classification in Mallet. d1d1 d2d2 d3d3 …. c1c … c2c … c3c3 …

Trainer (a.k.a Learner) A trainer is a function that takes an InstanceList as input, and outputs a classifier. Training stage: –Classifier train (instanceList); Test stage: –Classification classify (instance);

Important concepts (summary) Instance, InstanceList Labeled data, unlabeled data Training data, test data Feature, feature template Feature vector Attribute-value table Trainer, classifier Training stage, test stage

Steps for solving an NLP task with classifiers Convert the task into a classification problem (optional) Split data into training/test/validation Convert the data into attribute-value table Training Decoding Evaluation

Important subtasks (for you) Converting the data into attribute-value table –Define feature types –Feature selection –Convert an instance into a feature vector Understanding training/decoding algorithms for various algorithms.

Notation Classification in general Text categorization Input/dataxixi didi Target/labelyiyi cici Featuresfkfk t k (term) ………

Questions for “Concepts in a classification task”?

Summary Course overview Mathematical foundation –Probability theory –Information theory  M&S Ch2 Basic concepts in the classification task

Downloading Hw1 Mallet Guide Homework Guide

Coming up Next Tuesday: –Mallet tutorial on 1/8 (Tues): 10:30-11:30am at LLC 109. –Classification algorithm overview and Naïve Bayes: read the paper beforehand. Next Thursday: –kNN and Rocchio: read the other paper Hw1 is due at 11pm on 1/13

Additional slides

An example 570/571: –POS tagging: HMM –Parsing: PCFG –MT: Model 1-4 training 572: –HMM: forward-backward algorithm –PCFG: inside-outside algorithm –MT: EM algorithm  All special cases of EM algorithm, one method of unsupervised learning.

Proof: Relative entropy is always non-negative

Entropy of a language The entropy of a language L: If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:

Cross entropy of a language The cross entropy of a language L: If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as:

Conditional Entropy