Final review LING 572 Fei Xia 03/07/06. Misc Parts 3 and 4 were due at 6am today. Presentation: email me the slides by 6am on 3/9 Final report: email.

Slides:

Advertisements

Similar presentations

Unsupervised Learning

Advertisements

Clustering Beyond K-means

Supervised Learning Recap

The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.

CMPUT 466/551 Principal Source: CMU

Visual Recognition Tutorial

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Final review LING572 Fei Xia Week 10: 03/13/08 1.

FSA and HMM LING 572 Fei Xia 1/5/06.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Decision tree LING 572 Fei Xia 1/10/06. Outline Basic concepts Main issues Advanced topics.

Ensemble Learning: An Introduction

Decision List LING 572 Fei Xia 1/18/06. Outline Basic concepts and properties Case study.

Transformation-based error- driven learning (TBL) LING 572 Fei Xia 1/19/06.

Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.

Forward-backward algorithm LING 572 Fei Xia 02/23/06.

The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.

Three kinds of learning

Introduction LING 572 Fei Xia Week 1: 1/4/06. Outline Course overview Mathematical foundation: (Prereq) –Probability theory –Information theory Basic.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?

Visual Recognition Tutorial

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Sequence labeling and beam search LING 572 Fei Xia 2/15/07.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

1 Introduction LING 575 Week 1: 1/08/08. Plan for today General information Course plan HMM and n-gram tagger (recap) EM and forward-backward algorithm.

Parameter estimate in IBM Models: Ling 572 Fei Xia Week ??

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Decision tree LING 572 Fei Xia 1/16/06.

Crash Course on Machine Learning

Final review LING572 Fei Xia Week 10: 03/11/

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Mohammad Ali Keyvanrad

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Text Classification, Active/Interactive learning.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.

CS Statistical Machine learning Lecture 24

Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Today’s Topics HW1 Due 11:55pm Today (no later than next Tuesday) HW2 Out, Due in Two Weeks Next Week We’ll Discuss the Make-Up Midterm Be Sure to Check.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Conditional Markov Models: MaxEnt Tagging and MEMMs

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.

Decision List LING 572 Fei Xia 1/12/06. Outline Basic concepts and properties Case study.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

Data Mining Lecture 11.

Introduction to EM algorithm

Ensemble learning.

A task of induction to find patterns

Presentation transcript:

Final review LING 572 Fei Xia 03/07/06

Misc Parts 3 and 4 were due at 6am today. Presentation: me the slides by 6am on 3/9 Final report: me by 6am on 3/14. Group meetings: 1:30-4:00pm on 3/16.

Outline Main topics Applying to NLP tasks Tricks

Main topics

Supervised learning –Decision tree –Decision list –TBL –MaxEnt –Boosting Semi-supervised learning –Self-training –Co-training –EM –Co-EM

Main topics (cont) Unsupervised learning –The EM algorithm –The EM algorithm for PM models Forward-backward Inside-outside IBM models for MT Others –Two dynamic models: FSA and HMM –Re-sampling: bootstrap –System combination –Bagging

Main topics (cont) Homework –Hw1: FSA and HMM –Hw2: DT, DL, CNF, DNF, and TBL –Hw3: Boosting Project: –P1: Trigram (learn to use Carmel, relation between HMM and FSA) –P2: TBL –P3: MaxEnt –P4: Bagging, boosting, system combination, SSL

Supervised learning

A classification problem DistrictHouse type IncomePrevious Customer Outcome SuburbanDetachedHighNoNothing SuburbanSemi- detached HighYesRespond RuralSemi- detached LowNoRespond UrbanDetachedLowYesNothing …

Classification and estimation problems Given –x: input attributes –y: the goal –training data: a set of (x, y) Predict y given a new x: –y is a discrete variable  classification problem –y is a continuous variable  estimation problem

Five ML methods Decision tree Decision list TBL Boosting MaxEnt

Decision tree Modeling: tree representation Training: top-down induction, greedy algorithm Decoding: find the path from root to a leaf node, where the tests along the path are satisfied.

Decision tree (cont) Main algorithms: ID3, C4.5, CART Strengths: –Ability to generate understandable rules –Ability to clearly indicate best attributes Weakness: –Data splitting –Trouble with non-rectangular regions –The instability of top-down induction  bagging

Decision list Modeling: a list of decision rules Training: greedy, iterative algorithm Decoding: find the 1 st rule that applies Each decision is based on a single piece of evidence, in contrast to MaxEnt, boosting, TBL

TBL Modeling: a list of transformations (similar to decision rules) Training: –Greedy, iterative algorithm –The concept of current state Decoding: apply every transformation to the data

TBL (cont) Strengths: –Minimizing error rate directly –Ability to handle non-classification problem Dynamic problem: POS tagging Non-classification problem: parsing Weaknesses: –Transformations are hard to interpret as they interact with one another –Probabilistic TBL: TBL-DT

Boosting Training Sample Weighted Sample fTfT f1f1 … f2f2 f ML

Boosting (cont) Modeling: combining a set of weak classifiers to produce a powerful committee. Training: learn one classifier at each iteration Decoding: use the weighted majority vote of the weak classifiers

Boosting (cont) Strengths –It comes with a set of theoretical guarantee (e.g., training error, test error). –It only needs to find weak classifiers. Weaknesses: –It is susceptible to noise. –The actual performance depends on the data and the base learner.

MaxEnt The task: find p* s.t. where If p* exists, it has of the form

MaxEnt (cont) If p* exists, then where

MaxEnt (cont) Training: GIS, IIS Feature selection: –Greedy algorithm –Select one (or more) at a time In general, MaxEnt achieves good performance on many NLP tasks.

Common issues Objective function / Quality measure: –DT, DL: e.g., information gain –TBL, Boosting: minimize training errors –MaxEnt: maximize entropy while satisfying constraints

Common issues (cont) Avoiding overfitting –Use development data –Two strategies: stop early post-pruning

Common issues (cont) Missing attribute values: –Assume a “blank” value –Assign most common value among all “similar” examples in the training data –(DL, DT): Assign a fraction of example to each possible class. Continuous-valued attributes –Choosing thresholds by checking the training data

Common issues (cont) Attributes with different costs –DT: Change the quality measure to include the costs Continuous-valued goal attribute –DT, DL: each “leaf” node is marked with a real value or a linear function –TBL, MaxEnt, Boosting: ??

Comparison of supervised learners DTDLTBLBoostingMaxEnt ProbabilisticPDTPDLTBL-DTConfidenceY ParametricNNNNY representationTreeOrdered list of rules Ordered list of transfor mations List of weighted classifiers List of weighte d features Each iterationAttributeRuleTransfor mation Classifier & weight Feature & weight Data processing Split data Split data* Change cur_y Reweight (x,y) None decodingPath1 st ruleSequenc e of rules Calc f(x)

Semi-supervised Learning

Semi-supervised learning Each learning method makes some assumptions about the problem. SSL works when those assumptions are satisfied. SSL could degrade the performance when mistakes reinforce themselves.

SSL (cont) We have covered four methods: self- training, co-training, EM, co-EM

Co-training The original paper: (Blum and Mitchell, 1998) –Two “independent” views: split the features into two sets. –Train a classifier on each view. –Each classifier labels data that can be used to train the other classifier. Extension: –Relax the conditional independence assumptions –Instead of using two views, use two or more classifiers trained on the whole feature set.

Unsupervised learning

EM is a method of estimating parameters in the MLE framework. It finds a sequence of parameters that improve the likelihood of the training data.

The EM algorithm Start with initial estimate, θ 0 Repeat until convergence –E-step: calculate –M-step: find

The EM algorithm (cont) The optimal solution for the M-step exists for many classes of problems.  A number of well-known methods are special cases of EM. The EM algorithm for PM models –Forward-backward algorithm –Inside-outside algorithm –…–…

Other topics

FSA and HMM Two types of HMMs: –State-emission and arc-emission HMMs –They are equivalent We can convert HMM into WFA Modeling: Marcov assumption Training: –Supervised: counting –Unsupervised: forward-backward algorithm Decoding: Viterbi algorithm

Bootstrap f1f1 f2f2 fBfB ML f

Bootstrap (cont) A method of re-sampling: –One original sample  B bootstrap samples It has a strong mathematical background. It is a method for estimating standard errors, bias, and so on.

System combination f1f1 f2f2 fBfB ML 1 ML B ML 2 f

System combination (cont) Hybridization: combine substructures to produce a new one. –Voting –Naïve Bayes Switching: choose one of the f i (x) –Similarity switching –Naïve Bayes

Bagging f1f1 f2f2 fBfB ML f bootstrap + system combination

Bagging (cont) It is effective for unstable learning methods: –Decision tree –Regression tree –Neural network It does not help stable learning methods –K-nearest neighbors

Relations

WFSA and HMM DL, DT, TBL EM, EM for PM

WFSA and HMM HMM Finish Add a “Start” state and a transition from “Start” to any state in HMM. Add a “Finish” state and a transition from any state in HMM to “Finish”. Start

DT, CNF, DNF, DT, TBL k-CNF k-DNFk-DT K-DL k-TBL

The EM algorithm The generalized EM The EM algorithm PM Gaussian Mix Inside-Outside Forward-backward IBM models

Solving a NLP problem

Issues Modeling: represent the problem as a formula and decompose the formula into a function of parameters Training: estimate model parameters Decoding: find the best answer given the parameters Other issues: –Preprocessing –Postprocessing –Evaluation –…–…

Modeling Generative vs. discriminative models Introducing hidden variables The order of decomposition

Modeling (cont) Approximation / assumptions Final formulae and types of parameters

Modeling (cont) Using classifiers for non-classification problem –POS tagging –Chunking –Parsing

Training Objective functions: –Maximize likelihood: EM –Minimize error rate: TBL –Maximum entropy: MaxEnt –…. Supervised, semi-supervised, unsupervised: –Ex: Maximize likelihood Supervised: simple counting Unsupervised: EM

Training (cont) At each iteration: –Choose one attribute / rule / weight / … at a time, and never change it in later time: DT, DL, TBL, –Update all the parameters at each iteration: EM Choose “untrained” parameters (e.g., thresholds): use development data. –Minimal “gain” for continuing iteration

Decoding Dynamic programming: –CYK for PCFG –Viterbi for HMM Dynamic problem: –Decode from left to right –Features only look at the left context –Keep top-N hypotheses at each position

Preprocessing Sentence segmentation Sentence alignment (for MT) Tokenization Morphing POS tagging …

Post-processing System combination Casing (MT) …

Evaluation Use standard training/test data if possible. Choose appropriate evaluation measures: –WSD: for what applications? –Word alignment: F-measure vs. AER. How does it affect MT result? –Parsing: F-measure vs. dependency link accuracy

Tricks

Algebra Probability Optimization Programming

Algebra The order of sums: Pulling out constants:

Algebra (cont) The order of sums and products: The order of log and product / sum:

Probability Introducing a new random variable: The order of decomposition:

More general cases

Probability (cont) Source-channel model: Bayes Rule:

Probability (cont) Normalization: Jensen’s inequality:

Optimization When there is no analytical solution, use iterative approach. If the optimal solution to g(x) is hard to find, look for the optimal solution to a (tight) lower bound of g(x).

Optimization (cont) Using Lagrange multipliers: Constrained problem: maximize f(x) with the constraint that g(x)=0 Unconstrained problem: maximize f(x) – λg(x) Taking first derivatives to find the stationary points.

Programming Using/creating a good package: –Tutorial, sample data, well-written code –Multiple levels of code Core ML algorithm: e.g., TBL Wrapper for a task: e.g., POS tagger Wrapper to deal with input, output, etc.

Programming (cont) Good practice: –Write notes and create wrappers (all the commands should be stored in the notes, or even better in a wrapper code) –Use standard directory structures: src/, include/, exec/, bin/, obj/, docs/, sample/, data/, result/ –Give meaning filenames only to important code: e.g., aaa100.exec, build_trigram_tagger.pl –Give meaning function, variable names –Don’t use global variables

Final words We have covered a lot of topics: It takes time to digest, but at least we understand the basic concepts. The next step: applying them to real applications.