Download presentation
Presentation is loading. Please wait.
Published byAbigail Lindsey Modified over 9 years ago
1
Final review LING 572 Fei Xia 03/07/06
2
Misc Parts 3 and 4 were due at 6am today. Presentation: email me the slides by 6am on 3/9 Final report: email me by 6am on 3/14. Group meetings: 1:30-4:00pm on 3/16.
3
Outline Main topics Applying to NLP tasks Tricks
4
Main topics
5
Supervised learning –Decision tree –Decision list –TBL –MaxEnt –Boosting Semi-supervised learning –Self-training –Co-training –EM –Co-EM
6
Main topics (cont) Unsupervised learning –The EM algorithm –The EM algorithm for PM models Forward-backward Inside-outside IBM models for MT Others –Two dynamic models: FSA and HMM –Re-sampling: bootstrap –System combination –Bagging
7
Main topics (cont) Homework –Hw1: FSA and HMM –Hw2: DT, DL, CNF, DNF, and TBL –Hw3: Boosting Project: –P1: Trigram (learn to use Carmel, relation between HMM and FSA) –P2: TBL –P3: MaxEnt –P4: Bagging, boosting, system combination, SSL
8
Supervised learning
9
A classification problem DistrictHouse type IncomePrevious Customer Outcome SuburbanDetachedHighNoNothing SuburbanSemi- detached HighYesRespond RuralSemi- detached LowNoRespond UrbanDetachedLowYesNothing …
10
Classification and estimation problems Given –x: input attributes –y: the goal –training data: a set of (x, y) Predict y given a new x: –y is a discrete variable classification problem –y is a continuous variable estimation problem
11
Five ML methods Decision tree Decision list TBL Boosting MaxEnt
12
Decision tree Modeling: tree representation Training: top-down induction, greedy algorithm Decoding: find the path from root to a leaf node, where the tests along the path are satisfied.
13
Decision tree (cont) Main algorithms: ID3, C4.5, CART Strengths: –Ability to generate understandable rules –Ability to clearly indicate best attributes Weakness: –Data splitting –Trouble with non-rectangular regions –The instability of top-down induction bagging
14
Decision list Modeling: a list of decision rules Training: greedy, iterative algorithm Decoding: find the 1 st rule that applies Each decision is based on a single piece of evidence, in contrast to MaxEnt, boosting, TBL
15
TBL Modeling: a list of transformations (similar to decision rules) Training: –Greedy, iterative algorithm –The concept of current state Decoding: apply every transformation to the data
16
TBL (cont) Strengths: –Minimizing error rate directly –Ability to handle non-classification problem Dynamic problem: POS tagging Non-classification problem: parsing Weaknesses: –Transformations are hard to interpret as they interact with one another –Probabilistic TBL: TBL-DT
17
Boosting Training Sample Weighted Sample fTfT f1f1 … f2f2 f ML
18
Boosting (cont) Modeling: combining a set of weak classifiers to produce a powerful committee. Training: learn one classifier at each iteration Decoding: use the weighted majority vote of the weak classifiers
19
Boosting (cont) Strengths –It comes with a set of theoretical guarantee (e.g., training error, test error). –It only needs to find weak classifiers. Weaknesses: –It is susceptible to noise. –The actual performance depends on the data and the base learner.
20
MaxEnt The task: find p* s.t. where If p* exists, it has of the form
21
MaxEnt (cont) If p* exists, then where
22
MaxEnt (cont) Training: GIS, IIS Feature selection: –Greedy algorithm –Select one (or more) at a time In general, MaxEnt achieves good performance on many NLP tasks.
23
Common issues Objective function / Quality measure: –DT, DL: e.g., information gain –TBL, Boosting: minimize training errors –MaxEnt: maximize entropy while satisfying constraints
24
Common issues (cont) Avoiding overfitting –Use development data –Two strategies: stop early post-pruning
25
Common issues (cont) Missing attribute values: –Assume a “blank” value –Assign most common value among all “similar” examples in the training data –(DL, DT): Assign a fraction of example to each possible class. Continuous-valued attributes –Choosing thresholds by checking the training data
26
Common issues (cont) Attributes with different costs –DT: Change the quality measure to include the costs Continuous-valued goal attribute –DT, DL: each “leaf” node is marked with a real value or a linear function –TBL, MaxEnt, Boosting: ??
27
Comparison of supervised learners DTDLTBLBoostingMaxEnt ProbabilisticPDTPDLTBL-DTConfidenceY ParametricNNNNY representationTreeOrdered list of rules Ordered list of transfor mations List of weighted classifiers List of weighte d features Each iterationAttributeRuleTransfor mation Classifier & weight Feature & weight Data processing Split data Split data* Change cur_y Reweight (x,y) None decodingPath1 st ruleSequenc e of rules Calc f(x)
28
Semi-supervised Learning
29
Semi-supervised learning Each learning method makes some assumptions about the problem. SSL works when those assumptions are satisfied. SSL could degrade the performance when mistakes reinforce themselves.
30
SSL (cont) We have covered four methods: self- training, co-training, EM, co-EM
31
Co-training The original paper: (Blum and Mitchell, 1998) –Two “independent” views: split the features into two sets. –Train a classifier on each view. –Each classifier labels data that can be used to train the other classifier. Extension: –Relax the conditional independence assumptions –Instead of using two views, use two or more classifiers trained on the whole feature set.
32
Unsupervised learning
33
EM is a method of estimating parameters in the MLE framework. It finds a sequence of parameters that improve the likelihood of the training data.
34
The EM algorithm Start with initial estimate, θ 0 Repeat until convergence –E-step: calculate –M-step: find
35
The EM algorithm (cont) The optimal solution for the M-step exists for many classes of problems. A number of well-known methods are special cases of EM. The EM algorithm for PM models –Forward-backward algorithm –Inside-outside algorithm –…–…
36
Other topics
37
FSA and HMM Two types of HMMs: –State-emission and arc-emission HMMs –They are equivalent We can convert HMM into WFA Modeling: Marcov assumption Training: –Supervised: counting –Unsupervised: forward-backward algorithm Decoding: Viterbi algorithm
38
Bootstrap f1f1 f2f2 fBfB ML f
39
Bootstrap (cont) A method of re-sampling: –One original sample B bootstrap samples It has a strong mathematical background. It is a method for estimating standard errors, bias, and so on.
40
System combination f1f1 f2f2 fBfB ML 1 ML B ML 2 f
41
System combination (cont) Hybridization: combine substructures to produce a new one. –Voting –Naïve Bayes Switching: choose one of the f i (x) –Similarity switching –Naïve Bayes
42
Bagging f1f1 f2f2 fBfB ML f bootstrap + system combination
43
Bagging (cont) It is effective for unstable learning methods: –Decision tree –Regression tree –Neural network It does not help stable learning methods –K-nearest neighbors
44
Relations
45
WFSA and HMM DL, DT, TBL EM, EM for PM
46
WFSA and HMM HMM Finish Add a “Start” state and a transition from “Start” to any state in HMM. Add a “Finish” state and a transition from any state in HMM to “Finish”. Start
47
DT, CNF, DNF, DT, TBL k-CNF k-DNFk-DT K-DL k-TBL
48
The EM algorithm The generalized EM The EM algorithm PM Gaussian Mix Inside-Outside Forward-backward IBM models
49
Solving a NLP problem
50
Issues Modeling: represent the problem as a formula and decompose the formula into a function of parameters Training: estimate model parameters Decoding: find the best answer given the parameters Other issues: –Preprocessing –Postprocessing –Evaluation –…–…
51
Modeling Generative vs. discriminative models Introducing hidden variables The order of decomposition
52
Modeling (cont) Approximation / assumptions Final formulae and types of parameters
53
Modeling (cont) Using classifiers for non-classification problem –POS tagging –Chunking –Parsing
54
Training Objective functions: –Maximize likelihood: EM –Minimize error rate: TBL –Maximum entropy: MaxEnt –…. Supervised, semi-supervised, unsupervised: –Ex: Maximize likelihood Supervised: simple counting Unsupervised: EM
55
Training (cont) At each iteration: –Choose one attribute / rule / weight / … at a time, and never change it in later time: DT, DL, TBL, –Update all the parameters at each iteration: EM Choose “untrained” parameters (e.g., thresholds): use development data. –Minimal “gain” for continuing iteration
56
Decoding Dynamic programming: –CYK for PCFG –Viterbi for HMM Dynamic problem: –Decode from left to right –Features only look at the left context –Keep top-N hypotheses at each position
57
Preprocessing Sentence segmentation Sentence alignment (for MT) Tokenization Morphing POS tagging …
58
Post-processing System combination Casing (MT) …
59
Evaluation Use standard training/test data if possible. Choose appropriate evaluation measures: –WSD: for what applications? –Word alignment: F-measure vs. AER. How does it affect MT result? –Parsing: F-measure vs. dependency link accuracy
60
Tricks
61
Algebra Probability Optimization Programming
62
Algebra The order of sums: Pulling out constants:
63
Algebra (cont) The order of sums and products: The order of log and product / sum:
64
Probability Introducing a new random variable: The order of decomposition:
65
More general cases
66
Probability (cont) Source-channel model: Bayes Rule:
67
Probability (cont) Normalization: Jensen’s inequality:
68
Optimization When there is no analytical solution, use iterative approach. If the optimal solution to g(x) is hard to find, look for the optimal solution to a (tight) lower bound of g(x).
69
Optimization (cont) Using Lagrange multipliers: Constrained problem: maximize f(x) with the constraint that g(x)=0 Unconstrained problem: maximize f(x) – λg(x) Taking first derivatives to find the stationary points.
70
Programming Using/creating a good package: –Tutorial, sample data, well-written code –Multiple levels of code Core ML algorithm: e.g., TBL Wrapper for a task: e.g., POS tagger Wrapper to deal with input, output, etc.
71
Programming (cont) Good practice: –Write notes and create wrappers (all the commands should be stored in the notes, or even better in a wrapper code) –Use standard directory structures: src/, include/, exec/, bin/, obj/, docs/, sample/, data/, result/ –Give meaning filenames only to important code: e.g., aaa100.exec, build_trigram_tagger.pl –Give meaning function, variable names –Don’t use global variables
72
Final words We have covered a lot of topics: 5+4+3+4 It takes time to digest, but at least we understand the basic concepts. The next step: applying them to real applications.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.