Presentation is loading. Please wait.

Presentation is loading. Please wait.

Final review LING 572 Fei Xia 03/07/06. Misc Parts 3 and 4 were due at 6am today. Presentation: email me the slides by 6am on 3/9 Final report: email.

Similar presentations


Presentation on theme: "Final review LING 572 Fei Xia 03/07/06. Misc Parts 3 and 4 were due at 6am today. Presentation: email me the slides by 6am on 3/9 Final report: email."— Presentation transcript:

1 Final review LING 572 Fei Xia 03/07/06

2 Misc Parts 3 and 4 were due at 6am today. Presentation: email me the slides by 6am on 3/9 Final report: email me by 6am on 3/14. Group meetings: 1:30-4:00pm on 3/16.

3 Outline Main topics Applying to NLP tasks Tricks

4 Main topics

5 Supervised learning –Decision tree –Decision list –TBL –MaxEnt –Boosting Semi-supervised learning –Self-training –Co-training –EM –Co-EM

6 Main topics (cont) Unsupervised learning –The EM algorithm –The EM algorithm for PM models Forward-backward Inside-outside IBM models for MT Others –Two dynamic models: FSA and HMM –Re-sampling: bootstrap –System combination –Bagging

7 Main topics (cont) Homework –Hw1: FSA and HMM –Hw2: DT, DL, CNF, DNF, and TBL –Hw3: Boosting Project: –P1: Trigram (learn to use Carmel, relation between HMM and FSA) –P2: TBL –P3: MaxEnt –P4: Bagging, boosting, system combination, SSL

8 Supervised learning

9 A classification problem DistrictHouse type IncomePrevious Customer Outcome SuburbanDetachedHighNoNothing SuburbanSemi- detached HighYesRespond RuralSemi- detached LowNoRespond UrbanDetachedLowYesNothing …

10 Classification and estimation problems Given –x: input attributes –y: the goal –training data: a set of (x, y) Predict y given a new x: –y is a discrete variable  classification problem –y is a continuous variable  estimation problem

11 Five ML methods Decision tree Decision list TBL Boosting MaxEnt

12 Decision tree Modeling: tree representation Training: top-down induction, greedy algorithm Decoding: find the path from root to a leaf node, where the tests along the path are satisfied.

13 Decision tree (cont) Main algorithms: ID3, C4.5, CART Strengths: –Ability to generate understandable rules –Ability to clearly indicate best attributes Weakness: –Data splitting –Trouble with non-rectangular regions –The instability of top-down induction  bagging

14 Decision list Modeling: a list of decision rules Training: greedy, iterative algorithm Decoding: find the 1 st rule that applies Each decision is based on a single piece of evidence, in contrast to MaxEnt, boosting, TBL

15 TBL Modeling: a list of transformations (similar to decision rules) Training: –Greedy, iterative algorithm –The concept of current state Decoding: apply every transformation to the data

16 TBL (cont) Strengths: –Minimizing error rate directly –Ability to handle non-classification problem Dynamic problem: POS tagging Non-classification problem: parsing Weaknesses: –Transformations are hard to interpret as they interact with one another –Probabilistic TBL: TBL-DT

17 Boosting Training Sample Weighted Sample fTfT f1f1 … f2f2 f ML

18 Boosting (cont) Modeling: combining a set of weak classifiers to produce a powerful committee. Training: learn one classifier at each iteration Decoding: use the weighted majority vote of the weak classifiers

19 Boosting (cont) Strengths –It comes with a set of theoretical guarantee (e.g., training error, test error). –It only needs to find weak classifiers. Weaknesses: –It is susceptible to noise. –The actual performance depends on the data and the base learner.

20 MaxEnt The task: find p* s.t. where If p* exists, it has of the form

21 MaxEnt (cont) If p* exists, then where

22 MaxEnt (cont) Training: GIS, IIS Feature selection: –Greedy algorithm –Select one (or more) at a time In general, MaxEnt achieves good performance on many NLP tasks.

23 Common issues Objective function / Quality measure: –DT, DL: e.g., information gain –TBL, Boosting: minimize training errors –MaxEnt: maximize entropy while satisfying constraints

24 Common issues (cont) Avoiding overfitting –Use development data –Two strategies: stop early post-pruning

25 Common issues (cont) Missing attribute values: –Assume a “blank” value –Assign most common value among all “similar” examples in the training data –(DL, DT): Assign a fraction of example to each possible class. Continuous-valued attributes –Choosing thresholds by checking the training data

26 Common issues (cont) Attributes with different costs –DT: Change the quality measure to include the costs Continuous-valued goal attribute –DT, DL: each “leaf” node is marked with a real value or a linear function –TBL, MaxEnt, Boosting: ??

27 Comparison of supervised learners DTDLTBLBoostingMaxEnt ProbabilisticPDTPDLTBL-DTConfidenceY ParametricNNNNY representationTreeOrdered list of rules Ordered list of transfor mations List of weighted classifiers List of weighte d features Each iterationAttributeRuleTransfor mation Classifier & weight Feature & weight Data processing Split data Split data* Change cur_y Reweight (x,y) None decodingPath1 st ruleSequenc e of rules Calc f(x)

28 Semi-supervised Learning

29 Semi-supervised learning Each learning method makes some assumptions about the problem. SSL works when those assumptions are satisfied. SSL could degrade the performance when mistakes reinforce themselves.

30 SSL (cont) We have covered four methods: self- training, co-training, EM, co-EM

31 Co-training The original paper: (Blum and Mitchell, 1998) –Two “independent” views: split the features into two sets. –Train a classifier on each view. –Each classifier labels data that can be used to train the other classifier. Extension: –Relax the conditional independence assumptions –Instead of using two views, use two or more classifiers trained on the whole feature set.

32 Unsupervised learning

33 EM is a method of estimating parameters in the MLE framework. It finds a sequence of parameters that improve the likelihood of the training data.

34 The EM algorithm Start with initial estimate, θ 0 Repeat until convergence –E-step: calculate –M-step: find

35 The EM algorithm (cont) The optimal solution for the M-step exists for many classes of problems.  A number of well-known methods are special cases of EM. The EM algorithm for PM models –Forward-backward algorithm –Inside-outside algorithm –…–…

36 Other topics

37 FSA and HMM Two types of HMMs: –State-emission and arc-emission HMMs –They are equivalent We can convert HMM into WFA Modeling: Marcov assumption Training: –Supervised: counting –Unsupervised: forward-backward algorithm Decoding: Viterbi algorithm

38 Bootstrap f1f1 f2f2 fBfB ML f

39 Bootstrap (cont) A method of re-sampling: –One original sample  B bootstrap samples It has a strong mathematical background. It is a method for estimating standard errors, bias, and so on.

40 System combination f1f1 f2f2 fBfB ML 1 ML B ML 2 f

41 System combination (cont) Hybridization: combine substructures to produce a new one. –Voting –Naïve Bayes Switching: choose one of the f i (x) –Similarity switching –Naïve Bayes

42 Bagging f1f1 f2f2 fBfB ML f bootstrap + system combination

43 Bagging (cont) It is effective for unstable learning methods: –Decision tree –Regression tree –Neural network It does not help stable learning methods –K-nearest neighbors

44 Relations

45 WFSA and HMM DL, DT, TBL EM, EM for PM

46 WFSA and HMM HMM Finish Add a “Start” state and a transition from “Start” to any state in HMM. Add a “Finish” state and a transition from any state in HMM to “Finish”. Start

47 DT, CNF, DNF, DT, TBL k-CNF k-DNFk-DT K-DL k-TBL

48 The EM algorithm The generalized EM The EM algorithm PM Gaussian Mix Inside-Outside Forward-backward IBM models

49 Solving a NLP problem

50 Issues Modeling: represent the problem as a formula and decompose the formula into a function of parameters Training: estimate model parameters Decoding: find the best answer given the parameters Other issues: –Preprocessing –Postprocessing –Evaluation –…–…

51 Modeling Generative vs. discriminative models Introducing hidden variables The order of decomposition

52 Modeling (cont) Approximation / assumptions Final formulae and types of parameters

53 Modeling (cont) Using classifiers for non-classification problem –POS tagging –Chunking –Parsing

54 Training Objective functions: –Maximize likelihood: EM –Minimize error rate: TBL –Maximum entropy: MaxEnt –…. Supervised, semi-supervised, unsupervised: –Ex: Maximize likelihood Supervised: simple counting Unsupervised: EM

55 Training (cont) At each iteration: –Choose one attribute / rule / weight / … at a time, and never change it in later time: DT, DL, TBL, –Update all the parameters at each iteration: EM Choose “untrained” parameters (e.g., thresholds): use development data. –Minimal “gain” for continuing iteration

56 Decoding Dynamic programming: –CYK for PCFG –Viterbi for HMM Dynamic problem: –Decode from left to right –Features only look at the left context –Keep top-N hypotheses at each position

57 Preprocessing Sentence segmentation Sentence alignment (for MT) Tokenization Morphing POS tagging …

58 Post-processing System combination Casing (MT) …

59 Evaluation Use standard training/test data if possible. Choose appropriate evaluation measures: –WSD: for what applications? –Word alignment: F-measure vs. AER. How does it affect MT result? –Parsing: F-measure vs. dependency link accuracy

60 Tricks

61 Algebra Probability Optimization Programming

62 Algebra The order of sums: Pulling out constants:

63 Algebra (cont) The order of sums and products: The order of log and product / sum:

64 Probability Introducing a new random variable: The order of decomposition:

65 More general cases

66 Probability (cont) Source-channel model: Bayes Rule:

67 Probability (cont) Normalization: Jensen’s inequality:

68 Optimization When there is no analytical solution, use iterative approach. If the optimal solution to g(x) is hard to find, look for the optimal solution to a (tight) lower bound of g(x).

69 Optimization (cont) Using Lagrange multipliers: Constrained problem: maximize f(x) with the constraint that g(x)=0 Unconstrained problem: maximize f(x) – λg(x) Taking first derivatives to find the stationary points.

70 Programming Using/creating a good package: –Tutorial, sample data, well-written code –Multiple levels of code Core ML algorithm: e.g., TBL Wrapper for a task: e.g., POS tagger Wrapper to deal with input, output, etc.

71 Programming (cont) Good practice: –Write notes and create wrappers (all the commands should be stored in the notes, or even better in a wrapper code) –Use standard directory structures: src/, include/, exec/, bin/, obj/, docs/, sample/, data/, result/ –Give meaning filenames only to important code: e.g., aaa100.exec, build_trigram_tagger.pl –Give meaning function, variable names –Don’t use global variables

72 Final words We have covered a lot of topics: 5+4+3+4 It takes time to digest, but at least we understand the basic concepts. The next step: applying them to real applications.


Download ppt "Final review LING 572 Fei Xia 03/07/06. Misc Parts 3 and 4 were due at 6am today. Presentation: email me the slides by 6am on 3/9 Final report: email."

Similar presentations


Ads by Google