Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.

Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06

Outline Supervised learning –Learning algorithms –Resampling: bootstrap –System combination Semi-supervised learning Unsupervised learning

Supervised Learning

Machine learning problems Input x: a sentence, a set of attributes, … Input domain X: the set of all possible input Output y: a class, a real number, a tag, a tag sequence, a parse tree, a cluster Output domain Y: the set of all possible output Training data t: a set of (x, y) pairs  in supervised learning, y is known.

Machine learning problems (cont) Predictor (f): a function from X to Y. A learner: a function from T to F –T: the set of all possible training data –F: the set of all possible predictors. Types of ML problems: –Y is a finite set: classification –Y is R: regression –Y is of other types: parsing, clustering, …

The standard setting for binary classification problems Input x: –There is a finite set of attributes: a 1, …, a n – x is a vector: x=(x 1, …, x n ) Output y: –Binary-class: Y has only two members –Multi-class: Y has k members

Converting to the standard setting Multi-class  binary: (Boosting) –Train one classifier: (x,y)  (x, 1), 0), …., ((x,y), 1), …, ((x,k), 0) –Train k classifiers, one for each class: for class j, (x,y)  (x, (y=j)) Y is not a pre-defined finite set –Ex: POS tagging, parsing –Convert Y to a sequence of decisions.

Converting to the standard setting (cont) x is not a vector (x 1, …, x n ) –Define a set of input attributes: a 1, …, a n –Convert x to a vector –Ex: use boosting for POS tagging

Classification algorithms DT, DL, TBL, Boosting, MaxEnt Comparison: –Representation –Training: iterative approach Feature selection Weight setting Data processing –Decoding

Representation DT: a tree –Each internal node is a test on an input attribute DL: an ordered list of rules (f i, v i ) –Each f i is a test on one or more attributes. TBL: an ordered list of trans (f i, v i  v i ’). –Each f i is a test on one or more attributes. Boosting: a list of weighted weak classifiers. –Often a classifier tests one or more attributes. MaxEnt: a list of weighted features –A feature is a binary function: f(x, y)=0 or 1

Training: “feature” selection DT: the test with max entropy reduction –Test: attr == val DL: the decision rule with max entropy reduction –Rule: if (attr1=val1 && … && attr_i=val_i) then y=c. TBL: the transformation with max error reduction –Transformation: if (attr1=val1 && … && attr_i=val_i) then y=c1  y=c2 –Transformation: if (attr1=val1 && … && attr_i=val_i && y=c1) then y=c2 Boosting: the classifier chosen by the weak learner. –Classifier: if (attr1=val1 && … && attr_i=val_i) then y=c1 else y=not (c1) MaxEnt: features with max increase of the log-likelihood of the training data –Feature: if (attr1=val1 && … && attr_i=val_i && y=c) then 1 else 0

Training: weight setting Boosting: weights that minimize the upper bound of training error. MaxEnt: weights that maximize the entropy

Training: data processing DT: split data DL: split data (optional) TBL: apply transformations to reset cur_y –Original data: (x, y) –Used data: ((x, cur_y), y) Boosting: re-weight the examples (x,y) MaxEnt: none

Decoding for static problems: a single decision DT: find the unique path from the root to a leaf node in the decision tree DL: find the 1 st rule that fires TBL: find the sequence of rules that fire Boosting: sum up the weighted decisions by multiple classifiers MaxEnt: find the y that maximizes p(y | x).

Decoding for dynamic problems: a decision sequence TBL: it can handle dynamic problems directly. Beam search: –Decode from left-to-right. –A feature should not refer to future decisions. –Keep top-N at each position  Easy to implement for MaxEnt  Need to add weights (e.g, probabilities, costs, confidence scores) to DT, DL, TBL, and boosting

Comparison of learners DTDLTBLBoostingMaxEnt ProbabilisticSDTSDLTBL-DTconfidenceY ParametricNNNNY representationtreeOrdere d list of rules Ordered list of transform ations List of weighted classifiers List of weighted features Each iterationattributeruletransform ation classifier & weight feature & weight Data processing Split data Split data* Change cur_y Reweight (x,y) None. decodingpath1 st ruleSequence of rules Calc f(x)

Evaluation of learners Accuracy: F-measure, error rate, …. Cost: –The types and amount of resources: tools and training data –The cost of errors Complexity: –Computational complexity of the algorithm (training time, decoding time) –Complexity of the model: # of parameters Stability Bias

Stability of a learner L Given two samples t 1 and t 2 from the same distribution D X*Y, let f 1 =L(t 1 ) and f 2 =L(t 2 ). If L is stable, f 1 and f 2 should agree most of the time.

Bias Utgoff (1986): –Strong/weak bias: one that focuses the learner on a relatively small (large, resp.) number of hypotheses. –Correct/incorrect bias: one that allows (does allow, respectively) the learner to select the target

Bias (cont) Rendell (1986): based on the learner’s behavior –Exclusive bias: the learner does not consider any of the candidates in a class –Preferential bias: the learner prefers one class of concepts over another class. Others: based on the learner’s design –Representational bias: certain concepts cannot be considered because they cannot be expressed. –Procedural bias: Ex: pruning in C4.5 is a procedural bias that results in a preference for smaller DTs.

Resampling

Bagging f1f1 f2f2 fBfB ML f Sample with replacement

System combination

 This can be seen as a special kind of ML problem.  So we can use any learner f1f1 f2f2 fBfB f

Methods ML problem: Input attribute vector (f1(x), …., fn(x)) The goal: f(x) Strategies: Switching: for x, f(x) is equal to some f i (x) Hybridization: create a new value.

Project Part 3

Tasks Understand the algorithm Run the tagger on four sets of training data. 1K5K10K40K Accuracy Training time # of features

The MaxEnt core What is the format of the training data? What is the format of the test data? How does GIS work? How does L-BFGS work? What is Gaussian prior smoothing? And how is it calculated? How are events and features represented internally? During the decoding stage, how does the code find the top-N classes for a new instance?

The MaxEnt tagger: features Where are feature templates defined? List the feature templates used by the tagger. If you want to add a new feature template, what do you need to do? Which piece of code do you need to modify? Given the feature templates, how are (instantiated) features selected and filtered?

The MaxEnt tagger: trainer What’s the format of the training sentences? How does the trainer convert a training sentence into a list of events? How does the trainer treat rare words? What additional features do rare words produce? How many files are created by the trainer in each experiment? How are they created? And what are their usages?

The MaxEnt tagger: decoder What’s the format of the test data? How are unknown words handled by the decoder? Which function does the beam search (Just provide the function name and file name)?

Project Part 4

Task 1: System combination Try three methods. The methods can come from existing work (e.g, (Henderson and Brill, 1999)), or be totally new. At least one of them is trained: –Create training data Split S into (S1, S2) Train each of the three POS taggers using S1 Tag instances in S2  (sys1, sys2, sys3, gold) –Train the combiner with the training data

Task 1 (cont) 1k 5K10K40K Trigrama/b… TBLa/b.. MaxEnt… Comb1… Comb2… Comb3… a: tagging result with the whole training data. b: tagging result with part of the training data.

Task 2: bagging B=10: use 10 bags Training data: 1K, 5K, and 10K. 40K is optional. One combination method.

Task 2 (cont) 1K5K10K40K (optional) Trigrama/b/c… TBL… MaxEnt… Comb1a/b/c… a: no bagging b: one bag c: 10 bags

Task 3: boosting Software: boostexter Main tasks: –Handling unknown words –Format conversion: pay attention to special characters: e.g., “,” in “2,300” –Feature templates –Choosing the number of rounds: N –Train and decode

Task 3 (cont) 1K5K10K40K (optional) Iteration num1 a/b… Iteration num2 … ….… Iteration Num5 … a: true tags for neighboring words b: most frequent tags for neighboring words

Task 4: semi-supervised learning Select one or more taggers Choose the SSL methods: self-training, co-training, or something else. Decide on strategies for adding data. Show the results with or without unlabeled data.

Task 4 (cont) 1K labeled data5K labeled data No unlabeled data a/b… 15K unlabeled data … 25K unlabeled data 35K unlabeled data a: tagging accuracy b: the number of sentences added to the labeled data.

Project Part 5-6

Part 5: Presentation Presentation: 10 minutes + Q&A Email me the slides by 6am on 3/9 and bring a copy to class. Focus: –Tagging results: tables, figures –How TBL and MaxEnt work? –Project Part 4

Part 6: Final report Email me the file by 6am on 3/14. It should include the major results and observations from Project Part 1-5. Thoughts about ML algorithms Thoughts about the course, project, etc.

Due date 6am on 3/7/06: Part 3-4 –ESubmit the following: code for part 4 reports for Parts 3 and 4. –Bring a hardcopy of the report to class. 6am on 3/9/06: Part 5 –Email me your presentation slides. –Bring a hardcopy of your slides to class (4 slides per page). 6am on 3/14/06: Part 6 –Email me the final report.

Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.

Similar presentations

Presentation on theme: "Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.

Similar presentations

Presentation on theme: "Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06."— Presentation transcript:

Similar presentations

About project

Feedback