Download presentation
Presentation is loading. Please wait.
1
SVM Anchored Learning and Model Tampering (a case study in Hebrew NP chunking)
Yoav Goldberg and Michael Elhadad Ben Gurion University of the Negev, Israel
2
Lexicalization Once upon a time: number of features one could use in models was quite low Enter discriminative models: we can now incorporate millions of features! Adding lexical information as features helps accuracy in many tasks We add lexical information everywhere. Everyone is happy. 9/22/2018 ACL 2007
3
Lexicalization What specific problems do these lexical features solve?
Are all these features equally important? Once upon a time: a limit on the number of features one could use in his models. Enter discriminative models: we can now incorporate millions of features! Adding lexical information as features is shown to helps accuracy. We add lexical information everywhere. Everyone is happy. 9/22/2018 ACL 2007
4
Continuation of Our Previous Work
Hebrew NP Chunking (ACL 2006) Define “Hebrew Simple NPs” (traditional base NP definition does not work for Hebrew) and derive them from a Hebrew Treebank. SVM based approach. Morphological (construct and number) features help identify Simple NP chunks Lexical Features are crucial for Hebrew. 9/22/2018 ACL 2007
5
NLP-Machine Learning Workflow
Find a Task Get a Corpus Annotate it Represent it as a ML problem Decide on features Decide on a learning algorithm Encode the features Learn a model Evaluate 9/22/2018 ACL 2007
6
NLP-Machine Learning Workflow
(Hebrew) NP Chunking Find a Task Get a Corpus Annotate it Represent it as a ML problem Decide on features Decide on a learning algorithm Encode the features Learn a model Evaluate Use Treebank and derive annotation from it B-I-O Tagging SVM, Poly kernel Binary feature vector SVMLight, YAMCHA, … 9/22/2018 ACL 2007
7
NLP-Machine Learning Workflow
(Hebrew) NP Chunking Find a Task Get a Corpus Annotate it Represent it as a ML problem Decide on features Decide on a learning algorithm Encode the features Learn a model Inspect the resulting model Use Treebank and derive annotation from it B-I-O Tagging SVM, Poly kernel Binary feature vector SVMLight, YAMCHA, … 9/22/2018 ACL 2007
8
Workflow How important are specific features? What is hard to learn? Locate corpus errors Is our task definition consistent? (Hebrew) NP Chunking Find a Task Get a Corpus Annotate it Represent as ML problem Decide on features Decide on a learning algorithm Encode features Learn a model Inspect the resulting model Use Treebank and derive annotation from it B-I-O Tagging SVM, Poly kerlnel Binary feature vector SVMLight, YAMCHA, … 9/22/2018 ACL 2007
9
Overview – SVM Learning
Binary supervised classifier Input: labeled examples encoded as vectors (yi{-1,+1}, xiRn), kernel function K, C Magic (for this talk) Output: weighted support vectors (subset of input vectors) Decision function: 9/22/2018 ACL 2007
10
Feature Vectors ( 1 iff w0 is ‘dog’, 1 iff w0 is ‘cat’ , …,
1 iff p0 is VB, 1 iff p+2 is NN, … ) Very high dimension yet very sparse vectors. Features whose values are 0 in all the SVs do not (directly) affect the classification. 9/22/2018 ACL 2007
11
Multiclass SVM is a binary classifier, in order to do 3-class classification, 3 classifiers are learned: B/I B/O I/O 9/22/2018 ACL 2007
12
SVM Model An SVM model is the collection of support vectors and their corresponding weights. Many weighted vectors – but what does it mean? Enter “Model Tampering” 9/22/2018 ACL 2007
13
Model Tampering Artificially force selected features in the Support Vectors to 0. Evaluate tampered model on a test set, to learn the importance of these features for the classification. Note: this is not the same as learning without these features. 9/22/2018 ACL 2007
14
Our Datasets HEBGold HEBErr ENG
NP Chunks corpus derived from HebTreebank, ~5000 sentences, perfect POS tags, chunk definition as in (Goldberg et al, 2006). HEBErr Same as HEBGold, with ~8% POS tag errors ENG Ramshaw and Marcus’s NP chunks data, ~11,000 sentences, ~ 4% POS tag errors. 9/22/2018 ACL 2007
15
Tamperings (1/3) TopN – For each lexical feature, count the number of SVs where it is active. Keep only the top N lexical features according to this rank. 9/22/2018 ACL 2007
16
Tamperings (1/3) Near top performance with only 1000 lexical features TopN – For each lexical feature, count the number of SVs where it is active. Keep only the top N lexical features according to this rank. 9/22/2018 ACL 2007
17
Tamperings (1/3) Near top performance with only 1000 lexical features TopN – For each lexical feature, count the number of SVs where it is active. Keep only the top N lexical features according to this rank. With perfect POS tags, even 500 is more than enough (some words may hurt us) 9/22/2018 ACL 2007
18
Tamperings (1/3) Near top performance with only 1000 lexical features TopN – For each lexical feature, count the number of SVs where it is active. Keep only the top N lexical features according to this rank. With perfect POS tags, even 500 is more than enough (some words may hurt us) For Hebrew, 10 lexical features are REALLY important 9/22/2018 ACL 2007
19
Tamperings (2/3) NoPOS – remove all lexical features corresponding to a given part-of-speech. 9/22/2018 ACL 2007
20
Tamperings (2/3) NoPOS – remove all lexical features corresponding to a given part-of-speech. Prepositions and punctuations are most important. 9/22/2018 ACL 2007
21
Tamperings (2/3) NoPOS – remove all lexical features corresponding to a given part-of-speech. Prepositions and punctuations are most important. Closed class are more important than open class. 9/22/2018 ACL 2007
22
Tamperings (2/3) NoPOS – remove all lexical features corresponding to a given part-of-speech. Prepositions and punctuations are most important. Closed class are more important than open class. Adverbs are hard for the POS tagger. 9/22/2018 ACL 2007
23
Tamperings A Small Challenge
Most lexical features of nouns are unimportant in Englih as well. However, few are important: ‘yesterday’, ‘Wednesday’,… Can you tell why? Tamperings NoPOS – remove all lexical features corresponding to a given part-of-speech. Prepositions and punctuations are most important. Closed class are more important than open class. Adverbs are hard for the POS tagger. 9/22/2018 ACL 2007
24
Top10 Lexical Features in Hebrew
Start of Sentence Marker Comma Quote of / של and / ו the / ה in / ב 9/22/2018 ACL 2007
25
Top10 Lexical Features in Hebrew
Start of Sentence Marker Comma Quote of / של and / ו the / ה in / ב של/of is different than the other prepositions in Hebrew with respect to chunk boundaries. 9/22/2018 ACL 2007
26
Top10 Lexical Features in Hebrew
Start of Sentence Marker Comma Quote of / של and / ו the / ה in / ב Quotes and commas are important. We know they are somewhat inconsistent in TB. Goldberg et al normalize punctuation before evaluation This work normalize punctuation before learning improves F score by ~0.8 (10-fold CV) של/of is different than the other prepositions in Hebrew with respect to chunk boundaries. 9/22/2018 ACL 2007
27
Tamperings (3/3) Loc=i –keep only lex features with index i. 9/22/2018
ACL 2007
28
Tamperings (3/3) Loc=i –keep only lex features with index i.
9/22/2018 ACL 2007
29
Tamperings (3/3) Loc=i –keep only lex features with index i.
Lexical features at position 0 (current word) are most important. Tamperings (3/3) Loc=i –keep only lex features with index i. Loci –keep only lex features with indexi. 9/22/2018 ACL 2007
30
Tamperings (3/3) Loc=i –keep only lex features with index i.
Lexical features at position 0 (current word) are most important. Top0 tampering (Removing all lexical features) yield somewhat better results (90.1) Tamperings (3/3) Loc=i –keep only lex features with index i. Loci –keep only lex features with indexi. 9/22/2018 ACL 2007
31
Tamperings (3/3) Loc=i –keep only lex features with index i.
Lexical features at position 0 (current word) are most important. Top0 tampering (Removing all lexical features) yield somewhat better results (90.1) Tamperings (3/3) Loc=i –keep only lex features with index i. Loci –keep only lex features with indexi. This is better than with all the features (93.79) (yet learning without it to begin with is worse) 9/22/2018 ACL 2007
32
Intuitively… The SVM learner uses rare, irrelevant features (i.e., word at location –2 is X and POS at location 2 is Y) to memorize hard cases. This rote learning helps generalization performance by focusing the learner on the “easy” cases… …but overfits on the hard events. 9/22/2018 ACL 2007
33
Anchored Learning Add a unique feature (ai – anchor) to each training sample (as many features as there are samples) Data is linearly separable. Anchors “remove the burden” from “real” features. Anchors with high weights correspond to the “Hard to Learn” cases. “Hard to Learn” cases are either corpus errors, or genuinely hard (both are interesting). 9/22/2018 ACL 2007
34
Anchors vs. Previous Work on Corpus Error Detection
Most relevant work: Boosting (Abney et. al. 1998): “hard to learn examples in an AdaBoost model are candidate corpus errors” AdaBoost models are easy to interpret. SVM and AdaBoost models are different. 9/22/2018 ACL 2007
35
Anchors vs. Previous Work on Corpus Error Detection
Nakagawa and Matsumoto (2002): Support Vectors with high αi values are “exceptional cases” Look for similar examples with different label to extract contrastive pairs. Our method: Finds the errors directly. Has better recall. Converges in reasonable time even when there are many errors. Allows learning without “important” features. 9/22/2018 ACL 2007
36
Anchored Learning Results (1/2)
Identified corpus errors with high precision. Some of the corpus errors found were actually errors in the process of deriving chunks from the Hebrew TreeBank Identified problematic aspects with the NP chunk definition used in previous work, triggering a revision of the definition. Identified some hard cases (multi-word expressions, adverbial usage, conjunctions) 9/22/2018 ACL 2007
37
Anchored Learning (2/2) Current-Word lexical features are the most important. What are the contextual lexical features used for? 9/22/2018 ACL 2007
38
Anchored Learning (2) What are the contextual lexical features used for? Learn 3 models: Mfull – all lexical features Mnear – without features w-2/w+2 Mno-cont – with only the w0 lexical feature Compare the hard cases in the models, to find the role of features w-1/w+1, w-2/w+2. w-2 w-1 w0 w1 w2 9/22/2018 ACL 2007
39
Anchored Learning (2) What are the contextual lexical features used for? Learn 3 models: Mfull – all lexical features Mnear – without features w-2/w+2 Mno-cont – with only the w0 lexical feature Compare the hard cases in the models, to find the role of features w-1/w+1, w-2/w+2. (Anchors guarantee convergence of learning process in reasonable time) w-2 w-1 w0 w1 w2 9/22/2018 ACL 2007
40
Mfull < Mnear < Mno-cont
H Hard cases: Anchored Learning (2) Hard cases solved by Mnear What are the contextual lexical features used for? Learn 3 models: Mfull – all lexical features Mnear – without features w-2/w+2 Mno-cont – with only the w0 lexical feature Compare the hard cases in the models, to find the role of features w-1/w+1, w-2/w+2. (Anchors guarantee convergence of learning process in reasonable time) w-2 w-1 w0 w1 w2 Hard cases solved by Mfull 9/22/2018 ACL 2007
41
Qualitative Results Contextual lexical features contribute mostly to disambiguating: Conjunctions Appositions Attachment of Adverbs and Adjectives Some multi-word expressions 9/22/2018 ACL 2007
42
Quantitative Results 9/22/2018 ACL 2007
43
Quantitative Results w-1/w+1 solves about 5 times more hard cases than w-2/w+2 9/22/2018 ACL 2007
44
Quantitative Results w-1/w+1 solves about 5 times more hard cases than w-2/w+2 Contextual lexical features are very important for learning back-to-back NPs. 9/22/2018 ACL 2007
45
Quantitative Results Investigating Hebrew back-to-back NPs:
Back-to-back SimpleNPs in Hebrew are not as common as in English, but are much harder to decide. Most of the learning of back-to-back NPs achieved by local context is only superficial and will rarely generalize. Better features are needed for this case. w-1/w+1 solves about 5 times more hard cases than w-2/w+2 Contextual lexical features are very important for learning back-to-back NPs. 9/22/2018 ACL 2007
46
[Prague] [Monday afternoon]
Challenge Solution: This talk is held in [Prague] [Monday afternoon] Quantitative Results Investigating Hebrew back-to-back NPs: Back-to-back SimpleNPs in Hebrew are not as common as in English, but are much harder to decide. Most of the learning of back-to-back NPs achieved by local context is only superficial and will rarely generalize. Better features are needed for this case. w-1/w+1 solves about 5 times more hard cases than w-2/w+2 Contextual lexical features are very important for learning back-to-back NPs. 9/22/2018 ACL 2007
47
To sum it up Investigating learned models can yield interesting insights about the task at hand: Importance of features, role of features Corpus improvement What’s hard to learn Better task definition SVM models are no longer a “black box”. Some interesting insights about Hebrew. 9/22/2018 ACL 2007
48
Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.