Toward Better Understanding

Slides:

Advertisements

Similar presentations

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.

Problem Semi supervised sarcasm identification using SASI

Taking the Kitchen Sink Seriously: An Ensemble Approach to Word Sense Disambiguation from Christopher Manning et al.

Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,

Introduction to Machine Learning Approach Lecture 5.

Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.

Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.

A Language Independent Method for Question Classification COLING 2004.

Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

This research is supported by the U.S. Department of Education and DARPA. Focuses on mistakes in determiner and preposition usage made by non-native speakers.

The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.

Language Identification and Part-of-Speech Tagging

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Chapter 7. Classification and Prediction

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Relation Extraction CSCI-GA.2591

Erasmus University Rotterdam

Estimating Link Signatures with Machine Learning Algorithms

Bidirectional CRF for NER

Improving a Pipeline Architecture for Shallow Discourse Parsing

ECE 5424: Introduction to Machine Learning

Yoav Goldberg and Michael Elhadad

Max-margin sequential learning methods

Prototype-Driven Learning for Sequence Models

Machine Learning Week 1.

Data Mining Practical Machine Learning Tools and Techniques

Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov

Lecture 18: Bagging and Boosting

Introduction Task: extracting relational facts from text

Implementing AdaBoost

Automatic Detection of Causal Relations for Question Answering

CSCI 5832 Natural Language Processing

Ensemble learning.

CS246: Information Retrieval

Natural Language Processing

Feature Selection for Ranking

Word embeddings (continued)

Statistical NLP Spring 2011

Trevor Brown DC 2338, Office hour M3-4pm

The Voted Perceptron for Ranking and Structured Classification

Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006

MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Learning and Memorization

Extracting Why Text Segment from Web Based on Grammar-gram

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

Toward Better Understanding of Hebrew NP Chunks SVM Anchored Learning and Model Tampering (a case study in Hebrew NP chunking) Yoav Goldberg and Michael Elhadad Ben Gurion University of the Negev, Israel

Lexicalization Once upon a time: number of features one could use in models was quite low Enter discriminative models: we can now incorporate millions of features! Adding lexical information as features helps accuracy in many tasks We add lexical information everywhere. Everyone is happy. 12/7/2018 ISCOL 2007

Lexicalization What specific problems do these lexical features solve? Are all these features equally important? Once upon a time: a limit on the number of features one could use in his models. Enter discriminative models: we can now incorporate millions of features! Adding lexical information as features is shown to helps accuracy. We add lexical information everywhere. Everyone is happy. 12/7/2018 ISCOL 2007

Continuation of Our Previous Work Hebrew NP Chunking (ACL 2006) Define “Hebrew Simple NPs” (traditional base NP definition does not work for Hebrew) and derive them from a Hebrew Treebank. SVM based approach. Morphological (construct and number) features help identify Simple NP chunks Lexical Features are crucial for Hebrew. 12/7/2018 ISCOL 2007

NLP-Machine Learning Workflow Find a Task  Get a Corpus  Annotate it  Represent it as a ML problem  Decide on features  Decide on a learning algorithm  Encode the features  Learn a model  Evaluate 12/7/2018 ISCOL 2007

NLP-Machine Learning Workflow (Hebrew) NP Chunking Find a Task  Get a Corpus  Annotate it  Represent it as a ML problem  Decide on features  Decide on a learning algorithm  Encode the features  Learn a model  Evaluate Use Treebank and derive annotation from it B-I-O Tagging SVM, Poly kernel Binary feature vector SVMLight, YAMCHA, … 12/7/2018 ISCOL 2007

NLP-Machine Learning Workflow (Hebrew) NP Chunking Find a Task  Get a Corpus  Annotate it  Represent it as a ML problem  Decide on features  Decide on a learning algorithm  Encode the features  Learn a model  Inspect the resulting model Use Treebank and derive annotation from it B-I-O Tagging SVM, Poly kerlnel Binary feature vector SVMLight, YAMCHA, … 12/7/2018 ISCOL 2007

Workflow How important are specific features? What is hard to learn? Locate corpus errors Is our task definition consistent? (Hebrew) NP Chunking Find a Task  Get a Corpus  Annotate it  Represent as ML problem  Decide on features  Decide on a learning algorithm  Encode features  Learn a model  Inspect the resulting model Use Treebank and derive annotation from it B-I-O Tagging SVM, Poly kerlnel Binary feature vector SVMLight, YAMCHA, … 12/7/2018 ISCOL 2007

Overview – SVM Learning Binary supervised classifier Input: labeled examples encoded as vectors (yi{-1,+1}, xi Rn), kernel function K, C Magic (for this talk) Output: weighted support vectors (subset of input vectors) Decision function: 12/7/2018 ISCOL 2007

Feature Vectors ( 1 iff w0 is ‘dog’, 1 iff w0 is ‘cat’ , …, 1 iff p0 is VB, 1 iff p+2 is NN, … ) Very high dimension yet very sparse vectors. Features whose values are 0 in all the SVs do not (directly) affect the classification. 12/7/2018 ISCOL 2007

Multiclass SVM is a binary classifier, in order to do 3-class classification, 3 classifiers are learned: B/I B/O I/O 12/7/2018 ISCOL 2007

SVM Model An SVM model is the collection of support vectors and their corresponding weights. Many weighted vectors – but what does it mean? Enter “Model Tampering” 12/7/2018 ISCOL 2007

Model Tampering Artificially force selected features in the Support Vectors to 0. Evaluate tampered model on a test set, to learn the importance of these features for the classification. Note: this is not the same as learning without these features. 12/7/2018 ISCOL 2007

Our Datasets HEBGold HEBErr ENG NP Chunks corpus derived from HebTreebank, ~5000 sentences, perfect POS tags, chunk definition as in (Goldberg et al, 2006). HEBErr Same as HEBGold, with ~8% POS tag errors ENG Ramshaw and Marcus’s NP chunks data, ~11,000 sentences, ~ 4% POS tag errors. 12/7/2018 ISCOL 2007

Tamperings (1/3) TopN – For each lexical feature, count the number of SVs where it is active. Keep only the top N lexical features according to this rank. 12/7/2018 ISCOL 2007

Tamperings (1/3) Near top performance with only 1000 lexical features TopN – For each lexical feature, count the number of SVs where it is active. Keep only the top N lexical features according to this rank. 12/7/2018 ISCOL 2007

Tamperings (1/3) Near top performance with only 1000 lexical features TopN – For each lexical feature, count the number of SVs where it is active. Keep only the top N lexical features according to this rank. With perfect POS tags, even 500 is more than enough (some words may hurt us) 12/7/2018 ISCOL 2007

Tamperings (1/3) Near top performance with only 1000 lexical features TopN – For each lexical feature, count the number of SVs where it is active. Keep only the top N lexical features according to this rank. With perfect POS tags, even 500 is more than enough (some words may hurt us) For Hebrew, 10 lexical features are REALLY important 12/7/2018 ISCOL 2007

Tamperings (2/3) NoPOS – remove all lexical features corresponding to a given part-of-speech. 12/7/2018 ISCOL 2007

Tamperings (2/3) NoPOS – remove all lexical features corresponding to a given part-of-speech. Prepositions and punctuations are most important. 12/7/2018 ISCOL 2007

Tamperings (2/3) NoPOS – remove all lexical features corresponding to a given part-of-speech. Prepositions and punctuations are most important. Closed class are more important than open class. 12/7/2018 ISCOL 2007

Tamperings (2/3) NoPOS – remove all lexical features corresponding to a given part-of-speech. Prepositions and punctuations are most important. Closed class are more important than open class. Adverbs are hard for the POS tagger. 12/7/2018 ISCOL 2007

Tamperings ISCOL Bonus The 4 most important Hebrew nouns were: % כלל ש"ח דרך Tamperings NoPOS – remove all lexical features corresponding to a given part-of-speech. Prepositions and punctuations are most important. Closed class are more important than open class. Adverbs are hard for the POS tagger. 12/7/2018 ISCOL 2007

Top10 Lexical Features in Hebrew Start of Sentence Marker Comma Quote of / של and / ו the / ה in / ב 12/7/2018 ISCOL 2007

Top10 Lexical Features in Hebrew Start of Sentence Marker Comma Quote of / של and / ו the / ה in / ב של/of is different than the other prepositions in Hebrew with respect to chunk boundaries. 12/7/2018 ISCOL 2007

Top10 Lexical Features in Hebrew Start of Sentence Marker Comma Quote of / של and / ו the / ה in / ב Quotes and commas are important. We know they are somewhat inconsistent in TB. Goldberg et al. 2006  normalize punctuation before evaluation This work  normalize punctuation before learning improves F score by ~0.8 (10-fold CV) של/of is different than the other prepositions in Hebrew with respect to chunk boundaries. 12/7/2018 ISCOL 2007

Tamperings (3/3) Loc=i –keep only lex features with index i. 12/7/2018 ISCOL 2007

Tamperings (3/3) Loc=i –keep only lex features with index i. 12/7/2018 ISCOL 2007

Tamperings (3/3) Loc=i –keep only lex features with index i. Lexical features at position 0 (current word) are most important. Tamperings (3/3) Loc=i –keep only lex features with index i. Loci –keep only lex features with indexi. 12/7/2018 ISCOL 2007

Tamperings (3/3) Loc=i –keep only lex features with index i. Lexical features at position 0 (current word) are most important. Top0 tampering (Removing all lexical features) yield somewhat better results (90.1) Tamperings (3/3) Loc=i –keep only lex features with index i. Loci –keep only lex features with indexi. 12/7/2018 ISCOL 2007

Tamperings (3/3) Loc=i –keep only lex features with index i. Lexical features at position 0 (current word) are most important. Top0 tampering (Removing all lexical features) yield somewhat better results (90.1) Tamperings (3/3) Loc=i –keep only lex features with index i. Loci –keep only lex features with indexi. This is better than with all the features (93.79) (yet learning without it to begin with is worse) 12/7/2018 ISCOL 2007

Intuitively… The SVM learner uses rare, irrelevant features (i.e., word at location –2 is X and POS at location 2 is Y) to memorize hard cases. This rote learning helps generalization performance by focusing the learner on the “easy” cases… …but overfits on the hard events. 12/7/2018 ISCOL 2007

Anchored Learning Add a unique feature (ai – anchor) to each training sample (as many features as there are samples) Data is linearly separable. Anchors “remove the burden” from “real” features. Anchors with high weights correspond to the “Hard to Learn” cases. “Hard to Learn” cases are either corpus errors, or genuinely hard (both are interesting). 12/7/2018 ISCOL 2007

Anchors vs. Previous Work on Corpus Error Detection Come To Prague Most relevant work: Boosting (Abney et. al. 1998): “hard to learn examples in an AdaBoost model are candidate corpus errors” AdaBoost models are easy to interpret. SVM and AdaBoost models are different. 12/7/2018 ISCOL 2007

Anchors vs. Previous Work on Corpus Error Detection Come To Prague Nakagawa and Matsumoto (2002): Support Vectors with high αi values are “exceptional cases” Look for similar examples with different label to extract contrastive pairs. Our method: Finds the errors directly. Has better recall. Converges in reasonable time even when there are many errors. Allows learning without “important” features. 12/7/2018 ISCOL 2007

Anchored Learning Results (1/2) Identified corpus errors with high precision. Some of the corpus errors found were actually errors in the process of deriving chunks from the Hebrew TreeBank Identified problematic aspects with the NP chunk definition used in previous work, triggering a revision of the definition. Identified some hard cases (multi-word expressions, adverbial usage, conjunctions) 12/7/2018 ISCOL 2007

ISCOL Bonus: Problems with Definition of NP Chunks [גוונים חמים] כמו [אדומים], [כתומים] ו [חומים] Are these really NP chunks? Where are the nouns? 12/7/2018 ISCOL 2007

ISCOL Bonus: Problems with Definition of NP Chunks את was included in the chunks  [את הממשלה, הכנסת, בית המשפט והתקשורת] 12/7/2018 ISCOL 2007

ISCOL Bonus: Problems with Definition of NP Chunks Some determiners can be very complex: [ו אולי אף יותר פעמים ] 12/7/2018 ISCOL 2007

ISCOL Bonus: Problems with Definition of NP Chunks של was considered as unambiguous, but: [נשיא בית הדין] ל [משמעת] של [המשטרה] The ל preposition is also interesting. 12/7/2018 ISCOL 2007

ISCOL Bonus: Problems with Definition of NP Chunks סמיכות + של + ל/מ מציבה בעיות מאד קשות להגדרה של ביטויי NP פשוטים בעברית. אין זמן לעבור על זה כאן, אבל אשמח מאד לדבר אתכם על זה אח"כ! של was considered as unambiguos, but: [נשיא בית הדין] ל [משמעת] של [המשטרה] The ל preposition is also interesting. 12/7/2018 ISCOL 2007

ISCOL Bonus: What’s hard in NP Chunking The prepositions של and מ Conjunctions: מערכת ה עבודה ה שכר ו ה איגוד ה מקצועי Some adverbs/adjectives: ה[אבדה] ל[משפחה] גדולה Multiword expressions (and prepositions): פה אחד, בכל מקרה, בבת אחת, כך או כך, לכל היותר... 12/7/2018 ISCOL 2007

Anchored Learning (2/2) Current-Word lexical features are the most important. What are the contextual lexical features used for? 12/7/2018 ISCOL 2007

Anchored Learning (2) What are the contextual lexical features used for? Learn 3 models: Mfull – all lexical features Mnear – without features w-2/w+2 Mno-cont – with only the w0 lexical feature Compare the hard cases in the models, to find the role of features w-1/w+1, w-2/w+2. w-2 w-1 w0 w1 w2 12/7/2018 ISCOL 2007

Anchored Learning (2) What are the contextual lexical features used for? Learn 3 models: Mfull – all lexical features Mnear – without features w-2/w+2 Mno-cont – with only the w0 lexical feature Compare the hard cases in the models, to find the role of features w-1/w+1, w-2/w+2. (Anchors guarantee convergence of learning process in reasonable time) w-2 w-1 w0 w1 w2 12/7/2018 ISCOL 2007

Mfull < Mnear < Mno-cont H Hard cases: Anchored Learning (2) Hard cases solved by Mnear What are the contextual lexical features used for? Learn 3 models: Mfull – all lexical features Mnear – without features w-2/w+2 Mno-cont – with only the w0 lexical feature Compare the hard cases in the models, to find the role of features w-1/w+1, w-2/w+2. (Anchors guarantee convergence of learning process in reasonable time) w-2 w-1 w0 w1 w2 Hard cases solved by Mfull 12/7/2018 ISCOL 2007

Qualitative Results Contextual lexical features contribute mostly to disambiguating: Conjunctions Appositions Attachment of Adverbs and Adjectives Some multi-word expressions 12/7/2018 ISCOL 2007

Quantitative Results 12/7/2018 ISCOL 2007

Quantitative Results w-1/w+1 solves about 5 times more hard cases than w-2/w+2 12/7/2018 ISCOL 2007

Quantitative Results w-1/w+1 solves about 5 times more hard cases than w-2/w+2 Contextual lexical features are very important for learning back-to-back NPs. 12/7/2018 ISCOL 2007

Quantitative Results Investigating Hebrew back-to-back NPs: Back-to-back SimpleNPs in Hebrew are not as common as in English, but are much harder to decide. Most of the learning of back-to-back NPs achieved by local context is only superficial and will rarely generalize. Better features are needed for this case. w-1/w+1 solves about 5 times more hard cases than w-2/w+2 Contextual lexical features are very important for learning back-to-back NPs. 12/7/2018 ISCOL 2007

ISCOL Bonus: Back-to-Back NP Examples נעצרו ב[שכם][20 פעילים] עד עתה מילא [את תפקיד זה][עמוס מר חיים] [אישה מבוגרת] ו [שמה] [בלומה] Quantitative Results Investigating Hebrew back-to-back NPs: Back-to-back SimpleNPs in Hebrew are not as common as in English, but are much harder to decide. Most of the learning of back-to-back NPs achieved by local context is only superficial and will rarely generalize. Better features are needed for this case. w-1/w+1 solves about 5 times more hard cases than w-2/w+2 Contextual lexical features are very important for learning back-to-back NPs. 12/7/2018 ISCOL 2007

To sum it up Investigating learned models can yield interesting insights about the task at hand: Importance of features, role of features Corpus improvement What’s hard to learn Better task definition SVM models are no longer a “black box”. Some interesting insights about Hebrew. 12/7/2018 ISCOL 2007

Thank You