Asma Naseer
Shallow Parsing or Partial Parsing At first proposed by Steven Abney (1991) Breaking text up into small pieces Each piece is parsed separately [1]
Words are not arranged flatly in a sentence but are grouped in smaller parts called phrases The girl was playing in the street اس نے احمد کو کتاب دی
Chunks are non-recursive (does not contain a phrase of the same category as it self) NP D? AdjP? AdjP? N The big red balloon [NP[D The] [AdjP [Adj big]] [AdjP [Adj red]] [N balloon]] [1]
Each phrase is dominated by a head h A man proud of his son. A proud man The root of the chunk has h as s-head (semantic head) Head of a Noun phrase is usually a Noun or pronoun [1] [1]
IOB (Inside Outside Begin) I-NP O-NP B-NP I-VP O-VP B-BP قائد اعظم محمد علی جناح نے قوم سے خطاب کیا [ جناح I-NP] [ علی I-NP] [ محمد I-NP] [ قائد اعظم B-NP] [ خطاب B-NP] [ سے O-NP ] [ قوم B-NP] [ نے O-NP] [ کیا O-NP]
Rule Based Vs Statistical Based Chunking [2] Use of Support Vector Learning for Chunk Identification [5] A Context Based Maximum Likelihood Approach to Chunking [6] Chunking with Maximum Entropy Models [7] Single-Classifier Memory-Based Phrase Chunking [8] Hybrid Text Chunking [9] Shallow Parsing as POS Tagging [3]
Two techniques are used Regular expressions rules ○ Shallow Parse based on regular expressions N-gram statistical tagger (machine based chunking) ○ NLTK (Natural Language Toolkit) based on TnT Tagger (Trigramsb’n’Tags). ○ Basic Idea: Reuse POS tagger for chunking.
Regular expressions rules Necessary to develop regular expressions manually N-gram statistical tagger Can be trained on gold standard chunked data
Focus is on Verb and Noun phrase chunking Noun Phrases Noun or pronoun is the head Also contains ○ Determiners i.e. Articles, Demonstratives, Numerals, Possessives and Quantifiers ○ Adjectives ○ Complements ( ad-positional, relative clauses ) Verb Phrases Verb is the head Often one or two complements Any number of Adjuncts
Training NLTK on Chunk Data Starts with empty rule set ○ 1. Define or refine a rule ○ 2. Execute chunker on training data ○ 3. Compare results with previous run Repeat (1,2 & 3) until performance does not improve significantly Issues: Total 211,727 phrases. Taken subset 1,000 phrases.
Training TnT on Chunk Data Chunking is treated as statistical tagging Two steps ○ Parameter generation : create model parameters from training corpus ○ Tagging : tag each word with chunk label
Data Set WSJ: Wall Street Journal Newspaper NY ○ US ○ International Business ○ Financial News Training: section 15-18 Testing: section 20 Both tagged with POS and IOB Special characters are treated as other POS, punctuation are tagged as O
Results Precision P = |reference ∩ test| / test Recall R = |reference ∩ test| / reference F- Measure F α = 0.5 = 1 / (α/P + (1-α)/PR) F- Rate F = (2 * P* R) / (R+P)
Results NLTK TnT PRF-Measure VP79.3 %80.1 %79.7 % NP76.5 %84.4 %80.3 % PRF-Measure VP79.59 %82.35 %80.95 % NP78.36 %76.76 %77.55 %
SVM (Large Margin Classifiers) Introduced by Vapnik 1995 Two class pattern recognition problem Good generalization performance High accuracy in text categorization without over fitting (Joachims, 1998; Taira and Haruono, 1999)
Training data (x i, y i )…. (x l, y l ) x i Є R n, y i Є {+1, -1} x i is the i-th sample represented by n dimensional vector yi is (+ve or –ve class) label of i-th sample In SVM +ve and –ve examples are separated by a hyperplane SVM finds optimal hyperplane
Two possible hyperplanes
Chunks in CoNLL-2000 shared task, are IOB Tagged Each chunk type belongs to either I or B I-NP or B-NP 22 types of chunks are found in CoNLL-2000 Chunking problem is classification of these 22 types SVM is binary classifier, so its extended to k- classes One class vs. all others Pairwise classification ○ k * (k-1) / 2 classifiers 22 * 21 / 2 = 231 classifiers ○ Majority decides final class
Feature vector consists of Words: w POS tags: t Chunk tags: c To identify chunk c i at i-th word w j, t j (j = i-2, i-1, i, i+1, i+2) cj (j = i-2, i-1) All features are expanded to binary values; either 0 or 1 The total dimensions of feature vector becomes 92837
Results It took about 1 day to train 231 classifiers PC-Linux Celeron 500 MHz, 512 MB ADJP, ADVP, CONJP, INTJ, LST, NP, PP, PRT, SBAR, VP Precision = 93.45 % Recall = 93.51 % F β=1 = 93.48 %
Training POS Tags based Construct symmetric n-context from training corpus 1-context: most common chunk label for each tag 3-context: tag followed by the tag before and after it [t -1, t 0, t +1 ] 5-context [t -2,t -1, t 0, t +1, t +2 ] 7-context [t -3, t -2,t -1, t 0, t +1, t +2, t +3 ]
Training For each context find the most frequent label CC [O CC] PRP CC RP [B-NP CC] To save storage space n-context is added if its different from its nearest lower order context
Testing Construct maximum context for each tag Look up in the database of most likely patterns If the largest context is not found context is diminished step by step The only rule for chunk-labeling is to look up [t -3, t -2,t -1, t 0, t +1, t +2, t +3 ].… [t 0 ] until the context is found
Results The best results are achieved for 5- context ADJP, ADVP, CONJP, INTJ, LST, NP, PP, PRT, SBAR, VP ○ Precision = 86.24% ○ Recall = 88.25% ○ F β=1 = 87.23%
Maximum Entropy models are exponential models Collect as much information as possible Frequencies of events relevant to the process MaxEnt model has the form P(w|h) = 1 / Z(h). e Σ i λ i f i (h,w) f i (h,w) is a binary valued featured vector describing an event λ i describes how important is f i Z(h) is a normalization factor
Attributes Used Information in WSJ Corpus Current Word POS Tag of Current Word Surrounding Words POS Tags of Surrounding Words Context Left Context: 3 words Right Context: 2 words Additional Information Chunk tags of previous 2 words
Results Tagging Accuracy = 95.5% # of correct tagged words Total # of words Recall = 91.86% # of correct proposed base NPs Number of correct base NPs Precision = 92.08% # of correct proposed base NPs Number of proposed base NPs F β=1 = 91.97% (β 2 +1). Recall.Precision β 2. (Recall + Precision)
Context based Lexicon and HMM based chunker Statistics were used for chunking by Church(1998) Corpus frequencies were used Non-recursive noun phrases were identified Skut & Brants (1998) modifeid Church approach and used Viterbi Tagger
Error-driven HMM based text chunker Memory is decreased by keeping only +ve lexical entries HMM based text chunker with context- dependent lexicon Given G n 1 = g 1, g 2,..., g n Find optimal sequence T n 1 = t 1, t 2,..., t n Maximize log P( T n 1 | G n 1 ) log P( T n 1 | G n 1 ) = log P(T n 1 ) + log P( T n 1, G n 1 ) P( T n 1 ) P ( G n 1 )
CoNLL 2000 : for testing and training Ratnaparkhi’s maximum entropy based POS tagger No change in internal operation Information for training is increased
Shallow Parsing VS POS Tagging Shallow Parsing requires more surrounding POS/lexical syntactic environment Training Configurations Words w 1 w 2 w 3 POS Tags t 1 t 2 t 3 Chunk Types c 1 c 2 c 3 Suffixes or Prefixes
Amount of information is gradually increased Word w 1 Tag t 1 Word, Tag, Chunk Label (w 1 t 1 c 1 ) ○ Current chunk label is accessed through another model with configurations of words and tags (w 1 t 1 ) To deal with sparseness ○ t 1, t 2 ○ c 1 ○ c 2 (last two letters) ○ w 1 (first two letters)
Word w 1
Tag t 1
(w 1 t 1 c 1 )
Sparseness Handling
PrecisionRecallF β=1 Word w 1 88.06%88.71%80.38% Tag t 1 88.15%88.07%88.11% (w 1 t 1 c 1 )89.79%90.70%90.24% Sparseness Handling 91.65%92.23%91.94% Over all Results
Error Analysis Three groups of errors Difficult syntactic constructs ○ Punctuations ○ Treating di-transitive VPs and transitive VPs ○ Adjective vs. Adverbial Phrases Mistakes made in training or testing by annotator ○ Noise ○ POS Errors ○ Odd annotation decisions Errors peculiar to approach ○ Exponential Distribution assigns non zero probability to all events ○ Tagger may assign illegal chunk-labels (I-NP while w is not NP)
Comments PPs are easy to identify ADJP and ADVP are hard to identify correctly (more syntactic information is required) Performance at NPs can be further improved Performance using w 1 or t 1 is almost same. Using both the features enhances performance
[1] Philip Brooks, “A Simple Chunk Parser”, May 8, 2003. [2] Igor Boehm, “Rule Based vs. Statistical Chunking of CoNLL data Set”. [3] Miles Osborne, “Shallow Parsing as POS Tagging” [4] Hans van Halteren, “Chunking with WPDV Models” [5] Taku Kudoh and Yuji Matsumoto, “Use of Support Vector Learning for Chunk Identification”, In proceeding of CoNLL-2000 and LLL-2000, page 142-144, Portugal 2000. [6] Christer Johanson, “A Context Sensitive Maximum Likelihood Approach to Chunking” [7] Rob Koeling, “Chunking with Maximum Entropy Models” [8] Jorn Veenstra and Antal van den Bosch, “Single Cassifier Memory Based Phrase Chunking” [9] Guo dong Zhou and Jian Su and TongGuan Tey, “Hybrid Text Chunking”
