Semantic Role Labelling Using Chunk Sequences Ulrike Baldewein Katrin Erk Sebastian Padó Saarland University Saarbrücken Detlef Prescher Amsterdam University Amsterdam
1. Representation for Classification Usual choice: Classify Constituents Intuition: one argument one constituent Not available in this task Classify words? Too fine-grained Classify chunks? Data anylsis: Not always the right level 34% of arguments: more than one chunk 13% of arguments: do not respect chunk boundaries
Chunk sequences as classification instances Sequences of chunks and chunk parts Adaptive level of structure „Potential constituents“ [ NP Britain´s] [ NP manufacturing industry] [ VP is transforming] [ NP itself] [ VP to boost] [ NP exports] ARG0 = NP_NP V = VP[VBG] ARG1 = NP ARG2 = VP_NP
Frequency-based chunk sequence filtering Filter 1: Use only sequence types which realise arguments in training set 1089 types Zipf distributed NP (23,000) S (5,000) NP_PP_NP_PP_NP_VP _PP_NP_NP (1) Filter 2: Use only frequent sequence type (f(s)>10) Examine material between sequence and target Divider sequences Also Zipf distributed Empty divider (14,000) NP (10,000) Similar to „Path“ Filter 3: Use only seq.s with frequent divider (f(d)<10) Filter 4: Use only seq.s co- occurring frequently with some divider (f(s,d)<5)
Results of filtering Leaves 87 sequence types (was 1089) 43,777 tokens in devel set (about 1 seq / word) 8,698 are proper arguments (about 20%) Bad news: representation loses 16% of proper arguments
2. Features „Shallow features“: Simple properties „Higher-level features“: Syntactic properties (mostly heuristic) „Divider features“: Shallow and higher-level properties of dividers
EM-based clustering Measure fit between objects y 1 (pred:arg) and y 2 (sequence) Example: How well does NP fit give:A1? y 1 and y 2 are independent and generated by cluster p(y 1,y 2 ) = c p(c) p(y 1 |c) p(y 2 |c) EM derives clusters from training data Intention: Generalisation within clusters Features: e.g. „most likely argument slot for this sequence for this predicate“
3. Procedure Filter sequences from training set Compute features for sequence tokens and their dividers (training + development + test set) Estimate Maximum Entropy model on training set Classify sequences from devel / test set Recover semantic parses
Two-step classification procedure Classifier 1: Argument recognition Binary decision about argumenthood All argument classes conflated into ARG Classifier 2: Argument labelling Consider only sequences assigned ARG by step 1
4. Classification result: Sequence chart Themanwiththebeardsleeps A0 (70%), A1 (20%)NOLABEL (70%), AM-MOD(25%) A0 (60%), NOLABEL (40%) A0 (65%), A1 (25%) Need to find optimal „semantic parse“ of argument labels
Semantic parse recovery Find most probable semantic parse p = (l 1,l 2,...) Step 1: Beam search: Simple probability model with independence assumption: P bs (l 1,l 2,...) = i P c (l i ) Step 2: Reestimation Global considerations: [A0 A0] Use counts from training set: P(l 1,l 2,...) = P bs (l 1,l 2,...) * P tr (l 1,l 2,...)
5. Results (Development Set) PrecisionRecallF-score Upper Bound Step 1 (ARG only) Final Upper Bound: given by lost chunk sequences But filtering is necessary Only sequence frequency filtering (filter 1 and 2): Good news: 9% arguments are lost (now 16%) Bad news: 127,000 sequences (now 44,000) Argument recognition much more difficult F-score with same features only 0.38
Results Two steps have different profiles Arg identification: shallow and divider features important Arg labelling: shallow and higher-level features important Clustering features unsuccessful: Increase precision at cost of recall Feature „most probable label for sequence“ Successful in SENSEVAL-3 model Largest problem is recall PrecisionRecallF-score Upper Bound Step 1 (ARG only) Final
What I talked about... and more Chunk sequences for SRL Adaptive representation with „higher-level“ features Recall problem (Filtering loses proper arguments) EM-based features promising, but currently not helpful Since submission Maxent vs. memory-based learner: virtually same result Left to do Detailed error analysis More intelligent filtering Better features