Extracting LTAGs from Treebanks Fei Xia 04/26/07
Q1: How does grammar extraction work?
Two types of elementary tree in LTAG VP ADVP ADV still VP* Initial tree:Auxiliary tree: S NP VP VNP draft Arguments and adjuncts are in different types of elementary trees
Adjoining operation Y Y*
They still draft policies
The treebank tree
Step 1: Distinguish head/argument/adjunct
Step 2: Insert additional nodes S still they draft policies PRP NP ADVP RB VP NP VBP NNS VP still they draft policies PRP NPADVP RB VP NP NNS S VBP
Step 3: Build elementary trees #1: #2: #3 : #4 :
Extracted grammar NP PRP they VP ADVPVP* RB still #1:#2: NP NNS policies S NP VP NPVBP draft #3: #4:
Q2: What info was missing in the source treebank? Head/argument/adjunct distinction –Use function tags and heuristics Raising verbs (e.g., seem, appear) vs. other verbs. –He seems to be late –He wants to be late Need a list of raising verbs in that language Features, feature equation (e.g., agreement), …
Q3: what methodological lessons can be drawn? The algorithm for extracting LTAGs from treebanks is straightforward. Some missing information can be “recovered” based on heuristics, others cannot. The extracted LTAGs are not as rich as the ones built by hand. Nevertheless, the grammars have been shown to be useful for parsing, SuperTagging, etc.
Q4: What are the advantages of a PS or DS treebank? The original extraction algorithm assumes the input is a PS treebank. But it can be easily extended if the input is a DS treebank. –Extract tree segments from DS –Run DS PS algorithm on the segments to get elementary trees
Q5: Building a treebank for a formalism or building a general treebank? I prefer the latter because –A general treebank can be used for different formalisms. –Different grammars under the same formalisms can be extracted. –Annotating a general treebank is often easier.