April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics School of Computer Science Faculty of Mathematics and Physics Charles University, Prague Czech Republic
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester2 Layering the PDT (5) stand-off layers: Deep structure (t) Syntax & semnatics Dependecy & non-dep. links Surface structure (a) Dependency, function Morphology (m) Lemma, tag (detailed) Word (token) (w) Audio/auto transcript (z) z-layer “PML” Scheme (XML based)
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester3 The Links Within t-layer Co-reference links Pronoun to antecedent, (future: full coref chains) Complement to 2 nd governor, etc. Lexicon links Verbs, nouns, adjectives, adverbs to dictionary entry Word sense disambiguated, valency/frame-based t-layer to a-layer Which a-node the t-node “comes from” No restrictions (crossing, many-to-many, …)
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester4 The Questions I Influence of choices made in the underlying annotation influenced “upper” layer choices? Minimal or none thanks to stand-off annotation style, and many-to- many references/links allowed (XML IDs) Added annotation (over surface syntax): Node order (information structure), deep dependencies, 30+ node labels (time, modalities, semantic POS, number, pronoun classes, …), co- reference, valency dictionary (~ “frame files”) links (word sense annotation), “empty” nodes (args), …
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester5 The Questions II Hard to circumvent syntactic choices? Not really… (again, thanks to XML stand-off) Only 1 label at surface syntactic level (function) Dependency(-only) no problem (no need to refer to phrases – all represented by subtrees) …but there will be a problem with the t-layer When referring from some “higher” (“logic”) layer: (Probably) need to refer to labels (attributes) Solution: Add IDs to attributes (should be easy, in fact – XML ID…)
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester6 The Questions III Desirable characteristics … for adding layers Stand-off annotation Proper IDs for in-, between-layer reference In advance, if possible, but usually can be added later Quality Control !! Easier with layers - cross-layer constraints Invisible to annotators -> catch random errors Links (between-layer type) can be pre-annotated PS vs. dep.: impact on additional annotation Not observed