June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna Čermáková, Anja Nedoluzhko, Jana Šindlerová, Josef Toman, Zdeněk Žabokrtský
June 6, 20073rd PIRE Meeting2 Outline: ● Functional Generative Description ● Parallel Treebanks ● PCEDT 2.0 – Project Report tectogrammatical level of annotation valency treatment annotation manual for English interannotator agreement
June 6, 20073rd PIRE Meeting3 Functional Generative Description ● Basic approach for Prague Treebanks dependency stratificational description of the language: ● From structure to function (meaning) - 3 layers of annotation: morphological analytical (=surface syntax) tectogrammatical (=“deep“ syntax, semantics)
June 6, 20073rd PIRE Meeting4 Functional Generative Description ● Since 1995: Prague Dependency Treebank (PDT) - > Czech data (1.0 released LDC 2001, 2.0 – LDC 2006) ● The idea of a parallel corpus: English data, Czech data – translated: Prague Czech-English Dependency Treebank (PCEDT) (1.0 released LDC 2004)
June 6, 20073rd PIRE Meeting5 The Idea of a Parallel, Syntactically Annotated Corpus Build an English corpus in the same formalism as PDT (data resource: Wall Street Journal section of Penn Treebank) Translate it into Czech Manual annotations of both parts of the corpus Train tectogrammar-based machine translation
June 6, 20073rd PIRE Meeting6 Phrasal x Dependency Tree Mr. Payson, an art dealer and collector, sold Vincent van Gogh's "Irises" at a Sotheby's auction in November 1987 to Australian businessman Alan Bond.
June 6, 20073rd PIRE Meeting7 Dependency Trees: a-layer = surface syntax t-layer = underlying syntax, semantics It may have been painted instead by a Rubens associate.
June 6, 20073rd PIRE Meeting8 Dependency Trees: a-layer = surface syntax t-layer = underlying syntax, semantics It may have been painted instead of Rubens by a Rubens associate.
June 6, 20073rd PIRE Meeting9 Tectogrammatical Representation (t-tree) Contains: ● syntactic dependency and coordination: edges ● semantic relations: tectogrammatical functors verb arguments (inner participants) ● semantic ACT, PAT ● syntactic ADDR, ORIG, EFF free modifications (e.g. TWHEN, LOC, DIR, MANN,CAUS, CPR, ACMP...) other: rhematizers, idiomatic expressions, foreign phrases... ● valency of the verbs: valency lexicon EngValLex
June 6, 20073rd PIRE Meeting10 Tectogrammatical Representation (t-tree) Contains: ● links to the lower layers ● grammatical (and textual) coreference ● topic-focus articulation
June 6, 20073rd PIRE Meeting11 Building the PCEDT 2.0, the Current Annotation of the English Data work with the corpus data ● input: WSJ texts (PTB), approx sentences (1.2 million words), automatically converted into PDT-like shape – a-layer ● automatic t-layer procession ● manual annotation running (approx trees annotated) ● meanwhile – Czech section annotation of the t-layer launched additional work ● conversion of the PropBank- lexicon into EngVallex (verbs only) ● tools adjustment (TrEd, unified macros for both CZ and ENG annotation) ● interannotator-agreement measuring ● first version of the annotation manual, is being revised ● training of new annotators
June 6, 20073rd PIRE Meeting12 EngValLex ● adaptation of PropBank into the format of PDT-Vallex (Valency lexicon for Czech) ● manual correction ● continuous checking during the annotation ● current version contains only verbs future work on EngValLex: ● defining surface realizations – morphosyntactic characteristics of the semantics roles ● valency of nouns and adjectives
June 6, 20073rd PIRE Meeting13 Annotation Manual = "Annotation of English on the tectogrammatical level: Reference book" ● based on the abbreviated version of the annotation manual for PDT (Czech) ● chapters specific to English data annotation added ● first rough version 1.0.1: April 2007 ● revision in progress ● extensions planned (concurrently with the annotation)
June 6, 20073rd PIRE Meeting14 Interannotator Agreement ● monthly control of the annotation consistency approx. 30 trees ● measured: structure: agreement in parent node functors ● further analysis: list of unpaired nodes statistics for diverging functors elimination of detected annotation divergences at annotator meetings
June 6, 20073rd PIRE Meeting15 Average Interannotator Agreement
June 6, 20073rd PIRE Meeting16 Future goals ● annotation expansion 500 trees/annotator/month increasing (or at last keeping) the interannotator agreement training of new annotators ● EngValLex precision ● annotation manual precision and expansion
June 6, 20073rd PIRE Meeting17 Acknowledgements The work on PCEDT project is supported by the grants PIRE ČR ME838 and GA405/06/0589.
June 6, 20073rd PIRE Meeting18 Acknowledgements The work on PCEDT project is supported by the grants PIRE ČR ME838 and GA405/06/0589. Thank you for your attention!