Olga Pustylnikov, Alexander Mehler Bielefeld University A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data
SFB 673 Motivation Exploring similarities among languages by means of syntactic treebanks We collected a database covering 11 languages Treebanks have been developed separately by different research projects quantitative investigations on these treebanks -> the need for unification
SFB 673 Motivation John loves Mary Mary John loves 1 John n 2 2 loves v 0 3 Mary n 2 John loves <W DOM="2" ID="3“ Mary (loves v ( (John n) (Mary n) ) corpusstructureannotation
SFB 673 Motivation (+) generic: allowing to represent as many treebanks as possible (+) extensible to new treebanks (+) complete: preserving all corpus specific information (+) transferable to other kinds of corpora (–) complex: exhibiting the minimal complexity -> graph representations Demands on the unified format of treebanks
SFB 673 Motivation Graph eXtensible Language is a graph model representig corpora in terms of graphs XML GXL WIKI Multimodal Data Treebanks TOOLS GXL (Holt et al., 2006) GXL can be applied to any kinds of corpora. (See e.g. Mehler and Gleim (2005), Ferrer i Cancho et al. (2007), Pustylnikov and Mehler (2008)) Treebanks eGXL
1. eGXL 2. Data 3. Complexity Evaluation 4. Application 5. Conclusion SFB 673 Agenda
SFB 673 eGXL Sentences Types IDREF … level data model
SFB 673 eGXL Sentences Types IDREF … level data model
SFB 673 The eGXL Types-graph The Types-graph contains treebank specific attributes (e.g.POS, morphological attribute etc.) -> nodes Each instance of an attribute is given a unique identifier … a unique identifier the value of the attribute a unique identifier the value of the attribute
SFB 673 The eGXL Sentences-graph vill Dettabestämtjagbemöta each token of a treebank word form an IDREF to the POS-node of the Types-graph a (syntactic) relation from (e.g. a head verb) to (e.g. a dependent argument) from (e.g. a head verb) to (e.g. a dependent argument)
SFB 673 The eGXL Sentences-graph nodeeach token of a treebank ida unique identifier formword form posan IDREF to the POS-node of the Types-graph rela (syntactic) relation relenda relation anchor infrom (e.g. a head verb) outto (e.g. a dependent argument) vill Dettabestämtjagbemöta
SFB 673 eGXL
1. eGXL 2. Data 3. Complexity Evaluation 4. Application 5. Conclusion SFB 673 Agenda
SFB Dependency Treebanks 7 different formats
SFB 673 Input vs. Output Formats Examples from Dutch, Swedish, Italian treebanks
SFB 673 Unification is possible… … due to the separation of the core from the secondary parts … diversity commonality
SFB 673 The TreebankWiki
1. eGXL 2. Data 3. Complexity Evaluation 4. Application 5. Conclusion SFB 673 Agenda
SFB 673 Complexity of eGXL Logical Scalling Factor (LSF): number of logical elements (e.g. XML-element) required to represent a treebank unit (e.g. a word form, POS etc.) noderel eGXLothereGXL other
1. eGXL 2. Data 3. Complexity Evaluation 4. Application 5. Conclusion SFB 673 Agenda
SFB 673 DTDB
1. eGXL 2. Data 3. Complexity Evaluation 4. Application 5. Conclusion SFB 673 Agenda
SFB 673 Conclusions a database covering 11 languages eGXL – a generic XML graph model adopted to syntactic treebanks use of treebanks within a single application (Ariadne) SFB 673 Thank you for your attention!