Download presentation
Presentation is loading. Please wait.
Published byLoreen Rice Modified over 9 years ago
1
Olga Pustylnikov, Alexander Mehler Bielefeld University A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data
2
SFB 673 Motivation Exploring similarities among languages by means of syntactic treebanks We collected a database covering 11 languages Treebanks have been developed separately by different research projects quantitative investigations on these treebanks -> the need for unification
3
SFB 673 Motivation John loves Mary Mary John loves 1 John n 2 2 loves v 0 3 Mary n 2 John loves <W DOM="2" ID="3“ Mary (loves v ( (John n) (Mary n) ) corpusstructureannotation
4
SFB 673 Motivation (+) generic: allowing to represent as many treebanks as possible (+) extensible to new treebanks (+) complete: preserving all corpus specific information (+) transferable to other kinds of corpora (–) complex: exhibiting the minimal complexity -> graph representations Demands on the unified format of treebanks
5
SFB 673 Motivation Graph eXtensible Language is a graph model representig corpora in terms of graphs XML GXL WIKI Multimodal Data Treebanks TOOLS GXL (Holt et al., 2006) GXL can be applied to any kinds of corpora. (See e.g. Mehler and Gleim (2005), Ferrer i Cancho et al. (2007), Pustylnikov and Mehler (2008)) Treebanks eGXL
6
1. eGXL 2. Data 3. Complexity Evaluation 4. Application 5. Conclusion SFB 673 Agenda
7
SFB 673 eGXL Sentences Types IDREF …...... 2-level data model
8
SFB 673 eGXL Sentences Types IDREF …...... 2-level data model
9
SFB 673 The eGXL Types-graph The Types-graph contains treebank specific attributes (e.g.POS, morphological attribute etc.) -> nodes Each instance of an attribute is given a unique identifier … a unique identifier the value of the attribute a unique identifier the value of the attribute
10
SFB 673 The eGXL Sentences-graph vill Dettabestämtjagbemöta....... each token of a treebank word form an IDREF to the POS-node of the Types-graph a (syntactic) relation from (e.g. a head verb) to (e.g. a dependent argument) from (e.g. a head verb) to (e.g. a dependent argument)
11
SFB 673 The eGXL Sentences-graph nodeeach token of a treebank ida unique identifier formword form posan IDREF to the POS-node of the Types-graph rela (syntactic) relation relenda relation anchor infrom (e.g. a head verb) outto (e.g. a dependent argument) vill Dettabestämtjagbemöta.......
12
SFB 673 eGXL
13
1. eGXL 2. Data 3. Complexity Evaluation 4. Application 5. Conclusion SFB 673 Agenda
14
SFB 673 11 Dependency Treebanks 7 different formats
15
SFB 673 Input vs. Output Formats Examples from Dutch, Swedish, Italian treebanks
16
SFB 673 Unification is possible… … due to the separation of the core from the secondary parts …...... diversity commonality
17
SFB 673 The TreebankWiki http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/
18
1. eGXL 2. Data 3. Complexity Evaluation 4. Application 5. Conclusion SFB 673 Agenda
19
SFB 673 Complexity of eGXL Logical Scalling Factor (LSF): number of logical elements (e.g. XML-element) required to represent a treebank unit (e.g. a word form, POS etc.) noderel eGXLothereGXL other
20
1. eGXL 2. Data 3. Complexity Evaluation 4. Application 5. Conclusion SFB 673 Agenda
21
SFB 673 DTDB
22
1. eGXL 2. Data 3. Complexity Evaluation 4. Application 5. Conclusion SFB 673 Agenda
23
SFB 673 Conclusions a database covering 11 languages eGXL – a generic XML graph model adopted to syntactic treebanks use of treebanks within a single application (Ariadne) olga.pustylnikov@uni-bielefeld.de alexander.mehler@uni-bielefeld.de ruediger.gleim@uni-bielefeld.de SFB 673 Thank you for your attention!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.