Download presentation
Presentation is loading. Please wait.
Published byMoses Newton Modified over 9 years ago
1
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu Heli.Uibo@ut.ee
2
Outline Who? Why? Three initiatives: –CG-corpus –Sofie Parallel Treebank –Arborest What next?
3
Who are we? Kaili Müürisep, PhD Tiina Puolakainen, PhD Mare Koit, PhD Tiit Roosmaa, PhD Kadri Muischnek, M.A. Heli Uibo, M.Sc. Andriela Rääbis, M.A. Heili Orav, M.A. Kaarel Kaljurand, M.Sc. + students of computational linguistics (experienced in shallow syntactic annotating of texts)
4
Why do we need syntactically annotated corpora? To evaluate language technological software (tools for information retrieval and extraction, automatic summarization, machine translation) To build a new up-to-date description of Estonian syntax, taking into account real language usage
5
Three syntactically annotated corpora for Estonian 1. Constraint Grammar (CG) Corpus size – 200 000 running words ≈ ca 15 000 sentences 184 000 words of Estonian original fiction 10 000 words of newspaper texts 6 000 words of legal texts shallow annotation, using Constraint Grammar: a syntactic function is determined for every word-form
6
Three syntactically annotated corpora for Estonian (2) Two small-scale experimental treebanks: 2. Sofie Parallel Treebank – a Penn-style phrase structure treebank of 50 sentences 3. Arborest – a VISL-style hybrid treebank of 2500 sentences (first 149 sentences manually revised)
7
Constraint Grammar Corpus Has been built to train and test the Constraint Grammar shallow syntactic parser ESTCG Currently the precision of ESTCG is 76,4-79,2 % and recall is 95,5-96,9 %.
8
ESTCG: Syntactic tags @SUBJ – subject @OBJ – object @PRD – predicative@ADVL – adverbial @+FMV, @-FMV, @+FCV, @-FCV – parts of the predicate @AN> @<AN – adjective as attribute @NN> @<NN – noun as attribute, apposition @AD> @<AD – adverb as attribute @Q> @<Q – complements of quantor @P> @<P – complements of adposition...
9
CG-corpus: example Mitmekesisus mitme_kesi=sus+0 //_S_ com sg nom #cap // **CLB @SUBJ on ole+0 //_V_ main indic pres ps3 sg ps af #FinV #Intr // @+FMV elu elu+0 //_S_ com sg gen // @NN> vaieldamatu vaieldamatu+0 //_A_ pos sg nom // @AN> omapära oma_pära+0 //_S_ com sg nom // @PRD $.. //_Z_ Fst //
10
CG-corpus: the process of extending the corpus 1)Input: morphologically hand-annotated text 2)Automatic syntactic analysis (ESTCG parser) 3)Hand-correcting – two linguists in parallel (annotating manual + GUI-based annotation tool) 4)Automatic comparison 5)Discussion of problematic cases 6)Creation of final version
11
Sofie Parallel Treebank Sofie Parallel Treebank is being developed inside Nordic Treebank Network, funded by NorFA language technology program and joining 15 academic institutions from Sweden, Norway, Denmark, Finland, Estonia and Iceland. Material – the 1st chapter of Jostein Gaarder's novel "Sophie's World". Currently, the parallel treebank includes Swedish, German, Norwegian, Estonian and two versions of Danish, 50-100 sentences from each language.
12
Sofie Parallel Treebank (cont-d) The syntactic structure represented in the trees of different languages is not uniform: –Danish: Discontinuous Grammar dependency treebank and VISL-style phrase structure treebank –Swedish: dependency treebank –German: NEGRA-style treebank –Norwegian: phrase structure treebank –Estonian: Penn-style phrase structure treebank. The representation format of trees is TIGER XML.
13
Estonian part of Sofie treebank: how we did it? Trees drawn on paper by K. Muischnek and H. Nigol. “Electronic” trees drawn with ANNOTATE tool, using Penn treebank tagset by H. Uibo and K. Kaljurand Database of trees exported from ANNOTATE in NEGRA format TigerRegistry and TigerSearch used to convert into TIGER XML Website of Sofie Parallel Treebank: http://omilia.uio.no/sofie
14
Sample trees from Sofie treebank Her begynte den dype skogen.
15
Straks Sofie hadde lukket porten bak seg, åpnet hun konvolutten.
16
Sofie Parallel Treebank – example from web-interface Sophie's father was the captain of a big oil tanker, and was away for most of the year.
17
Arborest Joint work with dr. Eckhard Bick, University of Southern Denmark VISL-style experimental treebank Annotated for both function (S = subject, P = predicate, O = object, A = adverbial,STA = statement, QUE = question, etc.) and form (np, vp, pp, advp, adjp, fcl = finite clause, par = paratagma, etc.)
18
Arborest (cont-d) Automatically generated from a sample of CG- corpus (2500 sentences) with CG→PSG rules 149 sentences revised 1/3 of sentences correct CG→PSG rules are under improvement Webpage http://corp.hum.sdu.dk/arborest.htmlhttp://corp.hum.sdu.dk/arborest.html
19
Arborest – sample tree
20
What next? To enlarge all three syntactically annotated corpora. To improve the CG-to-PSG rules to facilitate the easy semi-automatic way of building an Estonian treebank. To create another, syntactic-semantic dependency treebank for Estonian, which will be semi- automatically generated from one of the existing experimental phrase structure treebanks. → How many semantic information can be derived from the syntactic dependency structure?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.