Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of.
1 Phrase alignment of Estonian-German parallel treebanks Heli Uibo and Krista Liin, University of Tartu Martin Volk, Stockholm University.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Introduction to Syntax Owen Rambow September 30.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
Introduction to treebanks Session 1: 7/08/
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Partial Prebracketing to Improve Parser Performance John Judge NCLT Seminar Series 7 th December 2005.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
Introduction to Syntax Owen Rambow September
Introduction to Syntax Owen Rambow October
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
Växjö University Joakim Nivre Växjö University. 2 Who? Växjö University (800) School of Mathematics and Systems Engineering (120) Computer Science division.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
ELN – Natural Language Processing Giuseppe Attardi
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Parsing Estonian with Constraint Grammar Kaili Müürisep Institute of Cybernetics at Tallinn Technical University.
A Natural Language Interface for Crime-related Spatial Queries Chengyang Zhang, Yan Huang, Rada Mihalcea, Hector Cuellar Department of Computer Science.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Treebank Troubles Eckhard Bick Southern Denmark University
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
Triplet Extraction from Sentences Technical University of Cluj-Napoca Conf. Dr. Ing. Tudor Mureşan “Jožef Stefan” Institute, Ljubljana, Slovenia Assist.
Rules, Movement, Ambiguity
CPSC 503 Computational Linguistics
Supertagging CMSC Natural Language Processing January 31, 2006.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Spring 2006-Lecture 2.
POS Tagger and Chunker for Tamil
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
1 Chair of Language Technology. 2 Outline General information Staff Teaching –Courses –Supervision Research –Fields –Main results –Participation in conferences.
5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Natural Language Processing Vasile Rus
Approaches to Machine Translation
WP3: Supporting RTD in Language Technologies
LING/C SC 581: Advanced Computational Linguistics
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 26
Approaches to Machine Translation
Knowledge Representation for Natural Language Understanding
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu

Outline Who? Why? Three initiatives: –CG-corpus –Sofie Parallel Treebank –Arborest What next?

Who are we? Kaili Müürisep, PhD Tiina Puolakainen, PhD Mare Koit, PhD Tiit Roosmaa, PhD Kadri Muischnek, M.A. Heli Uibo, M.Sc. Andriela Rääbis, M.A. Heili Orav, M.A. Kaarel Kaljurand, M.Sc. + students of computational linguistics (experienced in shallow syntactic annotating of texts)

Why do we need syntactically annotated corpora? To evaluate language technological software (tools for information retrieval and extraction, automatic summarization, machine translation) To build a new up-to-date description of Estonian syntax, taking into account real language usage

Three syntactically annotated corpora for Estonian 1. Constraint Grammar (CG) Corpus  size – running words ≈ ca sentences  words of Estonian original fiction  words of newspaper texts  words of legal texts  shallow annotation, using Constraint Grammar: a syntactic function is determined for every word-form

Three syntactically annotated corpora for Estonian (2) Two small-scale experimental treebanks: 2. Sofie Parallel Treebank – a Penn-style phrase structure treebank of 50 sentences 3. Arborest – a VISL-style hybrid treebank of 2500 sentences (first 149 sentences manually revised)

Constraint Grammar Corpus  Has been built to train and test the Constraint Grammar shallow syntactic parser ESTCG  Currently the precision of ESTCG is 76,4-79,2 % and recall is 95,5-96,9 %.

ESTCG: Syntactic – – – – parts of – adjective – noun as – adverb – complements – complements of adposition...

CG-corpus: example Mitmekesisus mitme_kesi=sus+0 //_S_ com sg nom #cap // on ole+0 //_V_ main indic pres ps3 sg ps af #FinV #Intr elu elu+0 //_S_ com sg gen vaieldamatu vaieldamatu+0 //_A_ pos sg nom omapära oma_pära+0 //_S_ com sg nom $.. //_Z_ Fst //

CG-corpus: the process of extending the corpus 1)Input: morphologically hand-annotated text 2)Automatic syntactic analysis (ESTCG parser) 3)Hand-correcting – two linguists in parallel (annotating manual + GUI-based annotation tool) 4)Automatic comparison 5)Discussion of problematic cases 6)Creation of final version

Sofie Parallel Treebank Sofie Parallel Treebank is being developed inside Nordic Treebank Network, funded by NorFA language technology program and joining 15 academic institutions from Sweden, Norway, Denmark, Finland, Estonia and Iceland. Material – the 1st chapter of Jostein Gaarder's novel "Sophie's World". Currently, the parallel treebank includes Swedish, German, Norwegian, Estonian and two versions of Danish, sentences from each language.

Sofie Parallel Treebank (cont-d) The syntactic structure represented in the trees of different languages is not uniform: –Danish: Discontinuous Grammar dependency treebank and VISL-style phrase structure treebank –Swedish: dependency treebank –German: NEGRA-style treebank –Norwegian: phrase structure treebank –Estonian: Penn-style phrase structure treebank. The representation format of trees is TIGER XML.

Estonian part of Sofie treebank: how we did it? Trees drawn on paper by K. Muischnek and H. Nigol. “Electronic” trees drawn with ANNOTATE tool, using Penn treebank tagset by H. Uibo and K. Kaljurand Database of trees exported from ANNOTATE in NEGRA format TigerRegistry and TigerSearch used to convert into TIGER XML Website of Sofie Parallel Treebank:

Sample trees from Sofie treebank Her begynte den dype skogen.

Straks Sofie hadde lukket porten bak seg, åpnet hun konvolutten.

Sofie Parallel Treebank – example from web-interface Sophie's father was the captain of a big oil tanker, and was away for most of the year.

Arborest Joint work with dr. Eckhard Bick, University of Southern Denmark VISL-style experimental treebank Annotated for both function (S = subject, P = predicate, O = object, A = adverbial,STA = statement, QUE = question, etc.) and form (np, vp, pp, advp, adjp, fcl = finite clause, par = paratagma, etc.)

Arborest (cont-d) Automatically generated from a sample of CG- corpus (2500 sentences) with CG→PSG rules 149 sentences revised 1/3 of sentences correct CG→PSG rules are under improvement Webpage

Arborest – sample tree

What next? To enlarge all three syntactically annotated corpora. To improve the CG-to-PSG rules to facilitate the easy semi-automatic way of building an Estonian treebank. To create another, syntactic-semantic dependency treebank for Estonian, which will be semi- automatically generated from one of the existing experimental phrase structure treebanks. → How many semantic information can be derived from the syntactic dependency structure?