Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.

Slides:



Advertisements
Similar presentations
Applying Ontology-Based Lexicons to the Semantic Annotation of Learning Objects Kiril Simov and Petya Osenova BulTreeBank Project
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
An Introduction to GATE
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
The Language Model in Bulgarian Treebank (BulTreeBank) Petya Osenova (Sofia) , Prague.
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty.
A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
C++ Code Analysis: an Open Architecture for the Verification of Coding Rules Paolo Tonella ITC-irst, Centro per la Ricerca Scientifica e Tecnologica
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Survey of Semantic Annotation Platforms
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
Learning a token classification from a large corpus (A case study in abbreviations) Petya Osenova & Kiril Simov BulTreeBank Project (
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
Multi-lingual & multi- institutional distant learning Example of an international master programme in Computational Linguistics November, Blaubeuren,
From E-Content to E-Learning in Computational Linguistics Localisation of Teaching materials for less processed languages Kiril Simov *, Petya Osenova.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
1 Grammar Extraction and Refinement from an HPSG Corpus Kiril Simov BulTreeBank Project ( Linguistic Modeling Laboratory, Bulgarian.
SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור שבע Partial Parsing אורן גליקמן.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
CSA2050 Introduction to Computational Linguistics Parsing I.
Section 11.3 Features structures in the Grammar ─ Jin Wang.
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
POS Tagger and Chunker for Tamil
SYNTAX.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Levels of Linguistic Analysis
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
AUTONOMOUS REQUIREMENTS SPECIFICATION PROCESSING USING NATURAL LANGUAGE PROCESSING - Vivek Punjabi.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
NATURAL LANGUAGE PROCESSING
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Linked Open Data Dataset from Related Documents Petya Osenova and Kiril Simov IICT-BAS LDL-2016, LREC, Portoroz.
10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.
Natural Language Processing Vasile Rus
WP4 Models and Contents Quality Assessment
Statistical NLP: Lecture 3
CS416 Compiler Design lec00-outline September 19, 2018
CSE 3302 Programming Languages
Introduction CI612 Compiler Design CI612 Compiler Design.
Machine Learning in Natural Language Processing
CS 388: Natural Language Processing: Syntactic Parsing
BBI 3212 ENGLISH SYNTAX AND MORPHOLOGY
Natural Language - General
Levels of Linguistic Analysis
CS416 Compiler Design lec00-outline February 23, 2019
Applied Linguistics Chapter Four: Corpus Linguistics
CSCI 5832 Natural Language Processing
Lec00-outline May 18, 2019 Compiler Design CS416 Compiler Design.
Presentation transcript:

Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff BulTreeBank Project LML, Bulgarian Academy of Sciences (www. bultreebank.org) Workshop on Balkan Language Resources and Tools November 2003 Thessaloniki, Greece

Plan of the talk Preliminary Notes BulTreeBank Language Resources and Tools The integration architecture of the resources and tools Conclusion and Future work

Financial Support BulTreeBank is a joint project between Seminar für Sprachwissenschaft, Eberhard-Karls-Universität, Tübingen, Germany and Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, Sofia, Bulgaria The project is funded by the Volkswagen-Stiftung, Germany

Expected Results A set of Bulgarian sentences marked-up with detailed syntactic information A core set of sentences designated inside the treebank A linguistically interpreted text archive for Bulgarian A reliable partial grammar for automatic parsing of phrases in Bulgarian Software modules for compiling, manipulating and exploring the language resources

Preliminary notes (1) We rely on two prerequisites during the process of our treebank creation: – integration of the pre-processing components –an adequate annotation scheme

Preliminary notes (2) Integration is performed with the help of the following techniques: –Looking-forward strategy Adaptive mechanism Additive mechanism –Looking-backward strategy –Creation of a gold standard

Language Resources Text archive Morphological dictionary Gazetteers Valence dictionary Semantic dictionary Treebank

The BulTreeBank Text Archive A collection of linguistically interpreted texts from different genres (target size: 100 million words) About 72 million running words are converted into XML documents, marked up in conformance with the TEI guidelines 10 million running words are morphologically analyzed Over words are morphosyntactically disambiguated by hand

The morphological dictionary Published as a book – Popov, Simov and Vidinska, 1998 It covers the grammatical information of about lexemes ( word forms) and serves as a basis for the morphological analyzer The problem of the unknown words: open classes (names, abbreviations) and derivational models (diminutives etc)

The Gazetteers Gazetteers of names consisting of words – Bulgarian and foreign person names, locations from the whole world, organizations, and others Gazetteers of the most frequent abbreviations consisting of 1500 acronyms and graphical abbreviations Gazetteers of 300 most frequent introductory expressions and parentheticals. This is considered to be a step towards a basic list of collocations

The Valence Dictionary It consists of 1000 verbs and their valence frames The frames of the most frequent verbs are compared to the corpus data and repaired if necessary (new frames added, old ones deleted or more fine-grained) The semantic restrictions over the arguments are extracted and matched against the SIMPLE ontology (recall the Semantic Dictionary)

Lexical Entry of the Valence Dictionary Verb, its transitivity and aspect Meaning I.Frame (the arguments that the verb requires) S(ubject) + P(redicate) + O2(indirect object) | C(lause) II.Morphology of the verb's arguments S(ubject)=N,PerPron III.Semantics of the arguments S(ubject) is a person IV.Examples of the verb's usage

The Semantic Dictionary Classification of the most frequent nouns with respect to the ontological hierarchy of SIMPLE without specifying the synonymic relations between them (3 000 nouns) The proper names from the gazetteers are also mapped to the ontological hierarchy of SIMPLE

The Treebank Core set of sentences (1 500 sentences) - extracted mainly from Bulgarian grammars and processed manually --> highest quality Treebank (6 000 sentences) - extracted mainly from the corpus and pre-processed automatically before treated manually

Core set of sentences: Example of a Pragmatic Adjunct

A Corpus Sentence: an example of dependents realisation

The Tools Morphological analyzer Disambiguator(s) Partial grammars –sentence splitter –named-entity recognition module –chunkers

Morphological Analyzer Assigns all possible analyses to the tokens Implemented in CLaRK System as a regular grammar Works together with the ‘token classification’ strategy and with the gazetteers

Disambiguator(s) Rule-based disambiguator - a preliminary version of a rule-based morpho-syntactic disambiguator, encoded as a set of constraints within the CLaRK system --> 80 % coverage Neural-network-based disambiguator (Simov and Osenova 2001). Its accuracy is of % for part-of-speech and % for complete morpho- syntactic disambiguation

After the MorphoSyntactic Analysis and Disambiguation Човек Ncmsi Ncmsi с R R опит Ncmsi;Vppt+cv--smi Ncmsi и C C богато Ansi;D Ansi минало Ansi;Ncnsi;Vppt+caosni Ncnsi

Named-entity recognition Based on the information from the gazetteers and on RE rules: numerical expressions names abbreviations special symbols

After the application of Gazetteers Бъдеще Ncnsi за R България Ncfsi Димитър Калчев

Chunkers: General Assumptions Deals with non-recursive constituents Relies on a clear-indicator strategy Delays the attachment decisions Ignores semantic information Aims at accuracy, not coverage

Chunkers NP chunker – after preposition NPs –“sure” non-recursive NPs VP chunker –Analytical wordforms –“Da” constructions –Verb clitics PP chunker, AP chunker, Clausal chunker

After the application of some Chunk Grammars Common NP chunks –[един човек] от [града] (‘one man from town-the’) Name NP chunks: NEpers, NEloc etc. –[Министерство на културата] (‘Ministry of Culture’) Complex NP chunks –[нашето [Министерство на културата]] (‘our Ministry of Culture’) Analytical verb forms –[да [му я даде]] (‘to him her give-3p, sg’) to give it to him

Integration of the resources and tools The order of application Mutual dependence Quantitative and qualitative expansion The principle of cascadedness

Conclusion We described a set of basic language resources which are necessary for the creation of a Bulgarian treebank We outlined our tasks in the context of a ‘less- processed’ language (variety and flexibility of LRs and tools) It was shown that the creation of one type of resource (in our case - the treebank) can evoke the successful creation of other types of resources

Future tasks using the LRs and tools as separate modules for applications like Information retrieval and Extraction to extend the basic language resources into a more elaborate set, richer in information and relations to continue testing and validating the resources to invest more in their evaluation