The Language Model in Bulgarian Treebank (BulTreeBank) Petya Osenova (Sofia) 26.03.2007, Prague.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
PRACTICE CLASS #10 (#11) /30 Complex Sentence PRACTICE CLASS #10 (#11) /30.
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.
Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.
Chapter 4 Syntax.
Dr. Abdullah S. Al-Dobaian1 Ch. 2: Phrase Structure Syntactic Structure (basic concepts) Syntactic Structure (basic concepts)  A tree diagram marks constituents.
The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
Statistical NLP: Lecture 3
The Leaf Projection Path View of Parse Trees: Exploring String Kernels for HPSG Parse Selection Kristina Toutanova, Penka Markova, Christopher Manning.
MORPHOLOGY - morphemes are the building blocks that make up words.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Chapter 20: Natural Language Generation Presented by: Anastasia Gorbunova LING538: Computational Linguistics, Fall 2006 Speech and Language Processing.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
DS-to-PS conversion Fei Xia University of Washington July 29,
A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,
Almen sproglig viden og metode (General linguistics) IntroductiontotheStudyofSyntax DN´ CLM, engelsk tt NP NPP PNP DN´ NPP P NP Ø.
Linguistic Theory Lecture 2 Phrase Structure. What was there before structure? Classical studies: Classical studies: –Languages such as Latin Rich morphology.
Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.
Chapter 4 Syntax Part II.
Prof. Erik Lu. MORPHOLOGY GRAMMAR MORPHOLOGY MORPHEMES BOUND FREE WORDS LEXICAL GRAMMATICAL NOUNS VERBS ADJECTIVES (ADVERBS) PRONOUNS ARTICLES ADVERBS.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Learning a token classification from a large corpus (A case study in abbreviations) Petya Osenova & Kiril Simov BulTreeBank Project (
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
SYNTAX Lecture -1 SMRITI SINGH.
From E-Content to E-Learning in Computational Linguistics Localisation of Teaching materials for less processed languages Kiril Simov *, Petya Osenova.
The Descriptive Grammar as a (Meta)Database Jeff Good University of Pittsburgh and Max Planck Institute for Evolutionary Anthropology.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
ENGLISH SYNTAX Introduction to Transformational Grammar.
1 Grammar Extraction and Refinement from an HPSG Corpus Kiril Simov BulTreeBank Project ( Linguistic Modeling Laboratory, Bulgarian.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Linguistic Essentials
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
CSA2050 Introduction to Computational Linguistics Lecture 1 What is Computational Linguistics?
Rules, Movement, Ambiguity
CSA2050 Introduction to Computational Linguistics Parsing I.
Lecture 1 Lec. Maha Alwasidi. Branches of Linguistics There are two main branches: Theoretical linguistics and applied linguistics Theoretical linguistics.
Natural Language Processing Chapter 2 : Morphology.
Unit 8 Syntax. Syntax Syntax deals with rules for combining words into sentences, as well as with relationship between elements in one sentence Basic.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Leonid Iomdin Institute for Information Transmission Problems, Russian Academy of Sciences
SYNTAX.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
MENTAL GRAMMAR Language and mind. First half of 20 th cent. – What the main goal of linguistics should be? Behaviorism – Bloomfield: goal of linguistics.
Lec. 10.  In this section we explain which constituents of a sentence are minimally required, and why. We first provide an informal discussion and then.
SYNTAX.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 King Faisal University.
10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.
Netgraph – a Tool for Searching in the Prague Dependency Treebank 2.0 Defence of the Doctoral Thesis, Prague, September 3 rd, 2008 Author: Mgr. Jiří Mírovský.
Beginning Syntax Linda Thomas
An Introduction to the Government and Binding Theory
Statistical NLP: Lecture 3
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Natural Language Processing (NLP)
Improving a Pipeline Architecture for Shallow Discourse Parsing
BBI 3212 ENGLISH SYNTAX AND MORPHOLOGY
Introduction to Computational Linguistics
Linguistic Essentials
Natural Language Processing (NLP)
Linguistic aspects of interlanguage
Structure of a Lexicon Debasri Chakrabarti 13-May-19.
Artificial Intelligence 2004 Speech & Natural Language Processing
Natural Language Processing (NLP)
Presentation transcript:

The Language Model in Bulgarian Treebank (BulTreeBank) Petya Osenova (Sofia) , Prague

Outline of the Talk General overview Synopsis of Annotation principles The levels of linguistic knowledge representation From HPSG to dependency Handling some linguistic phenomena Outlook and Conclusions

General overview: statistics HPSG-based format: –Sentences: –Tokens: Dependency format: –Sentences: –Tokens:

General overview: encoded linguistic information Constituent structure (NP, VP, AP etc.) with crossing branches Dependency information (head-complement, head- adjunct, head-subject relations) Functional information (clauses, pragmatic elements, discontinuous elements) Encoded ellipsis and coreference relations Conversion to dependency structures exists (note that coreference and ellipsis are not presented in this format)

General overview: sources Bulgarian grammars (1750 sentences) Randomly selected sentences (2027 sentences) Whole articles/texts from newspapers and literature ( sentences)

Sentences from grammars: ambiguous interpretations

Sentence from random layer

Annotation Principles Theoretical aspects (HPSG language model) Implementational aspects (XML presentation)

HPSG-based Language Model: overview Linguistic objects Sort hierarchy (linguistic ontology) –Represents the main types of linguistic objects and their characteristics Grammar (theory) –HPSG Universal and Bulgarian Specific Principles –Bulgarian Lexicon

HPSG-based Language Model: Principles Head Feature Principle Valence Principle Adjunct Principle Semantic Principle

HPSG-based Language Model: hierarchy headed-phrase head-complement head-subject head-adjunct head-sem-adjunct head-pragmatic-adjunct head-filler non-headed-phrase

HPSG-based Language Model: constituency vs. dependancy HPSG separates the linear order from the constituent structure Each constituent structure reflects the dependency between its immediate constituents The realization of the dependants follows the sequence: complements > subject > adjuncts

Implementation XML additive trees Graphical counterparts

The Levels of Linguistic Knowledge Representation Wordforms Morphological information Lexical elements ( N, V, Prep ) Syntactic elements (PP) Named Entities Dependency reflection (VPA(djunct), NPC(omplement) Functional elements (Disc(ontinous), Pragmatic) Relations – coreference, discontinuity, ellipsis

Example tree

Lexical elements Simple: one orthographic word Complex: verbal complex (head with clitics, da-construction), numerals, subordinators, prepositions Foculizers: project lexical element, when combined with a lexical element

Examples of complex lexical signs

Syntactic elements Traditional domains: –NP –VP –AP –AdvP –PP

Functional elements Clauses – CL, CLDA, CLCHE, CLZADA, CLR, CLQ Sentence – S Markers of the coordination - Conj, ConjArg Markers of the elipsis – V-Elip, N-Elip Markers of discontinuity – DiscA, DiscE, DiscM Pragmatic elements - Pragmatic Non-immediate dominance - nid

Dependency reflection Explicit NPA, NPC, VPC, VPA, VPS, PP, APA, APC, APA, AdvPA, AdvPC Non-explicit CoordP, PP

The Dependency Representation No ellipsis presented Non-projective trees Two graphical views

The original tree

The dependency-arrow presentation

Dependency-leaf tree

Handling some linguistic phenomena Word order Coordination Ellipsis Coreference Pragmatic expressions and foculizers

Word order Two possibilities: permutations of elements that do not cause crossing branches effect discontinuity that violates the order of dependant realization

Permutations Петя посещава Прага Посещава Прага Петя Прага посещава Петя Петя Прага посещава

Discontinuities: higher dependant realized between the head and a lower dependant

Discontinuities: outer realization of an inner element (the head is lower) (extraction)

Discontinuities: the elements of two constituent structures are mixed (rare)

Coordination The decisive factors: the grammatical role of the conjuncts, not their syntactic label the valence of the conjuncts the order of dependants realization

Coordination: the realization of dependants matters

Coordination: one conjunct

Coordination: grammatical role matters

Coordination: valence of the conjuncts matters

Ellipsis Repaired within the sentence V-Elip, N-Elip, Prep-Elip, PP-Elip Attributes: equal, variant, negation Depending on broader context VD-Elip, ND-Elip, PPD-Elip Attributes: world knowledge, discourse, exists (only for VD-Elip)

Ellipsis within the sentence: attribute=equal

Ellipsis within the sentence: attribute=variant

Ellipsis within the sentence: attribute=negation

Ellipsis within broader context: attribute=discourse

Ellipsis within broader context: attribute=exists

Ellipsis within broader context: attribute=world knowledge

Pragmatic expressions Vocatives Modal adverbials Parenthetical elements Foculizers

Vocatives

Parenthetical elements

Foculizer

Coreference Types: -equality -member of a set -subset of a set Linguistic parameters: -pro-dropness -secondary predication -binding

Coreference: member of, pro- dropness, control

Next steps in the treebank development Towards a Semantic Lexicon connected to an Ontology (SIMPLE project) Phase 2 consists of two main tasks: –shallow semantic annotation, i.e. explication of semantic parameters of the head and its dependants –extension of co-reference relations, i.e. adding more types of relations to the existing ones (part of, bridging ones)

Conclusions Bulgarian Treebank exists in two formats now: HPSG-based and Dependancy-based It will be expanded in two directions: –Quantity (more sentences) –Semantic annotation

Accessability Link: Free use of: CLaRK system Dependancy format of Bulgarian Treebank Morphologically processed corpus of Bulgarian Facilities: Technical Reports Documentation