A Link Grammar for an Agglutinative Language

Slides:



Advertisements
Similar presentations
Computational language: week 10 Lexical Knowledge Representation concluded Syntax-based computational language Sentence structure: syntax Context free.
Advertisements

Chapter 4 Syntax.
Greenberg 1963 Some Universals of Grammar with Particular Reference to the Order of Meaningful Elements.
Grammars, Languages and Parse Trees. Language Let V be an alphabet or vocabulary V* is set of all strings over V A language L is a subset of V*, i.e.,
Linguistics, Morphology, Syntax, Semantics. Definitions And Terminology.
GRAMMAR & PARSING (Syntactic Analysis) NLP- WEEK 4.
MORPHOLOGY - morphemes are the building blocks that make up words.
James Tam The parsing assignment Theoretical Concepts For The Parsing Assignment A return to the compilation process Parsing and Formal grammars Divide.
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
Stemming, tagging and chunking Text analysis short of parsing.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Link Grammar ( by Davy Temperley, Daniel Sleator & John Lafferty ) Syed Toufeeq Ahmed ASU.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
COP4020 Programming Languages
Lect. 11Phrase structure rules Learning objectives: To define phrase structure rules To learn the forms of phrase structure rules To compose new sentences.
(2.1) Grammars  Definitions  Grammars  Backus-Naur Form  Derivation – terminology – trees  Grammars and ambiguity  Simple example  Grammar hierarchies.
1 A Chart Parser for Analyzing Modern Standard Arabic Sentence Eman Othman Computer Science Dept., Institute of Statistical Studies and Research (ISSR),
Assessing Reading: Meeting Year 3 Expectations
Morphology For Marathi POS-Tagger Veena Dixit 11/ 10 /2005.
The Eight Parts of Speech
ICS611 Introduction to Compilers Set 1. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
Word category and verb-argument structure information in the dynamics of parsing Frisch, Hahne, and Friedericie (2004) Cognition.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
IV. SYNTAX. 1.1 What is syntax? Syntax is the study of how sentences are structured, or in other words, it tries to state what words can be combined with.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Metalanguage Revision English language year
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור שבע Partial Parsing אורן גליקמן.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
CPS 506 Comparative Programming Languages Syntax Specification.
The Parts of Speech The 8 Parts of Speech… Nouns Adjectives Pronouns Verbs Adverbs Conjunctions Prepositions Interjections.
Structural Levels of Language Lecture 1. Ferdinand de Saussure  "Language is a system sui generis “ = a system where everything holds together  The.
Parts of Speech Major source: Wikipedia. Adjectives An adjective is a word that modifies a noun or a pronoun, usually by describing it or making its meaning.
Natural Language Processing Chapter 2 : Morphology.
Unit 8 Syntax. Syntax Syntax deals with rules for combining words into sentences, as well as with relationship between elements in one sentence Basic.
Dependency Parsing Parsing Algorithms Peng.Huang
SYNTAX.
Levels of Linguistic Analysis
3 Phonology: Speech Sounds as a System No language has all the speech sounds possible in human languages; each language contains a selection of the possible.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Parsing and Code Generation Set 24. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program,
GRAMMARS & PARSING. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program, referred to as a.
NATURAL LANGUAGE PROCESSING
MORPHOLOGY. PART 1: INTRODUCTION Parts of speech 1. What is a part of speech?part of speech 1. Traditional grammar classifies words based on eight parts.
The structure and Function of Phrases and Sentences
Chapter 4 Syntax a branch of linguistics that studies how words are combined to form sentences and the rules that govern the formation of sentences.
10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.
Chapter 3: Describing Syntax and Semantics
Parts of Speech Review.
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
CSC 594 Topics in AI – Natural Language Processing
An Introduction to the Government and Binding Theory
Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur
Statistical NLP: Lecture 3
Revision Outcome 1, Unit 1 The Nature and Functions of Language
Basic Parsing with Context Free Grammars Chapter 13
Course content – the syllabus and educational framework
Syntax Analysis Chapter 4.
Part I: Basics and Constituency
ENERGY 211 / CME 211 Lecture 15 October 22, 2008.
Natural Language - General
Chapter 2: A Simple One Pass Compiler
R.Rajkumar Asst.Professor CSE
Levels of Linguistic Analysis
BNF 9-Apr-19.
Complementation.
Mariana Berenguer AP Language and Composition
Artificial Intelligence 2004 Speech & Natural Language Processing
Faculty of Computer Science and Information System
Presentation transcript:

A Link Grammar for an Agglutinative Language Ozlem Istek & Ilyas Cicekli Bilkent University, TURKEY

Outline Link Grammar Formalism Some Distinctive Features of Turkish Syntax The System Architecture Of Turkish Parser and Our Adapted Link Grammar Formalism Method for Handling the Syntactic Roles of the Words with Derivations Evaluation Concluding Remarks RANLP-2007

Link Grammar Link grammar is a formal grammatical system developed by Sleator and Temperley The syntax of a language is defined by a grammar that includes the words and their linking requirements. The grammar is defined in a dictionary file and each of the linking requirements of words is expressed in terms of connectors A given sentence is accepted by the system if the linking requirements of all the words are satisfied (connectivity), none of the links between the words cross each other (planarity) and there is at most one link between any pair of words (exclusion) RANLP-2007

Link Grammar – Example The linkage requirements of three Turkish words: yedi : O- & S-; - ate kadın : S+ ; - the woman portakalı : O+; - the orange A linkage for a sentence containing these three words +--------------S-------------------+ | +-------O------+ | | | Kadın portakalı yedi (The woman ate the orange) The woman the orange ate RANLP-2007

Turkish Syntax The basic word order is SOV, but order of constituents may change according to the discourse context. Turkish is head-final -- modifiers precede modified item. an adjective (modifier) precedes the head noun (modified item) in a noun phrase. In the basic word order of the sentence, the subject and the object (modifiers) precede the verb (modified item). Although the head-final property can be violated at major constituent levels (SOV) of a sentence, it is preserved at sub-clause levels and smaller syntactic structures. kırmızı şapkalı kız (the girl with the red hat) red with hat girl RANLP-2007

Turkish Syntax (cont.) Turkish is agglutinative. Words can take many derivational suffixes and each of these derivations can take its inflectional suffixes. Inflectional suffixes have important grammatical roles. A significant amount of interaction between syntax and morphotactics. uygarlaştı He got civilized. uygar-laş-tı uygar+Noun+A3sg+Pnon+Nom^DB+Verb+Become+Pos+Past+A3sg RANLP-2007

Motivation for New Formalism In standart link grammar formalism, linking requirements are defined for words. When we consider all possible derivations and inflections for Turkish words, the number of possible words will be huge. The words in the same category behave similarly at the syntactical level. We preferred to use linking requirements based on the classes of words and their inflections (and derivations are treated as separate words) RANLP-2007

System Architecture of Turkish Parser Input Sentence Morphological Analysis Stripping Lexical Parts Separating Derivation Boundaries Create Sentence List Linking Requirements for Turkish Word Classes and Derivations Parse Sentences with Link Grammar All possible linkages RANLP-2007

System Architecture (cont.) Morphological Analysis: All the words in the input sentence are analyzed by the fully functional Turkish morphological analyzer. oku  oku+Verb+Pos+Past+A2sg (read) uygarlaşmak  uygar+Noun+A3sg+Pnon+Nom (to get civilized) ^DB+Verb+Become+Pos^DB+Noun+Inf1+A3sg+Pnon+Nom Stripping Lexical Parts: Lexical parts of the words are removed for all types of words except conjunctions. In fact, Turkish link grammar is designed for the classes of word types and their feature structures oku+Verb+Pos+Past+A2sg  Verb+Pos+Past+A2sg RANLP-2007

System Architecture (cont.) Separating Derivation Boundaries: The words are separated at derivational boundaries and the part of speech tag of each derived form is marked in order to indicate its position in that word. Each token starts with a part of speech tag together with a position mark, and continues with inflectional feature structures. Noun+A3sg+P1pl+Loc ^DB+Adj+Rel ^DB+Noun+Zero+A3sg+Pnon+Gen NounRoot+A3sg+P1pl+Loc AdjDB NounDBEnd+A3sg+Pnon+Gen RANLP-2007

System Architecture (cont.) Parsing Sentences: Each representation of the sentence is fed into the parser. A sentence is parsed with respect to the designed Turkish link grammar. Turkish link grammar contains linking requirements for: each part of speech tag, and each part of speech tag followed by one of the strings “Root”, “DB”, or “DBEnd”. A linking requirement for a token depend on the part of speech tag of the token, and the inflection suffixes in that token. RANLP-2007

Turkish Link Grammar Linking requirements are defined for a part of speech tag and inflectional suffixes. Noun+A3sg+Pnon+Nom : linking requirements for nouns with +A3sg+Pnon+Nom inflections Noun+A3sg+Pnon+Acc : linking requirements for nouns with +A3sg+Pnon+Acc inflections Verb+Pos+Past+A1sg : linking requirements for verbs with +Pos+Past+A1sg inflections Verb+Pos+Past+A2sg : linking requirements for verbs with +Pos+Past+A2sg inflections RANLP-2007

Linking Requirements for Derivations In order to preserve the syntactic roles that the intermediate derived forms of a word play, they are treated as separate words in the grammar. In order to indicate that they are the intermediate derivations of the same word, all of them are linked with the special “DB” (derivational boundary) connector. Noun+A3sg+P1pl+Loc ^DB+Adj+Rel ^DB+Noun+Zero+A3sg+Pnon+Gen +----------DB----------+---DB---+ | | | NounRoot+A3sg+P1pl+Loc AdjDB NounDBEnd+A3sg+Pnon+Gen RANLP-2007

Linking Requirements for Derivations (cont.) A derived word consists of root word, intermediate derived forms and last derived form. Root Word only contributes left linking requirements of that word, and it is connected to the right with a DB connector. Intermediate Derived Forms also only contribute left linking requirements of that word, and it is connected to the left and right with a DB connector. Last Derived Form contributes both left and right linking requirements of that word, and it is connected to the left with a DB connector. RANLP-2007

Linking Requirements for Derivations (cont.) For each part of speech tag, we will need three more linking requirements for three positions in derived words (root, intermediate and last) Example: Noun Inflections : LeftLinkingRs & RightLinkingRs NounRoot Inflections : LeftLinkingRs & DB- NounDB Inflections : LeftLinkingRs & DB- & DB+ NounDBEnd Inflections : LeftLinkingRs & RightLinkingRs & DB- RANLP-2007

Evaluation We tested the developed Turkish parser with a set of 250 sentences. Average number of words in the sentences is 5.19. Average number of parses per sentence is 7.49. For 84.31% of the sentences, their result sets contain the correct parse. Average ordering of the correct parse in the result set was 1.78. For 62.39% of the sentences, the first parse is the correct parse For 80.94% of the sentences, one of the first three parses is correct. RANLP-2007

Conclusions A Turkish grammar is developed in the link grammar formalism. The developed Turkish link grammar is not a lexical grammar. We used the morphological feature structures and the word classes. We preserved the syntactic roles of the intermediate derived forms of words in our system by separating the derived words from their derivational boundaries and treating each intermediate form as a distinct word. Our linking requirements are defined for morphological categories. Our current system does not use a POS tagger, and its addition will improve the performance in terms of both time and precision. RANLP-2007