HISPAL A Constraint Grammar Parser for Spanish Eckhard Bick University of Southern Denmark

Slides:



Advertisements
Similar presentations
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
1 Words and the Lexicon September 10th 2009 Lecture #3.
Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.
1/17 Probabilistic Parsing … and some other approaches.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
1 A Chart Parser for Analyzing Modern Standard Arabic Sentence Eman Othman Computer Science Dept., Institute of Statistical Studies and Research (ISSR),
Emergence of Syntax. Introduction  One of the most important concerns of theoretical linguistics today represents the study of the acquisition of language.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
A Language Independent Method for Question Classification COLING 2004.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Treebank Troubles Eckhard Bick Southern Denmark University
Applying Constraint Grammar to Tibetan NLP Edward Garrett & Nathan Hill Project website:
CSA2050 Introduction to Computational Linguistics Parsing I.
Supertagging CMSC Natural Language Processing January 31, 2006.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
POS Tagging1 POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.
Roadmap Probabilistic CFGs –Handling ambiguity – more likely analyses –Adding probabilities Grammar Parsing: probabilistic CYK Learning probabilities:
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark
Constraint Grammar ESSLLI
A Dependency Constraint Grammar for Esperanto Eckhard Bick University of Southern Denmark / GrammarSoft.
Natural Language Processing Vasile Rus
Constraint Grammar ESSLLI Tuesday: Lexicon, PoS, Morphology.
DeepDict A Graphical Corpus-based Dictionary of Word Relations Eckhard Bick University of Southern Denmark & GrammarSoft ApS.
Constraint Grammar ESSLLI Thursday: Dependency. CG Input ➢ remember CG is modular! output from one module becomes input to the next module pre- process.
A Bare-bones Constraint Grammar
Generate text from CG (p.3)
Language Identification and Part-of-Speech Tagging
Leonardo Zilio Supervisors: Prof. Dr. Maria José Bocorny Finatto
Approaches to Machine Translation
CSC 594 Topics in AI – Natural Language Processing
Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur
Statistical NLP: Lecture 3
Basic Parsing with Context Free Grammars Chapter 13
David Mareček and Zdeněk Žabokrtský
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Syntax Analysis Chapter 4.
Chapter Eight Syntax.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Constraint Grammar ESSLLI
Probabilistic and Lexicalized Parsing
Machine Learning in Natural Language Processing
CS 388: Natural Language Processing: Syntactic Parsing
Topics in Linguistics ENG 331
Chapter Eight Syntax.
Natural Language - General
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
The CoNLL-2014 Shared Task on Grammatical Error Correction
The Nature of Learner Language
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 26
Approaches to Machine Translation
Statistical n-gram David ling.
Extracting Recipes from Chemical Academic Papers
Natural Language Processing
Discrete Maths 13. Grammars Objectives
Presentation transcript:

HISPAL A Constraint Grammar Parser for Spanish Eckhard Bick University of Southern Denmark

HISPAL VISL/SDU Introduction ➢ HISPAL is a morphological tagger and syntactic parser for free, running Spanish text ➢ exploits a cross-language unified descriptive system for grammatical categories (VISL) ➢ Low-intensity project at the ISK, University of Southern Denmark, since 2001 (1999) ➢ continuous applicational feedback regarding teaching tools and corpus annotation HISPAL relies on Constraint Grammar technology....

HISPAL VISL/SDU which is not a new idea... Other CG systems ● Pure CG systems (high cost - large lexica, full morphological analysis, hand-written rules): – English: ENGCG (Karlsson et al. 1995) – Portuguese: PALAVRAS (Bick 1996) – Norwegian: Oslo-Bergen-Tagger (Hagen, Johannessen, Nøklestad 2000) – Danish: DanGram (Bick 2003) ● Hybrid systems (various cost-saving techniques): – Relaxation Labelling (Padró 1996) – µ-TBL (Lager 1999) - machine learning, rule templates and rule ordering – FrAG (Bick 2004) - correction CG for probabilistic tagger

HISPAL VISL/SDU... but is done in novel ways: Cost-saving techniques for a non-hybrid CG ● lexicon-free morphological analyzer ● lexicon-bootstrapping from corpora ● grammar porting (Portuguese -> Spanish) ● corpus-based grammar tuning

HISPAL VISL/SDU Format 1: Dependency trees Word-lemma extra PoS morphologysyntacticdependency form functionlink $¿ #1->0 Cuáles [cuál] DET MF #2->3 son [ser] V PR 3P IND #3->0 los [el] DET M #4->5 motivos [motivo] N M 3 que [que] SPEC MF #6->7 han [haber] V PR 3P 5 hecho [hacer] V PCP M 7 resurgir [resurgir] V 8 este [este] DET M #10- >11 debate [debate] N M 9 $ #12->0 What are the motives that have made this debate resurface?

HISPAL VISL/SDU Format 2: Constituent trees SOURCE: Running text 1. ¿ Cuáles son los motivos que han hecho resurgir este debate A1 QUE:fcl ¿ =Cs:pron-int("cuál" DET MF P) Cuáles =P:v-fin("ser" PR 3P IND VFIN) son =S:np ==DN:pron-dem("el" DET M P) los ==H:n("motivo" M P) motivos ==DN:fcl ===S:spec("que" MF SP) que ===P:vp ====Vaux:v-fin("haber" PR 3P IND) han ====Vm:v-pcp("hacer" M S) hecho ===Od:fcl ====P:v-inf("resurgir" ) resurgir ====S:np =====DN:pron-dem("este" DET M S) este =====H:n("debate" M S) debate ?

HISPAL VISL/SDU Anatomy of the HISPAL parser

HISPAL VISL/SDU The morphological analyzer ● full-form lexicon only for about 220 closed-class words ● use of affix-classes with or without stem conditions – '-aremos' -> '...ar' (lemma) V FUT 1P IND (verb, future, 1. person plural, indicative ● but hypothetical stems are also suggested: – [compraremo] ADJ M P – [compraremo] N M P – [comprar] V FUT 1P IND – [comprarer] V PR/PS 1P IND – [comprarar] V PR 1P SUBJ

HISPAL VISL/SDU weighting morphological candidates ● if one or more suggested readings have lexicon-support for their roots, other readings are discarded ● longer endings are preferred to shorter ones: -anes -> án, -enes -> én rather than simple plural '-s' (with a root '....ane' or '...ene') ● recognizably analytical readings with lexicon support are preferred to heuristic ones: -idad, -itud, -ista, super-, for instance decir -> antedecir, bendecir, contradecir, descedicr, entredecir, interdecir, maldecir, predecir, redecir..., allowing also for productive derivation ● even without root-lexicon support, recognized affixes may allow better prediction of word class ● as a last resort, all inflectionally possible forms are passed on to the contextual disambiguation, in effect making the CG grammar part of the heuristical part of the morphological analyzer

HISPAL VISL/SDU The lexicon 1. Original version created by boot-strapping (2001) ● A hand-built closed class lexicon for Spanish (pronouns, prepositions, conjunctions...) ● a Spanish affix file used together with the Portuguese morpho-chunker and dummy-roots ● a list of safe open class word candidates, extracted from corpora using e.g. article-noun sequences and unambiguous verbal inflexions ● overgenerating, heuristic output from the Spanish morphological analyzer, using dummy- roots ('xxxa' N F, 'xxxo' ADJ, 'xxxar' V) to recognize open class word candidates and their inflexion (nouns, verbs, adjectives...) 2. The combined seeding system was then run on a large body of text, adding morphological and PoS tags as well as lemma cohorts for each word 3. Disambiguation by running the Portuguese CG module

HISPAL VISL/SDU The lexicon 2 4. Use the surviving readings to generate new entries for the HISPAL lexicon 5. Reiterate the process, using new data and more and more “Spanish” versions of the CG rule set 6. Finally, a large portion of lexeme strings with more than one PoS entry were manually checked against published dictionaries, among them all cases of gender ambiguity for nouns (o guarda - a guarda).

HISPAL VISL/SDU multitagger raw corpora lexicon multitaged text unambiguousl y tagged text Constraint Grammar

HISPAL VISL/SDU Lemma distribution in the HISPAL lexicon

HISPAL VISL/SDU Secondary lexicon information ● Valency – Once the parser produced more reliable output, annotated corpora were used to extract verb valency frames, such as 'transitive verb' og 'reflexive verb' – manually added valency potential for auxiliaries and some support verbs (ser, estar, dejar) – systematic valency based on suffixes (e.g. -izar ) – however, most valency frames are incomplete, and most nouns and adjectives lack them altogether ● verbs: with 7530 valency patterns

HISPAL VISL/SDU ● Semantic prototypes – conceptually, some 160 semantic prototype categories are used for nouns, e.g. = bird, = container, etc., in analogy with the PALAVRAS system – however, only a few types have been systematically implemented, notably systematic ones (i.e. '-ista' -> +HUM) ● nouns with semantic tags (10% well-checked) ● 1175 common person and place names (, ) – corpus experiments to extract, for instance, the +TOP feature from corpora based on preposition dependency – possible use of existing ressources (Eurowordnet, Simple)

HISPAL VISL/SDU Coverage of the HISPAL lexicon and morphological analyzer

HISPAL VISL/SDU Constraint Grammar parsing ● rule based, reductionist, focus on disambiguation ● rules add, remove or select morphological, syntactic or other readings ● rules use context conditions of arbitrary distance and complexity (i.e. other words and tags in the sentence) ● rules are applied in a deterministic and sequential way, so removed information can't be recovered – rules in batches, safe rules first – last remaining reading can't be removed – robust method that will assign readings even to very unconventional language input

HISPAL VISL/SDU some simple rule examples ● REMOVE VFIN IF (*-1C VFIN BARRIER CLB OR KC) exploits the uniqueness principle: 1 finite verb per clause TARGET (PROP) IF (NOT -1 PRP) syntactic potential of proper nouns ● SELECT IF (*-1 >>> OR KS BARRIER NON-PRE-N/ADV) (*1 VFIN BARRIER NON-ATTR) clause-initial np's, followed by a finite verb, are likely to be subjects

HISPAL VISL/SDU Bootstrapping a Constraint Grammar ● Mature CGs consist of thousands of (manual!) rules, which will cost several man- years, if a full grammar is built from scratch ● A possible alternative: Importing a CG from a related language: Portuguese -> Spanish – also suggested for Catalan -> Spanish (unevaluated?): ● Why is CG-porting at all possible: – Unlike rewriting rules in a PSG, Constraint Grammar doesn't strive to describe a language in a complete and positive way – Rather, rules focus on what is NOT possible (annotation through disambiguation) – Therefore, superfluous rules won't hurt, and heuristic (Portuguese) rules can function as a harmless backup in the presence of newer, non-heuristic Spanish rules – With compatible PoS tag conventions, the grammar will work at once, for free running input, and can be changed incrementally

HISPAL VISL/SDU ● Token- and lexeme-references in sets and rules were translated, i.e. – structural words like prepositions or conjunctions: quando -> cuando (when), e -> y (and), ou ->o (or) – semantically inspired lists (months, days of the week, units) ● Specific Spanish rules were added early in the rules file to cover phenomena like the use of the preposition a with (especially human) direct objects. ● Error-producing rules were traced and changed, replaced or deleted. Often, rules could be “repaired” by adding further context conditions, or by restricting the target set Some specific changes

HISPAL VISL/SDU Problems ● changes may appear unsatisfyingly piecemeal to a linguist ● it is difficult to tell, if a given error (from a corpus run) is Spanish-specific, or if the original Portuguese rule was already at fault - so for now there is no back-porting trade off between the 2 grammars ● Also, because of the complex, reductionist rule intervention, it is dangerous to re- port the (faster-growing) Portuguese grammar, once both systems have undergone individual changes ● differences in ambiguity classes, e.g. – muito mucho/muy (1 : many) – (many : 1)

HISPAL VISL/SDU Current grammar size ● 1418 morphological disambiguation rules ● 1249 mapping rules ● 1862 syntactic disambiguation rules

HISPAL VISL/SDU Performance of the HISPAL parser global values (1) Allmost no morphological errors were found for correct PoS, implying little in-class ambiguity. This may be due in part to the fairly distinctive inflexional morphology of Spanish, but can also be explained by the use of underspecified tags for systematically ambiguous morphology (e.g. gender in '-ista' nouns: M/F). ● test run on a manually revised gold-corpus (2567 words, 3025 tokens), taken from the interview corpus

HISPAL VISL/SDU Performance of the HISPAL parser specific syntactic functions

HISPAL VISL/SDU Comparison to other systems ● results were similar to those reported for other languages (English, Karlsson et al. 1995, Norwegian, Hagen et al. 2000, Danish, Portuguese and French, Bick 2003 & 2004) ● No data published for other Spanish, CG-comparable systems – Connexor's Machinese [ – Freeling [Atserias et al & 2006] ● Syntactic accuracy (95-96%) compares favourably to the syntactic edge label accuracy of the best performing system in the CoNLL X shared task on machine- learned dependency-parsing (90.4% on the Cast3LB treebank) – up-side: The CoNLL systems got hand-corrected PoS for free – down-side: The HISPAL gold standard corpus was built by manually revising parser-output, thus introducing a possible bias in favour of the parser in ambiguous cases

DeepDict ● syntactically analyzed corpus (dependency links and functions) – Spanish Wikipedia (Nov. 2005) – ca. 22 M words – Spanish section of Europarl – ca. 27 M words ● lemmatization, “normalization” (passives, numbers, names) ● extraction of mother-daughter relations, depgrams, not ngrams (cf. Adam Kilgariff's Sketch Engine) – N A + ADJ (gravemente enfermo) – V (ganar ktp, V + PRP (creer en) ● co-occurrence measure: p(AB) / p(A) * p (B) ● graphical interface

HISPAL VISL/SDU

Outlook ● more corpus work, currently focussing on: – Europarl (27.2 M words) – Wikipedia (22.3 M words) – News texts (2 M words) ● lexicon completion: – complete manual revision by a native speaker – corpus-based completion of valency patters – (manual?) completion of semantic prototype ontology ● exchange format for comparison with other Spanish annotation projects ● more stringent evaluation

HISPAL VISL/SDU Tools and documentation: Teaching games: Corpora: DeepDict: Contact: