HISPAL A Constraint Grammar Parser for Spanish Eckhard Bick University of Southern Denmark
HISPAL VISL/SDU Introduction ➢ HISPAL is a morphological tagger and syntactic parser for free, running Spanish text ➢ exploits a cross-language unified descriptive system for grammatical categories (VISL) ➢ Low-intensity project at the ISK, University of Southern Denmark, since 2001 (1999) ➢ continuous applicational feedback regarding teaching tools and corpus annotation HISPAL relies on Constraint Grammar technology....
HISPAL VISL/SDU which is not a new idea... Other CG systems ● Pure CG systems (high cost - large lexica, full morphological analysis, hand-written rules): – English: ENGCG (Karlsson et al. 1995) – Portuguese: PALAVRAS (Bick 1996) – Norwegian: Oslo-Bergen-Tagger (Hagen, Johannessen, Nøklestad 2000) – Danish: DanGram (Bick 2003) ● Hybrid systems (various cost-saving techniques): – Relaxation Labelling (Padró 1996) – µ-TBL (Lager 1999) - machine learning, rule templates and rule ordering – FrAG (Bick 2004) - correction CG for probabilistic tagger
HISPAL VISL/SDU... but is done in novel ways: Cost-saving techniques for a non-hybrid CG ● lexicon-free morphological analyzer ● lexicon-bootstrapping from corpora ● grammar porting (Portuguese -> Spanish) ● corpus-based grammar tuning
HISPAL VISL/SDU Format 1: Dependency trees Word-lemma extra PoS morphologysyntacticdependency form functionlink $¿ #1->0 Cuáles [cuál] DET MF #2->3 son [ser] V PR 3P IND #3->0 los [el] DET M #4->5 motivos [motivo] N M 3 que [que] SPEC MF #6->7 han [haber] V PR 3P 5 hecho [hacer] V PCP M 7 resurgir [resurgir] V 8 este [este] DET M #10- >11 debate [debate] N M 9 $ #12->0 What are the motives that have made this debate resurface?
HISPAL VISL/SDU Format 2: Constituent trees SOURCE: Running text 1. ¿ Cuáles son los motivos que han hecho resurgir este debate A1 QUE:fcl ¿ =Cs:pron-int("cuál" DET MF P) Cuáles =P:v-fin("ser" PR 3P IND VFIN) son =S:np ==DN:pron-dem("el" DET M P) los ==H:n("motivo" M P) motivos ==DN:fcl ===S:spec("que" MF SP) que ===P:vp ====Vaux:v-fin("haber" PR 3P IND) han ====Vm:v-pcp("hacer" M S) hecho ===Od:fcl ====P:v-inf("resurgir" ) resurgir ====S:np =====DN:pron-dem("este" DET M S) este =====H:n("debate" M S) debate ?
HISPAL VISL/SDU Anatomy of the HISPAL parser
HISPAL VISL/SDU The morphological analyzer ● full-form lexicon only for about 220 closed-class words ● use of affix-classes with or without stem conditions – '-aremos' -> '...ar' (lemma) V FUT 1P IND (verb, future, 1. person plural, indicative ● but hypothetical stems are also suggested: – [compraremo] ADJ M P – [compraremo] N M P – [comprar] V FUT 1P IND – [comprarer] V PR/PS 1P IND – [comprarar] V PR 1P SUBJ
HISPAL VISL/SDU weighting morphological candidates ● if one or more suggested readings have lexicon-support for their roots, other readings are discarded ● longer endings are preferred to shorter ones: -anes -> án, -enes -> én rather than simple plural '-s' (with a root '....ane' or '...ene') ● recognizably analytical readings with lexicon support are preferred to heuristic ones: -idad, -itud, -ista, super-, for instance decir -> antedecir, bendecir, contradecir, descedicr, entredecir, interdecir, maldecir, predecir, redecir..., allowing also for productive derivation ● even without root-lexicon support, recognized affixes may allow better prediction of word class ● as a last resort, all inflectionally possible forms are passed on to the contextual disambiguation, in effect making the CG grammar part of the heuristical part of the morphological analyzer
HISPAL VISL/SDU The lexicon 1. Original version created by boot-strapping (2001) ● A hand-built closed class lexicon for Spanish (pronouns, prepositions, conjunctions...) ● a Spanish affix file used together with the Portuguese morpho-chunker and dummy-roots ● a list of safe open class word candidates, extracted from corpora using e.g. article-noun sequences and unambiguous verbal inflexions ● overgenerating, heuristic output from the Spanish morphological analyzer, using dummy- roots ('xxxa' N F, 'xxxo' ADJ, 'xxxar' V) to recognize open class word candidates and their inflexion (nouns, verbs, adjectives...) 2. The combined seeding system was then run on a large body of text, adding morphological and PoS tags as well as lemma cohorts for each word 3. Disambiguation by running the Portuguese CG module
HISPAL VISL/SDU The lexicon 2 4. Use the surviving readings to generate new entries for the HISPAL lexicon 5. Reiterate the process, using new data and more and more “Spanish” versions of the CG rule set 6. Finally, a large portion of lexeme strings with more than one PoS entry were manually checked against published dictionaries, among them all cases of gender ambiguity for nouns (o guarda - a guarda).
HISPAL VISL/SDU multitagger raw corpora lexicon multitaged text unambiguousl y tagged text Constraint Grammar
HISPAL VISL/SDU Lemma distribution in the HISPAL lexicon
HISPAL VISL/SDU Secondary lexicon information ● Valency – Once the parser produced more reliable output, annotated corpora were used to extract verb valency frames, such as 'transitive verb' og 'reflexive verb' – manually added valency potential for auxiliaries and some support verbs (ser, estar, dejar) – systematic valency based on suffixes (e.g. -izar ) – however, most valency frames are incomplete, and most nouns and adjectives lack them altogether ● verbs: with 7530 valency patterns
HISPAL VISL/SDU ● Semantic prototypes – conceptually, some 160 semantic prototype categories are used for nouns, e.g. = bird, = container, etc., in analogy with the PALAVRAS system – however, only a few types have been systematically implemented, notably systematic ones (i.e. '-ista' -> +HUM) ● nouns with semantic tags (10% well-checked) ● 1175 common person and place names (, ) – corpus experiments to extract, for instance, the +TOP feature from corpora based on preposition dependency – possible use of existing ressources (Eurowordnet, Simple)
HISPAL VISL/SDU Coverage of the HISPAL lexicon and morphological analyzer
HISPAL VISL/SDU Constraint Grammar parsing ● rule based, reductionist, focus on disambiguation ● rules add, remove or select morphological, syntactic or other readings ● rules use context conditions of arbitrary distance and complexity (i.e. other words and tags in the sentence) ● rules are applied in a deterministic and sequential way, so removed information can't be recovered – rules in batches, safe rules first – last remaining reading can't be removed – robust method that will assign readings even to very unconventional language input
HISPAL VISL/SDU some simple rule examples ● REMOVE VFIN IF (*-1C VFIN BARRIER CLB OR KC) exploits the uniqueness principle: 1 finite verb per clause TARGET (PROP) IF (NOT -1 PRP) syntactic potential of proper nouns ● SELECT IF (*-1 >>> OR KS BARRIER NON-PRE-N/ADV) (*1 VFIN BARRIER NON-ATTR) clause-initial np's, followed by a finite verb, are likely to be subjects
HISPAL VISL/SDU Bootstrapping a Constraint Grammar ● Mature CGs consist of thousands of (manual!) rules, which will cost several man- years, if a full grammar is built from scratch ● A possible alternative: Importing a CG from a related language: Portuguese -> Spanish – also suggested for Catalan -> Spanish (unevaluated?): ● Why is CG-porting at all possible: – Unlike rewriting rules in a PSG, Constraint Grammar doesn't strive to describe a language in a complete and positive way – Rather, rules focus on what is NOT possible (annotation through disambiguation) – Therefore, superfluous rules won't hurt, and heuristic (Portuguese) rules can function as a harmless backup in the presence of newer, non-heuristic Spanish rules – With compatible PoS tag conventions, the grammar will work at once, for free running input, and can be changed incrementally
HISPAL VISL/SDU ● Token- and lexeme-references in sets and rules were translated, i.e. – structural words like prepositions or conjunctions: quando -> cuando (when), e -> y (and), ou ->o (or) – semantically inspired lists (months, days of the week, units) ● Specific Spanish rules were added early in the rules file to cover phenomena like the use of the preposition a with (especially human) direct objects. ● Error-producing rules were traced and changed, replaced or deleted. Often, rules could be “repaired” by adding further context conditions, or by restricting the target set Some specific changes
HISPAL VISL/SDU Problems ● changes may appear unsatisfyingly piecemeal to a linguist ● it is difficult to tell, if a given error (from a corpus run) is Spanish-specific, or if the original Portuguese rule was already at fault - so for now there is no back-porting trade off between the 2 grammars ● Also, because of the complex, reductionist rule intervention, it is dangerous to re- port the (faster-growing) Portuguese grammar, once both systems have undergone individual changes ● differences in ambiguity classes, e.g. – muito mucho/muy (1 : many) – (many : 1)
HISPAL VISL/SDU Current grammar size ● 1418 morphological disambiguation rules ● 1249 mapping rules ● 1862 syntactic disambiguation rules
HISPAL VISL/SDU Performance of the HISPAL parser global values (1) Allmost no morphological errors were found for correct PoS, implying little in-class ambiguity. This may be due in part to the fairly distinctive inflexional morphology of Spanish, but can also be explained by the use of underspecified tags for systematically ambiguous morphology (e.g. gender in '-ista' nouns: M/F). ● test run on a manually revised gold-corpus (2567 words, 3025 tokens), taken from the interview corpus
HISPAL VISL/SDU Performance of the HISPAL parser specific syntactic functions
HISPAL VISL/SDU Comparison to other systems ● results were similar to those reported for other languages (English, Karlsson et al. 1995, Norwegian, Hagen et al. 2000, Danish, Portuguese and French, Bick 2003 & 2004) ● No data published for other Spanish, CG-comparable systems – Connexor's Machinese [ – Freeling [Atserias et al & 2006] ● Syntactic accuracy (95-96%) compares favourably to the syntactic edge label accuracy of the best performing system in the CoNLL X shared task on machine- learned dependency-parsing (90.4% on the Cast3LB treebank) – up-side: The CoNLL systems got hand-corrected PoS for free – down-side: The HISPAL gold standard corpus was built by manually revising parser-output, thus introducing a possible bias in favour of the parser in ambiguous cases
DeepDict ● syntactically analyzed corpus (dependency links and functions) – Spanish Wikipedia (Nov. 2005) – ca. 22 M words – Spanish section of Europarl – ca. 27 M words ● lemmatization, “normalization” (passives, numbers, names) ● extraction of mother-daughter relations, depgrams, not ngrams (cf. Adam Kilgariff's Sketch Engine) – N A + ADJ (gravemente enfermo) – V (ganar ktp, V + PRP (creer en) ● co-occurrence measure: p(AB) / p(A) * p (B) ● graphical interface
HISPAL VISL/SDU
Outlook ● more corpus work, currently focussing on: – Europarl (27.2 M words) – Wikipedia (22.3 M words) – News texts (2 M words) ● lexicon completion: – complete manual revision by a native speaker – corpus-based completion of valency patters – (manual?) completion of semantic prototype ontology ● exchange format for comparison with other Spanish annotation projects ● more stringent evaluation
HISPAL VISL/SDU Tools and documentation: Teaching games: Corpora: DeepDict: Contact: