Presentation is loading. Please wait.

Presentation is loading. Please wait.

HISPAL A Constraint Grammar Parser for Spanish Eckhard Bick University of Southern Denmark

Similar presentations


Presentation on theme: "HISPAL A Constraint Grammar Parser for Spanish Eckhard Bick University of Southern Denmark"— Presentation transcript:

1 HISPAL A Constraint Grammar Parser for Spanish Eckhard Bick University of Southern Denmark eckhard.bick@mail.dk

2 HISPAL 28.9.2006 --- VISL/SDU Introduction ➢ HISPAL is a morphological tagger and syntactic parser for free, running Spanish text ➢ exploits a cross-language unified descriptive system for grammatical categories (VISL) ➢ Low-intensity project at the ISK, University of Southern Denmark, since 2001 (1999) ➢ continuous applicational feedback regarding teaching tools and corpus annotation HISPAL relies on Constraint Grammar technology....

3 HISPAL 28.9.2006 --- VISL/SDU which is not a new idea... Other CG systems ● Pure CG systems (high cost - large lexica, full morphological analysis, hand-written rules): – English: ENGCG (Karlsson et al. 1995) – Portuguese: PALAVRAS (Bick 1996) – Norwegian: Oslo-Bergen-Tagger (Hagen, Johannessen, Nøklestad 2000) – Danish: DanGram (Bick 2003) ● Hybrid systems (various cost-saving techniques): – Relaxation Labelling (Padró 1996) – µ-TBL (Lager 1999) - machine learning, rule templates and rule ordering – FrAG (Bick 2004) - correction CG for probabilistic tagger

4 HISPAL 28.9.2006 --- VISL/SDU... but is done in novel ways: Cost-saving techniques for a non-hybrid CG ● lexicon-free morphological analyzer ● lexicon-bootstrapping from corpora ● grammar porting (Portuguese -> Spanish) ● corpus-based grammar tuning

5 HISPAL 28.9.2006 --- VISL/SDU Format 1: Dependency trees Word-lemma extra PoS morphologysyntacticdependency form functionlink $¿ #1->0 Cuáles [cuál] DET MF P @SC> #2->3 son [ser] V PR 3P IND VFIN @FS-QUE #3->0 los [el] DET M P @>N #4->5 motivos [motivo] N M P @ 3 que [que] SPEC MF SP @SUBJ> #6->7 han [haber] V PR 3P IND@FS-N 5 hecho [hacer] V PCP M S @ICL-AUX 7 resurgir [resurgir] V INF @ICL- 8 este [este] DET M S @>N #10- >11 debate [debate] N M S @ 9 $ #12->0 What are the motives that have made this debate resurface?

6 HISPAL 28.9.2006 --- VISL/SDU Format 2: Constituent trees SOURCE: Running text 1. ¿ Cuáles son los motivos que han hecho resurgir este debate A1 QUE:fcl ¿ =Cs:pron-int("cuál" DET MF P) Cuáles =P:v-fin("ser" PR 3P IND VFIN) son =S:np ==DN:pron-dem("el" DET M P) los ==H:n("motivo" M P) motivos ==DN:fcl ===S:spec("que" MF SP) que ===P:vp ====Vaux:v-fin("haber" PR 3P IND) han ====Vm:v-pcp("hacer" M S) hecho ===Od:fcl ====P:v-inf("resurgir" ) resurgir ====S:np =====DN:pron-dem("este" DET M S) este =====H:n("debate" M S) debate ?

7 HISPAL 28.9.2006 --- VISL/SDU Anatomy of the HISPAL parser

8 HISPAL 28.9.2006 --- VISL/SDU The morphological analyzer ● full-form lexicon only for about 220 closed-class words ● use of affix-classes with or without stem conditions – '-aremos' -> '...ar' (lemma) V FUT 1P IND (verb, future, 1. person plural, indicative ● but hypothetical stems are also suggested: – [compraremo] ADJ M P – [compraremo] N M P – [comprar] V FUT 1P IND – [comprarer] V PR/PS 1P IND – [comprarar] V PR 1P SUBJ

9 HISPAL 28.9.2006 --- VISL/SDU weighting morphological candidates ● if one or more suggested readings have lexicon-support for their roots, other readings are discarded ● longer endings are preferred to shorter ones: -anes -> án, -enes -> én rather than simple plural '-s' (with a root '....ane' or '...ene') ● recognizably analytical readings with lexicon support are preferred to heuristic ones: -idad, -itud, -ista, super-, for instance decir -> antedecir, bendecir, contradecir, descedicr, entredecir, interdecir, maldecir, predecir, redecir..., allowing also for productive derivation ● even without root-lexicon support, recognized affixes may allow better prediction of word class ● as a last resort, all inflectionally possible forms are passed on to the contextual disambiguation, in effect making the CG grammar part of the heuristical part of the morphological analyzer

10 HISPAL 28.9.2006 --- VISL/SDU The lexicon 1. Original version created by boot-strapping (2001) ● A hand-built closed class lexicon for Spanish (pronouns, prepositions, conjunctions...) ● a Spanish affix file used together with the Portuguese morpho-chunker and dummy-roots ● a list of safe open class word candidates, extracted from corpora using e.g. article-noun sequences and unambiguous verbal inflexions ● overgenerating, heuristic output from the Spanish morphological analyzer, using dummy- roots ('xxxa' N F, 'xxxo' ADJ, 'xxxar' V) to recognize open class word candidates and their inflexion (nouns, verbs, adjectives...) 2. The combined seeding system was then run on a large body of text, adding morphological and PoS tags as well as lemma cohorts for each word 3. Disambiguation by running the Portuguese CG module

11 HISPAL 28.9.2006 --- VISL/SDU The lexicon 2 4. Use the surviving readings to generate new entries for the HISPAL lexicon 5. Reiterate the process, using new data and more and more “Spanish” versions of the CG rule set 6. Finally, a large portion of lexeme strings with more than one PoS entry were manually checked against published dictionaries, among them all cases of gender ambiguity for nouns (o guarda - a guarda).

12 HISPAL 28.9.2006 --- VISL/SDU multitagger raw corpora lexicon multitaged text unambiguousl y tagged text Constraint Grammar

13 HISPAL 28.9.2006 --- VISL/SDU Lemma distribution in the HISPAL lexicon

14 HISPAL 28.9.2006 --- VISL/SDU Secondary lexicon information ● Valency – Once the parser produced more reliable output, annotated corpora were used to extract verb valency frames, such as 'transitive verb' og 'reflexive verb' – manually added valency potential for auxiliaries and some support verbs (ser, estar, dejar) – systematic valency based on suffixes (e.g. -izar ) – however, most valency frames are incomplete, and most nouns and adjectives lack them altogether ● verbs: 4.300 with 7530 valency patterns

15 HISPAL 28.9.2006 --- VISL/SDU ● Semantic prototypes – conceptually, some 160 semantic prototype categories are used for nouns, e.g. = bird, = container, etc., in analogy with the PALAVRAS system – however, only a few types have been systematically implemented, notably systematic ones (i.e. '-ista' -> +HUM) ● 22039 nouns with semantic tags (10% well-checked) ● 1175 common person and place names (, ) – corpus experiments to extract, for instance, the +TOP feature from corpora based on preposition dependency – possible use of existing ressources (Eurowordnet, Simple)

16 HISPAL 28.9.2006 --- VISL/SDU Coverage of the HISPAL lexicon and morphological analyzer

17 HISPAL 28.9.2006 --- VISL/SDU Constraint Grammar parsing ● rule based, reductionist, focus on disambiguation ● rules add, remove or select morphological, syntactic or other readings ● rules use context conditions of arbitrary distance and complexity (i.e. other words and tags in the sentence) ● rules are applied in a deterministic and sequential way, so removed information can't be recovered – rules in batches, safe rules first – last remaining reading can't be removed – robust method that will assign readings even to very unconventional language input

18 HISPAL 28.9.2006 --- VISL/SDU some simple rule examples ● REMOVE VFIN IF (*-1C VFIN BARRIER CLB OR KC) exploits the uniqueness principle: 1 finite verb per clause ● MAP (@SUBJ> @<SUBJ @<SC) TARGET (PROP) IF (NOT -1 PRP) syntactic potential of proper nouns ● SELECT (@SUBJ>) IF (*-1 >>> OR KS BARRIER NON-PRE-N/ADV) (*1 VFIN BARRIER NON-ATTR) clause-initial np's, followed by a finite verb, are likely to be subjects

19 HISPAL 28.9.2006 --- VISL/SDU Bootstrapping a Constraint Grammar ● Mature CGs consist of thousands of (manual!) rules, which will cost several man- years, if a full grammar is built from scratch ● A possible alternative: Importing a CG from a related language: Portuguese -> Spanish – also suggested for Catalan -> Spanish (unevaluated?): http://prado.uab.es/English/corpus.html http://prado.uab.es/English/corpus.html ● Why is CG-porting at all possible: – Unlike rewriting rules in a PSG, Constraint Grammar doesn't strive to describe a language in a complete and positive way – Rather, rules focus on what is NOT possible (annotation through disambiguation) – Therefore, superfluous rules won't hurt, and heuristic (Portuguese) rules can function as a harmless backup in the presence of newer, non-heuristic Spanish rules – With compatible PoS tag conventions, the grammar will work at once, for free running input, and can be changed incrementally

20 HISPAL 28.9.2006 --- VISL/SDU ● Token- and lexeme-references in sets and rules were translated, i.e. – structural words like prepositions or conjunctions: quando -> cuando (when), e -> y (and), ou ->o (or) – semantically inspired lists (months, days of the week, units) ● Specific Spanish rules were added early in the rules file to cover phenomena like the use of the preposition a with (especially human) direct objects. ● Error-producing rules were traced and changed, replaced or deleted. Often, rules could be “repaired” by adding further context conditions, or by restricting the target set Some specific changes

21 HISPAL 28.9.2006 --- VISL/SDU Problems ● changes may appear unsatisfyingly piecemeal to a linguist ● it is difficult to tell, if a given error (from a corpus run) is Spanish-specific, or if the original Portuguese rule was already at fault - so for now there is no back-porting trade off between the 2 grammars ● Also, because of the complex, reductionist rule intervention, it is dangerous to re- port the (faster-growing) Portuguese grammar, once both systems have undergone individual changes ● differences in ambiguity classes, e.g. – muito mucho/muy (1 : many) – lhe @DAT le @DAT/ACC (many : 1)

22 HISPAL 28.9.2006 --- VISL/SDU Current grammar size ● 1418 morphological disambiguation rules ● 1249 mapping rules ● 1862 syntactic disambiguation rules

23 HISPAL 28.9.2006 --- VISL/SDU Performance of the HISPAL parser global values (1) Allmost no morphological errors were found for correct PoS, implying little in-class ambiguity. This may be due in part to the fairly distinctive inflexional morphology of Spanish, but can also be explained by the use of underspecified tags for systematically ambiguous morphology (e.g. gender in '-ista' nouns: M/F). ● test run on a manually revised gold-corpus (2567 words, 3025 tokens), taken from the interview corpus

24 HISPAL 28.9.2006 --- VISL/SDU Performance of the HISPAL parser specific syntactic functions

25 HISPAL 28.9.2006 --- VISL/SDU Comparison to other systems ● results were similar to those reported for other languages (English, Karlsson et al. 1995, Norwegian, Hagen et al. 2000, Danish, Portuguese and French, Bick 2003 & 2004) ● No data published for other Spanish, CG-comparable systems – Connexor's Machinese [www.connexor.com/demo/syntax/]www.connexor.com/demo/syntax/ – Freeling [Atserias et al. 1998 & 2006] ● Syntactic accuracy (95-96%) compares favourably to the syntactic edge label accuracy of the best performing system in the CoNLL X shared task on machine- learned dependency-parsing (90.4% on the Cast3LB treebank) – up-side: The CoNLL systems got hand-corrected PoS for free – down-side: The HISPAL gold standard corpus was built by manually revising parser-output, thus introducing a possible bias in favour of the parser in ambiguous cases

26 DeepDict www.gramtrans.com ● syntactically analyzed corpus (dependency links and functions) – Spanish Wikipedia (Nov. 2005) – ca. 22 M words – Spanish section of Europarl – ca. 27 M words ● lemmatization, “normalization” (passives, numbers, names) ● extraction of mother-daughter relations, depgrams, not ngrams (cf. Adam Kilgariff's Sketch Engine) – N + @N A + ADJ (gravemente enfermo) – V + @ACC (ganar terreno), @SUBJ ktp, V + PRP (creer en) ● co-occurrence measure: p(AB) / p(A) * p (B) ● graphical interface

27

28

29

30

31 HISPAL 28.9.2006 --- VISL/SDU

32 Outlook ● more corpus work, currently focussing on: – Europarl (27.2 M words) – Wikipedia (22.3 M words) – News texts (2 M words) ● lexicon completion: – complete manual revision by a native speaker – corpus-based completion of valency patters – (manual?) completion of semantic prototype ontology ● exchange format for comparison with other Spanish annotation projects ● more stringent evaluation

33 HISPAL 28.9.2006 --- VISL/SDU Tools and documentation: http://beta.visl.sdu.dkhttp://beta.visl.sdu.dk Teaching games: http://visl.sdu.dkhttp://visl.sdu.dk Corpora: http://corp.hum.sduhttp://corp.hum.sdu DeepDict: http://gramtrans.comhttp://gramtrans.com Contact: eckhard.bick@mail.dk


Download ppt "HISPAL A Constraint Grammar Parser for Spanish Eckhard Bick University of Southern Denmark"

Similar presentations


Ads by Google