Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.

Similar presentations


Presentation on theme: "Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega."— Presentation transcript:

1 Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega

2 Spanish FrameNet Project  Spanish FrameNet is a research project which is sponsored by the Department of Education of Spain (Grant No. TSI2005-01200) from December 2005 to December 2006.Department of Education of Spain  A new grant proposal has been submitted to the Spanish Department of Education for the period 2007-2009  SFN is developed at the Autonomous University of Barcelona (Spain) and the International Computer Science Institute (Berkeley, CA) in cooperation with the FrameNet Project.Autonomous University of BarcelonaInternational Computer Science InstituteFrameNet Project  PI: Carlos Subirats, System Analyst: Marc Ortega, 2 linguist

3 SFN Goals  The Spanish FrameNet Project is creating an online lexical resource for Spanish, based on frame semantics and supported by corpus evidence.  SFN will be available to the public by July 2007  SFN will contain at least 1,000 lexical items aprox. - verbs, predicative nouns, and adjectives, adverbs, prepositions and entities- representative of a wide range of semantic domains.  The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses

4 Frame Semantics  Spanish FrameNet (SFN) is using, adapting and changing FrameNet Frames in order to adapt them to Spanish  Some SFN Frames are the same as English FN (with Spanish examples)  Some SFN Frames have the same English FN name but they are different (slightly different definition, different FE’s, or different core sets)  To adapt FN to Spanish we defined some new frames and some FN frames are not used (new frames use the same FN format), like: Cause_to_halt Change_emotional_state Collapse Inventing Motion_backwards, Motion_interruption, Motion_manner, Motion_medium, Motion_up_downwards Return Social_interaction Think_up

5 Current Project Status  Frames Defined: 92  Lexical Units: 624 Annotated: 413 Subcorporated: 130 Created but without subcorporation: 23

6 Spanish FrameNet Corpus and Tools  Spanish FrameNet is using a 350 million word corpus It includes both European and New World Spanish (40% and 60%) The SFN Corpus has been developed by the SFN research team, since there are no (large) public domain Spanish corpora available  The SFN Corpus is lemmatized and tagged with a set of in-house tools  FNDesktop  Web Reports  Sato Tool

7 The SFN tagging and chunking system  The SFN Corpus is tagged and lemmatized by using: An electronic dictionary of Spanish of 600,000 forms, which is expanded from a dictionary of 93,000 lemmas:electronic dictionary of Spanish  66,000 single-word lexical units, like unir (unite), inmoralidad (immorality), allí (there), etc.;  26,000 multi-word lexical units (MWLU), like muerte cerebral (brain death), etc., which are automatically expanded in 55,000 inflected MWLU forms. Plain text to Deterministic Finite State Automata (FSA) corpus tagger 2,000 Finite State Transducers (FST) transducers of multi-word verbs Transducers of head of verbal phrases (compound verbal tenses)

8 The SFN tagging and chunking system  The POS tagging process gives to corpus formats: Automata Corpus IMS-CWB (Institut für Maschinelle Sprachverarbeitung -Corpus Workbench)

9 Automata Corpus  Lexical tagging (part-of- speech, lemma)  Word ambiguities are represented in deterministic finite state automata (DFSAs) as different possible transitions between two consecutive states DFSA of the sentence Al habérselo propuesto a tiempo FST for compound verb form tagging DFSA of the sentence Al habérselo propuesto a tiempo  Allows efficient word disambiguation  Allows extended lexical tagging using automata transduction Compound verbal forms tagging Multi-word verb recognition FST for compound verb form tagging Transduced DFSA of the sentence Al habérselo propuesto a tiempo  Very efficient process rates  Human access is almost impossible

10 CWB Corpus  Lexical tagging (part-of- speech, lemma)  Text DSFA are disambiguated and converted to XML format  Unambiguous corpus  Allows human access to corpus contents  Allows human corpus search  Corpus contents are codified and indexed for an efficient corpus search

11 Multi-word verb recognition DFSA of the sentence Le hacían siempre el vacío en la empresa before the transduction Subsequential FST that detects the multi-word verb hacer el vacío Output DFSA of the sentence after the intersection and transduction Inflectional morphological Inflectional morphological properties are kept the siempre adverb is detected between the core verb and idiom

12 Subcorporation Process  Internal tools GramCreator and XQS are used to create subcorporation grammar # Request: solicitud # N-de-GN-de # * = 4 { ( + * ) ( + ( ( + ) ( + + ) )) + ( ( + )) ) } Solicitud grammar example: the syntactic structure N-de-GN-de is detected

13 Subcorporation Process  Each grammar (regular expression) is converted to a Finite State Transducer  LU’s subcorpora is transduced with a set of grammar’s FST to produce a set of subcorpora  The transduction process allows very efficient process rates (100 transductions per second)  The subcorporation set is converted to XML and imported to FNDesktop

14 Subcorporation Process N-de-GN-de structure detection

15 Annotation Tool  SFN uses the FN annotation tool (FNDesktop) to add semantic annotation to the LU subcorporation sets  The FNClassifier has been adapted to Spanish: the classifier has new rules which are adapted to the Spanish tags and Spanish local Syntactic contexts

16 Annotation search tools (Web Reports)

17 Annotation search tools (Sato Tool)


Download ppt "Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega."

Similar presentations


Ads by Google