Download presentation
Presentation is loading. Please wait.
Published byLoreen Clarke Modified over 9 years ago
1
SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV
2
The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation programme) Objectives: Reliable (exhaustive and precise) multilingual lexical resources for a variety of purposes such as machine translation, information extraction and information retrieval, etc.
3
Prerequisites for carrying out such task: Large-coverage linguistic resources such as comprehensive multilingual and monolingual dictionaries (designed according to certain criteria and stored in a format such as would ensure accessibility and manageability). Ancillary (esp. disambiguation and recognition) resources. An appropriate system for the storage and management of multilingual linguistic data, as well as the implementation of task-related procedures.
4
Methodology Systematization and unification of the existing INTEX resources as well as their conversion in compatibility with the established NooJ format. Expansion and enhancement of the resources aiming at ever higher precision and recall parameters. Creation of various new resources using the experience, resources and tools developed along the first two lines.
5
Conversion of the lexical resources in DELA format to the.nod format: Conversion of the BGD (Bulgarian Grammar Dictionary) 1 automata underlying the DELAF dictionaries to the.flx automata description. Creation of automata for the existing dictionaries of compounds since they have been stored in DELACF format. Koeva, S. Grammar Dictionary of Bulgarian. Description of the concept of organization of the linguistic data. Bulgarian Language 6, pp. 49-58
6
Conversion of the INTEX graphs into the NooJ format: Preprocessing graphs: Compound conjunctions graphs. Abbreviations and elision graphs (with possible treatment in a dictionary), etc. Recognition graphs developed along tasks involving automatic treatment of syntactic phenomena.
7
Expanding the compound words dictionaries with new entries in a systematic way (covering large and diverse areas of the lexicon`s inventory of compounds). Establishing the resources to be used: The available specialised on-line dictionaries The lexical-semantic data base - the Bulgarian WordNet. Developing automata for the inflection types in the established format.
8
Specifics: Restricted paradigms for certain types of compounds (esp. domain-specific terms) – pluralia tantum, singularia tantum, count forms, plural endings. Invariable forms or forms that are not established in the Bulgarian language, esp. ones introduced in the language as transcription of mainly English terms, etc. (hedge, swap, bear market, bull market, etc.)
9
Compounds extraction from the above mentioned resources (enhanced complementarily): Extraction of thematic compound dictionaries of terms, named entities, other compound lexemes (using semantic relations encoded in the data base and employing inheritance to the task). Employing NooJ as environment for compounds extraction, processing of the obtained material with the already designed dictionaries and encoding of the appropriate candidates among the unrecognized tokens.
11
Dictionaries generation enhancement Exploring large data bases and spotting different head words inflection types using the existing automata: Using chiefly Bulgarian WordNet where head words of compounds are marked unambiguously. Using simple syntactic grammars (identifying NPs) to spot head words in the available domain specific dictionaries of concepts and terms (more comprehensive with regard to the coverage of types of inflection).
13
Recognition enhancement Development of morphological grammars embracing certain classes of words not present currently in any dictionary, provided the source words are in the dictionary: Personal feminine nouns приятел (friend) - приятелка (girl friend) Diminutive nouns – детенце (a small child), кученце (a small dog), etc. Verbal nouns, etc.
15
Present day and future directions: Information retrieval, machine translation, etc. Facilitating linguistic tasks by supplying the prerequisites - large resources as input data – for the exploration of linguistic phenomena, validation of linguistic hypotheses on language material. Education (facilitating the acquisition of knowledge and skills in NLP)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.