Integrating Semantic Dictionaries for English, French and Bulgarian into the NooJ System for the Purposes of Information Retrieval Svetla Koeva, Max Silbetztein 8th INTEX / NooJ Workshop, 30 May, 2005
Main research goals To provide a sufficient methodology for the implementation of the natural language semantic relations into the NooJ system: –to create specialized Semantic Dictionaries for English, French and Bulgarian based on WordNet semantic relations; –to provide compete formalization of the inflection for simple and compound words included in the Wn structure.
History The integration of semantic relations into the INTEX system was initially proposed at the sixth INTEX workshop. Later on the idea was advanced into the Joint research RILA project Information retrieval based on semantic relations –LASELDI, Université de Franche-Comté –Department of Computational Linguistics, IBL, Bulgarian Academy of Sciences.
Language resources Bulgarian grammatical dictionary (BGD) – over lemmas and word forms; English WordNet 2.0 – synonymous sets; Bulgarian WordNet (BalkaNet project) – synonymous sets; French WordNet (EuroWordNet project) – synonymous sets; English dictionary – over lemmas (not inflected); French dictionary – extracted with INTEX.
Implementation tasks To transform the format of the BGD into the NooJ standard; To create semantic dictionaries for Bulgarian and English; To associate lemmas from the Bulgarian semantic dictionaries with the corresponding inflection types; To add missing lemmas and inflection types in BGD, if any; To create extensive dictionaries and corresponding inflection types for compounds.
BGD – Information structure design Category information – 6 classes: Noun, Verb, Adjective, Pronoun, Numeral, Others (Adverb, Preposition, Conjunction, Particle, Interjection) ; Paradigmatic information – Personal, Transitive, Perfective, Common, …; Grammatical information – Inflection, Conjugation, Sound alternations, ….
BGD – Grammatical subclasses Nouns - 22 subclasses with respect of their Type (Common, Proper, Singularia tantum, Pluralia tantum) and Gender; Verbs – 32 subclasses with respect of Transitivity, Perfectiveness, and Personality; Adjectives – 2 subclasses; Pronouns – 26 subclasses with respect of their Type and Possessor; Numerals – 6 sunclasses.
BGD – Grammatical types Noun – Number, Definiteness, Counting form, Case, Optional forms – 266 types; Verb – Person, Number, Tense, Mood, Voice, Participles, Gender, Definiteness – 257 types; Adjective – Gender, Number, Definiteness – 30 types; Pronoun – Gender, Person, Number, Definiteness, Case, Clitic, Possessing – 28 types; Numeral – Gender, Number, Definiteness, Approximate form, Male form – 20 types.
BGD – Dictionary format а,ЧА,0ПРИ, 7 sm0, Ok, ‘‘ абсол`ютен, ПРИ, 7smh, Ok, '2RCия‘ `август, С+М, 10sml, Ok, '2RCият‘ авиокомп`ания, С+Ж, 1sf0, Ok, '2RCа‘ австр`ийски, ПРИ, 3sfd, Ok, '2RCата‘ автоб`ус, С+М, 11sn0, Ok, '2RCо‘ автомат`ичен, ПРИ, 7snd, Ok, '2RCото‘ адрес`ирам, Г+Н+Т, 4p0, Ok, '2RCи‘ агит`ирам, Г+Н+Т, 4pd, Ok, '2RCите'
Transforming BGD Perl Script Dictionary Grammatical types Transliteration of labels
NooJ dictionary → aбсол`ютен, ПРИ, 7 aбсолютен,A+FLX=A-7 `август, С+М, 10август,N+M+FLX=N_M-10 авиокомп`ания, С+Ж,1авиокомпания,N+F+FLX=N_F-1 aвстр ` ийски, ПРИ, 3aвстрийски,A+FLX=A-3 автоб`ус, С+М, 11автобус,N+M+FLX=N_M-11 автомат`ичен, ПРИ, 7автоматичен,A+FLX=A-7 адрес`ирам,Г+Н+Т,4адресирам,V+IT+FLX=V_IT-4
NooJ formal descriptions → sm0, Ok, ‘‘ A-7 = /sm0 + smh, Ok, '2RCия‘ ия /smh + sml, Ok, '2RCият‘ ият /sml + sf0, Ok, '2RCа‘ а /sf0 + sfd, Ok, '2RCата‘ ата /sfd + sn0, Ok, '2RCо‘ о /sn0 + snd, Ok, '2RCото‘ ото /snd + p0, Ok, '2RCи‘ и /p0 + pd, Ok, '2RCите‘ ите /pd;
WordNet semantic relations ILRPOS/POSEW2.0BulNet HYPERONYMY N/N V/V NEAR ANTONYMY N/N A/A V/V PART MERONYMY N/N MEMBER MERONYMY N/N PORTION MERONYMY N/N SUBEVENT V/V CAUSES V/V SIMILAR TO A/A V/V VERB GROUP V/V ALSO SEEA/A V/V
Other relations ILRPOS/POSEW2.0BulNet BE IN STATEA/N BG DERIVATIVEN/V DERIVEDA/N PARTICIPLEA/V40156 REGION DOMAINN/N V/N A/N B/N USAGE DOMAINN/N V/N A/N B/N98322 CATEGORY DOMAINN/N V/N A/N B/N
Selected relations Synonymy (reflexive, symmetric, and transitive relation of equivalence); Hypernymy (inverse, asymmetric, and transitive relation between synonym sets), Meronymy (inverse, asymmetric, and transitive relation between synonym sets): Part meronymy; Member meronymy; Portion meronymy.
Selected relations Similar to (symmetric relation between similar adjectival synsets); Verb group (symmetric relation between semantically related verb synsets); Also see (symmetric relation between synsets - verbs or adjectives, that are close in meaning); Category domain (asymmetric extralinguistic relation between synsets denoting a concept and the sphere of knowledge it belongs to).
DELAF semantic dictionaries These dictionaries consist of pairs of literals defined for the corresponding semantic relation: – car,automobile.N – auto,automibile.N All possible combinations between literals in the given synsets are listed: – car,automobile.N – cars,automobile.N – auto,automibile.N – autos,automibile.N
NooJ Semantic dictionaries Synonymy relation ‘a plant consisting of buildings with facilities for manufacturing’ фабрика,N+FLX=ENG n предпрятие,N+FLX=ENG n factory,N+FLX=ENG n mill,N+FLX=ENG n manufacturing plant,N+FLX=ENG n manufactory,N+FLX=ENG n
NooJ Semantic dictionaries Hypernymy relation ‘the organized action of making of goods and services for sale’ производство,N+FLX=ENG n промишленост,N+FLX=ENG n индустрия,N+FLX=ENG n production,N+FLX=ENG n industry,N+FLX=ENG n manufacture,N+FLX=ENG n
Inflecting wordnet... otstranqwam (to remove) … ГНТ remove something concrete, as by lifting, pushing, taking off, etc. or remove something abstract...
NooJ Semantic descriptions ‘the organized action of making of goods and services for sale’ ENG n = /Hs0 + то/Hsd + а /Hp0 + ата /Hpd + мишленост /Ss0 + мишлеността /Ssd + мишлености /Sp0 + мишленостите /Spd + индустрия/Ss0 + индустрията/Ssd + индустрии/Sp0 + индустриите/Spd; ENG n = /Hs + industry/Ss + industries/Sp0+ manifactures/Ss + manifactures/Sp;
After the nice solutions Lemmas which are not included in the BGD: –Lemmas classification to existing inflection types; –Formal description of new inflection types –Literals in Latin; –Validating WordNet. Semantic ambiguity - literals with two inflectional descriptions in BGD; Compound words –Formal description of inflection types; –Compounds classification.
NooJ Compound semantic descriptions ENG n = /Ss0 + та/Ssd + и (и/p0 +ите/pd) + завод ен/Ss0 + завод ния/Ssh + завод ният/Ssl + заводи ни/Sа0 + заводи ните/Sа0 + рафинерия/Ss0 + рафинерия та/Ssd + рафинерии и/Sp0 + рафинерии ите/Spd;
Applications of the Semantic Dictionaries Information retrieval by means of semantic equivalence with synonymy dictionaries; Information retrieval by means of semantic specification with hyperonymy and meronymy dictionaries; Information retrieval by means of similarity; Information retrieval by means thematic domains affiliations; Validation WordNet structure against its completeness and consistency.
Future directions Extensions and enhancements of the semantic dictionaries by means of: –Extension of the dictionaries coverage; –Addition of other semantic relations; –Inclusion of additional information to the entries. Integration of multilingual semantic extraction with NooJ using the Inter-Lingual-Index relation.