CSE Department, I.I.T. Bombay Automatic Lexicon Generation through WordNet by Nitin Verma and Pushpak Bhattacharyya Jan 21, 2004
CSE Department, I.I.T. Bombay Introduction u A lexicon is the heart of any natural language processing system. u Difficult to construct requiring enormous amount of time and man power. u Document specific dictionary generation – – Given a document D and word W therein, which sense S of W should be picked up from the document ? – Can one construct a document specific dictionary wherein single senses of the words are stored ?
CSE Department, I.I.T. Bombay UW Dictionary u An important machine readable lexical resource used by the enconverter and deconverter software's. Introduction Enconverter UW Dictionary Analysis Rules Natural Language UNL
CSE Department, I.I.T. Bombay u Format of dictionary entries – – Semantic attributes (derived from the ontology). – Syntactic attributes (POS, person, number, tense). – Used for the firing of appropriate analysis rules. Introduction (UW dictionary) [crane] “crane (icl>bird)” (N, ANIMT, FAUNA, BIRD); Restriction HW UW Attributes (both syntactic and semantic)
CSE Department, I.I.T. Bombay u Animate (ANIMT) – Flora (FLORA) v Shrubs (ANIMT, FLORA, SHRB), e.g. jasmine v Aquatic plants(ANIMT, FLORA, AQTC), e.g. lotus v …. – Fauna (FAUNA) v Mammals (MML) v Reptiles (ANIMT, FAUNA, RPTL), e.g. lizard v Birds (ANIMT, FAUNA, BIRD) v Fish (ANIMT, FAUNA, FISH) v Insects (ANIMT, FAUNA, INSCT), e.g. butterfly v …… Ontology* *Dictionary group, CFILT, IIT Bombay. Introduction
CSE Department, I.I.T. Bombay English-UW dictionary generation
CSE Department, I.I.T. Bombay u Resources used – – English WordNet, a WSD* system (soft word sense disambiguation method), the UNLKB and an inferencer. u Knowledge based approach. English-UW dictionary generation * G. Ramakrishnan and P. Bhattacharya. Soft Word Sense Disambiguation, GWN 2004
CSE Department, I.I.T. Bombay u Stage 1 – u Stage 2 – English-UW dictionary generation Method Word1 word Input Document WSD* Word1:N:1 Word2:N: POS and Sense tagged document
CSE Department, I.I.T. Bombay English-UW dictionary generation (Method) Word1:pos1:sense1 Word2:pos2:sense Inference Engine KB WordNet Database of rules Tagged Document UW Dictionary Explanation UNL KB
CSE Department, I.I.T. Bombay UW generation for nouns UW generation
CSE Department, I.I.T. Bombay UW generation for nouns crane:N:4 Word2:pos2:sense Inference Engine KB WordNet UNL KB Tagged Document crane:N:4 1
CSE Department, I.I.T. Bombay UW generation for nouns crane:N:4 Word2:pos2:sense Inference Engine KB WordNet UNL KB Tagged Document crane:N:4 A query to collect semantic information 1 2
CSE Department, I.I.T. Bombay UW generation for nouns crane:N:4 Word2:pos2:sense Inference Engine KB WordNet UNL KB Tagged Document crane:N:4 A query to collect semantic information crane bird fauna, animal organism 1 2 3
CSE Department, I.I.T. Bombay UW generation for nouns crane:N:4 Word2:pos2:sense Inference Engine KB WordNet UNL KB Tagged Document crane:N:4 A query to collect semantic information crane bird fauna, animal organism A query to collect relevant rules
CSE Department, I.I.T. Bombay UW generation for nouns crane:N:4 Word2:pos2:sense Inference Engine KB WordNet UNL KB Tagged Document crane:N:4 A query to collect semantic information crane bird fauna, animal organism A query to collect relevant rules depthwordrelationrestriction 6birdiclanimal 5 iclliving thing 4 null
CSE Department, I.I.T. Bombay UW generation for nouns crane:N:4 Word2:pos2:sense Inference Engine KB WordNet UNL KB Tagged Document crane:N:4 A query to collect semantic information crane bird fauna, animal organism A query to collect relevant rules Crane(icl>bird) depthwordrelationrestriction 6birdiclanimal 5 iclliving thing 4 null 6
CSE Department, I.I.T. Bombay UW generation for nouns crane:N:4 Word2:pos2:sense Inference Engine KB WordNet UNL KB Tagged Document crane:N:4 A query to collect semantic information crane bird fauna, animal organism A query to collect relevant rules Crane(icl>bird) Explanation 7 depthwordrelationrestriction 6birdiclanimal 5 iclliving thing 4 null 6
CSE Department, I.I.T. Bombay UW generation for verbs UW generation
CSE Department, I.I.T. Bombay UW generation for verbs Input word {hypernyms(word)} Π {‘be’, ‘continue’, etc} = 0 true (icl > be) e.g. : exist (icl > be) {hypernyms(nominal word)} Π {‘phenomenon’, ‘natural event’, etc} = 0 true (icl > occur) e.g. : rain (icl > occur) false (icl > do)e.g. : make (icl > do)
CSE Department, I.I.T. Bombay UW generation for adjectives Input word UW present in the UNL KB ? Yes Pick the UW e.g. : broad (aoj > thing) No IS_DEFINED (is_a_value_of relation) on the input word ? Yes (aoj > thing) e.g. : good (aoj > thing) No (mod > thing)e.g. : green (mod > thing)
CSE Department, I.I.T. Bombay Semantic attribute generation English-UW dictionary generation (Method)
CSE Department, I.I.T. Bombay Semantic attribute generation crane:N:4 Word2:pos2:sense Inference Engine KB WordNet Database of rules Tagged Document crane:N:4 1
CSE Department, I.I.T. Bombay Semantic attribute generation crane:N:4 Word2:pos2:sense Inference Engine KB WordNet Database of rules Tagged Document crane:N:4 A query to collect semantic information 1 2
CSE Department, I.I.T. Bombay Semantic attribute generation crane:N:4 Word2:pos2:sense Inference Engine KB WordNet Database of rules Tagged Document crane:N:4 A query to collect semantic information crane bird fauna, animal organism 1 2 3
CSE Department, I.I.T. Bombay Semantic attribute generation crane:N:4 Word2:pos2:sense Inference Engine KB WordNet Database of rules Tagged Document crane:N:4 A query to collect semantic information crane bird fauna, animal organism A query to collect relevant rules
CSE Department, I.I.T. Bombay Semantic attribute generation crane:N:4 Word2:pos2:sense Inference Engine KB WordNet Database of rules Tagged Document crane:N:4 A query to collect semantic information crane bird fauna, animal organism A query to collect relevant rules IF hypernym=‘organism’ THEN generate ‘ANIMT’ ELSE generate ‘INANI’; IF hypernym=‘fauna’ THEN generate ‘FAUNA’; IF hypernym=‘bird’ THEN generate ‘BIRD’;
CSE Department, I.I.T. Bombay Semantic attribute generation crane:N:4 Word2:pos2:sense Inference Engine KB WordNet Database of rules Tagged Document crane:N:4 A query to collect semantic information crane bird fauna, animal organism A query to collect relevant rules IF hypernym=‘organism’ THEN generate ‘ANIMT’ ELSE generate ‘INANI’; IF hypernym=‘fauna’ THEN generate ‘FAUNA’; IF hypernym=‘bird’ THEN generate ‘BIRD’; (N,ANIMT,FAUNA,BIRD)
CSE Department, I.I.T. Bombay Database of rules Semantic attribute generation u No of such rules: 4344 HYPERNYMATTRIBUTE organismANIMT floraFLORA faunaFAUNA birdBIRD HYPERNYMATTRIBUTE changeVOA,CHNG communicateVOA,COMM moveVOA,MOTN completeVOA,CMPLT IS_A_VALUE_OFATTRIBUTE weightDES,WT strengthDES,STRNGTH qualDES,QUAL SYNONYMY OR ANTONYMY ATTRIBUTE brightDES,APPR deepDES,DPTH shallowDES,DPTH SYNONYMYATTRIBUTE backwardDRCTN alwaysFREQ frequentFREQ beautifullyMAN Table 1. Rules for nouns (96)Table 2. Rules for verbs (405) Table 4. Rules for adverbs (556) Table 3.2. Rules for adjectives (3258) Table 3.1. Rules for adjectives (29)
CSE Department, I.I.T. Bombay Experiments and Results No of correct entries in the dictionary Total no of entries in the dictionary Precision for nouns – 93.9%Precision for verbs – 84.4% Document No Precision =
CSE Department, I.I.T. Bombay No of correct entries in the dictionary Total no of entries in the dictionary Precision for adjectives – 90.06%Precision for adverbs – 86% Document No Precision = Experiments and results
CSE Department, I.I.T. Bombay Implementation details u Subtasks identified – – MySQL database is used for storing the rules and the UNL KB. v 7540 entries in the UNL KB. v 4344 entries in the rule base. – Inference engine in C++. – Web interface of the DDG in CGI & PHP. – Other utilities like UNL KB organizer, Rule entry interface, WSD integrator are implemented in Perl. – LOC 4761
CSE Department, I.I.T. Bombay Demo
CSE Department, I.I.T. Bombay Hindi-UW dictionary generation Method
CSE Department, I.I.T. Bombay Hindi-UW dictionary generation 1. WordNet API is used to obtain all possible parts-of-speech and all possible senses for every word. 2. Hindi WN is queried (by using Hindi WN API) to obtain the semantic attributes.
CSE Department, I.I.T. Bombay 2.Hindi WN is queried (by using Hindi WN API) to obtain the semantic attributes. 3.The Hindi UW dictionary database is queried (on the basis of input-word and its POS) to obtain an appropriate UW. 4.In this step the irrelevant entries are disabled and the incorrect ones are corrected manually by the lexicographer. Hindi-UW dictionary generation
CSE Department, I.I.T. Bombay Demo
CSE Department, I.I.T. Bombay u The burden of lexicography has been reduced considerably. u The system is being routinely used in our work on machine translation in a tri-language setting (English, Hindi and Marathi). u Future work will be directed towards the implementation of part-of-speech tagger and word-sense-disambiguator, for Hindi and Marathi languages. Conclusion and future work