WordNet–Based Collaborative Weighting for Ranking Web Pages Hyoungil Kim, Juntae Kim Dongguk University, Seoul, Korea Kyeonah Yu Duksung Women ’ s University, Seoul, Korea
Dept. of Computer Engineering, Dongguk University Agenda 1. Introduction 2. Background Next Generation Search Engines WordNet 3. The Proposed Method Sense Determination Query Expansion Sense-specific Collaborative Weighting 4. Experiments 5. Conclusion
Dept. of Computer Engineering, Dongguk University Introduction It is hard to extract information from the web by using a general search engine. –Problem is the word sense ambiguity –The number of search results is very large –The keyword-based method cannot discriminate important pages. Suggestion –Using the sense-specific collaborative weighting –Disambiguate query word using WordNet
Dept. of Computer Engineering, Dongguk University Next Generation Search Engines Issue –The problems of the keyword-based method New weighting and ranking schemes –Static reference information(hyperlink structure) To show the global authority –Dynamic reference information(user response) To show the global popularity
Dept. of Computer Engineering, Dongguk University WordNet Development –1985 in Princeton Cognition science team + linguistic psychologists Contents –Vocabulary database –It classified the English vocabulary according to the meaning of each word 95,600 different words 70,100 different meanings Words with same meaning synset
Dept. of Computer Engineering, Dongguk University WordNet Relationships between words –Synonym / Antonym The similar / opposite meaning between words Ex) rise = ascend, rise fall –Hyponym / Hypernym The hierarchical relationship of meanings Ex) maple => tree => plant –Meronym / Holonym The inclusive relationship Ex) leaf tree
Dept. of Computer Engineering, Dongguk University Synset hierarchy of WordNet The synsets in the WordNet are hierarchically organized according to their hypernym relationships. Example {Cattle,Cows,Oxen} {Bovine} …. {Mammal} {Vertebrate,Craniate} {Chordate} {Animal}
Dept. of Computer Engineering, Dongguk University 3 senses of java Sense 1 Java -- (an island in Indonesia S of Borneo; one of the world's most densely populated regions) => island -- (a land mass (smaller than a continent) that is surrounded by water... => land, dry land, earth, ground, solid ground, terra firma -- (the solid part... => object, physical object -- (a physical (tangible and visible) entity;... => entity, something -- (anything having existence (living or nonliving)) Sense 2 coffee, java -- (a beverage consisting of an infusion of ground coffee beans;... => beverage, drink, drinkable, potable -- (any liquid suitable for drinking:... => food, nutrient -- (any substance that can be metabolized by an organism... => substance, matter -- (that which has mass and occupies space;... => object, physical object -- (a physical (tangible and visible)... => entity, something -- (anything having existence (living or nonliving)) Sense 3 Java -- (a simple platform-independent object-oriented programming language... => object-oriented programming language, object-oriented programing language... => programming language, programing language -- ((computer science) a language... => artificial language -- (a language that is deliberately created for... => language, linguistic communication -- (a systematic means of... => communication -- (something that is communicated between... => social relation -- (a relation between living organisms;... => relation -- (an abstraction belonging to or... => abstraction -- (a general concept formed by...
Dept. of Computer Engineering, Dongguk University Sense Determination To determine the sense of the query –Using the synset hierarchy of the WordNet State of the sense of the query –There is ambiguity or no ambiguity. Strategy –Provide an user interface by which the user can select one of the synset.
Dept. of Computer Engineering, Dongguk University Hypernym Synonym Annotation The search query
Dept. of Computer Engineering, Dongguk University Query Expansion Expand query by using: –synonym, hypernym, or annotation –Words from each part are extracted and added (OR) If user selected sense 2 of “Java”, –Using the synonym {Java} {Java, coffee} –Using the hypernym {Java} {Java, beverage, drink} –Using the annotation {Java} {Java, beverage, infusion, coffee, bean}
Dept. of Computer Engineering, Dongguk University Sense-Specific Collaborative Weighting Weighting of Web pages –Using the 26 top-level categories of the noun hierarchy to store 26 sense-specific weights for each Web page Web page Count for {Food} Count for {Location} Count for {Comm.} Total count URL URL
Dept. of Computer Engineering, Dongguk University The Experimental System
Dept. of Computer Engineering, Dongguk University Experiments Data Set –For each query words 200 Web pages were collected from AltaVista. To obtain the collaborative weighting –200 Computer Engineering undergraduate students Evaluation –Compare the # of relevant pages in top-30 –Experimental system vs. AltaVista –Total click count weighting vs. sense-specific weighting
Dept. of Computer Engineering, Dongguk University The query words used for the experiments WordSynset Si Top_level Category Ck Java{coffee, java}{Food} {Java}{Location} {Java}{Communication} Character{character, role,…}{Action} {character, symbol,…}{Communication} Custom{custom, import,…}{Possession} {custom, tradition,…}{Cognition} Horse{horse, heroin,…}{Artifact} {horse, equus…}{Animal}
Dept. of Computer Engineering, Dongguk University Test results of 9 queries QueryNumber of important pages among top-30 WordMeaningExperimental SystemAlta Vista SW/SWT MWSW JavaIsland9540 Coffee13780 Language CharacterRole12961 Symbol CustomTrade8540 Tradition12764 HorseDrug11860 Animal Average accuracy49.6%39.3%34.4%23.7% Improvement to Alta Vista MW+15.2%+4.9%--
Dept. of Computer Engineering, Dongguk University Conclusion An interface using WordNet to resolve the ambiguity of the search query is presented Propose sense-specific collaborative evaluation in ranking Web pages Performance improvement of Web search engine
Dept. of Computer Engineering, Dongguk University References [1] D. Beeferman and A. Berger, Agglomerative clustering of a search engine query log, Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining, [2] S. Brin, L. Page. The anatomy of a large-scale hypertextual Web search engine, Proceedings of the 7 th International World Wide Web Conference, [3] C. Fellbaum, WordNet: An Electronic Lexical Database (MIT Press, 1998). [4] W. Frakes, and R.Baeza-Yates, Information Retrieval: Data Structures & Algorithm (Prentice-Hall, 1992). [5] J. M. Kleinberg, Authoritative sources in a hyperlinked environment, The Journal of the ACM, Vol. 46(5), [6] B. Krishna, R. Monika, Improved algorithms for topic distillation in a hyperlinked environment, Proceedings of the 21st ACM SIGIR conference, [7] D. Lewis and K. Jones, Natural language processing for information retrieval, Communications of ACM, Vol. 39, [8] X. Li, S. Szpakowicz and S. Matwin, A WordNet-based algorithm for word sense disambiguation, Proceedings of the International Joint Conference on Artificial Intelligence, [9] G. Miller, WordNet: An on-line lexical database, International Journal of Lexicography, [10] G. Salton, Automatic Text Processing (Addison Wesley, 1989)