Clustering Semantically Enhanced Web Search Results Anantha Bangalore, MSD, Vienna, VA Arun Sood, Professor and Chair, CS Dept Noorullah Moghul, CS PhD student George Mason University, Fairfax, VA. 9 September 2004
Overview DAIRS: Distributed Agents for Information Retrieval Systems Software agents Applied to Image, geospatial and text processing. Tested within medical context Results of initial testing are provided Applicable in many domains Scope for discussion 9 September 2004 © 2004 by Arun Sood
DAIRS - Problem Statement Data volume is exploding Data rich information poor environment Many search systems provide high recall but low precision (e.g. Google) Increased precision (relevance) Saves user time Enables a broader search of candidate URLs 9 September 2004 © 2004 by Arun Sood
Our Approach Assumption: Google (and other) search engines provide acceptable recall DAIRS extracts a robust and relevant result set Use an ontology to describe the user context Ontological filtering Clustering of the concepts 9 September 2004 © 2004 by Arun Sood
Subset of UMLS Semantic Net 9 September 2004 © 2004 by Arun Sood
Advantage of Our Agent Approach Easily compose solution methodologies using light weight agents: 200 agents in our system Works in a distributed environment Agents are mobile Load balance agent assigns agents in the background Exploits parallelism Import functionality from 3rd party software, without importing the application 9 September 2004 © 2004 by Arun Sood
Interface to Compose Solutions 9 September 2004 © 2004 by Arun Sood
EXPERIMENT Google Search – {cold, strain, fluid, adjustment, fat, condition, etc.} Selected top 100 URLs in each search Classified the URLs using DAIRS ( UMLS as Ontology filter and Cluto as clustering software) Compared DAIRS results with the URL classification done manually 9 September 2004 © 2004 by Arun Sood
Words with Multiple Senses (Cold) NLM has identified 50 words with multiple senses that occur frequently Cold disease, cold temperature, cold war, cold fusion, cold springs, cold calls, etc. Find URLs dealing with cold in a medical context (e.g. common cold) Ontology filter (UMLS – Metathesaurus) helps deemphasize non-medical URLs Clustering leads to separation of medical related URLs from other URLs 9 September 2004 © 2004 by Arun Sood
EXAMPLE URL CLASSIFICATION Common Cold Common Cold URLs classified correctly http://www.coldcure.com/ http://www.nlm.nih.gov/medlineplus/commoncold.html http://www.cdc.gov/flu/protect/sick.htm http://lib-sh.lsumc.edu/fammed/pted/cold.html http://www.healthscout.com/template.asp?page=cold&ap=1 Undetected URLs http://www.commoncold.org/ Contains images and links to other websites, little text http://www.commoncold.co.uk/ Contains very little textual content http://myheala.com/ Contains images and very little text http://www.coldeeze.com/ Mostly image content 9 September 2004 © 2004 by Arun Sood
9 September 2004 © 2004 by Arun Sood
9 September 2004 © 2004 by Arun Sood
URL CLASSIFICATION EXAMPLE – 2 Cold URLs –False Alarms http://www.theatlantic.com/unbound/jazz/sundgaar.htm NEWS article describes a story at a cold place http://www.coldasice.com/ Winter wear http://www.inc.com/guides/sales/20677.html Cold Calls – sales calls http://www.cold-me.net/ Music website 9 September 2004 © 2004 by Arun Sood
SUMMARY OF THE RESULTS: IR Measures CONCEPT Google Hits Analyzed Correct Classification Undetected False Alarms Google DAIRS Common cold (Disease) 100 5 4 91 Cold Temperature 9 86 Strain (Muscle) 15 6 79 Strain (Bacterial) 42 14 44 8 Fluid (Substance) 2 96 Fluid (Behavior) 36 55 12 Like “cold” example, most of the misses are because of limited text at these sites – mostly images, and pointers to other web pages. 9 September 2004 © 2004 by Arun Sood
Location of hits: Usability Measures CONCEPT Google Hits Analyzed Correct Classification Undetected Common cold (Disease) 100 44,45,49,62,83 19, 50, 51, 99 Cold Temperature 43, 53, 58, 74, 96 1, 9, 17, 31, 37, 54, 64, 84, 85 Strain (Muscle) 1, 17, 18, 22, 27, 33, 42, 43, 54, 67, 79, 80, 84, 92, 95 7, 11, 20, 38, 53, 85 9 September 2004 © 2004 by Arun Sood
Building a Robust DAIRS Previous study shows that some sites were not properly classified because the text content was small Next steps Build agent to extract links to the next level of URLs Build agent to parse the next level of URL text and include in the search results Build agent to OCR the images, and extract text 9 September 2004 © 2004 by Arun Sood
DAIRS vs. Search Engines DAIRS complements search engines to fine tune target specific searches DAIRS permits creating user based filters using ontologies DAIRS facilitates the creation of user guided technology specific dictionaries Our project on DAIRS for nanotechnology will build a mega-dictionary, which will be parsed into components of interest to clients 9 September 2004 © 2004 by Arun Sood
Commercial Applicability of DAIRS For example the monitoring the developments in Nanotechnology The dynamic issues related with a growing field is an ideal place to use a DAIRS approach to manage information Date Google URLs Google News (30 days) 9/6 1.59 M 1150 6/6 712 4/29 1.42 M 1390 3/22 1.3 M 970 9 September 2004 © 2004 by Arun Sood
Review – Key issues Ontologies can be used to focus the search results Significant reduction in false alarms, with some loss in detections Discussed strategies for improving DAIRS DAIRS complements search engines Broad applicability 9 September 2004 © 2004 by Arun Sood
Questions? Can DAIRS be used for composition of web services? How to build an ontology? Is it possible to build a good enough representation? Single ontology or linked ontologies? Build a single ontology for an organization? How difficult is it to build an agent? What is under the hood? Why is agent mobility important? 9 September 2004 © 2004 by Arun Sood