Clustering Semantically Enhanced Web Search Results

Slides:



Advertisements
Similar presentations
Cover Slide ~ Semantic Web Pilot Program Elements ~ Presented by Cola Atkinson – BBN 07 May 2003.
Advertisements

Sharpdesk Overview Desktop Composer Search Imaging      
Thane Kerner Silverchair. What is… The Semantic Web? A Semantic Data Layer? Semantic Tagging? Why add semantics to my content? How can I get semantic.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Search Engines and Information Retrieval
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Towards Semantic Web Mining Bettina Berndt Andreas Hotho Gerd Stumme.
Information Retrieval in Practice
A field is a unit of information. Limit search by the title field.
CS580: Building Web Based Information Systems Roger Alexander & Adele Howe The purpose of the course is to teach theory and practice underlying the construction.
University of Kansas Data Discovery on the Information Highway Susan Gauch University of Kansas.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Search Engines
Search Engines and Information Retrieval Chapter 1.
Strategies for improving Web site performance Google Webmaster Tools + Google Analytics Marshall Breeding Director for Innovative Technologies and Research.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Patterns, effective design patterns Describing patterns Types of patterns – Architecture, data, component, interface design, and webapp patterns – Creational,
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Ontological Classification of Web Pages Zafer Erenel Many users use search engines to locate and buy goods and services (such as choosing a vacation).
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Objective Understand concepts used to web-based digital media. Course Weight : 5%
Basic Search Engine Optimization. What is SEO?  SEO is an abbreviation for search engine optimization.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Data Mining for Web Intelligence Presentation by Julia Erdman.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Ontology-Centered Personalized Presentation of Knowledge Extracted from the Web Ralitsa Angelova.
Web- and Multimedia-based Information Systems Lecture 2.
Medical Information Retrieval: eEvidence System By Zhao Jin Mar
Advanced Semantics and Search Beyond Tag Clouds and Taxonomies Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Database Technologies for E-Commerce Rakesh Agrawal IBM Almaden Research Center.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Retrieval in Practice
Information Retrieval in Practice
WEB TESTING
Information Architecture
Managing the content of web pages
Web Page Elements Writing For the Web
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Information Retrieval (in Practice)
CANTINA: A Content-Based Approach to Detecting Phishing Web Sites
Information Retrieval
Web Services and Application of Multi-Agent Paradigm for DL
Strategies for improving Web site performance
Preface to the special issue on context-aware recommender systems
MUG Tuesday, May 31, 2016.
Personalized Social Image Recommendation
Information Retrieval
Extracting Semantic Concept Relations
Data Mining Chapter 6 Search Engines
Course: Module: Lesson # & Name Instructional Material 1 of 32 Lesson Delivery Mode: Lesson Duration: Document Name: 1. Professional Diploma in ERP Systems.
CSE 635 Multimedia Information Retrieval
Ying Dai Faculty of software and information science,
Introduction to Information Retrieval
Introduction to Search Engines
Information Systems in Organizations 2
Presentation transcript:

Clustering Semantically Enhanced Web Search Results Anantha Bangalore, MSD, Vienna, VA Arun Sood, Professor and Chair, CS Dept Noorullah Moghul, CS PhD student George Mason University, Fairfax, VA. 9 September 2004

Overview DAIRS: Distributed Agents for Information Retrieval Systems Software agents Applied to Image, geospatial and text processing. Tested within medical context Results of initial testing are provided Applicable in many domains Scope for discussion 9 September 2004 © 2004 by Arun Sood

DAIRS - Problem Statement Data volume is exploding Data rich information poor environment Many search systems provide high recall but low precision (e.g. Google) Increased precision (relevance) Saves user time Enables a broader search of candidate URLs 9 September 2004 © 2004 by Arun Sood

Our Approach Assumption: Google (and other) search engines provide acceptable recall DAIRS extracts a robust and relevant result set Use an ontology to describe the user context Ontological filtering Clustering of the concepts 9 September 2004 © 2004 by Arun Sood

Subset of UMLS Semantic Net 9 September 2004 © 2004 by Arun Sood

Advantage of Our Agent Approach Easily compose solution methodologies using light weight agents: 200 agents in our system Works in a distributed environment Agents are mobile Load balance agent assigns agents in the background Exploits parallelism Import functionality from 3rd party software, without importing the application 9 September 2004 © 2004 by Arun Sood

Interface to Compose Solutions 9 September 2004 © 2004 by Arun Sood

EXPERIMENT Google Search – {cold, strain, fluid, adjustment, fat, condition, etc.} Selected top 100 URLs in each search Classified the URLs using DAIRS ( UMLS as Ontology filter and Cluto as clustering software) Compared DAIRS results with the URL classification done manually 9 September 2004 © 2004 by Arun Sood

Words with Multiple Senses (Cold) NLM has identified 50 words with multiple senses that occur frequently Cold disease, cold temperature, cold war, cold fusion, cold springs, cold calls, etc. Find URLs dealing with cold in a medical context (e.g. common cold) Ontology filter (UMLS – Metathesaurus) helps deemphasize non-medical URLs Clustering leads to separation of medical related URLs from other URLs 9 September 2004 © 2004 by Arun Sood

EXAMPLE URL CLASSIFICATION Common Cold Common Cold URLs classified correctly http://www.coldcure.com/ http://www.nlm.nih.gov/medlineplus/commoncold.html http://www.cdc.gov/flu/protect/sick.htm http://lib-sh.lsumc.edu/fammed/pted/cold.html http://www.healthscout.com/template.asp?page=cold&ap=1 Undetected URLs http://www.commoncold.org/ Contains images and links to other websites, little text http://www.commoncold.co.uk/ Contains very little textual content http://myheala.com/ Contains images and very little text http://www.coldeeze.com/ Mostly image content 9 September 2004 © 2004 by Arun Sood

9 September 2004 © 2004 by Arun Sood

9 September 2004 © 2004 by Arun Sood

URL CLASSIFICATION EXAMPLE – 2 Cold URLs –False Alarms http://www.theatlantic.com/unbound/jazz/sundgaar.htm NEWS article describes a story at a cold place http://www.coldasice.com/ Winter wear http://www.inc.com/guides/sales/20677.html Cold Calls – sales calls http://www.cold-me.net/ Music website 9 September 2004 © 2004 by Arun Sood

SUMMARY OF THE RESULTS: IR Measures CONCEPT Google Hits Analyzed Correct Classification Undetected False Alarms Google DAIRS Common cold (Disease) 100 5 4 91 Cold Temperature 9 86 Strain (Muscle) 15 6 79 Strain (Bacterial) 42 14 44 8 Fluid (Substance) 2 96 Fluid (Behavior) 36 55 12 Like “cold” example, most of the misses are because of limited text at these sites – mostly images, and pointers to other web pages. 9 September 2004 © 2004 by Arun Sood

Location of hits: Usability Measures CONCEPT Google Hits Analyzed Correct Classification Undetected Common cold (Disease) 100 44,45,49,62,83 19, 50, 51, 99 Cold Temperature 43, 53, 58, 74, 96 1, 9, 17, 31, 37, 54, 64, 84, 85 Strain (Muscle) 1, 17, 18, 22, 27, 33, 42, 43, 54, 67, 79, 80, 84, 92, 95 7, 11, 20, 38, 53, 85 9 September 2004 © 2004 by Arun Sood

Building a Robust DAIRS Previous study shows that some sites were not properly classified because the text content was small Next steps Build agent to extract links to the next level of URLs Build agent to parse the next level of URL text and include in the search results Build agent to OCR the images, and extract text 9 September 2004 © 2004 by Arun Sood

DAIRS vs. Search Engines DAIRS complements search engines to fine tune target specific searches DAIRS permits creating user based filters using ontologies DAIRS facilitates the creation of user guided technology specific dictionaries Our project on DAIRS for nanotechnology will build a mega-dictionary, which will be parsed into components of interest to clients 9 September 2004 © 2004 by Arun Sood

Commercial Applicability of DAIRS For example the monitoring the developments in Nanotechnology The dynamic issues related with a growing field is an ideal place to use a DAIRS approach to manage information Date Google URLs Google News (30 days) 9/6 1.59 M 1150 6/6 712 4/29 1.42 M 1390 3/22 1.3 M 970 9 September 2004 © 2004 by Arun Sood

Review – Key issues Ontologies can be used to focus the search results Significant reduction in false alarms, with some loss in detections Discussed strategies for improving DAIRS DAIRS complements search engines Broad applicability 9 September 2004 © 2004 by Arun Sood

Questions? Can DAIRS be used for composition of web services? How to build an ontology? Is it possible to build a good enough representation? Single ontology or linked ontologies? Build a single ontology for an organization? How difficult is it to build an agent? What is under the hood? Why is agent mobility important? 9 September 2004 © 2004 by Arun Sood