Mining Anchor Text for Query Refinement

Slides:



Advertisements
Similar presentations
eClassifier: Tool for Taxonomies
Advertisements

Chapter 5: Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Information Retrieval
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
The Business Model and Strategy of MBAA 609 R. Nakatsu.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.
The Business Model of Google MBAA 609 R. Nakatsu.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
1 Internet Research Third Edition Unit A Searching the Internet Effectively.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Internet Research – Illustrated, Fourth Edition Unit A.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Internet Search Operators Richard Goldman January 26, 2000.
User Interfaces and Information Retrieval Dina Reitmeyer WIRED (i385d)
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Internet Search Operators Richard Goldman January 26, 2000.
Plan for Today’s Lecture(s)
Text Indexing and Search
WEB SPAM.
Text Based Information Retrieval
Information Retrieval
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
CS 430: Information Discovery
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Search Engines & Subject Directories
Internet Research Third Edition
Information Retrieval
John Frazier and Jonathan perrier
Web Information retrieval (Web IR)
Anatomy of a Search Search The Index:
Search Engines & Subject Directories
Chapter 5: Information Retrieval and Web Search
Search Engines & Subject Directories
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Automatic Global Analysis
Retrieval Utilities Relevance feedback Clustering
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Introduction to Search Engines
Presentation transcript:

Mining Anchor Text for Query Refinement Reiner Kraft and Jason Zien IBM Almaden Research Center Mark Strohmaier

Problem Motivation 23% of search queries are single-term Expanding the query can lead to more accurate searches Previous studies indicate that anchor text is statistically similar to search queries Can this similarity be exploited to improve search queries?

What is anchor text? <a href=”this is the website”> This is the anchor text </a> Destination pages can have multiple links pointing to them Collections of anchor text can give a view of the destination page Naïve approach: Find links whose anchor text is similar to the query Return the links destination pages to the user

Problems with naïve approach High term frequency is not directly related to page quality Repeated terms may lead to unnatural queries IDF is not necessarily relevant Anchor text may appear multiple times

Methods of Query Refinement Weighting the number of occurrences Weight based on the type of anchor text Number of terms in the anchor text Smaller terms is better Number of characters in the anchor text More concise queries are better

Benefits of the Anchor Text There is much less anchor text than document text Pages can have many incoming links Refined anchor text can capture a degree of site popularity

Mining Anchor Text Initial web crawl covered 33 million links on IBM intranet Additionally, roughly 350,000 queries were analyzed Both categories showed a similar relationship between length and number of occurrences

Pre-processing Summaries Query refinement is sensitive to the number of terms Too few may not lead to much improvement Too many may lead to overspecialization Best results were for MAXCOUNT = 3

Studies Performed Three different approaches were compared Anchor Ranked Anchor Text refinement Doc.SW This ranked pages based on the most frequently occurring 2 and 3 term phrases DOC Similar to Doc.SW, but not counting stop words

Ranking Anchor Texts The results are ranked based on WCOUNT score Number of terms in the anchor summary Number of characters in the anchor summary

Comparison of Methods Second comparison tested 22 different queries QUERYLOG processes and dynamically updates user queries based on previous ones, in a similar manner as ANCHOR

Conclusions Using anchor text leads to better results than performing similar methods on document collections A similar approach can be used to refine user search queries as well

Future Directions Broadening search queries Lexical analysis, rather than straight textual Pre- and Post- anchor text