Mining Anchor Text for Query Refinement Reiner Kraft and Jason Zien IBM Almaden Research Center Mark Strohmaier
Problem Motivation 23% of search queries are single-term Expanding the query can lead to more accurate searches Previous studies indicate that anchor text is statistically similar to search queries Can this similarity be exploited to improve search queries?
What is anchor text? <a href=”this is the website”> This is the anchor text </a> Destination pages can have multiple links pointing to them Collections of anchor text can give a view of the destination page Naïve approach: Find links whose anchor text is similar to the query Return the links destination pages to the user
Problems with naïve approach High term frequency is not directly related to page quality Repeated terms may lead to unnatural queries IDF is not necessarily relevant Anchor text may appear multiple times
Methods of Query Refinement Weighting the number of occurrences Weight based on the type of anchor text Number of terms in the anchor text Smaller terms is better Number of characters in the anchor text More concise queries are better
Benefits of the Anchor Text There is much less anchor text than document text Pages can have many incoming links Refined anchor text can capture a degree of site popularity
Mining Anchor Text Initial web crawl covered 33 million links on IBM intranet Additionally, roughly 350,000 queries were analyzed Both categories showed a similar relationship between length and number of occurrences
Pre-processing Summaries Query refinement is sensitive to the number of terms Too few may not lead to much improvement Too many may lead to overspecialization Best results were for MAXCOUNT = 3
Studies Performed Three different approaches were compared Anchor Ranked Anchor Text refinement Doc.SW This ranked pages based on the most frequently occurring 2 and 3 term phrases DOC Similar to Doc.SW, but not counting stop words
Ranking Anchor Texts The results are ranked based on WCOUNT score Number of terms in the anchor summary Number of characters in the anchor summary
Comparison of Methods Second comparison tested 22 different queries QUERYLOG processes and dynamically updates user queries based on previous ones, in a similar manner as ANCHOR
Conclusions Using anchor text leads to better results than performing similar methods on document collections A similar approach can be used to refine user search queries as well
Future Directions Broadening search queries Lexical analysis, rather than straight textual Pre- and Post- anchor text