Mining Query Subtopics from Search Log Data Hu et al., SIGIR’12 Presented by Baichuan Li
Outline Problem Two observations Approach Results Applications Conclusion
Problem Intention != Query Many queries are ambiguous or multifaceted
Problem
Problem Xbox Manchester Homepage? Online game? Where to buy? Manchester news? Manchester tourism? Manchester weather? Manchester united?
Solution Identifying the major subtopics of a query from the search log data Personalized search Query suggestion Search result presentation Clustering Re-ranking Diversification
Observations from The Logs One subtopic per search (OSS) If a user clicks multiple URLs after submitting a query, then the clicked URLs tend to represent the same subtopic
Observations from The Logs Subtopic clarification by additional keyword (SCAK) Users often add additional keywords (in most cases, one additional keyword) to a query to expand the query in order to clarify its subtopic The URLs clicked after searching both with the original and the expanded queries tend to represent the same subtopic The key word tends to be indicative of the subtopic E.g. harry shum microsoft
One Subtopic per Search Use it as the rule for subtopic identification 10,000 groups of multiclicks of individual queries The rule is accurate when all the URLs within the multi-clicks are about the same sense or face
Accuracy V.S. Frequency
Accuracy v.s. Click Position
Distribution The queries with higher frequencies in search log data are more likely to have multi-clicks
Subtopic Clarification by Additional Keyword The keywords ‘microsoft’ and ‘jr’ can be used to represent the two groups (subtopics) respectively
Query Types ‘Q’: the query is a single phrase ‘Q + W’: ‘Q’ + a keyword ‘W + Q’: a keyword + ‘Q’ ‘Others’
Subtopic Overlap and URL Overlap Randomly selected 500 pairs of queries with the forms ‘Q’ and ‘Q + W’ If subtopics of an expanded query are contained in subtopics of the original query, then there is ‘subtopic overlap’ Check whether two queries share identical clicked URLs (‘URL overlap’)
Distribution The more popular (frequent) a query is, the more likely the rule is applicable
Mining Subtopics
Preprocessing
Clustering Similarity function The similarity function between URLs ui and uj : S1: based on the OSS phenomenon where and denote the vectors of multi-clicks of ui and uj respectively S2: based on the SCAK phenomenon S3: based on string similarities
SCAK Similarity (S2) where and denote the vectors of keywords associated with ui and uj respectively
Data TREC Data: TREC search result diversification track 3 in 2009 DataSetA and DataSetB: queries and URLs randomly sampled from the logs of the commercial search engine
Results
Application: Search Result Reranking The user is first asked which subtopic she is interested in, with the subtopics shown at the top of the results page When the user selects a subtopic, the URLs belonging to the subtopic will be moved to the top (re-ranked) The relative order between URLs inside and outside of the subtopic will be kept
Example
Evaluation Data Results Collect search log data of 20, 000 randomly selected searches Each query has at least two subtopics mined by the method Results The average position of last clicked URLs is 3.41 Assume the cost for the user to check the subtopics and click one of them is 1.0 The average position of last clicked URLs belonging to the same subtopics is 1.80
Conclusion Study the problem of query subtopic mining Discovered two phenomena of user search behavior one subtopic per search Subtopic clarification by additional keyword Design a novel similarity function Applications on search result reranking (and search result clustering) Problem Can only be employed when there is enough search log data