Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi, Castillo, Donato, Vigna, The Query Flow Graph: Model and Applications
Ambiguous queries: jaguar General queries: haifa Terminology differences (synonyms) between user and corpus stars - planets The Problem User queries are an imperfect description of their information needs Examples:
Query Suggestions Assist the user to phrase her information need jaguar Jaguar car Jaguar xf Jaguar animal Jaguar cat
Example: Google Related Searches
Query suggestion algorithms Query suggestions are extracted from the query log – There are methods that use different data sources such as a corpus, not covered today Topic (cluster) based – identify groups of similar queries Sequence based – mine and analyze the query log for likely query sequences
Improving Search Engines by Query Clustering - Baeza-Yates et al. Algorithm outline Offline: – Represent queries as term weighted vectors – Cluster queries – Rank queries in each cluster Online: – Given user’s query q – Find cluster C containing q – Suggest top k queries in cluster C Based on their rank and similarity to q
Query Model Given query q Let U be the set of URLs clicked for q (for all users and sessions) – Information is extracted from the query log q’s term weighted vector has a non 0 entry for any term that appears in some URL in U Terms are weighted according to – Term frequency and URLs popularity – Formula in next slide …
Query Model (2) - The number of clicks of u for the query q Note: paper proposes a refinement to Pop(u,q) which is not biased by search engine’s ranking Query similarity is computed by some measure, e.g. cosine similarity.
Query Support The fraction of the documents returned by the query that captured the attention of users (clicked documents) Denotes how ‘good’ is a query – A ‘global score’ Queries within a cluster are ranked according to their similarity to q as well as their support
Query Flow Graph – Boldi et al. Main idea: Aggregate the (massive) raw data in the query log – Many queries of many users Model user query behavior Use sophisticated techniques to infer query relatedness
Query Flow Graph Model G=(V, E, w) a directed graph where: V – nodes, representing a distinct set of queries Q – Queries are extracted from the query log A set of directed edges E Two queries q,q’ are connected with an edge if q’ follows q in at least one session
QFG Illustration q0 q1 q2 q3 q4 q5 Nodes are queries Edges connect between queries apple ipod apple store
Weighting Function w : E -> (0..1] a weighting function that assigns a weight to every edge (q,q’) For each edge (q,q’) assign a probability that q’ follows q in the same session – Extracted from the observed query log sessions
Illustration q0 q1 q2 q q4 q
Random walk on the QFG A random surfer executes a random walk on the graph as follows: – Start at a some node – Move along an edge with probability d Choose an edge by its probability (weight) – Or teleport to a random node with probability 1-d Choose an edge uniformly The Stationary distribution The probability to be at node q in the infinity Random walk score vector – query absolute scores
Random Walk Relative to a Node Random walk with restart to a single node: – Start at node q – Instead of teleporting to any node, always teleport to q The score of node q’ for this random walk measures relatedness of q’ to q – The probability to get from q to q’ in the infinity – Can normalize node’s relative score by its absolute score ; similar somehow to tfxidf – avoid highly popular queries (non related to q)
The Full Picture Off-line stage – For each node q in the graph Compute the stationary distribution vector of q – A random walk score relative to q Store suggestions for q, alternatives: – top k scored nodes – nodes having a score above some threshold On-line stage – User submits query q – Suggest queries stored for q Queries most related to q