Download presentation
Presentation is loading. Please wait.
Published byAshley Day Modified over 9 years ago
1
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission of one query from a user to the search engine at a certain time. Query Transaction: A query transaction is the search process 1) with the search interest focusing on the same topic or strongly related topics, 2) in a bounded and consecutive period, and 3) issued by the same user. It is represented as a series of query records in temporal order. User Session: A user session contains the history of all query records that belong to the same user, in a given period. It can also be represented as a series of query records in temporal order. Dynamic Sliding Window Segmentation Algorithm. The complexity of this algorithm is O(n). We empirically set the values of α, β, γ, θ to be 5 minutes, 24 hours, 60 minutes and 0.4 in our experiments. Overview: Web search engines have become the most popular solution to finding relevant information to a topic on the web. However, search engine users often experience difficulties in organizing and representing their information needs by simple queries. Finding related queries can help: Giving search query suggestions; Query expansion Indexing/Caching optimization We propose to segment user query sessions into query transactions in which queries are considered related and then to find statistically associated queries using a modified association rule mining model. Levenshtein Distance Similarity: Search engine users often reformulate their input queries by adding, deleting or changing some words of the original query string. Hence we use Levenshtein distance, a special type of edit distance, to measure the degree of matching between query strings. It defines a set of edit operations, such as insertion or deletion of a word, together with a cost for each operation. The distance between two query strings then is defined to be the sum of the costs in the cheapest chain of edit operations transforming one query string into the other. The Levenshtein Distance Similarity between two query strings is: Experiments: The temporal correlation model, proposed by Chien & Immorlica, is selected as the baseline. Our proposed technique is decomposed into two models and tested separately against rival models: Dynamic Sliding Window Segmentation Algorithm (DSW SA). Association Rule Mining Model with Levenshtein Distance Similarity (ARM_LDS). The Precision Rates of Our Experiment Results, at different levels of selected top K queries Segmentation Algorithm: Our model is based on the traditional association rule mining model. The quality of segmenting user sessions into query transactions is critical for mining association rules of related queries. A dynamic sliding window segmentation algorithm is proposed, which adopts three time interval constraints: the maximum interval length allowed between adjacent query records in a same query transaction (α); the maximum interval length of the period during which the user is allowed to be inactive (β); the maximum length of the time window which the query transaction is allowed to span (γ) (α ≤ γ ≤ β). It also sets a lower bound for the Levenshtein distance similarity between adjacent queries, i.e. θ, to justify the borders of query transactions. Mining Related Queries (continued): where wn(.) is the number of words (or characters in Chinese) in a query. Example: the Levenshtein Distance between “adobe photoshop” and “photoshop” is 1 and their Levenshtein Distance Similarity is 0.5. Assuming the input query is q i, we calculate the support factor q i ⇒ q k | s and confidence factor q i ⇒ q k | c of any hypothesized association rule q i ⇒ q k (q k ∈ Q, i ≠ j). Then we first set a threshold min_support for the support factors to filter weak association rules. Next we rank the list of association rules according to their confidence factors. Finally we select the top K rules and extract the related queries. A sample of how to segment a user session into query transactions. It is more like a decision tree algorithm with four decision factors α, β, γ, and θ. Mining Related Queries: Our model is a modified-confidence version of the traditional approach of mining association rules in data mining. Given the set of queries Q = {q 1, q 2, …, q n }, the association rule is redefined as an implication q i ⇒ q k, where q i ∈ Q, q k ∈ Q and i ≠ k. Mining related queries is simplified as finding the statistically strong associations between the input query q i and any other queries q k : Support: q i ⇒ q k has a support factor of s if s% of the transactions in T contain both {q i } and {q k }, notated as q i ⇒ q k | s. Raw Confidence: the raw confidence factor of q i ⇒ q k is rc if rc% of the transactions in T’ contain {q k }, provided that T’ is the set of all transactions in T that contains {q i }, and is notated as q i ⇒ q k | rc. Confidence: the raw confidence factor is combined with the Levenshtein distance similarity between q i and q k to get the confidence factor: A sample showing how our proposed technique (ARM_LDS) promotes the highly related queries in the ranking list without penalizing other related queries. The numbers in the brackets indicate the confidence factors (or Levenshtein Distance Similarities for LDS).
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.