MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

DISCOVERING EVENT EVOLUTION GRAPHS FROM NEWSWIRES Christopher C. Yang and Xiaodong Shi Event Evolution and Event Evolution Graph: We define event evolution.
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Indexing DNA Sequences Using q-Grams
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Optimizing search engines using clickthrough data
Relevance Feedback Retrieval of Time Series Data Eamonn J. Keogh & Michael J. Pazzani Prepared By/ Fahad Al-jutaily Supervisor/ Dr. Mourad Ykhlef IS531.
Video Shot Boundary Detection at RMIT University Timo Volkmer, Saied Tahaghoghi, and Hugh E. Williams School of Computer Science & IT, RMIT University.
Authers : Yael Pritch Alex Rav-Acha Shmual Peleg. Presenting by Yossi Maimon.
Information Retrieval in Practice
Aki Hecht Seminar in Databases (236826) January 2009
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Reduced Support Vector Machine
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Overview of Search Engines
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
The Development of a search engine & Comparison according to algorithms Sungsoo Kim Haebeom Lee The mid-term progress report.
Bug Localization with Machine Learning Techniques Wujie Zheng
Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.
1 B-Trees & (a,b)-Trees CS 6310: Advanced Data Structures Western Michigan University Presented by: Lawrence Kalisz.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Comparison of AD (Active Data) and ACL (Audit Command Language) Comparison of AD (Active Data) and ACL (Audit Command Language) January 27, 2011
Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
 Examine two basic sources for implicit relevance feedback on the segment level for search personalization. Eye tracking Display time.
Leonardo Guerreiro Azevedo Geraldo Zimbrão Jano Moreira de Souza Approximate Query Processing in Spatial Databases Using Raster Signatures Federal University.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Thesis Proposal: Prediction of popular social annotations Abon.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Retroactive Answering of Search Queries Beverly Yang Glen Jeh.
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
A new clustering tool of Data Mining RAPID MINER.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Distinguishing humans from robots in web search logs preliminary results using query rates and intervals Omer Duskin Dror G. Feitelson School of Computer.
Chapter 6 Queries and Interfaces. Keyword Queries n Simple, natural language queries were designed to enable everyone to search n Current search engines.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
1 Context-Aware Ranking in Web Search (SIGIR 10’) Biao Xiang, Daxin Jiang, Jian Pei, Xiaohui Sun, Enhong Chen, Hang Li 2010/10/26.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
User Modeling for Personal Assistant
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
DATA MINING © Prentice Hall.
School of Computer Science & Engineering
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
Lecture 12: Data Wrangling
Struggling and Success in Web Search
INF 141: Information Retrieval
Efficient Aggregation over Objects with Extent
Presentation transcript:

MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission of one query from a user to the search engine at a certain time. Query Transaction: A query transaction is the search process 1) with the search interest focusing on the same topic or strongly related topics, 2) in a bounded and consecutive period, and 3) issued by the same user. It is represented as a series of query records in temporal order. User Session: A user session contains the history of all query records that belong to the same user, in a given period. It can also be represented as a series of query records in temporal order. Dynamic Sliding Window Segmentation Algorithm. The complexity of this algorithm is O(n). We empirically set the values of α, β, γ, θ to be 5 minutes, 24 hours, 60 minutes and 0.4 in our experiments. Overview: Web search engines have become the most popular solution to finding relevant information to a topic on the web. However, search engine users often experience difficulties in organizing and representing their information needs by simple queries. Finding related queries can help: Giving search query suggestions; Query expansion Indexing/Caching optimization We propose to segment user query sessions into query transactions in which queries are considered related and then to find statistically associated queries using a modified association rule mining model. Levenshtein Distance Similarity: Search engine users often reformulate their input queries by adding, deleting or changing some words of the original query string. Hence we use Levenshtein distance, a special type of edit distance, to measure the degree of matching between query strings. It defines a set of edit operations, such as insertion or deletion of a word, together with a cost for each operation. The distance between two query strings then is defined to be the sum of the costs in the cheapest chain of edit operations transforming one query string into the other. The Levenshtein Distance Similarity between two query strings is: Experiments: The temporal correlation model, proposed by Chien & Immorlica, is selected as the baseline. Our proposed technique is decomposed into two models and tested separately against rival models: Dynamic Sliding Window Segmentation Algorithm (DSW SA). Association Rule Mining Model with Levenshtein Distance Similarity (ARM_LDS). The Precision Rates of Our Experiment Results, at different levels of selected top K queries Segmentation Algorithm: Our model is based on the traditional association rule mining model. The quality of segmenting user sessions into query transactions is critical for mining association rules of related queries. A dynamic sliding window segmentation algorithm is proposed, which adopts three time interval constraints: the maximum interval length allowed between adjacent query records in a same query transaction (α); the maximum interval length of the period during which the user is allowed to be inactive (β); the maximum length of the time window which the query transaction is allowed to span (γ) (α ≤ γ ≤ β). It also sets a lower bound for the Levenshtein distance similarity between adjacent queries, i.e. θ, to justify the borders of query transactions. Mining Related Queries (continued): where wn(.) is the number of words (or characters in Chinese) in a query. Example: the Levenshtein Distance between “adobe photoshop” and “photoshop” is 1 and their Levenshtein Distance Similarity is 0.5. Assuming the input query is q i, we calculate the support factor q i ⇒ q k | s and confidence factor q i ⇒ q k | c of any hypothesized association rule q i ⇒ q k (q k ∈ Q, i ≠ j). Then we first set a threshold min_support for the support factors to filter weak association rules. Next we rank the list of association rules according to their confidence factors. Finally we select the top K rules and extract the related queries. A sample of how to segment a user session into query transactions. It is more like a decision tree algorithm with four decision factors α, β, γ, and θ. Mining Related Queries: Our model is a modified-confidence version of the traditional approach of mining association rules in data mining. Given the set of queries Q = {q 1, q 2, …, q n }, the association rule is redefined as an implication q i ⇒ q k, where q i ∈ Q, q k ∈ Q and i ≠ k. Mining related queries is simplified as finding the statistically strong associations between the input query q i and any other queries q k : Support: q i ⇒ q k has a support factor of s if s% of the transactions in T contain both {q i } and {q k }, notated as q i ⇒ q k | s. Raw Confidence: the raw confidence factor of q i ⇒ q k is rc if rc% of the transactions in T’ contain {q k }, provided that T’ is the set of all transactions in T that contains {q i }, and is notated as q i ⇒ q k | rc. Confidence: the raw confidence factor is combined with the Levenshtein distance similarity between q i and q k to get the confidence factor: A sample showing how our proposed technique (ARM_LDS) promotes the highly related queries in the ranking list without penalizing other related queries. The numbers in the brackets indicate the confidence factors (or Levenshtein Distance Similarities for LDS).