Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Problem Semi supervised sarcasm identification using SASI
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Chapter 12: Web Usage Mining - An introduction
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Building an Intelligent Web: Theory and Practice Pawan Lingras Saint Mary’s University Rajendra Akerkar American University of Armenia and SIBER, India.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Presented by Zeehasham Rasheed
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries.
Databases & Data Warehouses Chapter 3 Database Processing.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
From Devices to People: Attribution of Search Activity in Multi-User Settings Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz Microsoft Research,
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Collaborative Information Retrieval - Collaborative Filtering systems - Recommender systems - Information Filtering Why do we need CIR? - IR system augmentation.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Retroactive Answering of Search Queries Beverly Yang Glen Jeh.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Post-Ranking query suggestion by diversifying search Chao Wang.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Web Intelligence and Intelligent Agent Technology 2008.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Detecting Online Commercial Intention (OCI)
Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz
Presentation transcript:

Search Query Log Analysis Kristina Lerman

What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s: < 2 terms 2001: 2.4 terms 2014: long search queries, e.g., “where is the nearest coffee shop” – Heavy-tailed distribution of term frequency – Billions of queries User intentions – Aggregate query words with results of search to learn user’s needs, wants, goals – Create a database of commonsense knowledge Cf. Cyc Does data exist? – AOL search query log – Google trends

2006 AOL search query log dataset ~20M web queries ~650K users 3 month period: March 1 – May 31, 2006 Data format – AnonID – an anonymous user ID number – Query – the query issued by the user – QueryTime – time query was submitted – ItemRank – rank of item clicked in results – ClickURL – the domain of the clicked item

Timeline 8/4/06: Announcement to SIG-IRList from AOL 8/6/06: TechCrunch slams AOL over privacy 8/7/06: Dataset removed 8/9/06: NYTimes identifies user – Thelma Arnold, 62, from Lilbum, Georgia 8/21/06: AOL CTO Maureen Govern resigns – AOL researcher and supervisor are fired

Weakly-supervised discovery of named entities using web search queries Marius Pasca (Google) CIKM-07: Conference on Information and Knowledge Management, Lisbon, Portugal

Weakly Supervised Discovery of Named Entities using Web Search (2007) Goal: Automatically extract knowledge (entities) from texts created by many people – Discover new instances of classes Red Alert is videogame Lilbum is a town Lorazepam is a drug For what purpose? – Cataloging human knowledge – Understanding searching users # in Lilbum takes Lorazepam, plays Red Alert

Intuition Templates in queries “side effects of xanax pills” “side effects of birth control pills” “side effects of lipitor pills” … – Prefix: “side effects of” – Postfix: “pills” But, templates are difficult to specify – Cf. extraction patterns in web information retrieval

“Weakly”-supervised approach Guided by a small set of known seed instances – Input is a target class and some examples Drug: {phentermine, viagra, vicodin, vioxx, xanax} City: {london, paris, san francisco, tokyo, toronto} Food: {chicken, fish, milk, tomatoes, wheat} – Identify the patterns seed instances occur in Learn many more new instances automatically – Use patterns to find more instances

Step 1: Identify query templates Identify all queries that contain each known class instance vioxx Extract left and right context – “long term vioxx use” Prefix: “long term” Postfix: “use” Infix: “vioxx”

Step 2: Generate candidate instances Go over the query log again Identify all queries that match template Collect query infixes as candidate instances {low blood pressure, xanax, lamictal, generic birth control, lipitor, vicodin, beta blockers, …}

Step 3. Compile search signatures Each candidate is represented as a vector – Each template is a dimension – Weighted by frequency in queries

Step 4. Reference signatures Vectors for example class instances are combined Prototype of search signature for the class

Example

Step 5. Compute signature similarity Vector similarity between reference signature and candidate signature – Jensen-Shannon similarity function Output is rank-ordered list Drug: {viagra, phentermine, ambien, adderall, vicodin, hydrocodone, xanax, vioxx, oxycontin, cialis, valium, lexapro, ritalin, zoloft, percocet, …}

Evaluation

Repeatability Need enormous database of search query logs – Probably best done at Google or Microsoft What can be done with small query databases? What types of social media text could this method be applied to?

Classifying the user intent of web queries using k-means clustering Ashish Kathuria, Bernard J. Jansen and Carolyn Hafernik, Amanda Spink

Problem Introduction WWW playes a vital tool in many people’s daily lives Nearly 70 percent of searchers use a search engine Search engines receive hundreds of millions of queries per day Billions of results per week in response to these queries. Smart users: Novel and increasingly assorted ways of searching!!

Understanding intent behind searching Can help to improve search engine performance via page ranking, result clustering, advertising, and presentation of results

Approach Automatically classify a large set of queries from a web search engine log as informational, navigational and transactional. Encode the characteristics of informational, navigational and transactional queries identified from prior work to develop an automatic classifier using k-means clustering. Use data-mining techniques to more accurately automatically classify queries by user Intent Overcome limitations of previous research: – Small datasets – Limited methodology

Classification of Queries Images from

Research methodology Dataset: Transaction log from Dogpile. Each record has fields like: User identification, cookie, Time of day, Query terms, source Step 1: Creating sessions and removing duplicates The fields of Time of day, User identification, Cookie, and Query were used to locate the initial query of a session and then recreate the series of actions in the session. Collapsed the search using user identification, cookie, and query to eliminate duplicates of result and null queries

Research methodology Step 2: Generating additional attributes Calculated three additional attributes for each record: Query length, query reformulation and result page Step 3: Assignment of terms 1. Navigational : Contain company/business/organization/people names Queries containing portions of URLs or even complete URLs 2. Transactional : Analysis, specifically via the identification of key terms related to transactional domains such as entertainment and ecommerce 3. Informational : Queries that use natural language terms Longer sessions than for informational searching

Research methodology Step 5: Converting string to vector Step 4 : Textual data to numerical data

K-means Clustering Navigational Informational Transactional The resulting data set had four attributes that could be used for classification: query length, source, query reformulation rate, user intent weight of the query

Results Performed on various datasets and achieved 94% accuracy Overall, about 76 percent of the queries were classified as informational, while about 12 percent were classified as transactional, and 12 percent were classified as navigational

Results Navigational queries: Low rates of reformulation, typically sessions of just one query. Informational queries: Low occurrences of query reformulation, indicating probably relatively easy informational needs, such as fact finding Transactional queries: Shorter queries

Discussion of approach Limitations: – The Dogpile user population representative of web search engine users in general? – What if a prototype has multiple user intents associated with it ? – Is relying solely on transactional logs sufficient ? Future Scope : – Investigate in subcategories – A laboratory study on how searchers express their underlying intent – Devlope algorithmic approaches for more in-depth analysis of individual queries The approach has a high success rate, it uses a large data set of queries and does not depend on external content, thereby making it implementable in real time.

Summary Identifying the user intent of web queries is very useful for web search engines because it would allow them to provide more relevant results to searchers and more precisely targeted sponsored links. Classifying queries helps in focused search: – Information queries: Provide relevant information and ads – Navigational queries: Provide links straight to a requested web page – Transactional queries: Focus on all commercial links for future purchase as well The use of k-means as an automatic clustering and classification technique yielded positive results and opened effective ways to improve performance of web search engines.

-Neha Mundada

Acquiring Explicit Goals from Search Query Logs Understanding human goals is necessary for – Recognize goals of actions – Create a plan E.g., ‘plan a trip to Vienna’ has subgoals – ‘contact travel agent’ – ‘book hotel’ – ‘buy concert tickets’, etc. Automatically acquire human goals from search query logs – Acquire and organize commonsense knowledge

Research overview Research Question: – If and How search query logs can be utilized to overcome the problem of acquiring knowledge about human goals? Following an exploratory research style, we intend to show: – contain a small but interesting number of user goals – Separation by automatic methods Results: – Knowledge about the automatic acquisition of goals out of search query logs – Knowledge about the nature of goals extracted from search query logs

Results of Human Subject Study 4 independent raters labeled 3000 queries Examples – bug killing devices – mothers working from home – how to lose weight Classes appear to be separable

Experimental Setup AOL search query log – ~ 20 million search queries – recorded between March 1 and May 31 (2006 ) – ethical issues pre-processing steps to reduce noise – 5 million queries labeled queries from the human subject study were utilized as training examples (controversial queries were omitted)

Classification approach Part of speech tagging – Maximum entropy tagger converts a sequence of words into a sequence of POS tags – Example Query “buy a car”  buy/VB a/DT car/NN Set of words {buy, car} Part of speech trigrams $ VB DT NN $  {$ VB DT, VB DT NN, DT NN $}

Classification approach (2) Linear Support Vector Machine [Dumais98] – Robust and effective in the area of text classification – Weka Machine Learning Toolkit Performance: – 10 trials – 3-fold Cross Validation – Precision, Recall and F1-Measure for the class: “queries containing goals” Precision = 0.77 Recall = 0.63 F1-Measure = 0.69

N-fold cross validation Problem: limited amount of labeled data Solution: N-fold cross validation Divide data into N equal segments (folds) Training data: N-1 folds Testing data: remaining fold Repeat for remaining test folds and average results

Goals are diverse Rank-Frequency plot of goals is heavy tailed – Few goals share by many users – Majority of goals are shared by only few users

Most frequent goals

Most frequent goals with “get”, “make”, “change” and “be”

Summary Web search queries are an abundant, but very sparse and very noisy, source of data about needs, desires, intentions of people Clever methods can learn from these diverse data – Named entities – Goals Can these methods be used in social media?