Unsupervised Extraction of Template Structure in Web Search Queries www 2012 – Session: search Qingxia Liu.

Slides:



Advertisements
Similar presentations
Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Advertisements

OLAP over Uncertain and Imprecise Data
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Personalized Search Result Diversification via Structured Learning
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
CONCLUSION & FUTURE WORK Normally, users perform triage tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Text Classification, Active/Interactive learning.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Crowdsourcing with Multi- Dimensional Trust Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department of Electrical.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Data Mining for Web Intelligence Presentation by Julia Erdman.
Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Topic Modeling using Latent Dirichlet Allocation
Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.
More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Hierarchical Clustering & Topic Models
Topic Modeling for Short Texts with Auxiliary Word Embeddings
User Modeling for Personal Assistant
Xiang Li,1 Lili Mou,1 Rui Yan,2 Ming Zhang1
What Is Cluster Analysis?
Sampath Jayarathna Cal Poly Pomona
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Sentiment analysis algorithms and applications: A survey
DATA MINING © Prentice Hall.
Lecture 12: Relevance Feedback & Query Expansion - II
The topic discovery models
Statistical Learning Methods for Natural Language Processing on the Internet 徐丹云.
Hidden Markov Models (HMMs)
The topic discovery models
Personalizing Search on Shared Devices
Hidden Markov Models (HMMs)
Matching Words with Pictures
The topic discovery models
Topic Modeling Nick Jordan.
Stochastic Optimization Maximization for Latent Variable Models
Web Mining Department of Computer Science and Engg.
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
Michal Rosen-Zvi University of California, Irvine
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Extracting Patterns and Relations from the World Wide Web
EM Algorithm 主講人:虞台文.
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
GhostLink: Latent Network Inference for Influence-aware Recommendation
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Unsupervised Extraction of Template Structure in Web Search Queries www 2012 – Session: search Qingxia Liu

Content Motivation Definition of the Problem Generative Model Alternative Models Experiments

Motivation Web Search Problems: Challenges: fact: Determine users’ intent Keyword-search: Small set of keywords Challenges: Scalability issues; Instrumentation issues: user clicks; Brevity, ambiguity: E.g “jaguar” the car or the animal? Sparsity issues: E.g tail queries such as “jaguar xj12 95 6.01 engine mount” fact: Various queries with the same search intent issue Usage of intent information: provide relevant results to detect whether a query has a commercial intent to select a useful set of advertisements to learn from the user’s interaction with the search engine

Motivation Goal: Similar works: extracting the hidden structure behind the observed search queries in a domain; with no manual intervention; e.g. “jaguar xj12 95 6.01 engine mount” <Brand,Model,Year,Part> pattern Similar works: require either direct supervision of the tasks or use ancillary information such web search click-through data we seek to solve these issues by enriching search queries with information about the hidden structure underlying them. analyze queries to obtain segmentations extract named entities from queries All these approaches require either direct supervision of the tasks, such as manually labeled seed data, or use ancillary information such web search click-through data, both of which might be expensive or difficult to obtain. 2, sume that the attribute-set as well as the associated vocabularies are given as inputs in form of database relations or entity hierarchies. detected templates for improving query recommendations

Model we seek to solve these issues by enriching search queries with information about the hidden structure underlying them. analyze queries to obtain segmentations extract named entities from queries All these approaches require either direct supervision of the tasks, such as manually labeled seed data, or use ancillary information such web search click-through data, both of which might be expensive or difficult to obtain. 2, sume that the attribute-set as well as the associated vocabularies are given as inputs in form of database relations or entity hierarchies. detected templates for improving query recommendations

Single-attribute template model A set of words template1 attributea W1 W2 W3 W4 …… attributeb template2 template3 attributec attributed template4

multi-attribute template model A set of words template1 attra attrb attrc W1 W2 W3 W4 …… template2 attrb attrc template3 attra attrc e.g. Brand Year Parts Honda Toyota Ford ……

Properties I: constrain the number of distinct templates that can be formed in the model II: each attribute in the template to generate at least one word III: Each attribute has a specific word distribution as well as a distinct tendency for the number of words it contributes in a query.

PROBLEM DEFINITION Given a set of queries, extract the underlying schema (templates, attribute, and their vocabularies) and learn the parameters of the generative process in a completely unsupervised manner while respecting the properties mentioned above.

Parameter generating process attributes Vector Candidate pool θt ~ Multinomial(μ) μ ~ Dirichlet(α) t q ~ Multinomial(γ) γ ~ Dirichlet(σ) T q1 tq1 q2 tq2 q3 tq3 q4 tq4 …… One way to think of this is that the candidate pool denotes the set of template configurations which are appropriate for the domain. t1 t2 t3 …… tT

Generative Model template θ[tq] na ~ Possion(ηa) ηa ~Gamma(g1,g2) zq wq1 tq1 aq1 q2 wq2 tq2 aq2 q wq tq aq q4 …… template θ[tq] attr1 attr2 attr3 attr4 …… na ~ Possion(ηa) ηa ~Gamma(g1,g2) zq zq1 zq2 zq3 …… W(z[q,i],i) ~ Mutinomial(φz[q,i]) φa~ Dirichlet(β) Query q w1 w2 w3 ……

Generative Model q1 tq1 q2 tq2 q tq q4 …… aq1 aq2 aq t1 t2 t3 …… tT wq1 tq1 aq1 q2 wq2 tq2 aq2 q wq tq aq q4 …… t1 t2 t3 …… tT

Model Learning Gibbs sampling Bayes’ Theorem 分母:Gibbs sampling 分子:根据前面的式子计算出来 Gibbs sampling

Model Learning

Model Learning

Overview random initialization: iterate over queries and the template set using the derived conditionals to update the vectors compute the likelihood p(φ, μ, γ, |data) ends with: query ->template, word -> attribute

Alternative approaches Latent Dirichlet Allocation (LDA) attribute Multinomial distribution topics a topic vocabularies φ distribution the ith word a document query

Alternative approaches Spherical k-Means nu,v : times wu and wv occur together in queries co-occurrence behavior clusters of words -> attributes wu nu,1 nu,2 nu,3 nu,4 …… co-occurrence behavior

Experiments queries: 100 million Yahoo search queries 43793,83387,15050 queries each domain ground truth:manually extracted

Experiments Automobile domain Travel domain Movies domain correctly placed:the learnt attribute and the ground truth attribute it belongs to are mapped to each other. PRECISION(N) is the fraction of words in the first N learnt attributes (in the algorithm’s ordering)that are correctly placed. CORRECTRECALL(N), is the fraction of words in ground truth attributes mapped to the first N learnt attributes that are correctly placed. Movies domain

Experiments number of iterations number of attributes φ acts as a prior to the attribute-word multinomial distributions g1 and g2 are used to generate the Poisson distributions for each attribute : attr - num of words different values of φ

Case Study applications: CTR on sponsored search advertisements fq1 = ∑i>1qi tendency to attract ad-clicks from users for infering its advertisability

Thanks for listening ~