Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Farag Saad i-KNOW 2014 Graz- Austria,

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Center for E-Business Technology Seoul National University Seoul, Korea Socially Filtered Web Search: An approach using social bookmarking tags to personalize.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites Center for E-Business Technology Seoul National University.

Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.

What is Statistical Modeling

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

Transportation mode detection using mobile phones and GIS information Leon Stenneth, Ouri Wolfson, Philip Yu, Bo Xu 1University of Illinois, Chicago.

Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

INFO 624 Week 3 Retrieval System Evaluation

1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.

Scalable Text Mining with Sparse Generative Models

Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Short Introduction to Machine Learning Instructor: Rada Mihalcea.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

by B. Zadrozny and C. Elkan

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

A Collaborative Writing Mode for Avoiding Blind Modifications Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent.

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor ： Dr.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Methods: Bagging and Boosting

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Linking Organizational Social Networking Profiles PROJECT ID: H JEROME CHENG ZHI KAI (A H ) 1.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

How Opinions are Received by Online Communities A Case Study on Amazon.com Helpfulness Votes Cristian Danescu-Niculescu-Mizil 1, Gueorgi Kossinets 2, Jon.

Enhancing Web Search by Promoting Multiple Search Engine Use Ryen W. W., Matthew R. Mikhail B. (Microsoft Research) Allison P. H (Rice University) SIGIR.

Selecting Good Expansion Terms for Pseudo-Relevance Feedback Guihong Cao, Jian-Yun Nie, Jianfeng Gao, Stephen Robertson 2008 SIGIR reporter: Chen, Yi-wen.

Class Imbalance in Text Classification

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

NTU & MSRA Ming-Feng Tsai

A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Machine Learning – Classification David Fenyő

Evaluation of IR Systems

Erasmus University Rotterdam

Asymmetric Gradient Boosting with Application to Spam Filtering

Overview of Machine Learning

iSRD Spam Review Detection with Imbalanced Data Distributions

Learning to Rank with Ties

Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,

Modeling IDS using hybrid intelligent systems

Presentation transcript:

Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea Michael J. Welch, Junghoo Cho University of California, Los Angeles SIGIR 2008

Copyright  2009 by CEBT Contents  Introduction  Motivation  Our Approach Identify candidate localizable queries Select a set of relevant features Train and evaluate supervised classifier performance  Evaluation Individual Classifiers Ensemble Classifiers  Conclusion and future work  Discussion IDS Lab Seminar - 2

Copyright  2009 by CEBT Introduction  Typical queries Insufficient to fully specify a user’s information need  Localizable queries Some queries are location sensitive – “italian restaurant” -> “[city] italian restaurant” – “courthouse” -> “[county] courthouse” – “drivers license” -> “[state] drivers license” They are submitted by a user with the goal of finding information or services relevant to user’s current location.  Our task Identify the queries which contain locations as contextual modifiers IDS Lab Seminar - 3

Copyright  2009 by CEBT Motivation  Why automatically localize? Reduce burden on the user – No special “local” or “mobile” site Improve search result relevance – Not all information is relevant to every user Increase clickthrough rate Improve local sponsored content matching IDS Lab Seminar - 4

Copyright  2009 by CEBT Motivation  Significant fraction of queries are localizable Roughly 30% But users only explicitly localize them about ½ of the time – 16% of queries would benefit from automatic localization  Users agree on which queries are localizable Queries for goods and services – E.g. “food supplies”, “home health care providers” – But “calories coffee”, “eye chart” are not. IDS Lab Seminar - 5

Copyright  2009 by CEBT Our Approach  Identify candidate localizable queries  Select a set of relevant features  Train and evaluate supervised classifier performance IDS Lab Seminar - 6

Copyright  2009 by CEBT Identifying Base Queries  Queries are short and unformatted  Use string matching Compare against locations of interest – Using U.S. Census Bureau data Extract base query – Where the matched portion of text is tagged with the detected location type (state, county, or city) To ensure accuracy, filter out false positives in the classifier Simple, yet effective IDS Lab Seminar - 7

Copyright  2009 by CEBT Example: Identifying Base Queries Public libraries in malibu california Public libraries in california Public libraries in Public libraries in malibu Public libraries in IDS Lab Seminar - 8 city:malibustate:california city:malibustate:california

Copyright  2009 by CEBT Example: Identifying Base Queries  Three distinct base queries Remove stop words and group by base Allows us to compute aggregate statistics IDS Lab Seminar - 9 BaseTag public libraries californiacity:malibu public libraries malibustate:california public librariescity:malibu, state:california

Copyright  2009 by CEBT Our Approach  Identify candidate localizable queries  Select a set of relevant features  Train and evaluate supervised classifier performance IDS Lab Seminar - 10

Copyright  2009 by CEBT Distinguishing Features  Hypothesis: localizable queries should Be explicitly localized by some users Occur several times – From different users Occur with several different locations – Each with about equal probability IDS Lab Seminar - 11

Copyright  2009 by CEBT Localization Ratio  Users vote for the localizability of query q i by contextualizing it with a location l  Drawbacks Capable to small sample sizes Unable to identify false positives resulting from incorrectly tagged locations IDS Lab Seminar - 12 r i : localization ratio for q i Q i : the count of all instances of q i Q i (L) : the count of all query instances tagged with some location l ∈ L r i : localization ratio for q i Q i : the count of all instances of q i Q i (L) : the count of all query instances tagged with some location l ∈ L, r i ∈ [0,1]

Copyright  2009 by CEBT Location Distribution  Informally: given an instance of any localized query q l with base q b, the probability that q l contains location l is approximately equal across all locations that occur with q b.  To estimate the distribution, we calculate several measures mean, median, min, max, and standard deviation of occurrence counts IDS Lab Seminar - 13 q l : localized query q b : base query L( q b ) : the set of location tags q l : localized query q b : base query L( q b ) : the set of location tags

Copyright  2009 by CEBT Location Distribution  The “fried chicken” problem IDS Lab Seminar - 14 TagCountTagCount city:chester6city:rice2 city:colorado springs1city:waxahachie1 city:cook1state:kentucky163 city:crown1state:louisiana4 city:lousiana4state:maryland2 city:louisville2

Copyright  2009 by CEBT Clickthrough Rates  Assumption Greater clickthrough rate indicative of higher user satisfaction – T. Joachims et. al., “Accurately interpreting clickthrough data as implicit feedback”, SIGIR ‘05.  Calculated clickthrough rates for both the base query and its localized forms Binary clickthrough function  Clickthrough rate for localized instances 17% higher than nonlocalized instances IDS Lab Seminar - 15

Copyright  2009 by CEBT Our Approach  Identify candidate localizable queries  Select a set of relevant features  Train and evaluate supervised classifier performance IDS Lab Seminar - 16

Copyright  2009 by CEBT Classifier Training Data  Selected a random sample of 200 base queries generated by the tagging step  Filtered out base queries where n L <= 1 (with only one distinct location modifier) u q = 1 (only issued by a single user) q = 0 (base form was never issued to the search engine)  From remaining 102 queries 48 positive (localizable) examples 54 negative (non-localizable) examples IDS Lab Seminar - 17

Copyright  2009 by CEBT Evaluation Setup  Evaluated supervised classifiers on precision and recall using 10- fold cross validation Precision: accuracy of queries classified as localizable Recall: percent of localizable queries identified  Focused attention on positive precision False positives more harmful than false negatives Recall scores account for manual filtering IDS Lab Seminar - 18

Copyright  2009 by CEBT Individual Classifiers  Naïve Bayes Gaussian assumption doesn’t hold for all features – Kernel-based naïve Bayes classifier is used.  Decision Trees Emphasized localization ratio, location distribution measures, and clickthrough rates IDS Lab Seminar - 19 ClassifierPrecisionRecall Naïve Bayes64%43% Decision Tree (Information Gain)67%57% Decision Tree (Normalized Information Gain)64%56% Decision Tree (Gini Coefficient)68%51%

Copyright  2009 by CEBT Individual Classifiers  SVM (Support Vector Machine) A set of related supervised learning methods used for classification and regression Improvement over NB and DT, but opaque  Neural Network Best individual classifier, but also opaque IDS Lab Seminar - 20 ClassifierPrecisionRecall SVM75%62% Neural Network85%52%

Copyright  2009 by CEBT Ensemble Classifiers  Observation False positive classifications didn’t fully overlap for individual classifiers  Combined DT, SVM, and NN using a majority voting scheme IDS Lab Seminar - 21 ClassifierPrecisionRecall Combined94%46%

Copyright  2009 by CEBT Conclusion  Method for classifying queries as localizable Scalable, language independent tagging Determined useful features for classification Demonstrated simple components can make a highly accurate system  Exploited variation in classifiers by applying majority voting IDS Lab Seminar - 22

Copyright  2009 by CEBT Future Work  Optimize feature computation for real-time Many features fit into MapReduce framework  Investigate using dynamic features Updating classifier models Explicit feedback loops  Generalize definition of “location” Landmarks, relative locations, GPS  Integration with search system IDS Lab Seminar - 23

Copyright  2009 by CEBT Discussion  Pros Interesting issue to be helpful for web search Good performance  Cons Lack contents to understand – One of equations is omitted – No explanation about terms No explanation why ‘localizable’ is called as ‘positive’ False positives IDS Lab Seminar - 24