Improving Classification Accuracy Using Automatically Extracted Training Data Ariel Fuxman A. Kannan, A. Goldberg, R. Agrawal, P. Tsaparas, J. Shafer Search.

Slides:



Advertisements
Similar presentations
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Advertisements

Struggling or Exploring? Disambiguating Long Search Sessions
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Random Forest Predrag Radenković 3237/10
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
1 Automatic Identification of User Goals in Web Search Uichin Lee, Zhenyu Liu, Junghoo Cho Computer Science Department, UCLA {uclee, vicliu,
University of Kansas Department of Electrical Engineering and Computer Science Dr. Susan Gauch April 2005 I T T C Dr. Susan Gauch Personalized Search Based.
Web Archive Information Retrieval Miguel Costa, Daniel Gomes (speaker) Portuguese Web Archive.
Distributed Representations of Sentences and Documents
Cohort Modeling for Enhanced Personalized Search Jinyun YanWei ChuRyen White Rutgers University Microsoft BingMicrosoft Research.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Adapting Deep RankNet for Personalized Search
Promote your website and get top listed in search engines Section E2 Andreas Livadiotis.
Information Re-Retrieval Repeat Queries in Yahoo’s Logs Jaime Teevan (MSR), Eytan Adar (UW), Rosie Jones and Mike Potts (Yahoo) Presented by Hugo Zaragoza.
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
From Devices to People: Attribution of Search Activity in Multi-User Settings Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz Microsoft Research,
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Active Learning for Class Imbalance Problem
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Date: 2012/10/18 Author: Makoto P. Kato, Tetsuya Sakai, Katsumi Tanaka Source: World Wide Web conference (WWW "12) Advisor: Jia-ling, Koh Speaker: Jiun.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Hao Wu Nov Outline Introduction Related Work Experiment Methods Results Conclusions & Next Steps.
Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Confidence-Aware Graph Regularization with Heterogeneous Pairwise Features Yuan FangUniversity of Illinois at Urbana-Champaign Bo-June (Paul) HsuMicrosoft.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Analysis of Topic Dynamics in Web Search Xuehua Shen (University of Illinois) Susan Dumais (Microsoft Research) Eric Horvitz (Microsoft Research) WWW 2005.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
A Statistical Comparison of Tag and Query Logs Mark J. Carman, Robert Gwadera, Fabio Crestani, and Mark Baillie SIGIR 2009 June 4, 2010 Hyunwoo Kim.
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
Improving Search Results Quality by Customizing Summary Lengths Michael Kaisser ★, Marti Hearst  and John B. Lowe ★ University of Edinburgh,  UC Berkeley,
Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification John Blitzer, Mark Dredze and Fernando Pereira University.
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
Kevin C. Chang. About the collaboration -- Cazoodle 2 Coming next week: Vacation Rental Search.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
 Who Uses Web Search for What? And How?. Contribution  Combine behavioral observation and demographic features of users  Provide important insight.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
NTU & MSRA Ming-Feng Tsai
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Date: 2013/9/25 Author: Mikhail Ageev, Dmitry Lagun, Eugene Agichtein Source: SIGIR’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Improving Search Result.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Web Intelligence and Intelligent Agent Technology 2008.
Coached Active Learning for Interactive Video Search Xiao-Yong Wei, Zhen-Qun Yang Machine Intelligence Laboratory College of Computer Science Sichuan University,
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
1 Clustering Web Queries John S. Whissell, Charles L.A. Clarke, Azin Ashkan CIKM ’ 09 Speaker: Hsin-Lan, Wang Date: 2010/08/31.
Guillaume-Alexandre Bilodeau
Detecting Online Commercial Intention (OCI)
Distributed Representation of Words, Sentences and Paragraphs
CIKM Competition 2014 Second Place Solution
Mining Query Subtopics from Search Log Data
Struggling and Success in Web Search
Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Presentation transcript:

Improving Classification Accuracy Using Automatically Extracted Training Data Ariel Fuxman A. Kannan, A. Goldberg, R. Agrawal, P. Tsaparas, J. Shafer Search Labs Microsoft Research – Silicon Valley Mountain View, CA

For classification tasks, large amounts of training data can significantly improve accuracy How do we create large training sets? – Conventional methods of using human labelers are expensive and do not scale Thesis: The Web can be used to automatically create labeled data Web as a Source of Training Data 2

In this talk Validate the thesis on a task of practical importance: Retail intent identification in Web Search Present desirable properties of sources of labeled data Show how to extract labeled data from the sources 3

Importance of Retail Intent Queries 4 Just Behave: A Look at Searcher Behavior -Total U.S. Market ComScore Feb 2009 Share of Searches (% of total search queries) Share of Searches (% of total search queries) Share of Paid Clicks (% of queries leading to paid clicks) Share of Paid Clicks (% of queries leading to paid clicks)

Application of Retail Intent 5 Provide enhanced user experience around Commerce Search

Retail intent identification Definition: A query posed to a search engine has retail intent if most users who type the query have the intent to buy a tangible product Examples : Queries with retail intentQueries without retail intent Zune 80 gbMedical insurance Buy ipodFree ringtones Digital camera lensesDigital camera history

Data Sources for Retail Intent Sources – Web sites of retailers (e.g., Amazon, Walmart, Buy.com) Training Data – Queries typed directly on search box of retailers Extraction from toolbar logs 7 URL in toolbar log

Desirable Properties of Web Data Sources Popularity – Sources should yield large amounts of data Orthogonality – Sources should provide training data about different regions of the training space Separation – Sources should provide either positive or negative examples of the target class, but not both 8

Popularity Sources should yield large amounts of data For retail intent identification – Web site traffic is a proxy for popularity – More traffic means more queries – Choose Web sites of retailers based on publicly available traffic report (Hitwise) 9

Orthogonality For retail intent identification Positive examples: top sites from “Departmental Stores” and “Classified Ads” (Amazon and Craigslist) Negative examples: top site from “Reference” (Wikipedia) 10 Sources should provide training data about different regions of the training space

Separation Training examples must unambiguously reflect the intended meaning of most users – Example: there is a book called “World War I”, but the intent of the query is mostly non-commercial Can be enforced by removing groups of confusable queries from the sources

Method to Enforce Separation Create “groups” of positive queries Compare the word frequency distribution of each group against the negative class using Jensen-Shannon divergence Remove groups with low divergence 12

Groups for Retail Intent Extracting groups from the toolbar log 13 URL in toolbar log

Enforcing separation property JS Divergence of Amazon and Craigslist with respect to Wikipedia See paper for experimental validation

Experiments Setup – Built multiple classifiers using manual and automatically extracted labels in the training sets – Classification method: logistic regression, using unigrams and bigrams as features – Test set: 5K queries randomly sampled from a query log and labeled using Mechanical Turk

Automatic vs. Manual 16 Accuracy of extracted labels classifier on par with manual labels classifier

Combining Manual and Automatically Extracted Marginally different from using only automatically extracted labels

Using Unlabeled Data 18 Performance of the automatic labels classifier is still on par with classifiers that start with manual labels and exploit unlabeled data using self-training

Conclusions By carefully choosing the data sources, we can extract valuable training data Using large amounts of automatically extracted training data, we can get classifiers that are on par with those trained with manual labels As future work, we would like to apply this experience to other classification tasks 19