Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Super Awesome Presentation Dandre Allison Devin Adair.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Beyond Boolean Queries Ranked retrieval  Thus far, our queries have all been Boolean.  Documents either match or don’t.  Good for expert users with.
A Machine Learning Approach for Improved BM25 Retrieval
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Vote Calibration in Community Question-Answering Systems Bee-Chung Chen (LinkedIn), Anirban Dasgupta (Yahoo! Labs), Xuanhui Wang (Facebook), Jie Yang (Google)
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Optimizing Estimated Loss Reduction for Active Sampling in Rank Learning Presented by Pinar Donmez joint work with Jaime G. Carbonell Language Technologies.
Evaluating Search Engine
Learning to Rank: New Techniques and Applications Martin Szummer Microsoft Research Cambridge, UK.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA {xiej2, szymansk,
 Manmatha MetaSearch R. Manmatha, Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst.
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Online Search Evaluation with Interleaving Filip Radlinski Microsoft.
Quality-aware Collaborative Question Answering: Methods and Evaluation Maggy Anastasia Suryanto, Ee-Peng Lim Singapore Management University Aixin Sun.
1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Learning to Rank for Information Retrieval
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
YZUCSE SYSLAB A Study of Web Search Engine Bias and its Assessment Ing-Xiang Chen and Cheng-Zen Yang Dept. of Computer Science and Engineering Yuan Ze.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Post-Ranking query suggestion by diversifying search Chao Wang.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks EMNLP 2008 Rion Snow CS Stanford Brendan O’Connor Dolores.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Autumn Web Information retrieval (Web IR) Handout #14: Ranking Based on Click Through data Ali Mohammad Zareh Bidoki ECE Department, Yazd University.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
1 Cross Market Modeling for Query- Entity Matching Manish Gupta, Prashant Borole, Praful Hebbar, Rupesh Mehta, Niranjan Nayak.
Evaluation of IR Systems
An Empirical Study of Learning to Rank for Entity Search
Tingdan Luo 05/02/2016 Interactively Optimizing Information Retrieval Systems as a Dueling Bandits Problem Tingdan Luo
Learning to Rank Shubhra kanti karmaker (Santu)
Intent-Aware Semantic Query Annotation
Learning to Rank with Ties
Retrieval Performance Evaluation - Measures
Information Retrieval and Web Design
Presentation transcript:

Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta Svore Sergey Markov Microsoft Bing/Microsoft Research

Roadmap Introduction How to Use Overlapping Labels Selective Overlapping labeling Experiments Conclusion and Discussion

Introduction The Web search/Learning to rank  Web documents/urls are represented by feature vectors;  A ranker learns a model from the training data, and computes a rank order of the urls for each query. The goal:  Retrieve relevant documents  i.e., achieve high retrieval accuracy measured in NDCG, MAP, etc

Factors Affecting Retrieval Accuracy Amount of training examples  The more training examples, the better the accuracy Quality of training labels  The higher the quality of labels, the better the accuracy

Figure cited from [Sheng et al.] KDD’08 Usually a large set of training examples are used, however

Solution: Improving quality of labels Label quality depends on  Expertise of the labelers  The number of labelers The more expert the labelers, and the more labelers, the higher the label quality. Cost!!!

The Current Approaches A lot of (cheap) non-experts for a sample  Labelers from Amazon Mechanical Turk  Weakness: labels are often unreliable Just one label from an expert for a sample  The single labeling scheme  Widely used in supervised learning  Weakness: personal bias

We propose a new labeling scheme High quality labels  e.g., Labels yield high retrieval accuracy  Labels are overlapping labels from experts At low cost  Only request additional labels when they are needed

Roadmap Introduction How to Use Overlapping Labels Selective Overlapping labeling Experiments Conclusion and Discussion

Labels Labels indicates the relevance of a url to a query Perfect, Excellent, Good, Fair, and Bad.

How to Use Overlapping Labels How to aggregate overlapping labels?  Majority, median, mean, something else? Change the weights of the labels?  Perfectx3, Excellentx2, Goodx2, Badx0.5? Use overlapping labels only on selected samples? How much overlap?  2x, 3x, 5x,100x?

Aggregating Overlapping Labels n training samples, k labelers K-Overlap (Using all labels)  When k=1, single labeling scheme, training cost: n; Labeling cost: 1.  Training cost: kn; Labeling cost: k. Majority vote  Training cost: n; Labeling cost: k. Highest label  Sort k labels into the order of most-relevant to lease-relevant (P/E/G/F/B); Pick the label at the top of the sorted list.  Training cost: n; Labeling cost: k.

Weighting the Labels Assign different weights for labels  Samples labeled as P/E/G, assign w 1 ;  Samples labeled as F/B, assign w 2 ;  w 1 = θw 2, θ>1. Intuition: “Perfect” probably deserves more weight than other labels  “Perfect” are rare in training data  Web search emphasizes on precision Training cost = n, Labeling cost = 1.

Selecting the samples to get overlapping labels Collect overlapping labels when it is needed for a sample. The proposed scheme

Roadmap Introduction How to Use Overlapping Labels Selective Overlapping labeling Experiments Conclusion and Discussion

Collect Overlapping Labels When Good+ Intuition:  People are difficult to satisfy Seldom say “this url is good” Often say “this url is bad”  It is even harder for people to agree on some urls are good So:  If someone thinks a url is good, it is worthwhile to verify with others’ opinions  If someone thinks a url is bad, we trust him

If-good-k If a label = P/E/G, get another k-1 overlapping labels; Otherwise, keep the first label, go to the next query/url. Example: (if-good-3)  Excellent, Good, Fair  Bad  Good, Good, Perfect  Fair  … Training cost = labeling cost = r is Good+:Fair- ratio among the first labelers.

Good-till-bad If a label = P/E/G, get another label; If this second label = P/E/G, continue to collect one more label; Till a label = F/B. Example: (Good-till-bad)  Excellent, Good, Fair  Bad  Good, Good, Perfect, Excellent, Good, Bad  Fair  … Training cost = labeling cost. Note that k can be large.

Roadmap Introduction How to Use Overlapping Labels Selective Overlapping labeling Experiments Conclusion and Discussion

Datasets The Clean label set  2,093 queries; 39,267 query/url pairs  11 labels for each query/url pair  120 judges in total  Two feature sets: Clean07 and Clean08 The Clean+ label set  1,000 queries; 49,785 query/url pairs  Created to evaluate if-good-k (k<=3)  17,800 additional labels

Evaluation Metrics NDCG for a given query at level L:, the relevance label at position i ; L: the truncation level. also report

Evaluation Average 5~10 runs for an experimental setting Two Rankers:  LambdaRank [ Burges et al. NIPS’06]  LambdaMart [Wu et al. MSR-TR ]

9 Experimental Settings Baseline: the single labeling scheme. 3-overlap: 3 overlapping labels, train on all of them. 11-overlap: 11 overlapping labels, train on all of them. Mv3: Majority Vote of 3 labels. Mv11: Majority Vote of 11 labels. If-good-3: If a label = Good+, get another 2 labels; o/w, keep this label. If-good-x3: assign Good+ labels 3 times of weights. Highest-3: The highest label among 3 labels. Good-till-bad: k=11.

Retrieval Accuracy on Clean08 LambdaRank Gain on Clean08 (LambdaRank): 0.46 point ifgood345.03%45.37%45.99%*47.53%50.53% highest344.87%45.17%45.97%*47.48%50.43% 11-overlap44.93%45.10%45.96%*47.57%50.58% mv %45.20%45.89%47.56%50.58% ifgoodx344.73%45.18%45.80%47.40%50.13% 3-overlap44.77%45.27%45.78%47.54%50.50% mv344.83%45.11%45.66%47.09%49.83% goodtillbad44.88%44.87%45.58%47.05%49.86% baseline44.72%44.98%45.53%46.93%49.69%

Retrieval Accuracy on Clean08 LambdaMart Gain on Clean08 (LambdaMart): 1.48 point ifgood344.63%45.08%45.93%*47.65%50.37% 11-overlap44.70%45.13%45.91%*47.59%50.35% mv %44.86%45.48%47.02%49.97% highest344.46%44.81%45.42%47.16%50.09% ifgoodx343.78%44.14%44.80%46.42%49.26% 3-overlap43.52%44.23%44.77%46.49%49.44% baseline43.48%43.89%44.45%46.11%49.12% mv342.96%43.25%44.01%45.56%48.30%

Retrieval Accuracy on Clean+ LambdaRank ifgood250.53%49.03%48.57%**48.56%50.02% ifgood350.33%48.84%48.41%48.48%49.89% baseline50.32%48.72%48.20%48.31%49.65% ifgoodx350.04%48.51%48.16%48.18%49.61% Gain on Clean+: 0.37 point

Costs of Overlapping Labeling (Clean07) ExperimentLabeling CostTraining CostFair-: Good+ Baseline overlap mv mv If-good If-good-x Highest Good-till-bad overlap

Discussion Why if-good-2/3 works?  More balanced training dataset?  More positive training samples?  No! (since simple weighting does not perform well)

Discussion Why if-good-2/3 works?  Better capture the worthiness of reconfirming a judgment  Yield higher quality labels

Discussion Why does it only need 1 or 2 additional labels?  Too many opinions from different labelers may create too much noise and too high variance.

Conclusions If-good-k is statistically better than single labeling; and statistically better than other methods in most cases; Only 1 or 2 additional labels are needed for selected sample; If-good-2/3 is cheap in labeling cost: ~1.4. What doesn’t work:  Majority vote  Simply change weights for labels

Thanks and Questions? Contact:    

Retrieval Accuracy on Clean07 LambdaRank Gain on Clean07 (LambdaRank): 0.95 point ifgood346.23%47.80%49.55%**51.81%55.38% mv %47.76%49.30%51.60%55.09% goodtillbad45.72%47.80%49.22%51.73%55.23% Highest345.75%47.67%49.16%51.49%55.01% 3-overlap45.52%47.48%49.00%51.51%54.90% ifgoodx345.25%47.28%48.98%51.26%54.82% mv345.07%47.28%48.87%51.36%54.93% 11-overlap45.25%47.24%48.69%51.11%54.58% baseline45.18%47.06%48.60%51.02%54.51%

Retrieval Accuracy on Clean07 LambdaMart Gain on Clean07 (LambdaMart): 1.92 point ifgood344.63%45.08%45.93%*47.65%50.37% 3-overlap44.70%45.13%45.91%*47.59%50.35% 11-overlap44.31%44.86%45.48%47.02%49.97% mv %44.81%45.42%47.16%50.09% ifgoodx343.78%44.14%44.80%46.42%49.26% highest343.52%44.23%44.77%46.49%49.44% mv343.48%43.89%44.45%46.11%49.12% baseline42.96%43.25%44.01%45.56%48.30%