Human-powered Sorts and Joins. At a high level Yet another paper on crowd-algorithms – Probably the second to be published (so keep that in mind when.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

CrowdER - Crowdsourcing Entity Resolution
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
Reliability and Validity
Programming Types of Testing.
G. Alonso, D. Kossmann Systems Group
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Planning under Uncertainty
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Crowd Algorithms Hector Garcia-Molina, Stephen Guo, Aditya Parameswaran, Hyunjung Park, Alkis Polyzotis, Petros Venetis, Jennifer Widom Stanford and UC.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
x – independent variable (input)
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.
Presented by Zeehasham Rasheed
Parametric Query Generation Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Identifying Reasons for Software Changes Using Historic Databases The CISC 864 Analysis By Lionel Marks.
Process Flowsheet Generation & Design Through a Group Contribution Approach Lo ï c d ’ Anterroches CAPEC Friday Morning Seminar, Spring 2005.
LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.
SQL Unit 5 Aggregation, GROUP BY, and HAVING Kirk Scott 1.
SVCL Automatic detection of object based Region-of-Interest for image compression Sunhyoung Han.
Announcements: Website is now up to date with the list of papers – By 1 st Tuesday midnight, send me: Your list of preferred papers to present By 8 th.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Chapter 1 What is Programming? Lecture Slides to Accompany An Introduction to Computer Science Using Java (2nd Edition) by S.N. Kamin, D. Mickunas, E.
Microsoft ® Office Access ® 2007 Training Datasheets II: Sum, sort, filter, and find your data ICT Staff Development presents:
Database Management 9. course. Execution of queries.
CS Learning Rules1 Learning Sets of Rules. CS Learning Rules2 Learning Rules If (Color = Red) and (Shape = round) then Class is A If (Color.
Data-Centric Human Computation Jennifer Widom Stanford University.
Discussion: So Who Won. Announcements Looks like you’re turning in reviews… good! – Some of you are spending too much time on them!! Key points, what.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Presenter: Shanshan Lu 03/04/2010
Test and Review chapter State the differences between archive and back-up data. Answer: Archive data is a copy of data which is no longer in regular.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Fast and accurate text classification via multiple linear discriminant projections Soumen Chakrabarti Shourya Roy Mahesh Soundalgekar IIT Bombay
Information Retrieval WS 2012 / 2013 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Ensemble Methods in Machine Learning
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Session 3 How to Approach the UML Written by Thomas A. Pender Published by Wiley Publishing, Inc. October 5, 2011 Presented by Kang-Pyo Lee.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Submission doc.: IEEE /1214r0 September 2014 Leif Wilhelmsson, Ericsson ABSlide 1 Impact of correlated shadowing in ax system evaluations.
In the news: A recently security study suggests that a computer worm that ran rampant several years ago is still running on many machines, including 50%
Algorithms and Pseudocode
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Accuracy, Reliability, and Validity of Freesurfer Measurements David H. Salat
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Designing a Scalable Data Cleaning Infrastructure
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Deco + Crowdsourcing Summary
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
CrowdDb.
Vocabulary Algorithm - A precise sequence of instructions for processes that can be executed by a computer.
Announcements: By Tuesday midnight, start submitting your class reviews: First paper: Human-powered sorts and joins Start thinking about projects!!
Deco: Declarative Crowdsourcing
Objective of This Course
Ensemble learning.
Probabilistic Databases
U3L2 The Need For Algorithms
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Human-powered Sorts and Joins

At a high level Yet another paper on crowd-algorithms – Probably the second to be published (so keep that in mind when you think about kinks in the paper) If the previous paper(s) can be viewed as “theoretical”, this paper definitely falls on the “practical” side. – Lots of “practical” advice and algorithms – Testing on real crowds Hard to point at “one” algorithm here because there are multiple problems, optimizations and ideas.

Three Key Components Their system, Qurk Sorting Joins

Qurk Declarative “workflow” system encapsulating human predicates as UDFs – User-defined functions – Commonly also used by relational databases to capture operations outside relational algebra – Typically external API calls We’ll see other comparable systems later on

Why such a system?

Take away repeatable code and redundancy Lack of manual optimization Less cumbersome to specify

Query model SQL

Qurk filter: inappropriate content photos(id PRIMARY KEY, picture IMAGE) Query =SELECT * FROM photos WHERE isSmiling(photos.picture); UDF First paper to represent crowd calls as UDF invocations!

UDFs as Tasks Instead of writing code for UDFs, can be described at a high level using Tasks Tasks = High level-templates for commonly occurring crowd-operations and or algorithms Filter, Generate, Sort, Joins

TASK isSmiling(picture) Prompt: “ Is the cat above smiling?”, picture Combiner: MajorityVote Generate Sort Join Group … TYPE Filter: Note: here, a task is an interface description for a crowd operation PER ITEM, coupled with accuracy combiner PER ITEM … In Crowdscreen, we had accuracy OVERALL and we expected the system to guarantee it.

QualityAdjust Yet another primitive they leverage from prior work – Using the EM (Expectation Maximization) algorithm – Repeated iterations until convergence

Is the cat above smiling? Yes No Is the cat above smiling? Yes No

Template: Generative Goal: labels, text passages, ph. no., open- ended answers (e.g., enumeration)

At its heart… Generate/Filter – sequence of questions (one per tuple) – “procedure” for solving each question – per question cost – per question procedure Sort/Join is different…

SORTS/JOINS is somewhat confusing… This is no longer a task PER ITEM; you’re sorting a group of items! Why specify accuracy (i.e., combiner function) for FILTER but not for RANK? What guarantees will you get? How much are you spending?

Joins: the possibly clause Is this confusing? Akin to “hints” for the optimizer

MTurk Statistics ManagerQuery Optimizer Executor DB σ BA HIT Compiler Task Manager Task 4 Task 5 Task 6 Saved Results a1a2a1a2 in A in B b1b1 Compiled HITs HIT results Task Cache Internal HIT Tasks Results User Results Queries Input Data

Some drawbacks… Qurk (somewhat) sweeps accuracy and latency under the rug, in favor of cost. – Qurk may be better designed to reason about accuracy than the user – Should we always use MV per question, or should we have fewer instances spread across many questions (e.g., in ranking) Even for cost, it is not clear how to specify this in a query, and how the system should use this across operators…

Three Key Components Their system, Qurk Sorting Joins

Sort Super Important problem!

Interfaces Comparison more dangerous? how dangerous? Rating The first paper to clearly articulate the use of multiple interfaces to get similar data!

Batching < < Novel Idea!

Problems with batching … In some cases, same effect as batching can be achieved by simply reducing cost per task. – Is this true? How far can we go with this? – Exploitative?

What are other issues with batching?

Correlated answers? Fatigue? Why is batching still done? Instructions provided once: saved time/cost Force all workers to attempt all questions (e.g., in a survey)

Measuring Quality of Results Kendall’s Tau rank correlation Range: [-1, 1] a b c d d c b a a b c d 1

Completely Comparison-Based – Tau = 1 (completely accurate) – O(# items 2 ) – Q: Do we really need O(# items 2 )? Paper argues that cycles may be present and hence quicksort-like algorithms will not work.

Completely Comparison-Based – Tau = 1 (completely accurate) – O(# items 2 ) – Q: Do we really need O(# items 2 )? Paper argues that cycles may be present and hence quicksort-like algorithms will not work. But we can certainly repeat each question multiple times! Cn log n may still be < n 2

Completely Comparison-Based – Tau = 1 (completely accurate) – O(# items 2 ) Completely Rating-Based – Tau ≈ 0.8 (accurate) – O(# items) – Q: What if I scale up the number of ratings per item, can I approach quality of Comparison-based tasks? – Interesting experiment!

Hybrid Schemes First, gather a bunch of ratings Order based on average ratings Then, use comparisons, in one of three flavors: – Random: pick S items, compare – Confidence-based: pick most confusing “window”, compare that first, repeat – Sliding-window: for all windows, compare

Results Sliding Window > Confidence > Random Weird results for window = 6 and 5 …

Can you think of other Hybrid Schemes?

1: Divide ratings up into 10 overlapping buckets, compare all pairs in each bucket 2: Start with the current sort and compare pairs of items, and keep comparing pairs 3: Use variance to determine windows; e.g., an item is compared to all other items that its score +/- variance overlaps with.

Fail fast on bug or ambiguous task? Fleiss’ Kappa (inter-rater agreement) Range: [0, 1]

adult sizedangerousnesslikelihood to be on saturn less ambiguousmore ambiguous Ambiguity

Sort summary 2-10x cost reduction Exploit humans’ ability to batch (but how does this affect price?) Quality signal: tau Fail fast signal: kappa Hybrid algorithms balance accuracy, price

Join: human-powered entity resolution International Business Machines == IBM

Matching celebrities

Simple join O(nm)

Naïve batching join O(nm/b)

Smart join O(nm/b 2 ) 4-10x reduction in cost Errors??

Can you think of better join algorithms?

Intuition: if A joins with X A does not join with Y And B joins with X do we need to compare B and Y? How much does skipping comparisons save us? Exploit Transitivity!

Join heuristics gender hair color skin color 50-66% reduction in cost

Could go wrong! If there are cases where – Feature is ambiguous – Feature equality does not imply join/not join – Selectivity is not helpful

Q: What other techniques could help us? Still O(n) to get these feature values Machine learning? – How do we apply it?

Q: What other techniques could help us? Still O(n) to get these feature values Machine learning? – How do we apply it? – Maybe use input as labeled data, and learn on other features? – Maybe use the crowd to provide features?

Could features help in sorting?

E.g., if pictures taken outside are always better than pictures taken inside, then can you have that as a “feature” – Can it be better than ratings?

Join summary 10-25x cost reduction Exploit humans’ ability to batch Feature extraction/learning to reduce work

System + workflow model: Qurk Sort: 2-10x cost reduction Join: 10-25x cost reduction Summary

Exposition-Wise… What could the paper have done better? – Maybe get rid of Qurk altogether? – More rigorous experiments? – Other ideas?

Other Open Issues Hard to extrapolate accuracy, latency, cost from a few experiments of 100 items each Cannot reason about batch-size independent of cost per HIT Not clear how batching affects quality Not clear how results generalize to other scenarios

Questions/Comments?

Discussion Questions Q: How can you help requesters reduce costs?

Discussion Questions Q: How can you help requesters reduce costs? Use optimized strategies Batching + Reduce Price Improve instructions & interfaces Training and elimination Only allow good workers to work on tasks Use machine learning

Discussion Questions Q. What are the different ways crowd algorithms like this can be used in conjunction with machine learning?

Discussion Questions Q. What are the different ways crowd algorithms like this can be used in conjunction with machine learning? Input/training Active learning ML feeds crowds

Discussion Questions Q: How would you go about gauging human error rates on a batch of filtering tasks that you’ve never seen before?

Discussion Questions Q: How would you go about gauging human error rates on a batch of filtering tasks that you’ve never seen before? You could have the “requester” create “gold standard” questions, but hard: people learn, high cost, doesn’t capture all issues You could try to use “majority” rule but what about difficulty, what about expertise?

30x30 Join 4-10x reduction in cost

Majority Vote vs. Quality Adjust Simple Join:.933 vs.967 Naive Batch 10:.6 vs.867 Smart Batch 3 x 3:.5 vs.867

30x30 Join

Common questions in crowdsourcing integration? $/worker? # workers? worker quality? correct answer? design patterns? workflow design? latency?