Answering Queries using Humans, Algorithms & Databases Aditya Parameswaran Stanford University (Joint work with Alkis Polyzotis, UC Santa Cruz) 1/11/11.

Slides:



Advertisements
Similar presentations
Panos Ipeirotis Stern School of Business
Advertisements

©2011 1www.id-book.com Evaluation studies: From controlled to natural settings Chapter 14.
Jeremy S. Bradbury, James R. Cordy, Juergen Dingel, Michel Wermelinger
Analysis of Computer Algorithms
Author: Graeme C. Simsion and Graham C. Witt Chapter 11 Logical Database Design.
Growing Every Child! The following slides are examples of questions your child will use in the classroom throughout the year. The questions progress from.
1 Probability and the Web Ken Baclawski Northeastern University VIStology, Inc.
Slide 1 of 18 Uncertainty Representation and Reasoning with MEBN/PR-OWL Kathryn Blackmond Laskey Paulo C. G. da Costa The Volgenau School of Information.
1 of 18 Information Dissemination New Digital Opportunities IMARK Investing in Information for Development Information Dissemination New Digital Opportunities.
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
0 - 0.
Addition Facts
The ANSI/SPARC Architecture of a Database Environment
Limitations of the relational model 1. 2 Overview application areas for which the relational model is inadequate - reasons drawbacks of relational DBMSs.
1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Construction process lasts until coding and testing is completed consists of design and implementation reasons for this phase –analysis model is not sufficiently.
Copyright Jiawei Han, modified by Charles Ling for CS411a
Deco Query Processing Hector Garcia-Molina, Aditya Parameswaran, Hyunjung Park, Alkis Polyzotis, Jennifer Widom Stanford and UCSC Scoop The Stanford –
1 Effective Feedback to the Instructor from Online Homework Michigan State University Mark Urban-Lurain Gerd Kortemeyer.
Configuration management
Fact-finding Techniques Transparencies
Deco — Declarative Crowdsourcing
TU e technische universiteit eindhoven / department of mathematics and computer science 1 Empirical Evaluation of Learning Styles Adaptation Language Natalia.
© Arjen P. de Vries Arjen P. de Vries Fascinating Relationships between Media and Text.
CREATING A PAYMENT REQUEST FOR VENDOR IN SYSTEM
CREATING A PAYMENT REQUEST FOR A NEW VENDOR
Database System Concepts and Architecture
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
James Hays and Alexei A. Efros Carnegie Mellon University CVPR IM2GPS: estimating geographic information from a single image Wen-Tsai Huang.
Document Clustering Carl Staelin. Lecture 7Information Retrieval and Digital LibrariesPage 2 Motivation It is hard to rapidly understand a big bucket.
S 1 Intelligent MultiModal Interfaces Manuel J. Fonseca Joaquim A. Jorge
Addition 1’s to 20.
Key Stage 3 National Strategy Handling data: session 2.
Test B, 100 Subtraction Facts
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Introduction to Recursion and Recursive Algorithms
Introduction Distance-based Adaptable Similarity Search
Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.
Davide Mottin, Senjuti Basu Roy, Alice Marascu, Yannis Velegrakis, Themis Palpanas, Gautam Das A Probabilistic Optimization Framework for the Empty-Answer.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Crowd Algorithms Hector Garcia-Molina, Stephen Guo, Aditya Parameswaran, Hyunjung Park, Alkis Polyzotis, Petros Venetis, Jennifer Widom Stanford and UC.
Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv) Automatic Index.
Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
CrowdSearch: Exploiting Crowds for Accurate Real-Time Image Search on Mobile Phones Original work by Yan, Kumar & Ganesan Presented by Tim Calloway.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Optimizing Plurality for Human Intelligence Tasks Luyi Mo University of Hong Kong Joint work with Reynold Cheng, Ben Kao, Xuan Yang, Chenghui Ren, Siyu.
Group Recommendations with Rank Aggregation and Collaborative Filtering Linas Baltrunas, Tadas Makcinskas, Francesco Ricci Free University of Bozen-Bolzano.
Data-Centric Human Computation Jennifer Widom Stanford University.
The Software Development Process
Chapter 4 Decision Support System & Artificial Intelligence.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Data Mining and Decision Support
Crowdscreen: Algorithms for Filtering Data using Humans Aditya Parameswaran Stanford University (Joint work with Hector Garcia-Molina, Hyunjung Park, Neoklis.
1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.
The Object-Oriented Database System Manifesto Malcolm Atkinson, François Bancilhon, David deWitt, Klaus Dittrich, David Maier, Stanley Zdonik DOOD'89,
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Advanced Higher Computing Science
So, what was this course about?
CrowdDb.
Probabilistic Data Management
SIS: A system for Personal Information Retrieval and Re-Use
Preference Query Evaluation Over Expensive Attributes
Deco: Declarative Crowdsourcing
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Probabilistic Databases
Presentation transcript:

Answering Queries using Humans, Algorithms & Databases Aditya Parameswaran Stanford University (Joint work with Alkis Polyzotis, UC Santa Cruz) 1/11/11 1

Why Crowd-source? Many tasks are done better by humans ◦ Understanding speech, images and language Many people are online and willing to work Several commercial marketplaces ◦ Amazon’s Mechanical Turk, Odesk, LiveOps, … Several programming libraries ◦ TurKit, HPROC, … 2 Label/Tag Identify Description Compare Sort Rank

Example Select top-k images for a restaurant from a user-submitted image DB ◦ Must display food served in restaurant OR ◦ Must display restaurant name ◦ Not dark ◦ Not copyrighted 3 Can use image processing algorithms for some cases Can look up database containing meta-data Need to ask humans

Example: Current Solution Programmer does all the work Implements calls to: ◦ Crowd libraries, hand-coding:  Which tasks to run  On which items  In what order  For what price ◦ Algorithms, since crowd latency may be high, specifying:  For which tasks, on which items and in what order ◦ Relational data Write code to : ◦ Integrate obtained information ◦ Deal with inconsistencies and incorrect answers from the crowd 4

Our Vision 5 Query DataHumans Result Declarative Query Processing Engine Algorithms Nothing out-of-the-ordinary for DB people! - Application only provides the UI to ask questions to humans - Remainder handled “under the covers” by the query optimizer - Application development becomes much simpler.

Outline for the Talk 6 Query DataHumans Result Declarative Query Processing Engine Algorithms 1. Example Declarative Queries 2. Need for Redesign 3. Research Challenges + Initial ideas 1.Query Semantics 2.Physical Query Processing 3.Handling Uncertainty 4.Other Important Issues

Find all jpg travel pictures: either large pictures of a clean beach, or pictures of a clean and safe city 4 types of predicates ◦ r – relational predicates ◦ a – algorithmic predicates ◦ h – human predicates  UI for question is provided by the application. ◦ ha – mixed (human / algorithmic) predicates  Can ask humans or can use algorithms travel(I):= rJpeg(I), hClean(I), hBeach(I), aLarge(I) travel(I):= rJpeg(I), hClean(I), haCity(I,C), rSafe(C) Example Query 7 Is this an image of a clean location?

Other Examples Find all images of people at the scene of the crime who also have a criminal record ◦ suspects(N, I):=rCriminal(N, P), rScene(I), haSim(I,P) ◦ rCriminal : database of known criminals, with images ◦ haSim : evaluates presence of P in I Then the results may be used for an aggregation: Find the best image for every criminal ◦ topImg(N, hBest( )):= suspects(N, I) ◦ hBest : the top image 8 Do these images contain the same person?

Need for a New Architecture Tradeoffs between ◦ Performance, ◦ Monetary cost, and ◦ Uncertainty in query result Unknown ◦ Selectivities ◦ Latency for both algorithms and humans ◦ Uncertainty in answers This combination of “evils” has never appeared before! Plus, some other aspects that will become clear later 9 time cost uncertainty Uncertain databases User Defined Functions Adaptive Query Optimization

Semantics of Query Model We want “correct answers” ◦ What is a correct answer? ◦ Notion not clear:  Correlations and Inconsistencies  Mistakes and Lack of knowledge ◦ We use a threshold on confidence to define correctness (later) Three semantics: 1.Find all correct answers, minimizing cost and time 2.Find k correct answers, minimizing cost and time 3.Find as many as possible, and minimize time for fixed cost 10 time cost uncertainty

Query Proc. without Uncertainty Two-criteria optimization, e.g., cost & time Selectivities are not known ◦ Adaptive query optimization Latency not known ◦ Asynchronous execution It is critical to reason about the Information gain of a question 11 travel(I):= rJpeg(I), hClean(I), hBeach(I), aLarge(I) travel(I):= rJpeg(I), hClean(I), haCity(I,C), rSafe(C)

Asking the right questions Prefer questions affecting more tuples “downstream” If all correct answers needed : ◦ Prefer selective questions If k correct answers needed : ◦ Prefer non-selective questions leading to answers For complex tasks ◦ Need to subdivide into questions that maximize information gain  e.g., Classify, Cluster, Categorize ◦ We studied this problem carefully for graph search ◦ Human-assisted graph search: It’s okay to ask questions, VLDB ‘11 12 travel(I):= rJpeg(I), hClean(I), hBeach(I), aLarge(I) travel(I):= rJpeg(I), hClean(I), haCity(I,C), rSafe(C)

Query Proc. without Uncertainty Intra and inter-stage optimization 13 Stage 1 Stage 3 Stage 2 Perform computations on relational data Issue a set of asynchronous questions to Crowd Algorithms Collect results from previous stages Prefer algorithmic and relational questions Stage n ……

Uncertainty Sources: ◦ Intra-predicate correlation ◦ Inter-predicate correlation ◦ Subjective views, random mistakes ◦ Lack of knowledge Only want correct answers (confidence > τ ) Standard techniques are insufficient! 14 travel(I):= rJpeg(I), hClean(I), hBeach(I), aLarge(I) travel(I):= rJpeg(I), hClean(I), haCity(I,C), rSafe(C) answers to two images I for hClean answer to I for hBeach is YES → NO for haCity

How do we compute confidence? Scheme 1: Majority voting ◦ Each question attempted by c humans ◦ Majority answer taken as the correct answer Scheme 2: Homogeneous worker population ◦ Per question, each worker is IID from a distribution ◦ No cross-correlations ◦ Infer distribution based on answers of workers Scheme 3: Item Response Theory 15

Other Important Aspects Pricing ◦ Must price tasks so that they complete ◦ “Important”, “harder” tasks priced higher Spam ◦ Test questions or a Gold standard ◦ Reputation systems for workers Choosing UI questions for predicates 16

From Tasks to UI Questions The choice of the UI affects: ◦ The number of UI questions (thus, the cost) ◦ The overall uncertainty of the answer ◦ Latency Complex tasks: ◦ Sort, Top-k, Max 17 suspects(N, I):= rCriminal(N, P), rScene(I), haSim(I,P) topImg(N, hBest( )):= suspects(N, I) Match similar images from the two sets I1I2I3I1I2I3 J1J2J3J1J2J3 Are the two images alike? I1I1 J1J1 Sort I1I2I3I1I2I3 Compare I1I1 I2I2

Conclusions Using human computation within the database ◦ Important and challenging new research area for DB people ◦ Requires a careful redesign of the DBMS ◦ More parameters/tradeoff that we need to keep track of Vision of the sCOOP project ◦ (System for COmputing with and Optimizing People) Look out for our paper at VLDB in Seattle! ◦ Human-assisted Graph Seach: It’s okay to ask questions!  By: A. Parameswaran, A. Das Sarma, H. Garcia-Molina, N. Polyzotis and J. Widom 18