Data-Centric Human Computation Jennifer Widom Stanford University.

Slides:



Advertisements
Similar presentations
Toward Scalable Keyword Search over Relational Data Akanksha Baid, Ian Rae, Jiexing Li, AnHai Doan, and Jeffrey Naughton University of Wisconsin VLDB 2010.
Advertisements

Deco Query Processing Hector Garcia-Molina, Aditya Parameswaran, Hyunjung Park, Alkis Polyzotis, Jennifer Widom Stanford and UCSC Scoop The Stanford –
Deco — Declarative Crowdsourcing
Answering Queries using Humans, Algorithms & Databases Aditya Parameswaran Stanford University (Joint work with Alkis Polyzotis, UC Santa Cruz) 1/11/11.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
CrowdER - Crowdsourcing Entity Resolution
© IBM Corporation Informix Chat with the Labs John F. Miller III Unlocking the Mysteries Behind Update Statistics STSM.
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the five essential properties of an algorithm.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
Query Optimization over Web Services Utkarsh Srivastava Jennifer Widom Jennifer Widom Kamesh Munagala Rajeev Motwani.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Crowd Algorithms Hector Garcia-Molina, Stephen Guo, Aditya Parameswaran, Hyunjung Park, Alkis Polyzotis, Petros Venetis, Jennifer Widom Stanford and UC.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Managing Data Resources
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
Mgt 20600: IT Management & Applications Databases Tuesday April 4, 2006.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
 Definition  Components  Advantages  Limitations Contents  DBMS DBMS  Functions Functions  Architecture Architecture.
Database Technical Session By: Prof. Adarsh Patel.
Announcements: Website is now up to date with the list of papers – By 1 st Tuesday midnight, send me: Your list of preferred papers to present By 8 th.
Chapter 6 – Database Security  Integrity for databases: record integrity, data correctness, update integrity  Security for databases: access control,
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Optimizing Plurality for Human Intelligence Tasks Luyi Mo University of Hong Kong Joint work with Reynold Cheng, Ben Kao, Xuan Yang, Chenghui Ren, Siyu.
Lecture 15: Mob Data Sourcing. Outline Crowdsourcing Crowd data sourcing Towards a principled solution Conclusions and challenges.
DBMS 2001Notes 1: Introduction1 Principles of Database Management Systems (Tietokannanhallintajärjestelmät) Pekka Kilpeläinen Fall 2001.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
Mob Data Sourcing MoDaS MoDaS Mob Data Sourcing. Outline Crowdsourcing Crowd data-sourcing Towards a principled solution Conclusions and challenges Warning:
Presenter: Shanshan Lu 03/04/2010
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Keyword Query Routing.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Jennifer Widom NoSQL Systems Motivation. Jennifer Widom NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not.
Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
SampleClean: Bringing Data Cleaning into the BDAS Stack Sanjay Krishnan and Daniel Haas In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan Wang, Tim.
CSCI Query Processing1 QUERY PROCESSING & OPTIMIZATION Dr. Awad Khalil Computer Science Department AUC.
Crowdscreen: Algorithms for Filtering Data using Humans Aditya Parameswaran Stanford University (Joint work with Hector Garcia-Molina, Hyunjung Park, Neoklis.
Human-powered Sorts and Joins. At a high level Yet another paper on crowd-algorithms – Probably the second to be published (so keep that in mind when.
Good Papers and Good Research Jennifer Widom Stanford University Shamelessly drawn from Research Principles Revealed “Research Principles Revealed” Codd.
Research Case in Crowdsourcing Dongwon Lee, Ph.D. IST 501 Fall 2014.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Presentation on Database management Submitted To: Prof: Rutvi Sarang Submitted By: Dharmishtha A. Baria Roll:No:1(sem-3)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Chapter 18 Query Processing and Optimization. Chapter Outline u Introduction. u Using Heuristics in Query Optimization –Query Trees and Query Graphs –Transformation.
So, what was this course about?
Deco + Crowdsourcing Summary
CrowdDB : Answering queries with Crowdsourcing
Announcements: By Tuesday midnight, start submitting your class reviews: First paper: Human-powered sorts and joins Start thinking about projects!!
Deco: Declarative Crowdsourcing
Query Optimization.
Presentation transcript:

Data-Centric Human Computation Jennifer Widom Stanford University

Jennifer Widom 2 Cloud Crowd Mega-Trends in C.S. Research 2000’s 2010’s

Jennifer Widom 3 Human Computation Augmenting computation with the use of human abilities to solve (sub)problems that are difficult for computers ―Object/image comparisons ―Information extraction ―Data gathering ―Relevance judgments ―Many more…  “Crowdsourcing”

Jennifer Widom 4 Crowdsourcing Research Marketplace #1 Marketplace #2 Marketplace #n … … Interfaces Incentives Trust, reputation Spam Pricing Interfaces Incentives Trust, reputation Spam Pricing Platforms Algorithms Data Gathering / Query Answering Basic ops Compare Filter Basic ops Compare Filter Complex ops Sort Cluster Clean Complex ops Sort Cluster Clean Get data Verify Get data Verify

Jennifer Widom 5 New Considerations & Tradeoffs Latency Cost Uncertainty How much am I willing to spend? How long can I wait? What is my desired quality?

Jennifer Widom 6 Live Experiment ‒ Human Filter Are there more than 40 dots?

Jennifer Widom 7 Computing the Answer Yes or No ? With what confidence? Should I ask more questions?  More cost ( ‒ )  Higher latency ( ‒ )  Higher accuracy (+)

Jennifer Widom 8 Live Experiment ‒ Filter #2 Are more than half of the dots blue?

Jennifer Widom 9 Live Experiment ‒ Two Filters Are there more than 40 dots and are more than half of the dots blue?

Jennifer Widom 10 Computing the Answer Ask questions separately or together? Together  lower cost, lower latency, lower accuracy? If separately, in sequence or in parallel? If in sequence, in what order? Different filters may have…  different cost  different latency  different accuracy

Jennifer Widom 11 Crowd Algorithms Design fundamental algorithms involving human computations ―Filter a large set (human predicate) ―Sort or find top-k from a large set (human comparison) Which questions do I ask of humans? Do I ask sequentially or in parallel? How much redundancy in questions? How do I combine answers? When do I stop? Which questions do I ask of humans? Do I ask sequentially or in parallel? How much redundancy in questions? How do I combine answers? When do I stop? Latency Cost Uncertainty

Jennifer Widom 12 Algorithms We’ve Looked At Filtering [SIGMOD 2012] Graph search [VLDB 2011] Find-Max [SIGMOD 2012] Entity Resolution (= Deduplication)

Jennifer Widom 13 Sample Results: Filtering Given:  Large set S of items  Filter F over items of S  Selectivity  of F on S  Human false-positive rate   Human false-negative rate Find strategy for asking F that:  Asks no more than m questions per item  Guarantees overall expected error less than e  Minimizes overall expected cost c (# questions) Yes No Exhaustive search Pruned search Probabilistic strategies Exhaustive search Pruned search Probabilistic strategies

Jennifer Widom 14 Human-Powered Query Answering CountryCapitalLanguage PeruLimaSpanish PeruLimaQuechua BrazilBrasiliaPortugese ……… Find the capitals of five Spanish-speaking countries DBMS Give me a Spanish-speaking country What language do they speak in country X? What is the capital of country X? Give me a country Give me a capital Is this country-capital-language triple correct? Give me the capitals of five Spanish-speaking countries

Jennifer Widom 15 Human-Powered Query Answering CountryCapitalLanguage PeruLimaSpanish PeruLimaQuechua BrazilBrasiliaPortugese ……… Find the capitals of five Spanish-speaking countries DBMS What if some humans say Brazil is Spanish-speaking and others say Portugese? What if some humans answer “Chile” and others “Chili”? Inconsistencies

Jennifer Widom 16 Key Elements of Our Approach Exploit relational model and SQL query language Configurable fetch rules for obtaining data from humans Configurable resolution rules for resolving inconsistencies in fetched data Traditional approach to query optimization  But with many new twists and challenges! Latency Cost Uncertainty

Jennifer Widom 17 Deco: Declarative Crowdsourcing DBMS Schema Designer (DBA) 1) Schema for conceptual relations Restaurant(name,cuisine,rating) 2) Fetch rules name  cuisine cuisine,rating  name 3) Resolution rules name: dedup () rating: average ()

Jennifer Widom 18 Deco: Declarative Crowdsourcing DBMS User or Application Declarative queries select name from Restaurant where cuisine = ‘Thai’ and rating >= 3 atleast 5 Query semantics Relational result over “some valid instance” Valid instance Fetch + Resolve + Join

Jennifer Widom 19 Deco: Declarative Crowdsourcing DBMS Generate query execution plan that orchestrates and optimizes fetches and resolutions to produce answer Different possible objectives: N tuples, minimize cost (fetches) F fetches, maximize tuples T time, minimize/maximize ?? Query Processor Query Processor Latency Cost Uncertainty

Jennifer Widom 20 Deco: Declarative Crowdsourcing DBMS Query Processor Query Processor

Jennifer Widom 21 Crowdsourcing Research Marketplace #1 Marketplace #2 Marketplace #n … … Platforms Algorithms Data Gathering / Query Answering Humans As Data Processors Humans As Data Processors Humans As Data Providers Humans As Data Providers

Data-Centric Human Computation Joint work with: Hector Garcia-Molina, Neoklis Polyzotis, Aditya Parameswaran, Hyunjung Park