CrowdDb.

Slides:

Advertisements

Similar presentations

What is a Database By: Cristian Dubon.

Advertisements

Relational Database. Relational database: a set of relations Relation: made up of 2 parts: − Schema : specifies the name of relations, plus name and type.

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

Databases Week 7 LBSC 690 Information Technology.

Databases Week 6 LBSC 690 Information Technology.

ارائه دهندگان: مجتبی بلبلی،احمد رحمانی،مجتبی صادقی استاد درس: دکتر شیخ اسماعیلی درس : پایگاه داده پیشرفته بهار91 دانشگاه آزاد اسلامی واحد علوم و تحقیقات.

Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.

Introduction –All information systems create, read, update and delete data. This data is stored in files and databases. Files are collections of similar.

MIS 301 Information Systems in Organizations Dave Salisbury ( )

DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.

Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.

Normalization (Codd, 1972) Practical Information For Real World Database Design.

Views Lesson 7.

Crowdsourced Enumeration Queries Ruihan Shan. Introduction Motivation.

1 CS 430 Database Theory Winter 2005 Lecture 7: Designing a Database Logical Level.

Quiz Which of the following is not a mandatory characteristic of a relation? Rows are not ordered (Not required) Each row is a unique There is a.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.

Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.

ER Diagrams ● Many different notations are available ● From wikipedia:wikipedia: Entity-relationship modelwikipedia: Entity-relationship model ● How do.

1 Finding Your Way Through a Database Exploring Microsoft Office Access.

N5 Databases Notes Information Systems Design & Development: Structures and links.

COP Introduction to Database Structures

DBM 380 AID Focus Dreams/dbm380aid.com

View Integration and Implementation Compromises

Relational Model.

Chapter 6 - Database Implementation and Use

Databases Chapter 16.

David M. Kroenke and David J

So, what was this course about?

DBM 380 aid Education Begins/dbm380aid.com

Deco + Crowdsourcing Summary

Applied CyberInfrastructure Concepts Fall 2017

Quiz Questions Q.1 An entity set that does not have sufficient attributes to form a primary key is a (A) strong entity set. (B) weak entity set. (C) simple.

CrowdDB : Answering queries with Crowdsourcing

CS 440 Database Management Systems

SQL: Structured Query Language DML- Queries Lecturer: Dr Pavle Mogin

Relational Algebra Chapter 4, Part A

Modeling Your Data Chapter 2 cs542

Translation of ER-diagram into Relational Schema

DBM 380 Competitive Success/snaptutorial.com

DBM 380 Education for Service/snaptutorial.com

DBM 380 Teaching Effectively-- snaptutorial.com

Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.

Deco: Declarative Crowdsourcing

More SQL Nested and Union queries, and more

ERD’s REVIEW DBS201.

February 7th – Exam Review

Database Fundamentals

Lecture 12: Data Wrangling

Entity relationship diagrams

CS4222 Principles of Database System

Teaching slides Chapter 8.

Chapter 4 The Relational Model Pearson Education © 2009.

Relational Algebra Chapter 4, Sections 4.1 – 4.2

Creating Tables & Inserting Values Using SQL

The Entity-Relationship Model

Computer Science Testing.

Chapter 4 The Relational Model Pearson Education © 2009.

ITEC 3220A Using and Designing Database Systems

Chapter 4 The Relational Model Pearson Education © 2009.

Lecture 20: Intro to Transactions & Logging II

Probabilistic Databases

Relational Database Design

Database Design: Relational Model

Chapter 4 The Relational Model Pearson Education © 2009.

Week 6 LBSC 690 Information Technology

CSE 326: Data Structures Lecture #14

Functional Dependencies and Normalization

Presentation transcript:

CrowdDb

History Lesson First crowd-powered database At that time, the state of the art was turkit Programming library for the crowd Two other crowd-powered databases at around the same time Deco (Stanford, UC Santa Cruz) Qurk (MIT Necessarily incomplete, preliminary

Motivation of CrowdDB Two reasons why present DB systems won’t do: Closed world assumption Get human help for finding new data Very literal in processing data SELECT marketcap FROM company WHERE name = “IBM” Get the best of both worlds: human power for processing and getting data traditional systems for heavy lifting/data manip

Issues in building CrowdDB Performance and variability: Humans are slow, costly, variable, inaccurate Task design and ambiguity: Challenging to get people to do what you want Affinity / Learning Workers develop relationships with requesters, skills Open world Possibly unbounded answers

At a High Level Modifications to QL: CrowdSQL Automatic UI generation Automatic interaction with marketplace Storing data for future use

Modifications to SQL Special keyword: CROWD Used in two ways First: crowdsourced columns CREATE TABLE Department ( university STRING, name STRING, url CROWD String, phone STRING, primary key (university, name)); CROWD attribute cannot be PK

Modifications to SQL Crowdsourced Tables CREATE CROWD TABLE Profs ( name STRING PRIMARY KEY, email STRING UNIQUE, university STRING, department STRING, FOREIGN KEY (university, department) REF Department (university, name) ); Still need a PK

How do we designate incomplete data? Special Keyword CNULL Constraint: we want CNULL to be filled in before query results are returned CREATE TABLE Department ( university STRING, name STRING, url CROWD String, phone STRING, primary key (university, name)); SELECT url FROM Department WHERE name = “math”

Comparisons CROWDEQUAL SELECT name FROM Professor WHERE department ~= “CS” CROWDORDER SELECT p FROM Picture WHERE subject = “chair” ORDER BY CROWDORDER (p, “Which picture visualizes better %chair”) Similar to Qurk FILTER, SORT predicates, but hides away even more details from the user

UI Generation Instantiated at run-time for every tuple Can also be edited Can you think of other UIs? (Remember previous papers..)

Multi-relational UI Generation Interesting trick to deal with dependencies If referencing a non-crowdsourced table Drop-down box + suggest function. Why? If referencing a crowdsourced table Normalized interface with a suggest function Denormalized interface getting dependent data

Query Processing Given these SQL extensions, there are a handful of new operators CROWDPROBE: Collects missing information in CROWD columns or tables CROWDJOIN: Inner table is probed in a crowdsourced fashion using the other table CROWDCOMPARE: Used for CROWDEQUAL and CROWDORDER

Let’s dig deeper… Quality management: Majority vote with 5 answers across all those whose PKs match Issue? CROWDPROBE on a CROWDTABLE? Prof (name, email, univ, dep), (n, u) pk Say I want to find 3 professors from univ = X, dep = Y; how long would I take? Best case: Worst case:

Let’s dig deeper… Quality management: Majority vote with 5 answers across all those whose PKs match Issue? CROWDPROBE on a CROWDTABLE? Prof (name, email, univ, dep) Say I want to find 3 professors from univ = X, dep = Y Workers may focus on different answers

Let’s dig deeper … CROWDTABLE Any other issues?

Let’s dig deeper … CROWDTABLE What if workers refer to two Professors in a slightly different manner: Jiawei Han vs. J. Han Spelling mistakes

Let’s dig deeper … CROWDJOIN Outer is used to probe inner crowdsourced table, asking for values of missing predicates E.g., join between Dept and Prof Are there similar issues here? What if workers can’t find the URL of a specific Prof?

Crowd Operators Tasks Right now: simple rule-based optimizer Create HITs and HIT groups Collect results from AMT Quality control via majority vote Right now: simple rule-based optimizer Batch size, # assignments, price fixed Predicate push down, join ordering, delete optimization Future: cost-based optimizer

Query Processing Example

Results on benchmarks HIT Size vs. responsiveness Tradeoff between HITS completed/time and % completion of HITS

Reward vs. Responsiveness % hits fully completed %hits with >1 done

Completion across workers Skewed distribution No variation in error rate between high freq workers and others

Complex Queries Entity resolution on company names Matching one company name with 100 others for four separate runs Majority vote gives the correct result Ordering photos in terms of relevance Majority vote matches expert ranking

Complex Queries Joining professors and departments SELECT p.name, p.email, d.name, d.phone FROM Professor p, Department d WHERE p.department = d.name AND p.university = d.university AND p.name = "[name of a professor]" Method 1: first prof details collected, then dep details Method 2: prof and dep details collected together via a denormalized interface Method 2 is cheaper, but Method 1 outperforms Method 2 in accuracy: Instructions for denormalized interface unclear

Other obersvations Relationship with workers is long-term Keep workers happy Implement less stringent approval mechanisms Good interface design and instructions matter Simple choices like “none of the above” improve quality dramatically

History Lesson Even now (4 years later), there is no real complete, fully-functional crowd-powered database Why?

History Lesson Even now, there is no real complete, fully-functional crowd-powered database Why? No one understands the crowds (EVEN NOW) We were all naïve in thinking that we could treat crowds as just another data source. People don’t seem to want to use crowds within databases Crowdsourcing is a one-off task Crowds have very different characteristics than other data

Still… The ideas are very powerful and applicable everywhere you want data to be extracted Very common use-case of crowds

Semantics Semantics = an understanding of what the query does Regular SQL has very understandable semantics because starting from a given state, you know exactly what state you will be once you execute a statement. Does CrowdSQL have understandable/ semantics? How would you improve it?

Semantics Does CrowdSQL have understandable/ semantics? How would you improve it? Overall, very hard. But at the least: A specification of budget? A specification that cost/latency is minimized?

Optimization Techniques Beyond the ones presented in the paper, what other “database style” optimization techniques can you think of?

Optimization Techniques Beyond the ones presented in the paper, what other “database style” optimization techniques can you think of? Predicate pushdown, e.g., if you only care about tuples in CA, instantiate interfaces with CA filled in. Reorder tables such that more “complete” tables are filled first. Reorder predicates such that more “complete” predicates are checked first. SELECT * FROM PROEFESSOR WHERE Dept = “math” AND Email LIKE “%berkeley%”

Recording Data CrowdDB only records either CNULL or the final outcome. Why might this be a bad idea?

Recording Data CrowdDB only records either CNULL or the final outcome. Why might this be a bad idea? Needs and aggregations schemes change An application that requires more accuracy We find that people are more erroneous than we expected Data may get stale

Joins between crowdsourced relations CrowdDB forbids joins between two crowdsourced tables. Is there a case where we may want that?

Joins between crowdsourced relations CrowdDB forbids joins between two crowdsourced tables. Is there a case where we may want that? Sure: People in a department, courses taught in the department

FDs? CrowdDB assumes a primary key per table. What if there are other Functional Dependencies? Can we do better?

FDs? CrowdDB assumes a primary key per table. What if there are other Functional Dependencies? Can we do better? Example: Company, City, State

Other things… What other issues did you identify in the paper?