David Karger Sewoong Oh Devavrat Shah MIT + UIUC.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Outline National Assessment of Educational Progress (NAEP) Multivariate Design Problem Implications for analysis Example with similar structure in Biostatistics.
Vote Elicitation with Probabilistic Preference Models: Empirical Estimation and Cost Tradeoffs Tyler Lu and Craig Boutilier University of Toronto.
1 Machine Learning: Lecture 1 Overview of Machine Learning (Based on Chapter 1 of Mitchell T.., Machine Learning, 1997)
Chapter 2. Analytic Functions Weiqi Luo ( ) School of Software Sun Yat-Sen University Office # A313
Primal Dual Combinatorial Algorithms Qihui Zhu May 11, 2009.
1 ECE 776 Project Information-theoretic Approaches for Sensor Selection and Placement in Sensor Networks for Target Localization and Tracking Renita Machado.
IMIM v v v v v v v v v DEFINITION L v 11 v 2 1 v 31 v 12 v 2 2 v 32.
Competitive fault tolerant Distance Oracles and Routing Schemes Weizmann Open U Weizmann Bar Ilan Shiri Chechik Michael Langberg David Peleg Liam Roditty.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Approximating Maximum Subgraphs Without Short Cycles Guy Kortsarz Join work with Michael Langberg and Zeev Nutov.
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
Sampling distributions. Example Take random sample of students. Ask “how many courses did you study for this past weekend?” Calculate a statistic, say,
Sampling distributions. Example Take random sample of 1 hour periods in an ER. Ask “how many patients arrived in that one hour period ?” Calculate statistic,
CUSTOMER NEEDS ELICITATION FOR PRODUCT CUSTOMIZATION Yue Wang Advisor: Prof. Tseng Advanced Manufacturing Institute Hong Kong University of Science and.
Statistical Inference: Confidence Intervals
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Global Synchronization in Sensornets Jeremy Elson, Richard Karp, Christos Papadimitriou, Scott Shenker.
Ranking individuals by group comparison New exponentiel model Two methods for calculations  Regularized least square  Maximum likelihood.
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
Computability and Complexity 24-1 Computability and Complexity Andrei Bulatov Approximation.
Inference about a Mean Part II
The Role of Specialization in LDPC Codes Jeremy Thorpe Pizza Meeting Talk 2/12/03.
From the Data at Hand to the World at Large Chapter 19 Confidence Intervals for an Unknown Population p Estimation of a population parameter: Estimating.
Normalised Least Mean-Square Adaptive Filtering
From Last week.
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
How to do backpropagation in a brain
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
8.2 Estimating Population Means LEARNING GOAL Learn to estimate population means and compute the associated margins of error and confidence intervals.
Estimates and Sample Sizes Lecture – 7.4
Estimating a Population Proportion
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Sample Size Determination CHAPTER thirteen.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Sahand Negahban Sewoong Oh Devavrat Shah Yale + UIUC + MIT.
C ROWD C ENTRALITY David Karger Sewoong Oh Devavrat Shah MIT and UIUC.
Scalable Computing on Open Distributed Systems Jon Weissman University of Minnesota National E-Science Center CLADE 2008.
Survey Propagation. Outline Survey Propagation: an algorithm for satisfiability 1 – Warning Propagation – Belief Propagation – Survey Propagation Survey.
Chapter 4: Introduction to Predictive Modeling: Regressions
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Agresti/Franklin Statistics, 1 of 87  Section 7.2 How Can We Construct a Confidence Interval to Estimate a Population Proportion?
Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.
Mohamed Hefeeda 1 School of Computing Science Simon Fraser University, Canada Efficient k-Coverage Algorithms for Wireless Sensor Networks Mohamed Hefeeda.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 8 Interval Estimation Population Mean:  Known Population Mean:  Known Population.
Section 11.3: Large-Sample Inferences Concerning a Difference Between Two Population or Treatment Proportions.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.
Raptor Codes Amin Shokrollahi EPFL. BEC(p 1 ) BEC(p 2 ) BEC(p 3 ) BEC(p 4 ) BEC(p 5 ) BEC(p 6 ) Communication on Multiple Unknown Channels.
6-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
Estimation Econometría. ADE.. Estimation We assume we have a sample of size T of: – The dependent variable (y) – The explanatory variables (x 1,x 2, x.
Final Outline Shang-Hua Teng. Problem 1: Multiple Choices 16 points There might be more than one correct answers; So you should try to mark them all.
KITPC Osamu Watanabe Tokyo Inst. of Tech. Finding Most-Likely Solution of the Perturbed k -Linear-Equation Problem k -Linear-Equation = k LIN 渡辺.
Ranking: Compare, Don’t Score Ammar Ammar, Devavrat Shah (LIDS – MIT) Poster ( No preprint), WIDS 2011.
Probability Theory and Parameter Estimation I
Approximating the MST Weight in Sublinear Time
Exam Preparation Class
Multiple Choice Review 
CONCEPTS OF ESTIMATION
Haim Kaplan and Uri Zwick
Ensemble learning.
BUSINESS MATHEMATICS & STATISTICS.
Estimates and Sample Sizes Lecture – 7.4
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Presentation transcript:

David Karger Sewoong Oh Devavrat Shah MIT + UIUC

o A patient is asked: rate your pain on scale 1-10 o Medical student gets answer : 5 o Intern gets answer : 8 o Fellow gets answer : 4.5 o Doctor gets answer : 6 o So what is the “right” amount of pain? o Crowd-sourcing o Pain of patient = task o Answer of patient = completion of task by a worker

o Goal: reliable estimate the tasks with min’l cost o Key operational questions: o Task assignment o Inferring the “answers”

o N tasks o Denote by t 1, t 2, …, t N – “true” value in {1,..,K} o M workers o Denote by w 1, w 2, …, w M – “confusion” matrix o Worker j: confusion matrix P j =[P j kl ] o Worker j’s answer: is l for task with value k with prob. P j kl o Binary symmetric case o K = 2: tasks takes value +1 or -1 o Correct answer w.p. p j

t1t1 tNtN t2t2 t N-1 w1w1 w2w2 w M-1 wMwM A 11 A N-1 1 A N2 A 2M o Binary tasks: o Worker reliability: o Necessary assumption: we know

o Goal: given N tasks o To obtain answer correctly w.p. at least 1-ε o What is the minimal number of questions (edges) needed? o How to assign them, and how to infer tasks values? t1t1 tNtN t2t2 t N-1 w1w1 w2w2 w M-1 wMwM A 11 A N-1 1 A N2 A 2M

o Task assignment graph o Random regular graph o Or, regular graph w large girth t1t1 tNtN t2t2 t N-1 w1w1 w2w2 w M-1 wMwM A 11 A N-1 1 A N2 A 2M

o Majority: o Oracle: t1t1 tNtN t2t2 t N-1 w1w1 w2w2 w M-1 wMwM A 11 A N-1 1 A N2 A 2M

o Majority: o Oracle: o Our Approach: t1t1 tNtN t2t2 t N-1 w1w1 w2w2 w M-1 wMwM A 11 A N-1 1 A N2 A 2M

o Iteratively learn o Message-passing o O(# edges) operations o Approximation of o Maximum Likelihood t1t1 tNtN t2t2 t N-1 w1w1 w2w2 w M-1 wMwM A 11 A N-1 1 A N2 A 2M

t1t1 tNtN t2t2 t N-1 w1w1 w2w2 w M-1 wMwM A 11 A N-1 1 A N2 A 2M o Theorem (Karger-Oh-Shah). o Let n tasks assigned to n workers as per o an (l,l) random regular graph o Let ql > √2 o Then, for all n large enough (i.e. n =Ω(l O(log(1/q)) e lq ))) after O(log (1/q)) iterations of the algorithm Crowd Quality

o To achieve target P error ≤ε, we need o Per task budget l = Θ(1/q log (1/ε)) o And this is minimax optimal o Under majority voting (with any graph choice) o Per task budget required is l = Ω(1/q 2 log (1/ε)) no significant gain by knowing side-information (golden question, reputation, …!)

Theorem (Karger-Oh-Shah). Given any adaptive algorithm, let Δ be the average number of workers required per task to achieve desired P error ≤ε Then there exists {p j } with quality q so that gain through adaptivity is limited

Theorem (Karger-Oh-Shah). To achieve reliability 1-ε, per task redundancy scales as K/q (log 1/ε + log K) Through reducing K-ary problem to K-binary problems (and dealing with few asymmetries)

o Learning similarities o Recommendations o Searching, …

o Learning similarities o Recommendations o Searching, …

o Crow-sourcing o Regular graph + message passing o Useful for designing surveys/taking polls o Algorithmically o Iterative algorithm is like power-iteration o Beyond stand-alone tasks o Learning global structure, e.g. ranking