Chaitanya Gokhale University of Wisconsin-Madison Joint work with AnHai Doan, Sanjib Das, Jeffrey Naughton, Ram Rampalli, Jude Shavlik, and Jerry Zhu Corleone:

Slides:

Advertisements

Similar presentations

Toward Scalable Keyword Search over Relational Data Akanksha Baid, Ian Rae, Jiexing Li, AnHai Doan, and Jeffrey Naughton University of Wisconsin VLDB 2010.

Advertisements

CrowdER - Crowdsourcing Entity Resolution

Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.

Farag Saad i-KNOW 2014 Graz- Austria,

Prachi Saraph, Mark Last, and Abraham Kandel. Introduction Black-Box Testing Apply an Input Observe the corresponding output Compare Observed output with.

Presenter: Chien-Ju Ho  Introduction to Amazon Mechanical Turk  Applications  Demographics and statistics  The value of using MTurk Repeated.

Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.

Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

AnHai Doan University of Wisconsin Big Data, Big Knowledge, and Big Crowd.

CrowdSearch: Exploiting Crowds for Accurate Real-Time Image Search on Mobile Phones Original work by Yan, Kumar & Ganesan Presented by Tim Calloway.

A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.

Chong Sun, Narasimhan Rampalli, Frank Yang, AnHai Doan

Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.

Today Evaluation Measures Accuracy Significance Testing

Mashithantu Softwares. Agenda Quick Overview Key features Admin features Hit Calculations Advertisers Campaigning Revenue Purpose Server capability Our.

Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.

Producers Motives and Interests Engaging Users and Producers The Danish Model.

Christopher Harris Informatics Program The University of Iowa Workshop on Crowdsourcing for Search and Data Mining (CSDM 2011) Hong Kong, Feb. 9, 2011.

BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.

 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.

Optimizing Plurality for Human Intelligence Tasks Luyi Mo University of Hong Kong Joint work with Reynold Cheng, Ben Kao, Xuan Yang, Chenghui Ren, Siyu.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Data-Centric Human Computation Jennifer Widom Stanford University.

Experimental Evaluation of Learning Algorithms Part 1.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

1.NET Web Forms Business Forms © 2002 by Jerry Post.

Presenter: Shanshan Lu 03/04/2010

Improving Classification Accuracy Using Automatically Extracted Training Data Ariel Fuxman A. Kannan, A. Goldberg, R. Agrawal, P. Tsaparas, J. Shafer Search.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor ： Dr.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Random Sampling. Introduction Scientists cannot possibly count every organism in a population. One way to estimate the size of a population is to collect.

6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.

BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

CrowdSearch: Exploiting Crowds for Accurate Real-Time Image Search on Mobile Phones Original work by Tingxin Yan, Vikas Kumar, Deepak Ganesan Presented.

On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Financial Analysis Supplement F Copyright ©2013 Pearson Education, Inc. publishing as Prentice HallF- 01.

The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.

Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

Data Driven Resource Allocation for Distributed Learning

Sofus A. Macskassy Fetch Technologies

Paul Suganthan G. C. University of Wisconsin-Madison

Deco + Crowdsourcing Summary

cs540 - Fall 2015 (Shavlik©), Lecture 25, Week 14

Magellan: Toward Building Entity Matching Management Systems

Websoft Research Group

Aspect-based sentiment analysis

Selectivity Estimation of Big Spatial Data

Machine Learning for Online Query Relaxation

MatchCatcher: A Debugger for Blocking in Entity Matching

Adaptive entity resolution with human computation

iSRD Spam Review Detection with Imbalanced Data Distributions

CSCI N317 Computation for Scientific Applications Unit Weka

Video Ad Mining for Predicting Revenue using Random Forest

Dialogue State Tracking & Dialogue Corpus Survey

External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Presentation transcript:

Chaitanya Gokhale University of Wisconsin-Madison Joint work with AnHai Doan, Sanjib Das, Jeffrey Naughton, Ram Rampalli, Jude Shavlik, and Jerry Zhu Corleone: Hands-Off Crowdsourcing for Entity

Entity Matching Has been studied extensively for decades No satisfactory solution as yet Recent work has considered crowdsourcing 2 Walmart Amazon idNamebrandprice 1HP Biscotti G ” Laptop.. HP Transcend 16 GB JetFlash 500 Transcend ….. ….……….. …... ……..…... idnamebrandprice 1Transcend JetFlash 700 Transcen d HP Biscotti 17.3” G72 Laptop.. HP ….. ….……….. …... ……..…...

Recent Crowdsourced EM Work Verifying predicted matches –e.g., [Demartini et al. WWW’12, Wang et al. VLDB’12, SIGMOD’13] Finding best questions to ask crowd –to minimize number of such questions –e.g., [Whang et al. VLDB’13] Finding best UI to pose questions –display 1 question per page, or 10, or …? –display record pairs or clusters? –e.g., [Marcus et al. VLDB’11, Whang et al. TR’12] 3

Recent Crowdsourced EM Work Example: verifying predicted matches –sample blocking rule: if prices differ by at least $50  do not match Shows that crowdsourced EM is highly promising But suffers from a major limitation –crowdsources only parts of workflow –needs a developer to execute the remaining parts 4 abcabc dede A B BlockingMatching (a,d) (b,e) (c,d) (c,e) (a,d) Y (b,e) N (c,d) Y (c,e) Y Verifying (a,d) Y (c,e) Y

Does not scale to EM at enterprises –enterprises often have tens to hundreds of EM problems –can’t afford so many developers Example: matching products at WalmartLabs –hundreds of major product categories –to obtain high accuracy, must match each category separately –so have hundreds of EM problems, one per category 5 electronics all clothes pants TVsshirts walmart.comWalmart Stores (brick&mortar) ……... …… electronics all books romance TVsscience ……... clothes …… Need for Developer Poses Serious Problems ……

Need for Developer Poses Serious Problems Can not handle crowdsourcing for the masses –masses can’t be developers, can’t use crowdsourcing startups either E.g., journalist wants to match two long lists of political donors –can’t use current EM solutions, because can’t act as a developer –can pay up to $500 –can’t ask a crowdsourcing startup to help  $500 is too little for them to engage a developer –same problem for domain scientists, small business workers, end users, data enthusiasts, … 6

Our Solution: Hands-Off Crowdsourcing Crowdsources the entire workflow of a task –requiring no developers Given a problem P supplied by user U, a crowdsourced solution to P is hands-off iff –uses no developers, only crowd –user U does no or little initial setup work, requiring no special skills Example: to match two tables A and B, user U supplies –the two tables –a short textual instruction to the crowd on what it means to match –two negative & two positive examples to illustrate the instruction 7

Hands-Off Crowdsourcing (HOC) A next logical direction for EM research –from no- to partial- to complete crowdsourcing Can scale up EM at enterprises Can open up crowdsourcing for the masses E.g., journalist wants to match two lists of donors –uploads two lists to an HOC website –specifies a budget of $500 on a credit card –HOC website uses crowd to execute the EM workflow, returns matches to journalist Very little work so far on crowdsourcing for the masses –even though that’s where crowdsourcing can make a lot of impacts 8

Our Solution: Corleone, an HOC System for EM 9 User Matcher B Candidate tuple pairs Instructions to the crowd Four examples Predicted matches A Tables Accuracy Estimator - Predicted matches - Accuracy estimates (P, R) Difficult Pairs’ Locator Crowd of workers (e.g., on Amazon Mechanical Turk) Blocker

Blocking |A x B| is often very large (e.g., 10B pairs or more) –developer writes rules to remove obviously non-matched pairs –critical step in EM How do we get the crowd to do this? –ordinary workers; can’t write machine-readable rules –if write in English, we can’t convert them into machine-readable Crowdsourced EM so far asks people to label examples –no work has asked people to write machine-readable rules 10 trigram(a.title, b.title) < 0.2 [for matching Citations] overlap(a.brand, b.brand) = 0 [for matching Products] AND cosine(a.title, b.title) ≤ 0.1 AND a.price/b.price ≥ 3 OR b.price/a.price ≥ 3 OR isNULL(a.price,b.price))

Our Key Idea Ask people to label examples, as before Use them to generate many machine-readable rules –using machine learning, specifically a random forest Ask crowd to evaluate, select and apply the best rules This has proven highly promising –e.g., reduce # of tuple pairs from 168M to 38.2K, at cost of $7.20 from 56M to 173.4K, at cost of $22 –with no developer involved –in some cases did much better than using a developer (bigger reduction, higher accuracy) 11

Blocking in Corleone Sample S from |A x B| Four examples supplied by user (2 pos, 2 neg) Stopping criterion satisfied? Select q “most informative” unlabeled examples Label the q selected examples using crowd Amazon’s Mechanical Turk Random forest F Train a random forest F Decide if blocking is necessary –If |A X B| < τ, no blocking, return A X B. Otherwise do blocking. Take sample S from A x B Train a random forest F on S (to match tuple pairs) –using active learning, where crowd labels pairs Y N

Blocking in Corleone 13 isbn_match NY No#pages_match NY NoYes title_match NY Nopublisher_match NY Noyear_match NY NoYes (isbn_match = N) No (isbn_match = Y) and (#pages_match = N) No (title_match = N) No (title_match = Y) and (publisher_match = Y) and (year_match = N) No Extracted candidate rules (title_match = Y) and (publisher_match = N) No Extract candidate rules from random forest F Example random forest F for matching books

Blocking in Corleone Evaluate the precision of extracted candidate rules –for each rule R, apply R to predict “match / no match” on sample S –ask crowd to evaluate R’s predictions –compute precision for R Select most precise rules as “blocking rules” Apply blocking rules to A and B using Hadoop, to obtain a smaller set of candidate pairs to be matched Multiple difficult optimization problems in blocking –to minimize crowd effort & scale up to very large tables A and B –see paper 14

The Rest of Corleone 15 User Matcher B Candidate tuple pairs Instructions to the crowd Four examples Predicted matches A Tables Accuracy Estimator - Predicted matches - Accuracy estimates Difficult Pairs’ Locator Crowd of workers (e.g., on Amazon Mechanical Turk) Blocker

Empirical Evaluation Mechanical Turk settings –Turker qualifications: at least 100 HITs completed with ≥ 95% approval rate –Payment: 1-2 cents per question Repeated three times on each data set, each run in a different week DatasetsTable ATable B|A X B||M|# attributes# features Restaurants , Citations261664, M Products255421,53755 M

Performance Comparison Two traditional solutions: Baseline 1 and Baseline 2 –developer performs blocking –supervised learning to match the candidate set Baseline 1: labels the same # of pairs as Corleone Baseline 2: labels 20% of the candidate set –for Products, Corleone labels 3205 pairs, Baseline 2 labels Also compare with results from published work 17

Performance Comparison Datasets CorleoneBaseline 1Baseline 2 Published Works PRF1F1 CostPRF1F1 PRF1F1 F1F1 Restaurants $ % [1,2] Citations $ % [2,3,4] Products $ Not available 18 [1] CrowdER: crowdsourcing entity resolution, Wang et al., VLDB’12. [2] Frameworks for entity matching: A comparison, Kopcke et al., Data Knowl. Eng. (2010). [3] Evaluation of entity resolution approaches on real-world match problems, Kopcke et al., PVLDB’10. [4] Active sampling for entity matching. Bellare et al., SIGKDD’12.

Comparison against blocking by a developer –Citations: 100% recall with 202.5K candidate pairs –Products: 90% recall with 180.2K candidate pairs See paper for more experiments –on blocking, matcher, accuracy estimator, difficult pairs’ locator, etc. Datasets Cartesian Product Candidate Set Recall (%) Total cost Time Restaurants176.4K 100$0- Citations168 million38.2K99$ hours Products56 million173.4K92$ hours 19 Blocking

Conclusion Current crowdsourced EM often requires a developer Need for developer poses serious problems –does not scale to EM at enterprises –cannot handle crowdsourcing for the masses Proposed hands-off crowdsourcing (HOC) –crowdsource the entire workflow, no developer Developed Corleone, the first HOC system for EM –competitive with or outperforms current solutions –no developer effort, relatively little money –being transitioned into production at WalmartLabs Future directions –scaling up to very large data sets –HOC for other tasks, e.g., joins in crowdsourced RDBMSs, IE