Chaitanya Gokhale University of Wisconsin-Madison Joint work with AnHai Doan, Sanjib Das, Jeffrey Naughton, Ram Rampalli, Jude Shavlik, and Jerry Zhu Corleone:

Chaitanya Gokhale University of Wisconsin-Madison Joint work with AnHai Doan, Sanjib Das, Jeffrey Naughton, Ram Rampalli, Jude Shavlik, and Jerry Zhu Corleone: Hands-Off Crowdsourcing for Entity Matching @WalmartLabs

Entity Matching Has been studied extensively for decades No satisfactory solution as yet Recent work has considered crowdsourcing 2 Walmart Amazon idNamebrandprice 1HP Biscotti G72 17.3” Laptop.. HP395.0 2Transcend 16 GB JetFlash 500 Transcend17.5.... ….. ….……….. …... ……..…... idnamebrandprice 1Transcend JetFlash 700 Transcen d 30.0 2HP Biscotti 17.3” G72 Laptop.. HP388.0.... ….. ….……….. …... ……..…...

Recent Crowdsourced EM Work Verifying predicted matches –e.g., [Demartini et al. WWW’12, Wang et al. VLDB’12, SIGMOD’13] Finding best questions to ask crowd –to minimize number of such questions –e.g., [Whang et al. VLDB’13] Finding best UI to pose questions –display 1 question per page, or 10, or …? –display record pairs or clusters? –e.g., [Marcus et al. VLDB’11, Whang et al. TR’12] 3

Recent Crowdsourced EM Work Example: verifying predicted matches –sample blocking rule: if prices differ by at least $50  do not match Shows that crowdsourced EM is highly promising But suffers from a major limitation –crowdsources only parts of workflow –needs a developer to execute the remaining parts 4 abcabc dede A B BlockingMatching (a,d) (b,e) (c,d) (c,e) (a,d) Y (b,e) N (c,d) Y (c,e) Y Verifying (a,d) Y (c,e) Y

Does not scale to EM at enterprises –enterprises often have tens to hundreds of EM problems –can’t afford so many developers Example: matching products at WalmartLabs –hundreds of major product categories –to obtain high accuracy, must match each category separately –so have hundreds of EM problems, one per category 5 electronics all clothes pants TVsshirts walmart.comWalmart Stores (brick&mortar) ……... …… electronics all books romance TVsscience ……... clothes …… Need for Developer Poses Serious Problems ……

Need for Developer Poses Serious Problems Can not handle crowdsourcing for the masses –masses can’t be developers, can’t use crowdsourcing startups either E.g., journalist wants to match two long lists of political donors –can’t use current EM solutions, because can’t act as a developer –can pay up to $500 –can’t ask a crowdsourcing startup to help  $500 is too little for them to engage a developer –same problem for domain scientists, small business workers, end users, data enthusiasts, … 6

Our Solution: Hands-Off Crowdsourcing Crowdsources the entire workflow of a task –requiring no developers Given a problem P supplied by user U, a crowdsourced solution to P is hands-off iff –uses no developers, only crowd –user U does no or little initial setup work, requiring no special skills Example: to match two tables A and B, user U supplies –the two tables –a short textual instruction to the crowd on what it means to match –two negative & two positive examples to illustrate the instruction 7

Hands-Off Crowdsourcing (HOC) A next logical direction for EM research –from no- to partial- to complete crowdsourcing Can scale up EM at enterprises Can open up crowdsourcing for the masses E.g., journalist wants to match two lists of donors –uploads two lists to an HOC website –specifies a budget of $500 on a credit card –HOC website uses crowd to execute the EM workflow, returns matches to journalist Very little work so far on crowdsourcing for the masses –even though that’s where crowdsourcing can make a lot of impacts 8

Our Solution: Corleone, an HOC System for EM 9 User Matcher B Candidate tuple pairs Instructions to the crowd Four examples Predicted matches A Tables Accuracy Estimator - Predicted matches - Accuracy estimates (P, R) Difficult Pairs’ Locator Crowd of workers (e.g., on Amazon Mechanical Turk) Blocker

Blocking |A x B| is often very large (e.g., 10B pairs or more) –developer writes rules to remove obviously non-matched pairs –critical step in EM How do we get the crowd to do this? –ordinary workers; can’t write machine-readable rules –if write in English, we can’t convert them into machine-readable Crowdsourced EM so far asks people to label examples –no work has asked people to write machine-readable rules 10 trigram(a.title, b.title) < 0.2 [for matching Citations] overlap(a.brand, b.brand) = 0 [for matching Products] AND cosine(a.title, b.title) ≤ 0.1 AND a.price/b.price ≥ 3 OR b.price/a.price ≥ 3 OR isNULL(a.price,b.price))

Our Key Idea Ask people to label examples, as before Use them to generate many machine-readable rules –using machine learning, specifically a random forest Ask crowd to evaluate, select and apply the best rules This has proven highly promising –e.g., reduce # of tuple pairs from 168M to 38.2K, at cost of $7.20 from 56M to 173.4K, at cost of $22 –with no developer involved –in some cases did much better than using a developer (bigger reduction, higher accuracy) 11

Blocking in Corleone Sample S from |A x B| Four examples supplied by user (2 pos, 2 neg) Stopping criterion satisfied? Select q “most informative” unlabeled examples Label the q selected examples using crowd Amazon’s Mechanical Turk Random forest F Train a random forest F Decide if blocking is necessary –If |A X B| < τ, no blocking, return A X B. Otherwise do blocking. Take sample S from A x B Train a random forest F on S (to match tuple pairs) –using active learning, where crowd labels pairs Y N

Blocking in Corleone 13 isbn_match NY No#pages_match NY NoYes title_match NY Nopublisher_match NY Noyear_match NY NoYes (isbn_match = N) No (isbn_match = Y) and (#pages_match = N) No (title_match = N) No (title_match = Y) and (publisher_match = Y) and (year_match = N) No Extracted candidate rules (title_match = Y) and (publisher_match = N) No Extract candidate rules from random forest F Example random forest F for matching books

Blocking in Corleone Evaluate the precision of extracted candidate rules –for each rule R, apply R to predict “match / no match” on sample S –ask crowd to evaluate R’s predictions –compute precision for R Select most precise rules as “blocking rules” Apply blocking rules to A and B using Hadoop, to obtain a smaller set of candidate pairs to be matched Multiple difficult optimization problems in blocking –to minimize crowd effort & scale up to very large tables A and B –see paper 14

The Rest of Corleone 15 User Matcher B Candidate tuple pairs Instructions to the crowd Four examples Predicted matches A Tables Accuracy Estimator - Predicted matches - Accuracy estimates Difficult Pairs’ Locator Crowd of workers (e.g., on Amazon Mechanical Turk) Blocker

Empirical Evaluation Mechanical Turk settings –Turker qualifications: at least 100 HITs completed with ≥ 95% approval rate –Payment: 1-2 cents per question Repeated three times on each data set, each run in a different week DatasetsTable ATable B|A X B||M|# attributes# features Restaurants533331176,423112412 Citations261664,263168.1 M534747 Products255421,53755 M1154923 16

Performance Comparison Two traditional solutions: Baseline 1 and Baseline 2 –developer performs blocking –supervised learning to match the candidate set Baseline 1: labels the same # of pairs as Corleone Baseline 2: labels 20% of the candidate set –for Products, Corleone labels 3205 pairs, Baseline 2 labels 36076 Also compare with results from published work 17

Performance Comparison Datasets CorleoneBaseline 1Baseline 2 Published Works PRF1F1 CostPRF1F1 PRF1F1 F1F1 Restaurants97.096.196.5$9.2010.06.17.699.293.896.4 92-97 % [1,2] Citations89.994.392.1$69.5090.484.387.193.091.192.0 88-92 % [2,3,4] Products91.587.489.3$256.8092.926.640.595.054.869.5 Not available 18 [1] CrowdER: crowdsourcing entity resolution, Wang et al., VLDB’12. [2] Frameworks for entity matching: A comparison, Kopcke et al., Data Knowl. Eng. (2010). [3] Evaluation of entity resolution approaches on real-world match problems, Kopcke et al., PVLDB’10. [4] Active sampling for entity matching. Bellare et al., SIGKDD’12.

Comparison against blocking by a developer –Citations: 100% recall with 202.5K candidate pairs –Products: 90% recall with 180.2K candidate pairs See paper for more experiments –on blocking, matcher, accuracy estimator, difficult pairs’ locator, etc. Datasets Cartesian Product Candidate Set Recall (%) Total cost Time Restaurants176.4K 100$0- Citations168 million38.2K99$7.206.2 hours Products56 million173.4K92$22.002.7 hours 19 Blocking

Conclusion Current crowdsourced EM often requires a developer Need for developer poses serious problems –does not scale to EM at enterprises –cannot handle crowdsourcing for the masses Proposed hands-off crowdsourcing (HOC) –crowdsource the entire workflow, no developer Developed Corleone, the first HOC system for EM –competitive with or outperforms current solutions –no developer effort, relatively little money –being transitioned into production at WalmartLabs Future directions –scaling up to very large data sets –HOC for other tasks, e.g., joins in crowdsourced RDBMSs, IE

Chaitanya Gokhale University of Wisconsin-Madison Joint work with AnHai Doan, Sanjib Das, Jeffrey Naughton, Ram Rampalli, Jude Shavlik, and Jerry Zhu Corleone:

Similar presentations

Presentation on theme: "Chaitanya Gokhale University of Wisconsin-Madison Joint work with AnHai Doan, Sanjib Das, Jeffrey Naughton, Ram Rampalli, Jude Shavlik, and Jerry Zhu Corleone:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chaitanya Gokhale University of Wisconsin-Madison Joint work with AnHai Doan, Sanjib Das, Jeffrey Naughton, Ram Rampalli, Jude Shavlik, and Jerry Zhu Corleone:

Similar presentations

Presentation on theme: "Chaitanya Gokhale University of Wisconsin-Madison Joint work with AnHai Doan, Sanjib Das, Jeffrey Naughton, Ram Rampalli, Jude Shavlik, and Jerry Zhu Corleone:"— Presentation transcript:

Similar presentations

About project

Feedback