CrowdER - Crowdsourcing Entity Resolution

Slides:



Advertisements
Similar presentations
Covers, Dominations, Independent Sets and Matchings AmirHossein Bayegan Amirkabir University of Technology.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
Ali Husseinzadeh Kashan Spring 2010
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Clustering Categorical Data The Case of Quran Verses
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Bahman Bahmani  Fundamental Tradeoffs  Drug Interaction Example [Adapted from Ullman’s slides, 2012]  Technique I: Grouping 
Preference Elicitation Partial-revelation VCG mechanism for Combinatorial Auctions and Eliciting Non-price Preferences in Combinatorial Auctions.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Basic Feasible Solutions: Recap MS&E 211. WILL FOLLOW A CELEBRATED INTELLECTUAL TEACHING TRADITION.
The number of edge-disjoint transitive triples in a tournament.
Introduction to Approximation Algorithms Lecture 12: Mar 1.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.
© 2006 Pearson Addison-Wesley. All rights reserved14 A-1 Chapter 14 Graphs.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Reduced Support Vector Machine
Approximation Algorithms
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Job Scheduling Lecture 19: March 19. Job Scheduling: Unrelated Multiple Machines There are n jobs, each job has: a processing time p(i,j) (the time to.
Linear Programming Applications
Experimental Evaluation
1 Introduction to Approximation Algorithms Lecture 15: Mar 5.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Radial Basis Function Networks
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Gene expression & Clustering (Chapter 10)
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Fixed Parameter Complexity Algorithms and Networks.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Optimizing Plurality for Human Intelligence Tasks Luyi Mo University of Hong Kong Joint work with Reynold Cheng, Ben Kao, Xuan Yang, Chenghui Ren, Siyu.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Minimum Cost Flows. 2 The Minimum Cost Flow Problem u ij = capacity of arc (i,j). c ij = unit cost of shipping flow from node i to node j on (i,j). x.
Analysis of Algorithms
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Lesley Charles November 23, 2009.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
EMIS 8373: Integer Programming NP-Complete Problems updated 21 April 2009.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
GAME PLAYING 1. There were two reasons that games appeared to be a good domain in which to explore machine intelligence: 1.They provide a structured task.
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
Machine Learning Queens College Lecture 7: Clustering.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Chapter 13 Backtracking Introduction The 3-coloring problem
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Approximation Algorithms based on linear programming.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
The NP class. NP-completeness
Data Driven Resource Allocation for Distributed Learning
Optimizing Parallel Algorithms for All Pairs Similarity Search
Parallel Density-based Hybrid Clustering
An Interactive Approach to Collectively Resolving URI Coreference
NP-Complete Problems.
Implementation of Relational Operations
EE5900 Advanced Embedded System For Smart Infrastructure
Presentation transcript:

CrowdER - Crowdsourcing Entity Resolution

Agenda Our agenda for today is to answer these questions: WHY CrowdER?? WHAT is CrowdER?? HOW is CrowdER implemented??

What is Entity Resolution? Entity reconciliation or duplicate detection. Central to data integration and data cleaning where you get data from multiple sources. Problem: Two records that are not exactly identical may refer to same real word entity. Solution: Entity Resolution - It is the task of finding different records that refer to the same entity.

Product Data set: Similar!!!!

Entity Resolution Techniques Machine based techniques: Similarity Based Learning Based

Similarity Based Techniques These require a similarity function and a threshold. Similarity function takes a pair of records as input, and outputs a similarity value. If similarity value > threshold then they refer to the same entity.

Example: Jaccard similarity Apply Jaccard similarity between product names and consider threshold value to be 0.5 Jaccard similarity over two sets is defined as the size of the set intersection divided by the size of the set union. Jaccard similarity between r1 and r2: As J(r1,r2) > 0.5, r1 and r2 are referring to the same entity.

Disadvantages: Expensive to compute the similarity for every pair of records and also might not result in accurate results.

Learning Based Techniques: These model the entity resolution as classification problem. Here we can choose Jaccard similarity on product name. Each pair of records will be represented as feature vector that contains only single dimension. Train the classifier - Training set consists of positive and negative feature vectors indicating matching and non-matching pairs.

Disadvantages: Need a good training set to train the classifier. But these state of art techniques have difficulty in identifying duplicate products based on textual descriptions in many domains.

Crowdsourcing? It is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, and especially from an online community, rather than from traditional employees. Popular crowdsourcing platforms: Amazon Mechanical Turk (AMT) Crowdflower They support crowd sourced execution of “microtasks” or “Human Intelligence Tasks” (HITS), where people do simple jobs requiring little or no domain expertise.

Entity Resolution using Crowdsourcing: ER is expressed as a query in a crowd-enabled query processing system such as Crowd DB. Simple CrowdSQL query: SELECT p.id, q.id FROM product p, product q WHERE p.product_name ~= q.product_name ~= is an operator that can ask crowd to decide whether or not p.product_name and q.product_name refer to same product.

Naïve Approach: Create a HIT for each pair of records, and for each HIT, to ask people in the crowd to decide whether or not the two records both refer to same entity. If table has ‘n’ records Number of HITs = O(n2)

Batching Strategies: First Approach: Place multiple record pairs into a single HIT. Worker must check each pair in the HIT individually. Suppose a pair based HIT consists of ‘k’ pairs of records Number of HITs = O(n2/k)

Batching Strategies: Second Approach: Place multiple record pairs into a single HIT but asked workers to find all matches among all of the records. Suppose a pair based HIT consists of ‘k’ records Number of HITs = O(n2/k2)

Disadvantages of Batching: Though batching provide significant benefits in cost with only minimal negative impact on accuracy, it suffers from scalability problem. Example: size of records = 10000, k = 20, cost = $0.01 per HIT Approach Number of HITs Cost First 5,000,000 $50,000 Second 250,000 $2500

Can we do better???????

HOW ??

Hybrid Human Machine Approach Machine Based Techniques Only Crowd Based Techniques Hybrid Human Machine Approach

Contributions: Propose a hybrid human-machine system for entity resolution. Formulate cluster-based HIT generation problem using two-tiered heuristics-based solution. Compare Pair-based and cluster-based HIT generation. Compare this approach with other techniques by experimenting on some sample data.

CrowdER – Hybrid Human Machine Approach CrowdER uses machine-based techniques to discard those pair of records that look very dissimilar and only asks the crowd to verify the remaining pairs which need human insight. Combined the efficiency of the machine-based approaches with the answer quality that can be obtained from people.

Hybrid Human-Machine Workflow HIT generation is the key component of this workflow

HIT Generation Given set of pair of records, they must be combined into HITs for crowdsourcing. Two types: Pair-Based HITs Cluster-Based HITS

Pair-based HIT Generation: Each HIT consists of multiple pairs of records to be compared, batched into a singe HIT. Worker just have to verify they refer to the same entity or not. Suppose pair-based HIT can contain at most k pairs. Given set of pairs, P, we need to generate (|P|/k) pair-based HITs.

Example Total records = 9, Total pairs = 36 Pairs to be verified = 10, k=2 Number of HITs = 5

Amazon Mechanical Turk Example:

Cluster-based HIT Generation: A HIT consists of group of individual records rather than pairs and workers are asked to find all duplicate records in the group. It allows pair of records to be matched if and only if both records are in the HIT. As payment is made per HIT, we must reduce the number of HITs A bound should be placed to number of records to be placed in a HIT to avoid higher latencies and lower quality answers.

Example

Goal Goal: Given a set of pairs of records, P, and a cluster-size threshold, k, the cluster-based HIT generation problem is to generate the minimum number of cluster-based HITs, H1,H2, · · · ,Hh, that satisfy two requirements: 1. |Hℓ| ≤ k for any ℓ ∈ [1, h], where |Hℓ| denotes the number of records in Hℓ 2. for any (ri, rj) ∈ P, there exists Hℓ (ℓ ∈ [1, h]) such that ri ∈ Hℓ and rj ∈ Hℓ

Example Consider the ten pairs of records in Figure 2(a). Given the cluster-size threshold k = 4. Suppose we generate three cluster-based HITs, H1 = {r1, r2, r3, r7}, H2 = {r3, r4, r5, r6} and H3 = {r4, r7, r8, r9} Both the conditions mentioned satisfies. Number of HITs = 3

Cluster-based HIT generation problem is NP-Hard

Two-Tiered Approach First build a graph on set of pairs. The graph formed will consists of many connected components. We classify the connected components into 2 types based on cluster-size threshold ‘k’. 1. Large Connected Components (LCCs) (more than k vertices) 2. Small Connected Components (SCCs) (no more than k vertices)

Two-Tiered Approach Since LCCs have vertices greater than ‘k’, they have to be partitioned into SCCs. While partitioning, create SCCs that are highly connected so as to increase the number of edges covered by a component. By using a packing method, we can add multiple SCCs to a single HIT to reduce the number of HITs.

Example It consists of two connected components. Suppose k = 4. The top one is an LCC and the bottom one is an SCC. Consider the top component containing seven vertices {r1, r2, r3, r4, r5, r6, r7}. Since it is an LCC, it must be partitioned. Assume it is partitioned into three SCCs{r1, r2, r3, r7}, {r3, r4, r5, r6}, {r4, r7}, which can cover all of its edges. Then the graph becomes {r1, r2, r3, r7}, {r3, r4, r5, r6}, {r4, r7}, {r8, r9}.

Algorithm

LCC Partitioning (Top Tier) Input: LCC Output: SCCs formed by partitioning LCC Approach: Greedy Algorithm Process: The algorithm iteratively generates SCC with highest connectivity and iterates until the generated SCCs cover all edges in larger one. Indegree: Number of edges between vertex ‘r’ (not in scc) and vertices in scc Outdegree: Number of edges between ‘r’ (not in scc) and vertices not in scc

Procedure Initialize a ‘scc’ with vertex having maximum degree in LCC. The algorithm repeats to add new vertex to scc that maximizes connectivity of scc. If there is a tie in case of maximum indegree, the algorithm selects a vertex with minimum outdegree. Add this vertex to scc and update indegree and outdegree of all vertices w.r.t new scc. Algorithm stops when size of scc is equal to ‘k’ or when no remaining vertex connects with scc.

Algorithm

Example

SCC Packing (Bottom Tier) Input: SCCs Output: Pack them into minimum number of HITs such that such that size of each HIT is no larger than ‘k’. As this is NP-hard problem we formulate it as an integer linear problem. The integer linear problem can be solved by using efficient column generation and branch- and-bound

Procedure Let p = [a1, a2, · · · , ak] denote a pattern of a CB HIT, where aj (j ∈ [1, k]) is the number of SCCs in the HIT that contain j vertices. Since a cluster-based HIT can contain at most k vertices, we say that p is a feasible pattern only if  j=1k (j * aj) ≤ k holds. For example, suppose k = 4, p1 = [0, 0, 0, 1] is a feasible pattern since 1 · 0 + 2 · 0 + 3 ·0 + 4 · 1 = 4 ≤ k holds. We collect all feasible patterns into a set A = {p1, p2, · · · , pm}, where pi = [ai1, ai2, · · · , aik](i ∈ [1,m]).

Continued… When packing a set of SCCs into cluster-based HITs, each HIT must correspond to a pattern in A. Let xi denote number of cluster-based HITs whose pattern is pi (i ->[1,m]). Now problem is to minimize the total number of patterns. We can formulate our packing problem as following integer linear program: min i=1m (xi) such that i=1m (aijxi) >= cj j [1,k] xi >= 0, integer

Example Given a set of SCCs, {r3, r4, r5, r6}, {r1, r2, r3, r7}, {r4, r7} and {r8, r9}. We have c1 = 0, c2 = 2, c3 = 0 and c4 = 2. To pack them into cluster-based HITs (k = 4) We first generate all feasible patterns, i.e. A = {p1 =[0, 0, 0, 1], p2 = [0, 2, 0, 0], p3=[0, 1, 0, 0]}.

Continued… Now decide on number of CB-HITs corresponding to each feasible pattern (x1,x2,x3). One possible solution: x1=2, x2=0, x3=2 It needs i=13 (xi) = 4 CB-HITs Check if this satisfies constraint condition. i=13 (aijxi) >= cj j [1,4] (j=2) But the solution is not optimal, since there is another solution x1=2, x2=1, x3=0 which results in 3 CB-HITs.

PB-HITs Vs CB-HITs In PB-HITs, the number of comparisons is the number of pairs which are batched into the HIT. For CB-HIT with n-records: Lets say CB HIT has “m” distinct entities e1,e2,….em where ei represents set of records that refer to ith entity. Total comparisons = i=1m (n-1- j=1i-1 |ej|) So the CB HIT requires fewer comparisons if it contains more matches.

Example Consider the CB HIT with 4 records {r1,r2,r3,r7} e1 = {r1,r2,r7} e2 = {r3} Number of comparisons = 3

Experimental Results Datasets: 1. Restaurant: 858 records Total pairs = 367,653 (106 refer to same entity) 2. Product: 1081 records from abt, 1092 from buy. Total pairs = 1,180,452 pairs (1097 pairs refer to same entity).

Terms Precision: Percentage of correctly identified matching pairs out of all pairs identified as matches. Recall : Percentage of correctly identified matching pairs out of all matching pairs in dataset.

Results - Machine Based (Simjoin)

Amazon Mechanical Turk We use AMT to evaluate our hybrid human – machine workflow. Cost per HIT: $0.02 to worker and $0.005 to AMT for publishing. Techniques to improve result quality: Replication: Each HIT is replicated into 3 different assignments and is assigned to different worker. Qualification Test: for workers to be eligible to do our HITs

Results – Cluster Based HIT

Results – Entity Resolution Techniques

Conclusion Proposed a hybrid human-machine approach for crowdsourcing entity resolution. Proposed a two-tier approach for generation of cluster-based HITs Still lot of work to be done to address various issues like scaling to much larger datasets, techniques to take privacy into consideration (confidential data).

QUESTIONS???