L EARNING - BASED E NTITY R ESOLUTION WITH M AP R EDUCE Lars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm Database Group Leipzig

L EARNING - BASED E NTITY R ESOLUTION WITH M AP R EDUCE Lars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm Database Group Leipzig http://dbs.uni-leipzig.de Glasgow, CloudDB 2011

2 / 16 Identification of semantically equivalent entities Within one data source or between different sources To merge them, compare them, improve data quality, etc. E NTITY R ESOLUTION Learning-based Entity Resolution with MapReduce Duplicates due to Order of authors Extraction errors Different titles Typos … Duplicates due to Order of authors Extraction errors Different titles Typos …

3 / 16 Learning-based Entity Resolution with MapReduce E NTITY RESOLUTION (2) Lot of research work Pairwise entity comparison Application of multiple similarity measures on several attributes Combination of similarity values to match decision for each entity pair Hard to configure combination of similarity values manually Study of real-world match systems/problems [VLDB’10] Effective matching is difficult – F-Measure <75% for product data Matching is expensive – scalability issues for O(n 2 ) Learning-based approaches automate combination of similarity values but come with poor efficiency [VLDB’10] Koepcke, Thor, Rahm: Evaluation of entity resolution approaches on real-world match problems. VLDB 2010

4 / 16 L EARNING - BASED E NTITY R ESOLUTION Based on training data, entity pairs are classified as match/ non-match Pairwise similarity values serve as feature for classification Learning-based Entity Resolution with MapReduce Similarity computation sim 1 … sim k match 0.8…0.7true 0.4…0.6false Training Similarities Classifier Training Classifier R S Similarity computation Classifier Application Match Result (id R, id S ) RS match A1A1 … AuAu A1A1 … AvAv ………………true ………………false id R id S sim 1 … sim k match... 0.5…0.6?... 0.8…0.9? Training Data  R  S Phase 1: Training Phase 2: Application Observations Training phase < 5% Similarity computation counts for 95% of Application phases

5 / 16 Learning-based Entity Resolution with MapReduce O UTLINE Motivation MapReduce Strategies for similarity computation and classifier application on Cartesian product of two data sources with MapReduce Solely in map phase (“Broadcast Join”)  MapSide Even distribution of entity pairs across reduce tasks  Reduce Split Experimental Results Conclusions & Future Work

6 / 16 Learning-based Entity Resolution with MapReduce M AP R EDUCE Programming model for distributed computation in cluster environments UDF map applied on each input entity which outputs key-value pairs UDF part applied on key of map output pairs  assigns each pair to a reduce task UDF group applied on key to group key-value pairs UDF reduce invoked for each group Map tasks (m=3) map 2 map 1 map 0 Input data part(key)  [0, r-1] 0 1 2 0 1 2 1 0 reduce 0 reduce 1 reduce 2 reduce tasks (r=3)

7 / 16 D ISTRIBUTED E VALUATION OF THE C ARTESIAN P RODUCT Pairwise entity comparison requires distribution of entity pairs to computing tasks/nodes Learning-based Entity Resolution with MapReduce R S classifier.classify( sim 1 (e R,e S ), sim 2 (e R,e S ), …, sim k (e R,e S ) ) = “match” +  Split R in x blocks (x=2)  Split S in y blocks (y=2)  Replicate each R-block y times  Replicate each S-block x times  x*y “match tasks”  Split S in x blocks (x=2)  Replicate R x times  x “match tasks” +

8 / 16 M AP S IDE (m =3) Map Tasks buffer R in memory at initialization time Each Map task operates on a partition of the larger data source S map(entity) – match currently processed entity of S with all buﬀered entities of R Learning-based Entity Resolution with MapReduce Pairs a-c, b-c a-d, b-d S c d map 0 Map S e f Pairs a-e, b-e a-f, b-f R a b map 1 S g h Pairs a-g, b-g a-h, b-h map 2

9 / 16 R is split in x blocks, S is split in y blocks All x blocks of R are compared with all y blocks of S Implementation Composite map output keys Grouping by i.j  invocation of reduce per group Entities of R appear before entities of S in the list of entities Reduce tasks buffer entities of R and match each entity of S with buffer R EDUCE S PLIT Learning-based Entity Resolution with MapReduce Assigned block index (random) Outputted keyvalue pairs Partitioning function Entityeof R [0, x-1] y pairs(i.j.R,e) for[0, y-1] part(i.j.source)= (i+jx) mod r Entityeof S [0, y-1] x pairs(i.j.S,e) for[0,x-1] j 012 i 0021 1102 Example reduce task assignment of part for x=2, y=3, r=3 R0R0 R1R1 …R x-1 S0S0 S1S1 …S y-1 (1.0.R, e) (1.1.R, e) (1.y-1.R, e) (0.y-1.S, e)(1.y-1.S, e) (x-1.y-1.S, e)

10 / 16 R EDUCE S PLIT ( M =3, R =3, X =2, Y =3) Learning-based Entity Resolution with MapReduce S c d e map 1 Map Key=IndexR.IndexS.Source S f g h R a b map 2 map 0 KeyValue 0.0.SfSfS 1.0.SfSfS 0.1.SgSgS 1.1.SgSgS 0.2.ShShS 1.2.ShShS Partitioning by (IndexR+IndexS*x modulo r) KeyValue 0.0.RaRaR 0.0.ScScS fSfS 1.1.RbRbR 1.1.SdSdS gSgS KeyValue 0.2.RaRaR 0.2.SeSeS hShS 1.0.RbRbR 1.0.ScScS fSfS KeyValue 0.1.RaRaR 0.1.SdSdS gSgS 1.2.RbRbR 1.2.SeSeS hShS reduce 0 reduce 1 reduce 2 Pairs a-e, a-h b-c, b-f Pairs a-d, a-g b-e, b-h Reduce Group By: IndexR.IndexS KeyValue 0.0.Ra R 0.1.Ra R 0.2.Ra R 1.0.Rb R 1.1.Rb R 1.2.Rb R KeyValue 0.0.Sc S 1.0.Sc S 0.1.Sd S 1.1.Sd S 0.2.Se S 1.2.Se S Pairs a-c, a-f b-d, b-g

11 / 16 M AP S IDE VS. R EDUCE S PLIT MapSide requires that the R entirely fits in main memory that is available per map task (multiple per node!) No data redistribution, sorting, grouping and reduce task scheduling With ReduceSplit, only |R|/x entities need to be buffered At the expense of data replication (|R|*y + |S|*x map output pairs) Careful choice of x, y is crucial for performance Learning-based Entity Resolution with MapReduce

12 / 16 E XPERIMENTAL R ESULTS – M ATCH Q UALITY Bibliographic datasets – DBLP (2,600) vs. GoogleScholar 64,000 Up to six matchers Two classifiers – Decision Tree and Support Vector Machine from WEKA Employing multiple matchers increases overall match quality (F-measure) Especially true if additional matchers operate on different attributes Learning-based Entity Resolution with MapReduce

13 / 16 E XPERIMENTAL R ESULTS – TIME DISTRIBUTION Evaluation of the runtime using MapSide Same match problem 10 Amazon EC2 High-CPU Medium instances (each with two virtual cores) Generally multiple matchers increase match quality At the expense of runtime Similarity computation consumes between 88% and 97% of overall runtime depending on number of matchers Parallel Sorted Neighborhood Blocking with MapReduce

14 / 16 E XPERIMENTAL R ESULTS – S CALABILITY MapSide with n= 1…50 dual core VMs Almost linear speedup for up to 10 nodes Still good speedup values for more nodes (e.g. ≈40 for n=50) Learning-based Entity Resolution with MapReduce

15 / 16 C ONCLUSIONS Learning-based Entity Resolution with MapReduce Two different strategies for evaluation of Cartesian product of two input sources MapSide – similarity computation solely during map phase ReduceSplit – distribution of Cartesian product evaluation evenly across all reduce tasks Evaluation of the proposed approaches Future work Incorporate blocking strategies Analysis of learned model to avoid application of all matchers Learning-based Entity Resolution with MapReduce

16 / 16 Learning-based Entity Resolution with MapReduce T HANK YOU FOR YOUR ATTENTION

L EARNING - BASED E NTITY R ESOLUTION WITH M AP R EDUCE Lars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm Database Group Leipzig

Similar presentations

Presentation on theme: "L EARNING - BASED E NTITY R ESOLUTION WITH M AP R EDUCE Lars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm Database Group Leipzig"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

L EARNING - BASED E NTITY R ESOLUTION WITH M AP R EDUCE Lars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm Database Group Leipzig

Similar presentations

Presentation on theme: "L EARNING - BASED E NTITY R ESOLUTION WITH M AP R EDUCE Lars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm Database Group Leipzig"— Presentation transcript:

Similar presentations

About project

Feedback