P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig Kaiserslautern,

P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig http://dbs.uni-leipzig.de Kaiserslautern, BTW 2011

2 / 13 Detection of entities in one or more sources that refer to the same real-world object E NTITY R ESOLUTION Parallel Sorted Neighborhood Blocking with MapReduce

3 / 13 E NTITY RESOLUTION (2) Runtime-intensive task  O(n²) entity comparisons Blocking: Semantically grouping of similar entities in blocks Based on blocking keys derived from entities attributes Restrict entity comparisons to entities from the same block Parallelization MapReduce Exploitation cloud infrastructures Parallel Sorted Neighborhood Blocking with MapReduce

4 / 13 S ORTED N EIGHBORHOOD - R UNNING E XAMPLE (w=3) Parallel Sorted Neighborhood Blocking with MapReduce KS 1a 1d 2b 2e 2f 2h 3c 3g 3i S a b c d e f g h i Key Generation + Sort by Key d-e, b-e b-f, e-f e-h, f-h f-c, h-c h-g, c-g c-i, g-i Sliding Window Determine blocking key for each entity and sort entities by blocking key Move window of fixed size w over sorted records and compare all entities within window All entities within a distance of w-1 are compared O(n²)  O(n) + O(n*log n) + O(n*w) a-d, a-b, d-b

5 / 13 O UTLINE Motivation Sorted Neighborhood and SN with MapReduce Challenge 1: Sorted Reduce Partitions  SRP Challenge 2: Comparison of Boundary Entities  JobSN/RepSN Experimental Results Conclusions & Future Work Parallel Sorted Neighborhood Blocking with MapReduce

6 / 13 M AP R EDUCE Computation expressed by two UDFs Contain sequential code Executed in parallel among multiple nodes map:(key in, value in )  list(key tmp, value tmp ) reduce:(key tmp, list(value tmp ))  list(key out, value out ) Computation relies on data partitioning and redistribution Number of map tasks m and reduce tasks r Task executed by some idle node in the cluster UDF part partitions map output and distributes it to the r reduce tasks Sorting of key-value pairs Grouping of key-value pairs by key and invocation of reduce for each group Parallel Sorted Neighborhood Blocking with MapReduce

7 / 13 E NTITY R ESOLUTION WITH M AP R EDUCE (m =3, r =2) Parallel Sorted Neighborhood Blocking with MapReduce Input Split map 1 S a b c d e f g h i KS 1d 2e 2f KS 3g 2h 3i map 2 map 3 S d e f S g h i Partitioning “key modulo r” reduce 1 reduce 2 M b-f e-h M a-d c-i b-f e-h Output Merge Map Step: Blocking Reduce Step: Matching KS 1a 2b 3c S a b c KS 1a 3c 1d 3g 3i KS 1a 1d 3c 3g 3i KS 2b 2e 2f 2h M a-d c-i Map phase Input data partitioned in m partitions Each processed by one map task that calls map for each input record (“blocking”) UDF part partitions map output and distributes it to the r reduce tasks Reduce phase Sorting of key-value pairs by key Grouping of key-value pairs by key Invocation of reduce for each group (“matching”) Challenge 1: SN requires totally sorted list of entities All entities assigned to reduce task R i have smaller blocking key than all entities assigned to reduce task R i+1 “Sorted reduce partitions” (SRP) Must be ensured by part  range partitioning

8 / 13 reduce: forEach(entity ϵ list(value tmp )) match(buffer, entity); //match all buffered entities with entity buffer.append(entity); if(buffer.size()==w) buffer.removeFirst(); S ORTED N EIGHBORHOOD WITH M AP R EDUCE – SRP Parallel Sorted Neighborhood Blocking with MapReduce map 1 KS 1.1a 1.2b 2.3c KS 1.1d 1.2e f KS 2.3g 1.2h 2.3i map 2 map 3 S a b c S d e f S g h i Partitioning by partition prefix KS 1.1a d 1.2b e f h KS 2.3c g i reduce 1 reduce 2 B c-g c-i g-i Key Generation + Partition PrefixSliding Window (+ Matching) KS 1a 2b 3c KS 1d 2e 2f KS 3g 2h 3i B a-d a-b d-b d-e b-e b-f e-f e-h f-h f-c ? h-c? h-g? Challenge 2: Boundary Entities Comparison of entities entities that are assigned to different reduce tasks map outputs composite key: partitionPrefix.blockKey partitionPrefix(k)= 1 if k<=2, otherwise 2 (range partitioning) part(partitionPrefix.blockKey)= partitionPrefix Key-value pairs are sorted and grouped by composed key

9 / 13 SN realization using two consecutive jobs Job1 : SRP + additional output of boundary entities Keys of the additionally outputted entities are prefixed with an additional boundary component Job2 : SN for boundary entities part(boundary.partitionIndex.blockKey)= boundary % r Sort and group by composed key S ORTED N EIGHBORHOOD WITH M AP R EDUCE – J OB SN Parallel Sorted Neighborhood Blocking with MapReduce KS 1.1a d 1.2b e f h KS 2.3c g i reduce 1 reduce 2 B a-d... f-h B c-g c-i g-i Sliding Window (+ Matching) + Boundary Prefix KS 1.2f h KS 2.3c g map 1 Partitioning by boundary prefix reduce 1 B f-c h-c h-g IdentitySliding Window (+ Matching) KS 1.1.2f h map 2 KS 1.2.3c g KS 1.1.2f h 1.2.3c g KS 1.1.2f h KS 1.2.3c g

10 / 13 S ORTED N EIGHBORHOOD WITH M AP R EDUCE - R EP SN Parallel Sorted Neighborhood Blocking with MapReduce map 1 KS 1.1a 1.2b 2.3c KS 1.1d 1.2e f KS 2.3g 1.2h 2.3i map 2 map 3 S a b c S d e f S g h i Key Generation + Partition Prefix + Boundary Prefix KS 1.1a 1.2b 2.3c 1.1a 1.2b KS 1.1d 1.2e f e f KS 2.3g 1.2h 2.3i 1.2h KS 1.1.1a 1.1.2b 2.2.3c 2.1.1a 2.1.2b KS 1.1.1d 1.1.2e f 2.1.2e f KS 2.2.3g 1.1.2h 2.2.3i 2.1.2h KS 1.1.1a d 1.1.2b e f h KS 2.1.1a 2.1.2b e f h 2.2.3c g i reduce 1 reduce 2 B a-d a-b d-b d-e b-e b-f e-f e-h f-h B f-c h-c h-g c-g c-i g-i Sliding Window (+ Matching) Partition ing by boundary prefix SN realization using data replication Reduce task i>1 needs last w-1 entities of previous partition in front of its input Potential boundary entities are replicated by the map tasks (two key-value pairs) Replica of entity that is assigned to reduce task R i is assigned to R i+1 Implementation Map key prefixed with boundary component (like JobSN) boundary= partitionPrefix+1 for replicated entities (boundary=partitionPrefix otherwise) part(boundary.partitionPrefix.blockKey)= boundary

11 / 13 E XPERIMENTAL R ESULTS 1.4m publication records, blocking by title.substring(2), w=1000 4 Dual core nodes, Hadoop 0.20.2 Runtime reduction: 9h to 1.5h  relative speedup of almost 6 Runtime of the implementations differ only slightly JobSN faster for small degree of parallelism RepSN completes faster gebinning with m=r=4 Parallel Sorted Neighborhood Blocking with MapReduce

12 / 13 C ONCLUSIONS Application of the MapReduce programming model for parallel execution of typical Entity Resolution workflows Realization of Sorted Neighborhood Blocking with MapReduce Sorted reduce partitions Range partitioning Boundary entities JobSN: generation of boundary correspondences by additional job RepSN: SN realization within a single job using data replication in map phase Evaluation of the proposed approaches Future work Load balancing mechanisms for handling skewed (blocking key) data Multi-pass Blocking within single job Parallel Sorted Neighborhood Blocking with MapReduce

13 / 13 Parallel Sorted Neighborhood Blocking with MapReduce T HANK YOU FOR YOUR ATTENTION

P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig Kaiserslautern,

Similar presentations

Presentation on theme: "P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig Kaiserslautern,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig Kaiserslautern,

Similar presentations

Presentation on theme: "P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig Kaiserslautern,"— Presentation transcript:

Similar presentations

About project

Feedback