D ATA P ARTITIONING FOR D ISTRIBUTED E NTITY M ATCHING Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß, Hanna Köpcke, Erhard Rahm Database Group.

D ATA P ARTITIONING FOR D ISTRIBUTED E NTITY M ATCHING Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß, Hanna Köpcke, Erhard Rahm Database Group Leipzig http://dbs.uni-leipzig.de Singapore, 13 th September 2010

2 / 20 Detection of entities in one ore more sources that refer to the same real-world object Entity comparisons Comparisons based on string similarity Two sources: m 2 Combination of several matchers Aggregation of individual results Runtime Execution times up to several hours for a single attribute matcher Worse for machine learning approaches Memory requirements Source and intermediate results do not fit in memory Chunk-wise processing E NTITY M ATCHING Data Partitioning for Distributed Entity Matching …

3 / 20 Blocking Group similar entities within blocks Restrict entity matching to entities from the same block Supported by entity matching frameworks Parallelization Split match computation in sub-tasks Execute them in parallel on multiple multi-core nodes Currently utilized by only few frameworks H OW TO SPEED UP ENTITY MATCHING ? Data Partitioning for Distributed Entity Matching ? ?

4 / 20 C ONTRIBUTIONS Generic data partitioning strategies for parallel matching Size-based partitioning for evaluating the Cartesian product of input entities Blocking-based matching Applicable to arbitrary matchers Load balancing regarding available resources and match strategy Service-based infrastructure for parallel entity matching Evaluation of the strategies for different types of matchers and datasets Data Partitioning for Distributed Entity Matching

5 / 20 O UTLINE Motivation Overview Partitioning Strategies Size-based Partitioning Blocking-based Partitioning Match Infrastructure Evaluation Conclusion & Future Work Data Partitioning for Distributed Entity Matching

6 / 20 O VERVIEW Data Partitioning for Distributed Entity Matching Set of entities, described via attributes Partitioning strategy Partition input data Match task generation Parallel execution of a match strategy One or several matchers Combination of individual match results (manually, training-based) Treated as black box Input Source Input Source Integrated Source... instance data integration... Partitioning Strategies Size-based Partitioning max. partition size Match Task Generation MT 2 MT 1 Task list Match Task Generation Blocking Partition Tuning source- specific blocking key min./max. partition size M1M1 M2M2 MtMt... Match Result ⋃ Result Aggregation Parallel Matching

7 / 20 Applicable to Cartesian product of input entities Split n input entities into p partitions of fairly equal size m p = n/m Range partitioning, Round Robin Match task compares two of these partitions Match partitions P i and P j if i ≤ j p+p(p-1)/2 match tasks Promises good load balancing and scalability to many nodes S IZE - BASED PARTITIONING Data Partitioning for Distributed Entity Matching Input Set P1P1 P2P2 …P p-1 PpPp Input Set P1P1 x P2P2 xx ………… P p-1 xx…x PpPp xx…xx

8 / 20 S IZE - BASED PARTITIONING – SUITABLE PARTITION SIZE M ? Data Partitioning for Distributed Entity Matching

9 / 20 B LOCKING - BASED PARTITIONING Blocking – logical clustering of possibly matching entities Blocks of largely varying size Entities with missing attribute values Assigned to dedicated misc block Have to be compared with entities of all blocks Simple approach – one match task per block Poor load balancing and/or high communication overhead Large blocks dominate execution time and consume much memory Small blocks slow down parallel matching due high communication overhead compared to time for matching Partition tuning to split or aggregate blocks Data Partitioning for Distributed Entity Matching

10 / 20 B LOCKING - BASED PARTITIONING – P ARTITION TUNING Large blocks for which matching would consume to much memory are split into equally-sized partitions Max. partition size m chosen according to memory requirement estimation All sub-partitions of a split block have to be matched with each other Data Partitioning for Distributed Entity Matching Drives & Storage (3,250) 3½(1,300) 2½(600) Blu-ray(60) HD-DVD(40) Misc(600) 3½ 1 (700) 3½ 2 (600) CD-RW(50) DVD-RW(600) Max. partition size = 700 Blocking by type

11 / 20 B LOCKING - BASED PARTITIONING – P ARTITION TUNING Small blocks with sizes below min. partition size are aggregated into larger ones Less partitions  less match tasks  reduced communication and scheduling overhead Aggregation introduces unnecessary comparisons and may lead to false-positives Data Partitioning for Distributed Entity Matching Drives & Storage (3,250) 3½(1,300) 2½(600) Blu-ray (60) HD-DVD(40) misc(600) 3½ 1 (700) 3½ 2 (600) CD-RW(50) DVD-RW(600) Min. partition size = 70 Blu-ray Max. partition size = 700 Blocking by type CD-RW(150) HD-DVD(100) HD-DVD

12 / 20 B LOCKING - BASED PARTITIONING – MATCH TASK GENERATION One match task per normal (non-misc) block that has not been split Blocks that have been split in k sub-partitions result in k+k(k-1)/2 match tasks The misc block (or its sub-partitions) have to be matched with all blocks (sub-partitions) Data Partitioning for Distributed Entity Matching Drives & Storage (3,250) 3½(1,300) 2½(600) Blu-ray(60) HD-DVD(40) Misc(600) 3½ 1 (700) 3½ 2 (600) CD-RW(50) DVD-RW(600) Blu-ray HD-DVD CD-RW(150) Drives & Storage 3½ 2½ Blu- ray HD- DVD CD - RW DVD -RW misc 3½ 1 3½ 2 Drives & Storage 3½3½ 3½ 1 3½ 2 2½ Blu- ray HD- DVD CD- RW DVD-RW misc Drives & Storage 3½ 2½ Blu- ray HD- DVD CD - RW DVD -RW misc 3½ 1 3½ 2 Drives & Storage 3½3½ 3½ 1 3½ 2 2½X Blu- ray X HD- DVD CD- RW DVD-RWX misc Drives & Storage 3½ 2½ Blu- ray HD- DVD CD - RW DVD -RW misc 3½ 1 3½ 2 Drives & Storage 3½3½ 3½ 1 X 3½ 2 XX 2½X Blu- ray X HD- DVD CD- RW DVD-RWX misc Drives & Storage 3½ 2½ Blu- ray HD- DVD CD - RW DVD -RW misc 3½ 1 3½ 2 Drives & Storage 3½3½ 3½ 1 X 3½ 2 XX 2½X Blu- ray X HD- DVD CD- RW DVD-RWX miscXXXXXX

13 / 20 M ATCH INFRASTRUCTURE – S IZE - BASED PARTITIONING EXAMPLE Data Partitioning for Distributed Entity Matching CPU 1 CPU 2 CPU 1 CPU 2 Data Service Workflow Service... e1e1 e2e2 e k-1 ekek Source data/ Attribute histograms Match Service 1 Match Service 2 Match task list t1t1 Equally sized partitions (described logically)... pnpn p1p1 p2p2 t2t2 p1p1 p2p2 pnpn p1p1 p1p1 p2p2 tntn pnpn t n+1 p2p2 t 2n-1 pnpn t n(n-1)/2...

14 / 20 I NFRASTRUCTURE – S IZE - BASED PARTITIONING EXAMPLE Data Partitioning for Distributed Entity Matching CPU 1 CPU 2 CPU 1 CPU 2 Data Service Workflow Service Match Service 1 Match Service 2 Match task list t1t1 t2t2 tntn t n+1 t 2n-1 t n(n-1)/2... t1t1 t2t2 tntn t n+1 t 2n-1 t n(n-1)/2 CPU 1... CPU 2... CPU 2... CPU 1... CPU 1... CPU 2... Unify partial match results

15 / 20 E VALUATION Datasets 114,000 electronic product offers Small subset of 20,000 offers Two matchers WAM – Levenshtein, Trigram (weighted average) LRM – Jaccard, Trigram, Cosine (machine learning) Computing environment Up to 16 cores (4 nodes with 4x2.66GHz and 4GB RAM) 3GB RAM heap size Data Partitioning for Distributed Entity Matching

16 / 20 E VALUATION – I NFLUENCE OF THE MAX. PARTITION SIZE 20,000 electronic product offers Cartesian product Single node, 4 match threads Data Partitioning for Distributed Entity Matching LRM: m max = 500

17 / 20 20,000 electronic product offers Cartesian product (m max = 1000/500) 4 nodes, 4 match threads per node E VALUATION – P ARALLEL MATCHING ON MULTIPLE NODES Data Partitioning for Distributed Entity Matching

18 / 20 114,000 electronic product offers Blocking (m max = 1000/500, m min = 200/100) 4 nodes, 4 match threads per node E VALUATION – P ARALLEL MATCHING ON MULTIPLE NODES Data Partitioning for Distributed Entity Matching

19 / 20 C ONCLUSIONS & FUTURE WORK Two generic data partitioning strategies for parallel matching with any matchers Size-based partitioning Blocking-based partitioning Partition tuning to achieve evenly loaded nodes Evaluation on newly developed service-based infrastructure Adapt approaches to cloud architectures Parallelize Blocking Investigate optimizations within match strategies Data Partitioning for Distributed Entity Matching

20 / 20 Data Partitioning for Distributed Entity Matching Please also note our contribution to the VLDB experiments and analysis track: Evaluation of entity resolution approaches on real-world match problems Hanna Köpcke, Andreas Thor, Erhard Rahm Date: 14 September 2010, Tuesday Time: 17:30 hours Room: Swallow T HANK YOU FOR YOUR ATTENTION

D ATA P ARTITIONING FOR D ISTRIBUTED E NTITY M ATCHING Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß, Hanna Köpcke, Erhard Rahm Database Group.

Similar presentations

Presentation on theme: "D ATA P ARTITIONING FOR D ISTRIBUTED E NTITY M ATCHING Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß, Hanna Köpcke, Erhard Rahm Database Group."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

D ATA P ARTITIONING FOR D ISTRIBUTED E NTITY M ATCHING Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß, Hanna Köpcke, Erhard Rahm Database Group.

Similar presentations

Presentation on theme: "D ATA P ARTITIONING FOR D ISTRIBUTED E NTITY M ATCHING Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß, Hanna Köpcke, Erhard Rahm Database Group."— Presentation transcript:

Similar presentations

About project

Feedback