D ATA P ARTITIONING FOR D ISTRIBUTED E NTITY M ATCHING Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß, Hanna Köpcke, Erhard Rahm Database Group.

Slides:

Advertisements

Similar presentations

Introduction to Grid Application On-Boarding Nick Werstiuk

Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

D ON ' T M ATCH T WICE : R EDUNDANCY - FREE S IMILARITY C OMPUTATION WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig University.

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

LOAD BALANCING IN A CENTRALIZED DISTRIBUTED SYSTEM BY ANILA JAGANNATHAM ELENA HARRIS.

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA Michael Hartung, Lars Kolb, Anika Groß, Erhard Rahm Database Research.

PARALLELIZING LARGE-SCALE DATA- PROCESSING APPLICATIONS WITH DATA SKEW: A CASE STUDY IN PRODUCT-OFFER MATCHING Ekaterina Gonina UC Berkeley Anitha Kannan,

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

OpenFOAM on a GPU-based Heterogeneous Cluster

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

1 04/18/2005 Flux Flux: An Adaptive Partitioning Operator for Continuous Query Systems M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin UC.

CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.

P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig Kaiserslautern,

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

SecureMR: A Service Integrity Assurance Framework for MapReduce Author: Wei Wei, Juan Du, Ting Yu, Xiaohui Gu Source: Annual Computer Security Applications.

Introduction to Hadoop and HDFS

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Designing and Deploying a Scalable EPM Solution Ken Toole Platform Test Manager MS Project Microsoft.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

MapReduce How to painlessly process terabytes of data.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

Conclusions and Future Considerations: Parallel processing of raster functions were 3-22 times faster than ArcGIS depending on file size. Also, processing.

SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.

Data Placement and Task Scheduling in cloud, Online and Offline 赵青天津科技大学

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

L EARNING - BASED E NTITY R ESOLUTION WITH M AP R EDUCE Lars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm Database Group Leipzig

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.

ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

ApproxHadoop Bringing Approximations to MapReduce Frameworks

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Efficient Load Balancing Algorithm for Cloud Computing Network Che-Lun Hung 1, Hsiao-hsi Wang 2 and Yu-Chen Hu 2 1 Dept. of Computer Science & Communication.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

Indicate Research Pilots An e-Infrastructure enabled semantic search service Technical Conference Catania 20/04/2012 NTUA Kostas Pardalis 1.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Optimizing Parallel Algorithms for All Pairs Similarity Search

Conception of parallel algorithms

A Black-Box Approach to Query Cardinality Estimation

Accelerating MapReduce on a Coupled CPU-GPU Architecture

A Distributed Bucket Elimination Algorithm

Communication and Memory Efficient Parallel Decision Tree Construction

Support for ”interactive batch”

Presentation transcript:

D ATA P ARTITIONING FOR D ISTRIBUTED E NTITY M ATCHING Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß, Hanna Köpcke, Erhard Rahm Database Group Leipzig Singapore, 13 th September 2010

2 / 20 Detection of entities in one ore more sources that refer to the same real-world object Entity comparisons Comparisons based on string similarity Two sources: m 2 Combination of several matchers Aggregation of individual results Runtime Execution times up to several hours for a single attribute matcher Worse for machine learning approaches Memory requirements Source and intermediate results do not fit in memory Chunk-wise processing E NTITY M ATCHING Data Partitioning for Distributed Entity Matching …

3 / 20 Blocking Group similar entities within blocks Restrict entity matching to entities from the same block Supported by entity matching frameworks Parallelization Split match computation in sub-tasks Execute them in parallel on multiple multi-core nodes Currently utilized by only few frameworks H OW TO SPEED UP ENTITY MATCHING ? Data Partitioning for Distributed Entity Matching ? ?

4 / 20 C ONTRIBUTIONS Generic data partitioning strategies for parallel matching Size-based partitioning for evaluating the Cartesian product of input entities Blocking-based matching Applicable to arbitrary matchers Load balancing regarding available resources and match strategy Service-based infrastructure for parallel entity matching Evaluation of the strategies for different types of matchers and datasets Data Partitioning for Distributed Entity Matching

5 / 20 O UTLINE Motivation Overview Partitioning Strategies Size-based Partitioning Blocking-based Partitioning Match Infrastructure Evaluation Conclusion & Future Work Data Partitioning for Distributed Entity Matching

6 / 20 O VERVIEW Data Partitioning for Distributed Entity Matching Set of entities, described via attributes Partitioning strategy Partition input data Match task generation Parallel execution of a match strategy One or several matchers Combination of individual match results (manually, training-based) Treated as black box Input Source Input Source Integrated Source... instance data integration... Partitioning Strategies Size-based Partitioning max. partition size Match Task Generation MT 2 MT 1 Task list Match Task Generation Blocking Partition Tuning source- specific blocking key min./max. partition size M1M1 M2M2 MtMt... Match Result ⋃ Result Aggregation Parallel Matching

7 / 20 Applicable to Cartesian product of input entities Split n input entities into p partitions of fairly equal size m p = n/m Range partitioning, Round Robin Match task compares two of these partitions Match partitions P i and P j if i ≤ j p+p(p-1)/2 match tasks Promises good load balancing and scalability to many nodes S IZE - BASED PARTITIONING Data Partitioning for Distributed Entity Matching Input Set P1P1 P2P2 …P p-1 PpPp Input Set P1P1 x P2P2 xx ………… P p-1 xx…x PpPp xx…xx

8 / 20 S IZE - BASED PARTITIONING – SUITABLE PARTITION SIZE M ? Data Partitioning for Distributed Entity Matching

9 / 20 B LOCKING - BASED PARTITIONING Blocking – logical clustering of possibly matching entities Blocks of largely varying size Entities with missing attribute values Assigned to dedicated misc block Have to be compared with entities of all blocks Simple approach – one match task per block Poor load balancing and/or high communication overhead Large blocks dominate execution time and consume much memory Small blocks slow down parallel matching due high communication overhead compared to time for matching Partition tuning to split or aggregate blocks Data Partitioning for Distributed Entity Matching

10 / 20 B LOCKING - BASED PARTITIONING – P ARTITION TUNING Large blocks for which matching would consume to much memory are split into equally-sized partitions Max. partition size m chosen according to memory requirement estimation All sub-partitions of a split block have to be matched with each other Data Partitioning for Distributed Entity Matching Drives & Storage (3,250) 3½(1,300) 2½(600) Blu-ray(60) HD-DVD(40) Misc(600) 3½ 1 (700) 3½ 2 (600) CD-RW(50) DVD-RW(600) Max. partition size = 700 Blocking by type

11 / 20 B LOCKING - BASED PARTITIONING – P ARTITION TUNING Small blocks with sizes below min. partition size are aggregated into larger ones Less partitions  less match tasks  reduced communication and scheduling overhead Aggregation introduces unnecessary comparisons and may lead to false-positives Data Partitioning for Distributed Entity Matching Drives & Storage (3,250) 3½(1,300) 2½(600) Blu-ray (60) HD-DVD(40) misc(600) 3½ 1 (700) 3½ 2 (600) CD-RW(50) DVD-RW(600) Min. partition size = 70 Blu-ray Max. partition size = 700 Blocking by type CD-RW(150) HD-DVD(100) HD-DVD

12 / 20 B LOCKING - BASED PARTITIONING – MATCH TASK GENERATION One match task per normal (non-misc) block that has not been split Blocks that have been split in k sub-partitions result in k+k(k-1)/2 match tasks The misc block (or its sub-partitions) have to be matched with all blocks (sub-partitions) Data Partitioning for Distributed Entity Matching Drives & Storage (3,250) 3½(1,300) 2½(600) Blu-ray(60) HD-DVD(40) Misc(600) 3½ 1 (700) 3½ 2 (600) CD-RW(50) DVD-RW(600) Blu-ray HD-DVD CD-RW(150) Drives & Storage 3½ 2½ Blu- ray HD- DVD CD - RW DVD -RW misc 3½ 1 3½ 2 Drives & Storage 3½3½ 3½ 1 3½ 2 2½ Blu- ray HD- DVD CD- RW DVD-RW misc Drives & Storage 3½ 2½ Blu- ray HD- DVD CD - RW DVD -RW misc 3½ 1 3½ 2 Drives & Storage 3½3½ 3½ 1 3½ 2 2½X Blu- ray X HD- DVD CD- RW DVD-RWX misc Drives & Storage 3½ 2½ Blu- ray HD- DVD CD - RW DVD -RW misc 3½ 1 3½ 2 Drives & Storage 3½3½ 3½ 1 X 3½ 2 XX 2½X Blu- ray X HD- DVD CD- RW DVD-RWX misc Drives & Storage 3½ 2½ Blu- ray HD- DVD CD - RW DVD -RW misc 3½ 1 3½ 2 Drives & Storage 3½3½ 3½ 1 X 3½ 2 XX 2½X Blu- ray X HD- DVD CD- RW DVD-RWX miscXXXXXX

13 / 20 M ATCH INFRASTRUCTURE – S IZE - BASED PARTITIONING EXAMPLE Data Partitioning for Distributed Entity Matching CPU 1 CPU 2 CPU 1 CPU 2 Data Service Workflow Service... e1e1 e2e2 e k-1 ekek Source data/ Attribute histograms Match Service 1 Match Service 2 Match task list t1t1 Equally sized partitions (described logically)... pnpn p1p1 p2p2 t2t2 p1p1 p2p2 pnpn p1p1 p1p1 p2p2 tntn pnpn t n+1 p2p2 t 2n-1 pnpn t n(n-1)/2...

14 / 20 I NFRASTRUCTURE – S IZE - BASED PARTITIONING EXAMPLE Data Partitioning for Distributed Entity Matching CPU 1 CPU 2 CPU 1 CPU 2 Data Service Workflow Service Match Service 1 Match Service 2 Match task list t1t1 t2t2 tntn t n+1 t 2n-1 t n(n-1)/2... t1t1 t2t2 tntn t n+1 t 2n-1 t n(n-1)/2 CPU 1... CPU 2... CPU 2... CPU 1... CPU 1... CPU 2... Unify partial match results

15 / 20 E VALUATION Datasets 114,000 electronic product offers Small subset of 20,000 offers Two matchers WAM – Levenshtein, Trigram (weighted average) LRM – Jaccard, Trigram, Cosine (machine learning) Computing environment Up to 16 cores (4 nodes with 4x2.66GHz and 4GB RAM) 3GB RAM heap size Data Partitioning for Distributed Entity Matching

16 / 20 E VALUATION – I NFLUENCE OF THE MAX. PARTITION SIZE 20,000 electronic product offers Cartesian product Single node, 4 match threads Data Partitioning for Distributed Entity Matching LRM: m max = 500

17 / 20 20,000 electronic product offers Cartesian product (m max = 1000/500) 4 nodes, 4 match threads per node E VALUATION – P ARALLEL MATCHING ON MULTIPLE NODES Data Partitioning for Distributed Entity Matching

18 / ,000 electronic product offers Blocking (m max = 1000/500, m min = 200/100) 4 nodes, 4 match threads per node E VALUATION – P ARALLEL MATCHING ON MULTIPLE NODES Data Partitioning for Distributed Entity Matching

19 / 20 C ONCLUSIONS & FUTURE WORK Two generic data partitioning strategies for parallel matching with any matchers Size-based partitioning Blocking-based partitioning Partition tuning to achieve evenly loaded nodes Evaluation on newly developed service-based infrastructure Adapt approaches to cloud architectures Parallelize Blocking Investigate optimizations within match strategies Data Partitioning for Distributed Entity Matching

20 / 20 Data Partitioning for Distributed Entity Matching Please also note our contribution to the VLDB experiments and analysis track: Evaluation of entity resolution approaches on real-world match problems Hanna Köpcke, Andreas Thor, Erhard Rahm Date: 14 September 2010, Tuesday Time: 17:30 hours Room: Swallow T HANK YOU FOR YOUR ATTENTION