L EARNING - BASED E NTITY R ESOLUTION WITH M AP R EDUCE Lars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm Database Group Leipzig

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

D ON ' T M ATCH T WICE : R EDUNDANCY - FREE S IMILARITY C OMPUTATION WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig University.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
PARALLELIZING LARGE-SCALE DATA- PROCESSING APPLICATIONS WITH DATA SKEW: A CASE STUDY IN PRODUCT-OFFER MATCHING Ekaterina Gonina UC Berkeley Anitha Kannan,
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
P ARALLEL S ORTED N EIGHBORHOOD B LOCKING WITH M AP R EDUCE Lars Kolb, Andreas Thor, Erhard Rahm Database Group Leipzig Kaiserslautern,
Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
D ATA P ARTITIONING FOR D ISTRIBUTED E NTITY M ATCHING Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß, Hanna Köpcke, Erhard Rahm Database Group.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Record Linkage in a Distributed Environment
CS4432: Database Systems II Query Processing- Part 3 1.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
CS4432: Database Systems II Query Processing- Part 2.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Joe Bradish Parallel Neural Networks. Background  Deep Neural Networks (DNNs) have become one of the leading technologies in artificial intelligence.
CS4432: Database Systems II Query Processing- Part 1 1.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Optimizing Parallel Algorithms for All Pairs Similarity Search
Parallel Databases.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
On Spatial Joins in MapReduce
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
MOMA - A Mapping-based Object Matching System
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Map Reduce
Presentation transcript:

L EARNING - BASED E NTITY R ESOLUTION WITH M AP R EDUCE Lars Kolb, Hanna Köpcke, Andreas Thor, Erhard Rahm Database Group Leipzig Glasgow, CloudDB 2011

2 / 16 Identification of semantically equivalent entities Within one data source or between different sources To merge them, compare them, improve data quality, etc. E NTITY R ESOLUTION Learning-based Entity Resolution with MapReduce Duplicates due to Order of authors Extraction errors Different titles Typos … Duplicates due to Order of authors Extraction errors Different titles Typos …

3 / 16 Learning-based Entity Resolution with MapReduce E NTITY RESOLUTION (2) Lot of research work Pairwise entity comparison Application of multiple similarity measures on several attributes Combination of similarity values to match decision for each entity pair Hard to configure combination of similarity values manually Study of real-world match systems/problems [VLDB’10] Effective matching is difficult – F-Measure <75% for product data Matching is expensive – scalability issues for O(n 2 ) Learning-based approaches automate combination of similarity values but come with poor efficiency [VLDB’10] Koepcke, Thor, Rahm: Evaluation of entity resolution approaches on real-world match problems. VLDB 2010

4 / 16 L EARNING - BASED E NTITY R ESOLUTION Based on training data, entity pairs are classified as match/ non-match Pairwise similarity values serve as feature for classification Learning-based Entity Resolution with MapReduce Similarity computation sim 1 … sim k match 0.8…0.7true 0.4…0.6false Training Similarities Classifier Training Classifier R S Similarity computation Classifier Application Match Result (id R, id S ) RS match A1A1 … AuAu A1A1 … AvAv ………………true ………………false id R id S sim 1 … sim k match …0.6? …0.9? Training Data  R  S Phase 1: Training Phase 2: Application Observations Training phase < 5% Similarity computation counts for 95% of Application phases

5 / 16 Learning-based Entity Resolution with MapReduce O UTLINE Motivation MapReduce Strategies for similarity computation and classifier application on Cartesian product of two data sources with MapReduce Solely in map phase (“Broadcast Join”)  MapSide Even distribution of entity pairs across reduce tasks  Reduce Split Experimental Results Conclusions & Future Work

6 / 16 Learning-based Entity Resolution with MapReduce M AP R EDUCE Programming model for distributed computation in cluster environments UDF map applied on each input entity which outputs key-value pairs UDF part applied on key of map output pairs  assigns each pair to a reduce task UDF group applied on key to group key-value pairs UDF reduce invoked for each group Map tasks (m=3) map 2 map 1 map 0 Input data part(key)  [0, r-1] reduce 0 reduce 1 reduce 2 reduce tasks (r=3)

7 / 16 D ISTRIBUTED E VALUATION OF THE C ARTESIAN P RODUCT Pairwise entity comparison requires distribution of entity pairs to computing tasks/nodes Learning-based Entity Resolution with MapReduce R S classifier.classify( sim 1 (e R,e S ), sim 2 (e R,e S ), …, sim k (e R,e S ) ) = “match” +  Split R in x blocks (x=2)  Split S in y blocks (y=2)  Replicate each R-block y times  Replicate each S-block x times  x*y “match tasks”  Split S in x blocks (x=2)  Replicate R x times  x “match tasks” +

8 / 16 M AP S IDE (m =3) Map Tasks buffer R in memory at initialization time Each Map task operates on a partition of the larger data source S map(entity) – match currently processed entity of S with all buffered entities of R Learning-based Entity Resolution with MapReduce Pairs a-c, b-c a-d, b-d S c d map 0 Map S e f Pairs a-e, b-e a-f, b-f R a b map 1 S g h Pairs a-g, b-g a-h, b-h map 2

9 / 16 R is split in x blocks, S is split in y blocks All x blocks of R are compared with all y blocks of S Implementation Composite map output keys Grouping by i.j  invocation of reduce per group Entities of R appear before entities of S in the list of entities Reduce tasks buffer entities of R and match each entity of S with buffer R EDUCE S PLIT Learning-based Entity Resolution with MapReduce Assigned block index (random) Outputted key- value pairs Partitioning function Entityeof R [0, x-1] y pairs(i.j.R,e) for[0, y-1] part(i.j.source)= (i+jx) mod r Entityeof S [0, y-1] x pairs(i.j.S,e) for[0,x-1] j 012 i Example reduce task assignment of part for x=2, y=3, r=3 R0R0 R1R1 …R x-1 S0S0 S1S1 …S y-1 (1.0.R, e) (1.1.R, e) (1.y-1.R, e) (0.y-1.S, e)(1.y-1.S, e) (x-1.y-1.S, e)

10 / 16 R EDUCE S PLIT ( M =3, R =3, X =2, Y =3) Learning-based Entity Resolution with MapReduce S c d e map 1 Map Key=IndexR.IndexS.Source S f g h R a b map 2 map 0 KeyValue 0.0.SfSfS 1.0.SfSfS 0.1.SgSgS 1.1.SgSgS 0.2.ShShS 1.2.ShShS Partitioning by (IndexR+IndexS*x modulo r) KeyValue 0.0.RaRaR 0.0.ScScS fSfS 1.1.RbRbR 1.1.SdSdS gSgS KeyValue 0.2.RaRaR 0.2.SeSeS hShS 1.0.RbRbR 1.0.ScScS fSfS KeyValue 0.1.RaRaR 0.1.SdSdS gSgS 1.2.RbRbR 1.2.SeSeS hShS reduce 0 reduce 1 reduce 2 Pairs a-e, a-h b-c, b-f Pairs a-d, a-g b-e, b-h Reduce Group By: IndexR.IndexS KeyValue 0.0.Ra R 0.1.Ra R 0.2.Ra R 1.0.Rb R 1.1.Rb R 1.2.Rb R KeyValue 0.0.Sc S 1.0.Sc S 0.1.Sd S 1.1.Sd S 0.2.Se S 1.2.Se S Pairs a-c, a-f b-d, b-g

11 / 16 M AP S IDE VS. R EDUCE S PLIT MapSide requires that the R entirely fits in main memory that is available per map task (multiple per node!) No data redistribution, sorting, grouping and reduce task scheduling With ReduceSplit, only |R|/x entities need to be buffered At the expense of data replication (|R|*y + |S|*x map output pairs) Careful choice of x, y is crucial for performance Learning-based Entity Resolution with MapReduce

12 / 16 E XPERIMENTAL R ESULTS – M ATCH Q UALITY Bibliographic datasets – DBLP (2,600) vs. GoogleScholar 64,000 Up to six matchers Two classifiers – Decision Tree and Support Vector Machine from WEKA Employing multiple matchers increases overall match quality (F-measure) Especially true if additional matchers operate on different attributes Learning-based Entity Resolution with MapReduce

13 / 16 E XPERIMENTAL R ESULTS – TIME DISTRIBUTION Evaluation of the runtime using MapSide Same match problem 10 Amazon EC2 High-CPU Medium instances (each with two virtual cores) Generally multiple matchers increase match quality At the expense of runtime Similarity computation consumes between 88% and 97% of overall runtime depending on number of matchers Parallel Sorted Neighborhood Blocking with MapReduce

14 / 16 E XPERIMENTAL R ESULTS – S CALABILITY MapSide with n= 1…50 dual core VMs Almost linear speedup for up to 10 nodes Still good speedup values for more nodes (e.g. ≈40 for n=50) Learning-based Entity Resolution with MapReduce

15 / 16 C ONCLUSIONS Learning-based Entity Resolution with MapReduce Two different strategies for evaluation of Cartesian product of two input sources MapSide – similarity computation solely during map phase ReduceSplit – distribution of Cartesian product evaluation evenly across all reduce tasks Evaluation of the proposed approaches Future work Incorporate blocking strategies Analysis of learned model to avoid application of all matchers Learning-based Entity Resolution with MapReduce

16 / 16 Learning-based Entity Resolution with MapReduce T HANK YOU FOR YOUR ATTENTION