Speeding Up Batch Alignment of Large Ontologies Using MapReduce Uthayasanker Thayasivam and Prashant Doshi Dept. of Computer Science University of Georgia.

Slides:



Advertisements
Similar presentations
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Advertisements

SecureMR: A Service Integrity Assurance Framework for MapReduce Wei Wei, Juan Du, Ting Yu, Xiaohui Gu North Carolina State University, United States Annual.
Maurice Hermans.  Ontologies  Ontology Mapping  Research Question  String Similarities  Winkler Extension  Proposed Extension  Evaluation  Results.
University of Minnesota CG_Hadoop: Computational Geometry in MapReduce Ahmed Eldawy* Yuan Li* Mohamed F. Mokbel*$ Ravi Janardan* * Department of Computer.
Optimizing Similarity Computations for Ontology Matching - Experiences from GOMMA Michael Hartung, Lars Kolb, Anika Groß, Erhard Rahm Database Research.
Literature Review 1.  Record linkage  Runtime reduction techniques ◦ Blocking ◦ Canopies ◦ Sorted Neighborhood  Shift to parallel computing  Research.
First Insights into the Library Track of the OAEI Dominique Ritze Mannheim University Library.
Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
CMPT-884 Jan 18, 2010 Video Copy Detection using Hadoop Presented by: Cameron Harvey Naghmeh Khodabakhshi CMPT 820 December 2, 2010.
Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distributed.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
New Ways of Mapping Knowledge Organization Systems Using a Semi-Automatic Matching- Procedure for Building Up Vocabulary Crosswalks Andreas Oskar Kempf.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.
SecureMR: A Service Integrity Assurance Framework for MapReduce Author: Wei Wei, Juan Du, Ting Yu, Xiaohui Gu Source: Annual Computer Security Applications.
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Introduction to Hadoop and HDFS
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Mining High Utility Itemset in Big Data
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Scalable Distributed Reasoning Using MapReduce Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen Department of Computer Science, Vrije.
Record Linkage in a Distributed Environment
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Scalable Keyword Search on Large RDF Data. Abstract Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Streaming Big Data with Self-Adjusting Computation Umut A. Acar, Yan Chen DDFP January 2014 SNU IDB Lab. Namyoon Kim.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
Load Rebalancing for Distributed File Systems in Clouds.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
Microsoft Research Faculty Summit Prashant Doshi Asst. Professor of Computer Science University of Georgia.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Distributed Video Transcoding System based on MapReduce for Video Content Delivery Myoungjin Kim', Hanku Lee l 'z* Hyeokju Lee' and Seungho Han' ' Department.
Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Hadoop Aakash Kag What Why How 1.
Efficient Multi-User Indexing for Secure Keyword Search
Optimizing Parallel Algorithms for All Pairs Similarity Search
Distributed Network Traffic Feature Extraction for a Real-time IDS
SpatialHadoop: A MapReduce Framework for Spatial Data
Dynamic Indexing in SpatialHadoop
Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan
Linchuan Chen, Peng Jiang and Gagan Agrawal
Agent-based Model Simulation with Twister
CS110: Discussion about Spark
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Speeding Up Batch Alignment of Large Ontologies Using MapReduce Uthayasanker Thayasivam and Prashant Doshi Dept. of Computer Science University of Georgia

Introduction  Ontology: formalize the knowledge of a domain by means of defining concepts and properties that relate them

Introduction: Ontology Alignment

Problem Definition: Ontology Alignment find a set of correspondences between two ontologies O 1 = and O 2 =. The ontology alignment problem:

Ontology Alignment Challenges  Improving the Alignment Quality  Structural & lexical disparity  Improving the Alignment Efficiency  Quickly producing quality alignment  Improving the Scalability Ontology Sizes Efficiency / Quality Resources Efficiency / Quality

Space of Alignments m11m12…m1|V 2 | m21m22…m2|V 2 | ………… m|V 1 |1m|V 1 |2…m|V 1 ||V 2 | x1 x2.. x|V 1 | y1y2…y|V 2 | Alignment between many-to-many Alignment Space Size: one-to-manyone-to-one Evaluating An Alignment: Cartesian Product of entities

Space of Alignments m11m12…m1|V 2 | m21m22…m2|V 2 | ………… m|V 1 |1m|V 1 |2…m|V 1 ||V 2 | x1 x2.. x|V 1 | y1y2…y|V 2 | Alignment between many-to-many Alignment Space Size: one-to-manyone-to-one Evaluating An Alignment: Cartesian Product of entities Bipartite graph

Large Ontology Matching  Reduction of alignment space  Early pruning of dissimilar element pairs  aflood (Hanif and Masaki ‘09)  Partition based matching  Falcon-AO (Jian et. al. ‘05)  Parallel matching  MapPSO (Bock and Hettenhausen ‘10)  VDoc+ (Zhang ‘12) O2O2 O1O1 P11P11 P12P12 P13P13 P21P21 P22P22 P23P23 4 blocks

Batch Alignment of Large Ontologies  Scalability is challenging  OAEI Very Large Biomedical Ontology Track  8 out of 21 tools completed  Ontology repositories (e.g., NCBO at Stanford)  Batch alignment of ontologies  New ontologies posted  Ontologies get updated Approach allows any alignment algorithm to be utilized on a MapReduce architecture

Contributions: Batch Alignment of Large Ontologies General & Novel Approach To speed up batch alignment of large ontologies using MapReduce  No impact to alignment quality for some algorithms  Benefits ontology repositories

MapReduce Framework

output Key-> Value Key-> Key-> Output Value Key identifies a subproblem

MapReduce Framework O1O1 O2O2 O11O11 O21O21 O31O31 O12O12 O22O22

O1O1 O2O2 O11O11 O21O21 O31O31 O12O12 O22O22 …

O1O1 O2O2 O11O11 O21O21 O31O31 O12O12 O22O22

O1O1 O2O2 O11O11 O21O21 O31O31 O12O12 O22O22

Mapper & Reducer Algorithms

Identifying Alignment Subproblems  Approach: Hamdi et al  Identify anchors: entity pairs with identical names or labels  Cluster concepts around the anchors  Using structural neighborhood Entities from one cluster are predominantly in correspondence with entities in one other cluster

Merging Subproblem Alignments

Performance Evaluation  Datasets  Conference track from OAEI (120 pairs)  Large ontologies from OAEI (SNOMED, NCI,... 5 pairs)  New biomedical ontology testbed (50 pairs from NCBO)  Algorithms  Compare F-measure & runtime  Default setup on a single node  MapReduce setup using Hadoop (12 nodes each with 24 2GB & 2GHz Intel Xeon processors) Falcon-AOOptima+LogMapYAM++

Results – 3 Datasets Algos.Speedup Confer.LargeOAEIBiomed Falcon2155 LogMap9165 Optima Yam ConferenceLarge OAEI Biomedical

Results – Large OAEI ontologies  Conference Track  No partitioning  No change in output Ontology Pairs MapRed./Def. Falcon-AO MapReduce LogMap MapRed./Def. Optima+ MapReduce YAM++ Default LogMap Default YAM++ PRFPRFPRFPRFPRFPRF mouse, human STW, TheSoz fma,nci fma, snomed snomed, nci  Other Datasets  LogMap & Yam++ :  Tradeoff is in the alignment quality  Falcon-AO & Optima+:  No change in output

Speedup with # of nodes in the Hadoop cluster

Discussion  First inter-matcher parallelization approach  Especially using MapReduce  Exhibits significant speedup for batch alignment  Some algorithms may find small reduction in alignment quality due to the partitioning  Significant speedup for single ontology pair  Falcon-AO, Optima+ & YAM++  Any alignment algorithm can fit in our framework

Thank you Questions ?

Parallel Alignment of Large Ontologies on A Computing Cluster  Current Divide and Conquer Approaches  Heavily rely on structure  Size based partitioning techniques are not effective  Current Parallel Matching algorithms  Parallelize the process within the algorithms  Do not support multi node – cluster architecture