MSSG: A Framework for Massive-Scale Semantic Graphs Timothy D. R. Hartley, Umit Catalyurek, Fusun Ozguner, Andy Yoo, Scott Kohn, Keith Henderson Dept.

Slides:



Advertisements
Similar presentations
Large Scale Computing Systems
Advertisements

The IEEE International Conference on Big Data 2013 Arash Fard M. Usman Nisar Lakshmish Ramaswamy John A. Miller Matthew Saltz Computer Science Department.
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
Running Large Graph Algorithms – Evaluation of Current State-of-the-Art Andy Yoo Lawrence Livermore National Laboratory – Google Tech Talk Feb Summarized.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Distributed Breadth-First Search with 2-D Partitioning Edmond Chow, Keith Henderson, Andy Yoo Lawrence Livermore National Laboratory LLNL Technical report.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
OpenFOAM on a GPU-based Heterogeneous Cluster
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO IEEE Symposium of Massive Storage Systems, May 3-5, 2010 Data-Intensive Solutions.
The Virtual Microscope Umit V. Catalyurek Department of Biomedical Informatics Division of Data Intensive and Grid Computing.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
Distribution Statement A. Approved for public release; distribution is unlimited. Test and Evaluation/Science and Technology Program Rapid Data Analyzer.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
How to Resolve Bottlenecks and Optimize your Virtual Environment Chris Chesley, Sr. Systems Engineer
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.
The Center for Autonomic Computing is supported by the National Science Foundation under Grant No NSF CAC Seminannual Meeting, October 5 & 6,
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Improving Network I/O Virtualization for Cloud Computing.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Path Computation in External Memory In this work we focus on undirected, unweighted graphs with small diameter. This fits well for real world graph data.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
An Efficient Linear Time Triple Patterning Solver Haitong Tian Hongbo Zhang Zigang Xiao Martin D.F. Wong ASP-DAC’15.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Scalable Keyword Search on Large RDF Data. Abstract Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely.
Computational Research in the Battelle Center for Mathmatical medicine.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
A New Algorithm for Inferring User Search Goals with Feedback Sessions.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Parallel IO for Cluster Computing Tran, Van Hoai.
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
Developing GRID Applications GRACE Project
Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
Development of Open API for Micro-scale
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Triple Stores.
FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs Scribed by Vinh Ha.
Parallel Density-based Hybrid Clustering
Parallel Data Laboratory, Carnegie Mellon University
Verilog to Routing CAD Tool Optimization
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Overview of big data tools
Graph Colouring as a Challenge Problem for Dynamic Graph Processing on Distributed Systems Scott Sallinen, Keita Iwabuchi, Suraj Poudel, Maya Gokhale,
Energy-Efficient Storage Systems
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Triple Stores.
Accelerating Regular Path Queries using FPGA
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

MSSG: A Framework for Massive-Scale Semantic Graphs Timothy D. R. Hartley, Umit Catalyurek, Fusun Ozguner, Andy Yoo, Scott Kohn, Keith Henderson Dept. of Electrical and Computer Engineering, The Ohio State University Dept. of Biomedical Informatics, The Ohio State University Center for Applied Scientific Computing, Lawrence Livermore National Laboratory Presented by K. Sheldon April 2011

Abstract This paper reports the findings from the development of a middleware framework to access and analyze massive scale graphs. The target is scale free semantic graphs with O(10 12 ) vertices and edges. Since massive scale graphs do not permit storing the entire graph in memory, a new graph database, grDB, is proposed to efficiently store and retrieve large scale graphs using out of core storage. This framework scales well and the experimental results show that the grDB outperforms other databases such as BerkeleyDB and MySQL. Objectives: Design and implement a flexible, easy-to-use API and associated middleware platform for analyzing massive-scale semantic graphs Presented by K. Sheldon April 2011

Overview Design of architecture MSSG (Massive Scale Semantic Graphs) Experimental results Conclusions Presented by K. Sheldon April2011

Scale Free Graphs Follow power law Many vertices with low degree A few hubs have very high degree Presented by K. Sheldon April 2011

MSSG Architecture

grDB Architecture

Experiment 24 nodes - dual 2.4GHz AMD Opteron GB RAM per node 500 GB local disks Graphs Pubmed-S: 3,751,921 vertices and 27,841,781 edges Pubmed-L: 26,676,177 vertices and 519,630,678 edges Synthesized: 100 Million vertices and 2 Billion edges Measurement Search time (s) Aggregate Edges processed Presented by K. Sheldon April 2011

Custom Software (MSSG program) Stream graph clustering engine built at LLNL to solve the graph search problem. It can run on any cluster. Data is stored on disk. Lessons learned: – Well designed custom software scale to billion edge graphs performs well relatively inexpensive compared to other options. Cost : long software development time lack of generality across other graph algorithm domains. Again the communication and disk writes are impediments. Presented by K. Sheldon April 2011

Ingestion and Search Performance (PubMed-S)

Ingestion and Search Performance (PubMed-L)

Search Performance of grDB for 100M graph

Conclusions A custom disk based graph database, grDB combined with a parallel out of core BFS algorithm Predictions: for a One trillion edge graph Ingestion with grDB in roughly 77 hours average search in 10s of minutes I/O efficient structures are necessary with more experimental testing on performance. Investigate other external memory libraries such as TPIE (transparent parallel I/O environment). This provides efficient access to multiple disks attached to a single system Investigate optimization techniques for the Ingestion, GraphDB and Query portions of the framework. Presented by K. Sheldon April 2011