GRAPHALYTICS A Big Data Benchmark for Graph-Processing Platforms

Slides:



Advertisements
Similar presentations
2000 Making DADS distributed a Nordunet2 project Jochen Hollmann Chalmers University of Technology.
Advertisements

Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.
Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014.
Project Goals and Status Peter Boncz (VU Amsterdam) Munich April 22+23, 2013.
Analysis and Modeling of Social Networks Foudalis Ilias.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Advance Analytics Capabilities
® IBM India Research Lab © 2006 IBM Corporation Challenges in Building a Strategic Information Integration Infrastructure Mukesh Mohania IBM India Research.
Hiperspace Lab University of Delaware Antony, Sara, Mike, Ben, Dave, Sreedevi, Emily, and Lori.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
LDBC & The Social Network Benchmark Peter Boncz Database Architectures CWI Special chair “Large-Scale Data VU event.cwi.nl/lsde2015.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
LDBC-Benchmarking Graph-Processing Platforms: A Vision Benchmarking Graph-Processing Platforms: A Vision (A SPEC Research Group Process) Delft University.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Storage in Big Data Systems
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Hotspot Detection in a Service Oriented Architecture Pranay Anchuri,
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Linked Data Benchmark Council 2-year status report LDBC Linked Data Benchmark Council 2-year status report Peter Boncz.
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
Advanced Software Engineering PROJECT November 2015.
Students: Aiman Md Uslim, Jin Bai, Sam Yellin, Laolu Peters Professors: Dr. Yung-Hsiang Lu CAM 2 Continuous Analysis of Many CAMeras The Problem Currently.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Making the Case for Business Intelligence
Performance Assurance for Large Scale Big Data Systems
Connected Infrastructure
VisIt Project Overview
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Big Data is a Big Deal!.
SAS users meeting in Halifax
Introduction to Distributed Platforms
Hadoop and Analytics at CERN IT
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Open Source distributed document DB for an enterprise
Spark Presentation.
Connected Infrastructure
Empirical analysis of Chinese airport network as a complex weighted network Methodology Section Presented by Di Li.
Hadoop Clusters Tess Fulkerson.
Extraction, aggregation and classification at Web Scale
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Supporting Fault-Tolerance in Streaming Grid Applications
Dieudo Mulamba November 2017
Introduction to Spark.
Anne Pratoomtong ECE734, Spring2002
Towards Effective Partition Management for Large Graphs
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Mayank Bhatt, Jayasi Mehar
Apache Spark & Complex Network
Accelerate Your Self-Service Data Analytics
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
HPML Conference, Lyon, Sept 2018
Overview of big data tools
Spark and Scala.
Technical Capabilities
The Performance of Big Data Workloads in Cloud Datacenters
Analyzing Two Participation Strategies in an Undergraduate Course Community Francisco Gutierrez Gustavo Zurita
Tahsin Reza Matei Ripeanu Nicolas Tripoul
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Resource Allocation for Distributed Streaming Applications
Assoc. Prof. Marc FRÎNCU, PhD. Habil.
Analyzing Massive Graphs - ParT I
Big-Data Analytics with Azure HDInsight
Phoenix: A Substrate for Resilient Distributed Graph Analytics
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali
Presentation transcript:

GRAPHALYTICS A Big Data Benchmark for Graph-Processing Platforms http://bl.ocks.org/mbostock/4062045 https://github.com/tudelft-atlarge/graphalytics/ GRAPHALYTICS was made possible by a generous contribution from Oracle. Alexandru Iosup, Tim Hegeman, Yong Guo, Ana Lucia Varbanescu, Wing Lung Ngai, Stijn Heldens, Mihai Capotã, Arnau Prat, Peter Boncz, Hassan Chafi Graph image Graphalytics: Benchmarking Graph-Processing Platforms LDBC TUC Meeting UPC Barcelona, March 2016 1

Graphs at the Core of Our Society: The LinkedIn ExampleData Deluge Nov 2015 Apr 2014 400 400 Sources: Vincenzo Cosenza, The State of LinkedIn, http://vincos.it/the-state-of-linkedin/ via Christopher Penn, http://www.shiftcomm.com/2014/02/state-linkedin-social-media-dark-horse/

LinkedIn Is Not Unique: Data Deluge company/day: 100+ posts, 1,000+ comments IBM 280k employee- -users, 2.6M followers 270M MAU 200+ avg followers >54B edges 1.2B MAU 0.8B DAU 200+ avg followers >240B edges Twitter https://about.twitter.com/company Beevolve 2012 estimates avg # of followers at 208 Facebook - http://thenextweb.com/facebook/2014/01/29/facebook-passes-1-23-billion-monthly-active-users-945-million-mobile-users-757-million-daily-users/

Graph Processing @large A Graph Processing Platform Distribution to processing platform ETL Active Storage (filtering, compression, replication, caching) Algorithm Interactive processing not considered in this presentation. Streaming not considered in this presentation.

Graph Processing @large A Graph Processing Platform Distribution to processing platform Ideally, N cores/disks  Nx faster Ideally, N cores/disks  Nx faster ETL Active Storage (filtering, compression, replication, caching) Algorithm Interactive processing not considered in this presentation. Streaming not considered in this presentation.

Graph Processing @large Data-intesive workload 10x graph size  100x—1,000x slower A Graph Processing Platform Distribution to processing platform Ideally, N cores/disks  Nx faster Ideally, N cores/disks  Nx faster ETL Active Storage (filtering, compression, replication, caching) Algorithm Compute-intesive workload different/more complex analysis  ?x slower Dataset-dependent workload unfriendly graphs  ??x slower Interactive processing not considered in this presentation. Streaming not considered in this presentation.

Graph-Processing Platforms Platform: the combined hardware, software, and programming system that is being used to complete a graph processing task 2 Which to choose? What to tune? Trinity

Graphalytics, in a nutshell An LDBC benchmark* Advanced benchmarking harness Diverse real and synthetic datasets Many classes of algorithms Granula for manual choke-point analysis Modern software engineering practices Supports many platforms Enables comparison of community-driven and industrial systems http://graphalytics.ewi.tudelft.nl https://github.com/tudelft-atlarge/graphalytics/

Benchmarking Harness Iosup et al. LDBC Graphalytics: A Benchmark for Large Scale Graph Analysis on Parallel and Distributed Platform (submitted).

Graphalytics = Representative Classes of Algorithms and Datasets 2-stage selection process of algorithms and datasets Class Examples % Graph Statistics Diameter, Local Clust. Coeff., PageRank 20 Graph Traversal BFS, SSSP, DFS 50 Connected Comp. Reachability, BiCC, Weakly CC 10 Community Detection Clustering, Nearest Neighbor, Community Detection w Label Propagation 5 Other Sampling, Partitioning <15 + weighted graphs: Single-Source Shortest Paths (~35%) : 1. Literature survey of + 2. Expert selection 2009–2013, 120+ articles in 10 top conferences: SIGMOD, VLDB, HPDC, Guo et al. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis, IPDPS’14.

Graphalytics = Distributed Graph Generation w DATAGEN 11 Rich set of configurations More diverse degree distribution than Graph500 Realistic clustering coefficient and assortativity Graphalytics Person Generation Edge Generation Activity Generation “Knows” graph serialization Activity serialization Level of Detail

Graph Processing System Graphalytics = Portable Perf. Analysis w Granula Graph Processing System Logging Patch Performance Analyzer Granula Archive Granula Performance Model Modeling Archiving logs rules Archiver Sharing Monitoring Minimal code invasion + automated data collection at runtime + portable archive (+ web UI)  portable bottleneck analysis

Graphalytics = Diverse Set of Automated Experiments Category Experiment Algo. Data Nodes/ Threads Metrics Baseline Dataset variety BFS,PR All 1 Run, norm. Algorithm variety R4(S), D300(L) Runtime Scalability Vertical vs. horiz. BFS, PR D300(L), D1000(XL) 1—16/1—32 Runtime, S Weak vs. strong G22(S)—G26(XL) Robustness Stress test BFS SLA met Variability D300(L), D1000(L) 1/16 CV Self-Test Time to run/part -- Datagen 1—16

Implementation status G=validated, on GitHub V=validation stage MapReduce2 Giraph GraphX PowerGraph Graph Lab Neo4j PGX.D Graph Mat OpenG TOTEM Map Graph Medusa LCC G -- BFS V WCC CDLP P’Rank SSSP Let me show you a demo run of our system, and maybe even some actual results. https://github.com/tudelft-atlarge/graphalytics/

Implementation status G=validated, on GitHub V=validation stage MapReduce2 Giraph GraphX PowerGraph Graph Lab Neo4j PGX.D Graph Mat OpenG TOTEM Map Graph Medusa LCC G -- BFS V WCC CDLP P’Rank SSSP Let me show you a demo run of our system, and maybe even some actual results. Benchmarking and tuning performed by vendors

Graphalytics Capabilities: An Example Graphalytics enables deep comparison of many systems at once, through diverse experiments and metrics Diverse datasets Your system here! Diverse metrics Diverse algorithms

Processing time (s) + Edges[+Vertices]/s Which system is the best? It depends… Algorithm + Dataset + Metric OK, but … why is this system better for this workload for this metric?

Granula Visualizer Portable choke-point analysis for everyone!

Graphalytics = Modern Software Engineering Process Graphalytics code reviews Internal release to LDBC partners (first, Feb 2015; last, Feb 2016) Public release, announced first through LDBC (Apr 2015) First full benchmark specification, LDBC criteria (Q1 2016) Jenkins continuous integration server SonarQube software quality analyzer https://github.com/tudelft-atlarge/graphalytics/

Graphalytics, in the future github.com/tudelft-atlarge/graphalytics/ An LDBC benchmark* Advanced benchmarking harness Diverse real and synthetic datasets Many classes of algorithms Granula for manual choke-point analysis Modern software engineering practices Supports many platforms Enables comparison of community-driven and industrial systems + more data generation + deeper performance metrics + choke-point analysis

PELGA – Performance Engineering for Large-scale Graph Analytics, workshop with EuroPar 2016