Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong

Slides:



Advertisements
Similar presentations
CS 484. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Advertisements

“Devo verificare un’equivalenza polinomiale…Che fò? Fò dù conti” (Prof. G. Di Battista)
Armend Hoxha Trevor Hodde Kexin Shi Mizan: A system for Dynamic Load Balancing in Large-Scale Graph Processing Presented by:
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Gossip Algorithms and Implementing a Cluster/Grid Information service MsSys Course Amar Lior and Barak Amnon.
Distributed Graph Processing Abhishek Verma CS425.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Graph Processing Recap: data-intensive cloud computing – Just database management on the cloud – But scaling it to thousands of nodes – Handling partial.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Big Data Infrastructure Jimmy Lin University of Maryland Monday, April 13, 2015 Session 10: Beyond MapReduce — Graph Processing This work is licensed under.
Paper by: Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) Pregel: A System for.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Pregel: A System for Large-Scale Graph Processing
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Hypothesis Testing. Distribution of Estimator To see the impact of the sample on estimates, try different samples Plot histogram of answers –Is it “normal”
Graph Partitioning Donald Nguyen October 24, 2011.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Lecture #12 Distributed Algorithms (I) CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Research Directions for Big Data Graph Analytics John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut and Arash Fard Department of Computer Science University.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
Graph Theory Topics to be covered:
Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
Scheduling policies for real- time embedded systems.
Fast, Exact Graph Diameter Computation with Vertex Programming Corey Pennycuff and Tim Weninger SIGKDD Workshop on High Performance Graph Mining August.
PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
1 BIM304: Algorithm Design Time: Friday 9-12am Location: B4 Instructor: Cuneyt Akinlar Grading –2 Midterms – 20% and 30% respectively –Final – 30% –Projects.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
Data Structures and Algorithms in Parallel Computing Lecture 4.
CS 584. Discrete Optimization Problems A discrete optimization problem can be expressed as (S, f) S is the set of all feasible solutions f is the cost.
Ensemble Methods in Machine Learning
Data Structures and Algorithms in Parallel Computing Lecture 3.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Classification Ensemble Methods 1
Research Directions for Big Data Graph Analytics John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut and Arash Fard.
Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.
Data Structures and Algorithms in Parallel Computing
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
Kijung Shin Jinhong Jung Lee Sael U Kang
In the news: A recently security study suggests that a computer worm that ran rampant several years ago is still running on many machines, including 50%
Acknowledgement: Arijit Khan, Sameh Elnikety. Google: > 1 trillion indexed pages Web GraphSocial Network Facebook: > 1.5 billion active users 31 billion.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
Great Theoretical Ideas in Computer Science.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Lu Qin Center of Quantum Computation and Intelligent Systems, University of Technology, Australia Jeffery Xu Yu The Chinese University of Hong Kong, China.
Speedup Graph Processing by Graph Ordering Hao Wei 1, Jeffrey Xu Yu 1, Can Lu 1, Xuemin Lin 2 1 The Chinese University of Hong Kong, Hong Kong 2 The University.
Parallel Graph Algorithms
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Parallel Programming By J. H. Wang May 2, 2017.
Cloud Computing Group 7.
Data Structures and Algorithms in Parallel Computing
Apache Spark & Complex Network
Objective of This Course
Parallel Sort, Search, Graph Algorithms
CSC4005 – Distributed and Parallel Computing
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Big Graph Processing on Cloud Jeffrey Xu Yu ( 于旭 ) The Chinese University of Hong Kong

Big Graphs/Networks

Graph Systems There are many and many graph systems in the literature. 3

Graph Computing on Cloud Workload Balancing Auto Approximation 4

Vertex-Centric Computing on BSP Distributed Vertex-centric Computing BSP (Bulk Synchronous Parallel)  Concurrent computing  Communication  Barrier synchronization

Workload Balancing Computing  Determined by the slowest  Workload balancing Communication  The volume matters  Cross-edges Computing + Communication  Balanced Partitioning

Balanced k-way Graph Partitioning  Size balanced partition  The minimum possible cross-edges It solves our problem if the graph is static  By static, we mean the vertices are always active during the computation However, for graph analytics, the vertices may toggle between active and inactive. Workload Balancing

Dynamic Workload Balancing 8 Computing  Determined by the slowest  Workload balancing Communication  The volume matters  Cross-edges Dynamic workload balancing  Respond to vertices’ status active/inactive

We do not know anything about what graph algorithms will be used. We do not know anything about graphs themselves. We cannot request graphs to be ‘well’ partitioned on Cloud. We cannot assume how graphs are initially partitioned on Cloud. It needs to react to workload balancing in good timing, and it cannot take long to balance itself. Any General Approach?

An Example

PageRank Semi-clustering Graph Coloring Single Source Shortest Path Breadth First Search Random Walk Maximal Matching Minimum Spanning Tree Maximal Independent Sets Representative Graph Algorithms

The three algorithms  PageRank  Semi-clustering  Graph Coloring The vertices are always active Ideal case for static partition  Perfectly balanced as expected Category 1: Always Active

The Three Algorithms  Single Source Shortest Path  Breadth First Search  Random Walk Significantly imbalanced Category 2: Traversal

The Three Algorithms  Maximal Matching  Minimum Spanning Tree  Maximal Independent Sets Somewhat balanced Category 3: Multi-Phases

Predicable? For category 1, the algorithms have stable working window. For category 2, even though the predictability cannot be ensured, however, most of large scale algorithms have the low-diameter property.  SSS has a reasonable hit-rate between supersteps. For Category 3, the hit-rate between two successive phases is very high, due to the algorithm design.

Our Approach [Shang et al. ICDE’13]

Some Basic Ideas

Compare with Random Partitioning

Graph Computing on Cloud The factors  Memory consumption, communication cost, CPU cost, and the number of rounds. The classes  MapReduce Class (MRC) by Karloff et al. in SODA’10.  Minimal MapReduce Class (MMC) by Tao et al. in SIGMOD’13.  Scalable Graph Processing (SGC) on MapReduce by Qin et al. in SIGMOD’14.  Balanced Practical Pregel Algorithms (BPPA) on BSP by Yan et al. in VLDB’14.

Big data and bigger data  Google: 2+EB  twitter: hit 8PB  Yahoo: 400PB  Facebook: 300PB Big data needs to get answers fast More data beat cleaver algorithm  A few useful things to know about machine learning by P. Domingos in CACM Auto-Approximate Graph Computing [Sang et al. VLDB’15]

Work in distributed environment is hard Designing a new algorithm is hard A new distributed approx. algorithm?  Hard + hard The target is fast answer! But, it is impossible to know the meaning of programs. Why Auto-Approximate?

To modify the vertex-centric programs (UDF) Auto-Approximate Graph Computing Traditional Computing Approximation Computing

The Errors Init value Default UDF Approx. UDF final results error term

The Errors The error comes from two sides  The “bad” input Error inherited from previous iterations  Wrong calculation Error from the new approx. UDF

Approximation There does not exist a way to have an approach that can approximate all problems, as restricted by Rice’s theorem.  Any nontrivial property about the language recognized by a Turing machine is undecidable. Approximation  Continuous functions  Discrete functions The notions of continuity from mathematical analysis are relevant and interesting even for software by Chaudhuri et al. in CACM,  shortest paths, minimum spanning trees 25

An Example Sampling as an example  Find chances of sampling  Synthesize codes  Correct the answer by regression

Error-Time Tradeoff

The Sampling Strategies

Graph Algorithms 30

Real Datasets 31

PR over twitter-mp (10 iterations)

The Eight Graph Algorithms

Time/Error Prediction

Some Remarks There are many reported graph systems in the literature. It needs to reconsider something new to explore further to deal with big graphs. 36