Parallel Subgraph Listing in a Large-Scale Graph Yingxia Shao  Bin Cui  Lei Chen  Lin Ma  Junjie Yao  Ning Xu   School of EECS, Peking University.

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

epiC: an Extensible and Scalable System for Processing Big Data

1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.

Quality Aware Privacy Protection for Location-based Services Zhen Xiao, Xiaofeng Meng Renmin University of China Jianliang Xu Hong Kong Baptist University.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

A Graph-Partitioning-Based Approach for Multi-Layer Constrained Via Minimization Yih-Chih Chou and Youn-Long Lin Department of Computer Science, Tsing.

Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao 1 1 gStore: Answering SPARQL Queries Via Subgraph Matching 1 Peking University, 2 Hong.

Native-Conflict-Aware Wire Perturbation for Double Patterning Technology Szu-Yu Chen, Yao-Wen Chang ICCAD 2010.

PAGE: A Partition Aware Graph Computation Engine Yingxia Shao, Junjie Yao, Bin Cui, Lin Ma EECS, Peking University, China.

Efficient Cohesive Subgraph Detection in Parallel

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Approximating Maximum Edge Coloring in Multigraphs

VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Balanced Graph Edge Partition ACM KDD 2014 Florian Bourse ENS Marc Lelarge INRIA-ENS Milan Vojnovic Microsoft Research.

BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.

Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)

Neighbourhood Sampling for Local Properties on a Graph Stream A. Pavan, Iowa State University Kanat Tangwongsan, IBM Research Srikanta Tirthapura, Iowa.

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

Crossing Minimisation (1) Ronald Kieft. Global contents Specific 2-layer Crossing Minimisation techniques After the break, by Johan Crossing Minimisation.

1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.

Swarm Computing Applications in Software Engineering By Chaitanya.

On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.

Expanders via Random Spanning Trees R 許榮財 R 黃佳婷 R 黃怡嘉.

An Iterative Heuristic for State Justification in Sequential Automatic Test Pattern Generation Aiman H. El-MalehSadiq M. SaitSyed Z. Shazli Department.

Clustering Moving Objects in Spatial Networks Jidong Chen, Caifeng Lai, Xiaofeng Meng, Renmin University of China Jianliang Xu, and Haibo Hu Hong Kong.

1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science

GStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao Peking University, 2 Hong.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.

CS270 Project Overview Maximum Planar Subgraph Danyel Fisher Jason Hong Greg Lawrence Jimmy Lin.

Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.

From Theory to Practice: Efficient Join Query Processing in a Parallel Database System Shumo Chu, Magdalena Balazinska and Dan Suciu Database Group, CSE,

Proposal of Asynchronous Distributed Branch and Bound Atsushi Sasaki†, Tadashi Araragi†, Shigeru Masuyama‡ †NTT Communication Science Laboratories, NTT.

Two Connected Dominating Set Algorithms for Wireless Sensor Networks Overview Najla Al-Nabhan* ♦ Bowu Zhang** ♦ Mznah Al-Rodhaan* ♦ Abdullah Al-Dhelaan*

Melbourne, Australia, Oct., 2015 gSparsify: Graph Motif Based Sparsification for Graph Clustering Peixiang Zhao Department of Computer Science Florida.

Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.

Graph Indexing From managing and mining graph data.

Presented by: Omar Alqahtani Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.

Lu Qin Center of Quantum Computation and Intelligent Systems, University of Technology, Australia Jeffery Xu Yu The Chinese University of Hong Kong, China.

Outline Introduction State-of-the-art solutions Equi-Truss Experiments

Cohesive Subgraph Computation over Large Graphs

Outline Introduction State-of-the-art solutions

Secretary Markets with Local Information

Optimizing Parallel Algorithms for All Pairs Similarity Search

Discrete ABC Based on Similarity for GCP

International Conference on Data Engineering (ICDE 2016)

A Study of Group-Tree Matching in Large Scale Group Communications

Parallel Programming By J. H. Wang May 2, 2017.

Combinatorial structural clustering (CSC): A novel structural clustering approach for large scale networks Liang Chen, Hongbo Liu, Weishi Zhang, And Bo.

On Efficient Graph Substructure Selection

Replication-based Fault-tolerance for Large-scale Graph Processing

Conflict-Aware Event-Participant Arrangement

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Efficient Subgraph Similarity All-Matching

Aiman H. El-Maleh Sadiq M. Sait Syed Z. Shazli

A Fundamental Bi-partition Algorithm of Kernighan-Lin

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Parallel Subgraph Listing in a Large-Scale Graph Yingxia Shao  Bin Cui  Lei Chen  Lin Ma  Junjie Yao  Ning Xu   School of EECS, Peking University  Hong Kong University of Science and Technology 1

Outline Subgraph listing operation Related work PSgL framework Evaluation Conclusion 2

Motivation 3 Motif Detection in Bioinformatics Cascades Counting in RN Triangle Counting in SN Introduction

Problem Definition 4 Pattern graph Subgraph Listing Operation o Input: pattern graph, data graph [both are undirected] o Output: all the occurrences of pattern graph in the data graph. Goal of our work o Efficiently listing subgraph in a large-scale graph Data graph Introduction

Related Work Centralized algorithms  Enumerate one by one [Chiba ’85, Wernicke ’06, Grochow ’07] Streaming algorithms  Only counting and results are inaccurate [Buriol ’06, Bordino ’08, Zhao ’10] MapReduce based Parallel algorithms  Decompose pattern graph + explicit join operation [Afrati ’13]  Fixed exploration plan + implicit join operation [Plantenga ’13] Other efficient algorithms for specific pattern graph  Triangle [Suri ’11, Chu ’11, Hu ’13] 5 Related Work

Drawbacks in existing parallel solutions MapReduce is not friendly to process graphs. Join operation is expensive. Do not take care of the balance of data distribution. Data graph Intermediate results The novel PSgL framework lists subgraph via graph traversal on in-memory stored native graph. 6 Related Work

Contributions We propose an efficient parallel subgraph listing framework, PSgL. We introduce a cost model for the subgraph listing in PSgL. We propose a simple but effective workload-aware distribution strategy, which facilitates PSgL to achieve good workload balance. We design three independent mechanisms to reduce the size of intermediate results. 7

Partial subgraph instance 8 {?,?,?,?} {2,3,4,5} {1,5,6,?} Preliminaries

Independence Property 9 Preliminaries

PSgL: Parallel Subgraph Listing Framework 10 PSgL

11 PSgLVertex program

Algorithm of Expanding a G psi - II Main logic Changes one GRAY vertex into BLACK; Validates the expanding vertex’s GRAY neighbors; Makes the expanding vertex’s WHITE neighbor become GRAY. Two observations In each expansion, at least one pattern vertex is processed. All GRAYs are the valid candidates for the next expansion. Example: expanding vertex 12 PSgLVertex program

Efficiency of PSgL # of iterations Total cost # of workers # of G psi processed by worker k cost of processing a G psi 13 PSgLAnalysis

Workload balance - I 14 Optimization

Workload aware distribution strategy A general greedy-based heuristic rule. Workload balance - II 15 αDescriptionDrawbacks 1local optimal 0imbalance 0.5 (*)Making a trade-off between local optimal and imbalance- All three strategies have the same worst bound which is K*|OPT|. But in practice, α = 0.5 performs best. Optimization

Comparison among various approaches 16 Optimization Random Roulette

Partial subgraph instance reduction - I Pattern graph automorphism breaking Using DFS to find the equivalent vertex group Assign partial order for each equivalent vertex group Initial pattern vertex selection Introduce a cost model General pattern graph Enumerate all possible selections based on cost model Cycle and clique The vertex with lowest rank is the best one. 17 < < < Automorphism Breaking Cost Model Best Initial Pattern Vertex Initial Pattern Vertex Section based on cost model Optimization

Partial subgraph instance reduction - II 18 Data GraphPGGpsi # w/ indexGpsi # w/o indexPruning Ratio LiveJournalPG 1 (v 1 )2.86 x x % PG 4 (v 1 )9.93 x 10 9 OOMunknown UsPatentPG 5 (v 1 )2.26 x x % PG 5 (v 3 ; v 4 )7.38 x x % PG 1 PG 4 PG 5 Optimization

Evaluation - Comparing to MR solutions 19 PSgL: 4302s Afrati: 7291s Evaluation  Afrati and SGIA-MR are the state-of-art MapReduce solutions.  The ratios exceed 100 times are not visualized.

Evaluation - Comparing to GraphLab 20 Data GraphPattern GraphAfratiPowerGraphPSgL Twitter432min2min12.5min Wikipedia871s36s125s WikiTalk4402s48s318s WikiTalk 13743s 100s 494s WikiTalk 13743s OOM* 494s WikiTalk1785s127s38s LiveJournal2749sOOM1330s Evaluation * using a different traversal order.

Conclusion Subgraph listing is a fundamental operation for massive graph analysis. We propose an efficient parallel subgraph listing framework, PSgL. Various distribution strategies Cost model Light-weight global edge index The workload-aware distribution strategy can be extended to other balance problems. A new execution engine is required for larger pattern graphs. 21

Thanks! 22

Backup Expr. – Scalability of PSgL 23 Performance vs. Worker Number

Backup Expr. – Initial pattern vertex selection 24 Livejournal Random graph Influences of the Initial Pattern Vertex on Various Data Graphs