Towards Effective Partition Management for Large Graphs

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Reachability Querying: An Independent Permutation Labeling Approach (published in VLDB 2014) Presenter: WEI, Hao.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Dinker Batra CLUSTERING Categories of Clusters. Dinker Batra Introduction A computer cluster is a group of linked computers, working together closely.
LFGRAPH: SIMPLE AND FAST DISTRIBUTED GRAPH ANALYTICS Hoque, Imranul, Vmware Inc. and Gupta, Indranil, University of Illinois at Urbana-Champaign – TRIOS.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
CPSC 322, Lecture 15Slide 1 Stochastic Local Search Computer Science cpsc322, Lecture 15 (Textbook Chpt 4.8) February, 6, 2009.
Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud Alexander G. Connor Panos K. Chrysanthis Alexandros Labrinidis Advanced Data Management.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Chapter 3 : Distributed Data Processing
FLANN Fast Library for Approximate Nearest Neighbors
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
Pregel: A System for Large-Scale Graph Processing
Achieving Load Balance and Effective Caching in Clustered Web Servers Richard B. Bunt Derek L. Eager Gregory M. Oster Carey L. Williamson Department of.
An affinity-driven clustering approach for service discovery and composition for pervasive computing J. Gaber and M.Bakhouya Laboratoire SeT Université.
1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
Introduction to Hadoop and HDFS
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Data Structures & Algorithms and The Internet: A different way of thinking.
Protecting Sensitive Labels in Social Network Data Anonymization.
1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.
THE LITTLE ENGINE(S) THAT COULD: SCALING ONLINE SOCIAL NETWORKS B 圖資三 謝宗昊.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Data Structures and Algorithms in Parallel Computing
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
CPSC 322, Lecture 16Slide 1 Stochastic Local Search Variants Computer Science cpsc322, Lecture 16 (Textbook Chpt 4.8) Oct, 11, 2013.
Adaptive Online Scheduling in Storm Paper by Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni Presentation by Keshav Santhanam.
The Biologically Inspired Distributed File System: An Emergent Thinker Instantiation Presented by Dr. Ying Lu.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Data Driven Resource Allocation for Distributed Learning
Data Center Network Architectures
International Conference on Data Engineering (ICDE 2016)
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
基于多核加速计算平台的深度神经网络 分割与重训练技术
Distributed Network Traffic Feature Extraction for a Real-time IDS
Large-scale file systems and Map-Reduce
LEARNING OBJECTIVES O(1), O(N) and O(LogN) access times. Hashing:
Parallel Programming By J. H. Wang May 2, 2017.
CLUSTER COMPUTING Presented By, Navaneeth.C.Mouly 1AY05IS037
PREGEL Data Management in the Cloud
Parallel Density-based Hybrid Clustering
CHAPTER 3 Architectures for Distributed Systems
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Chapter 16: Distributed System Structures
Parallel Programming in C with MPI and OpenMP
CS110: Discussion about Spark
Fault Tolerance Distributed Web-based Systems
Pregelix: Think Like a Vertex, Scale Like Spandex
Hierarchical Search on DisCSPs
Multithreaded Programming
Hierarchical Search on DisCSPs
Computational Advertising and
Parallel Programming in C with MPI and OpenMP
MapReduce: Simplified Data Processing on Large Clusters
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Towards Effective Partition Management for Large Graphs Shengqi Yang, Xifeng Yan, Bo Zong and Arijit Khan Computer Science, UC Santa Barbara {sqyang, xyan, bzong, arijitkhan}@cs.ucsb.edu Project Homepage http://grafia.cs.ucsb.edu/sedge 11/12/2018

Motivation - Challenges New requirements for data processing Ubiquitous demands on graph data management. Information Networks Social Networks Biological systems Communication Networks Efficient data processing in extremely large graphs. Google: 1 trillion indexed pages Facebook: >500 million active users De Bruijn: 500 million vertices Data management community are facing more and more challenges from the new requirements for unstructured data processing. Large scale, highly interconnected networks pervade both our society and the information world around us. The graphs of interest are often massive with millions, even billions of vertices. For example, the Facebook online social network includes more than 500 million users. DNA short read assembly could easily generate a de Bruijn graph with billions of vertices. 11/12/2018

Motivation – Existed solutions Memory-resident solution Running on single server. Difficult/Impossible to accommodate the content of an extremely large graph. Low concurrency. Simple distributed solution (e.g., Hadoop) Running on commodity cluster High concurrency and enough memory space Some successful applications Not ideal (poor locality and little work per vertex) Graph algorithms usually have poor locality and little work per vertex. 11/12/2018

Pregel from Google (2010) Graph partitioning and distribution Pregel divides a graph into partitions and distribute them to multiple machines for parallel processing of graph queries. 11/12/2018

Is Pregel suitable? How Pregel works (BSP model) Why Not Pregel? Distribution Model: Graph partitioning Computational model: run on each partition (vertex) Communication model: message passing Why Not Pregel? Limitations Messages are typically sent along the edge between two vertices. But a message may be sent between any two vertices if they know each other. 11/12/2018

Pregel: Limitations Unbalanced workload Inter-machine communication Pregel can serve queries that are uniformly distributed in the graph. However, they are not good at dealing with unbalanced query workload: queries that are concentrated in one part of the graph. 11/12/2018

Our Solution: Sedge SEDGE: a Self Evolving Distributed Graph Processing Environment Solving the problems facing Pregel Workload balancing (replication) Communication reduction Novel partitioning techniques. 2-level partition architecture that supports the new generated partitions. We want to achieve two goals. On one hand, to minimize query response time, our solution has to eliminate the inter-machine communication as much as possible. On the other hand, to maximize query throughput, our system needs to utilize each worker to its best of capability. 11/12/2018

Outline Partitioning techniques System Architecture Experiments Complementary partitioning On-demand partitioning Two-level partition management System Architecture Experiments Conclusions 11/12/2018

Techniques: Complementary partitioning Complementary partitioning : repartition the graph with region constraint These two sets of partitions will run independently. Complementary Partitioning is to repartition a graph such that the original cross-partition regions become internal ones. In the new partition set, the queries on original cross-partition edge, e, will be served within the same partition. Therefore, the new partition set can handle graph queries that have trouble in the original partition set. If there is room to hold both S1 and S2 in clusters, for a query Q visiting the shaded area R in S1, the system shall route it to S2 to eliminate communication cost. Meanwhile, the new partition set can also share the workload with original partition set. This complementary partitioning idea can be applied multiple times to generate a series of partition sets. We call each partition set a “primary partition set.” Each primary partition set is self complete, where a Pregel instance can run independently. 11/12/2018

… … An Iterative Solution Iteratively repartition the graph Pros Effective communication reduction Cons Space limitation 11/12/2018

Cross-partition query profiling v1 v2 v3 v4 v5 Blocks: coarse-granularity units that trace the path of cross-partition queries. e.g., C1,C2 and C3. Advantages: Query generalization. Profile a query with fewer features. 11/12/2018

Techniques: On-demand partitioning v1 v2 v3 v4 v5 P3 To deal with the hotpots that cross multiple partitions, we propose a dynamic partitioning technique. Envelope: a sequence of blocks that covers a cross partition query. Envelope Collection: put the maximized number of envelopes into a new partition wrt. space constraint. 11/12/2018

A Similarity-Based Greedy Algorithm The algorithm intends to combine similar envelopes sharing many common color-blocks. 2-step greedy algorithm: 1) Similarity search (nearest neighbor search). Locality Sensitive Hashing (LSH): Min-Hash, in O(n) 2) Envelope combining Cluster the envelopes in the same bucket produced by Min-Hash. Combine the clusters with highest benefit. We use Min-Hash as a probabilistic clustering method that assigns a pair of envelopes to the same bucket with a probability proportional to the similarity between them. It is straightforward to consider these buckets as clusters and combine the envelopes within the same buckets together. 11/12/2018

Two-level partition architecture Primary partitions. e.g., A, B, C and D. They are inter-connected in two-way Secondary partitions. e.g., B’ and E. They are connected with primary partitions in one-way 11/12/2018

Sedge: System Architecture The offline part acts at the initial stage. It partitions the input data/graph in a distributed manner and puts them to the workers. It creates multiple partition sets so that each set runs a Pregel Instance independently. During the runtime, the online part collects statistical information from workers and actively generates and removes partitions to accommodate the changing workload. Therefore the set of online techniques built in Sedge must be very efficient. Our implementation is focused on partition management. 11/12/2018

System Evaluation: RDF graphs SP2Bench Employ the DBLP library as its simulation basis. 100M tuples (11.24GB). Query templates (5 of 12): Q2, Q4, Q6, Q7, Q8. E.g. Q4: Given a journal, select all distinct pairs of article author names for authors that have published in the journal. Cluster environment 31 computing nodes. Among these nodes, one serves as the master and the rest as workers. 11/12/2018

Effect of complementary partitioning Experiment setting Partition configuration: CP1~CP5 Workload: 104 Random queries Result Significant cross query reduction Cross query vanishes for Q2,Q4 and Q6 Figure shows the effect of the approach. Note that the Y-axis is plotted in log scale. The missing bars for the CP4 and CP5 of Q2 , the CP5 of Q4 and the CP5 of Q6 correspond to the value of 0, i.e., the cross-partition query vanishes. The experiment demonstrates that by adding more complementary partition sets, the number of cross-partition queries can be dramatically reduced. 11/12/2018

Evolving query workload Experiment setting Partition configuration: CP1*5, CP5 and CP4+DP Workload: evolving 104 random queries Result CP1*5 vs. CP5: effect of complementary partitioning CP5 vs. CP4+DP: effect of on demand partitioning 11/12/2018

System Evaluation: real graphs Datasets Web graph: 30M vertices, 956M edges. Twitter: 15M users, 57M edges. Bio graph: 50M vertices, 51M edges. Synthetic graph: 0.5B vertices, 2 Billion edges. Query workload neighbor search. random walk. random walk with restart. 11/12/2018

Scalability Cross-partiti0n queries vs. Avg. response time Experiment setting Partition configuration: CP1+DP. Workload: 104 random queries with 0%, 25%, 75% and 100% cross queries. Result Effective for cross queries. Effective for partitioning difficult graphs, i.e., Twitter and Bio graphs. 11/12/2018

Conclusions Conclusions Future work Partitioning techniques Complementary partitioning On-demand partitioning Two-level partition management Future work Efficient distributed RDF data storage Distributed query processing 11/12/2018

Thank You! 11/12/2018