Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong.

Slides:

Advertisements

Similar presentations

Bounded Conjunctive Queries Yang Cao 1,2, Wenfei Fan 1,2, Tianyu Wo 2, Wenyuan Yu 3 1 University of Edinburgh, 2 Beihang University, 3 Facebook Inc.

Advertisements

Social network partition Presenter: Xiaofei Cao Partick Berg.

CS3771 Today: deadlock detection and election algorithms  Previous class Event ordering in distributed systems Various approaches for Mutual Exclusion.

epiC: an Extensible and Scalable System for Processing Big Data

1 TDD: Topics in Distributed Databases Distributed Query Processing MapReduce Vertex-centric models for querying graphs Distributed query evaluation by.

Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.

New Models for Graph Pattern Matching Shuai Ma ( 马帅 )

The IEEE International Conference on Big Data 2013 Arash Fard M. Usman Nisar Lakshmish Ramaswamy John A. Miller Matthew Saltz Computer Science Department.

A Model of Computation for MapReduce

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.

Experiments We measured the times(s) and number of expanded nodes to previous heuristic using BFBnB. Dynamic Programming Intuition. All DAGs must have.

1 QSX: Querying Social Graphs Parallel models for querying graphs beyond MapReduce Vertex-centric models –Pregel (BSP) –GraphLab GRAPE.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

Two Discrete Optimization Problems Problem #2: The Minimum Cost Spanning Tree Problem.

DIDS part II The Return of dIDS 2/12 CIS GrIDS Graph based intrusion detection system for large networks. Analyzes network activity on networks.

SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.

Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.

Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

Yinghui Wu, SIGMOD 2012 Query Preserving Graph Compression Wenfei Fan 1,2 Jianzhong Li 2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute.

1 QSX: Querying Social Graphs Querying big graphs Parallel query processing Boundedly evaluable queries Query-preserving graph compression Query answering.

Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology.

Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,

Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.

SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Research Directions for Big Data Graph Analytics John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut and Arash Fard Department of Computer Science University.

Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,

1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.

Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph)

Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.

Association Rules with Graph Patterns Yinghui Wu Washington State University Wenfei Fan Jingbo Xu University of Edinburgh Southwest Jiaotong University.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.

1 Distributed Databases BUAD/American University Distributed Databases.

Yinghui Wu, ICDE Adding Regular Expressions to Graph Reachability and Pattern Queries Wenfei Fan Shuai Ma Nan Tang Yinghui Wu University of Edinburgh.

Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.

Answering pattern queries using views Yinghui Wu UC Santa Barbara Wenfei Fan University of EdinburghSouthwest Jiaotong University Xin Wang.

NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.

1 QSX: Querying Social Graphs Approximate query answering Query-driven approximation Data-driven approximation Graph systems.

CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.

Data Structures and Algorithms in Parallel Computing Lecture 4.

CMSC 691B Multi-Agent System A Scalable Architecture for Peer to Peer Agent by Naveen Srinivasan.

Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.

Yinghui Wu, SIGMOD Incremental Graph Pattern Matching Wenfei Fan Xin Wang Yinghui Wu University of Edinburgh Jianzhong Li Jizhou Luo Harbin Institute.

1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

CPT-S Advanced Databases 11 Yinghui Wu EME 49.

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Answering pattern queries using views

CPT-S 415 Topics in Computer Science Big Data

Parallelizing Sequential Graph Computations

Parallel Programming By J. H. Wang May 2, 2017.

PREGEL Data Management in the Cloud

Replication-based Fault-tolerance for Large-scale Graph Processing

Ch 4. The Evolution of Analytic Scalability

NP-Complete Problems.

Job-aware Scheduling in Eagle: Divide and Stick to Your Probes

Resource Allocation for Distributed Streaming Applications

Computational Advertising and

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang Dong Deng Tsinghua University

Finding potential customers 2 Youtube users (YB) Interest = “beer ads” Youtube users (YF) Interest = “2014 FIFA worldcup” Sports (SP) Interest = “soccer” Food (F) Interest = “beer” f1f1 f4f4 f2f2 yb 1 sp 1 yf 2 f3f3 sp 2 yb 2 yf 3 yb 3 sp 3 yf 1 “find me Youtube users who like beer ads connected with a community of those who like worldcup videos, soccer fans and beer lovers” distributed social network

Searching distributed graphs 3 Real life graphs are distributed : Computational or Natural ◦Geo-distributed data centers ◦Decentralization social networks ◦Distributed knowledge bases: entity and personal information Distributed graph querying ◦given a pattern Q and a graph G fragmented into F = (F 1,…F n ) (F i distributed to site S i ), compute answer Q(G) ◦applications: social analysis, multi-source knowledge management

Distributed Querying Methods Graph exploration/Message passing ◦ Master node and slave node (Trinity (Microsoft), Pregel (Google)) ◦ Predefined graph partition and query execution plan ◦ Vertex centric/Local scheduling: GraphLab (CMU) Ideally we want a distributed algorithm to take ◦ less response time with more sites, independent with entire data graph ◦ data shipment cost decided by query size and number of sites only 4 intermediate results master node query query result query plan slave node (fragments)... Unbounded cost

Distributed graph simulation 5 Graph simulation ◦a graph G matches a pattern P if there exists a matching relation S ◦for each pair (u, v) in S, v is a node match of u ◦for each edge (u, u’), there exists an edge (v, v’) and (u’, v’) is in S Distributed graph simulation ◦Distributed data graph with in-nodes and virtual nodes ◦Given distributed data graph G and query Q, find match set Q(G) induced by S virtual node in-node

Undoable: Parallel Scalability 6 A distributed graph simulation algorithm A is parallel scalable in ◦response time if its running time is bounded by a polynomial in |Q| and |Fm|, (Fm is the largest fragment) ◦data shipment if ships at most a polynomial amount of data in |Q| and |F| Impossibility Theorems ◦Intuition of proof: simulation lacks data locality ◦holds for computational models where each site makes local decisions ◦holds for vertex-centric processing systems (Pregel, GraphLab, etc.) There exists no algorithm for distributed graph simulation that is parallel scalable in either response time or data shipment, even for Boolean pattern queries

Doable: Partition Boundedness 7 A distributed graph simulation algorithm A is partition bounded in ◦response time if its running time is bounded by a polynomial in |Q|,|Fm|, (Fm is the largest fragment) and |Vf| (or |Ef|) (size of virtual nodes/edges) ◦data shipment if ships at most a polynomial amount of data in |Q| and |Ef|(or |Vf|) Positive results ◦in O(|Vf||Vq|(|Vq|+|Vm|)(|Eq|+|Em|) ) time ◦Ships at most O(|Ef||Vq|) amount of data Distributed graph simulation has a partition bounded algorithm, in both response time and data shipment

Distributed pattern matching: framework 8 A mixed strategy: partial evaluation + message passing ◦local evaluation to generate partial results ◦asynchronous message passing to direct partial results among fragments

Partition bounded algorithm 9 Step 1: partial evaluation at each fragment ◦ introduce Boolean variables to indicate if match or not ◦keeps track of unevaluated in-nodes and virtual nodes Step 2: each site refines partial answers upon receiving new msgs (in parallel and asynchronously) ◦ships partial answers to other sites ◦incremental update optimization Step 3: coordinator collects partial answers and returns their union as Q(G) f1f1 f4f4 f2f2 yb 1 sp 1 yf 2 sp 3 yf 1

Parallel scalable algorithms: DAG patterns 10 Step 1: partial evaluation at each fragment Step 2: each site sends msgs following the topological ranks of query nodes ◦waits until all Boolean variables for the nodes at same rank to be collected ◦send msgs in a single batch to reduce # of msgs Step 3: coordinator collects partial answers and returns their union as Q(G) YB1 YF SP F YB2 YB3

A big picture 11 Partial evaluation ◦bounds on response time and network traffic ◦redundant local computation Message passing ◦unbounded data shipment and is hard to have provable bounds on response time Local evaluation can be optimized with carefully designed routing/scheduling

Experimental evaluation 12 Dataset ◦Real-life graphs: Yahoo (18 million nodes and edges), Citation (4.4 million nodes and edges) ◦Synthetic graphs Algorithms ◦Partition bounded algorithm dGPM ◦Scalable parallel algorithm dGPM d for DAG patterns ◦Above algorithms without optimizations (incremental update) ◦Centralized graph simulation ◦Baseline: disHHK [S.Ma, WWW ’12]

Efficiency of distributed graph simulation 13 response time data shipment

Conclusion 14 Take away ◦Impossible to find distributed simulation algorithms that are parallel scalable in response time or data shipment ◦Provide algorithms that are partition bounded: time and data shipment are not a function in the size of data graph ◦These algorithm scale well with big graphs Future work ◦Parallel scalability for other queries, e.g., subgraph isomorphism ◦Combining partial evaluation and message passing and compare with MapReduce and GraphLab ◦Combining distributed processing with optimizations: compression, view-based evaluation and top-k query evaluation