Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
Lecture 4 (week 2) Source Coding and Compression
Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
An On-Chip IP Address Lookup Algorithm Author: Xuehong Sun and Yiqiang Q. Zhao Publisher: IEEE TRANSACTIONS ON COMPUTERS, 2005 Presenter: Yu Hao, Tseng.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Topological Sort Topological sort is the list of vertices in the reverse order of their finishing times (post-order) of the depth-first search. Topological.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Distributed Breadth-First Search with 2-D Partitioning Edmond Chow, Keith Henderson, Andy Yoo Lawrence Livermore National Laboratory LLNL Technical report.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
Hardware-based Load Generation for Testing Servers Lorenzo Orecchia Madhur Tulsiani CS 252 Spring 2006 Final Project Presentation May 1, 2006.
1 Multi-Core Architecture on FPGA for Large Dictionary String Matching Department of Computer Science and Information Engineering National Cheng Kung University,
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.
Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Network Aware Resource Allocation in Distributed Clouds.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
WMNL Sensors Deployment Enhancement by a Mobile Robot in Wireless Sensor Networks Ridha Soua, Leila Saidane, Pascale Minet 2010 IEEE Ninth International.
Representing and Using Graphs
Dijkstra’s Algorithm. Announcements Assignment #2 Due Tonight Exams Graded Assignment #3 Posted.
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
1 An Adaptive File Distribution Algorithm for Wide Area Network Takashi Hoshino, Kenjiro Taura, Takashi Chikayama University of Tokyo.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Incremental Run-time Application Mapping for Heterogeneous Network on Chip 2012 IEEE 14th International Conference on High Performance Computing and Communications.
Graph Representations And Traversals. Graphs Graph : – Set of Vertices (Nodes) – Set of Edges connecting vertices (u, v) : edge connecting Origin: u Destination:
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
Review Graph Directed Graph Undirected Graph Sub-Graph Spanning Sub-Graph Degree of a Vertex Weighted Graph Elementary and Simple Path Link List Representation.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Yiting Xia, T. S. Eugene Ng Rice University
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Reza Yazdani Albert Segura José-María Arnau Antonio González
HUFFMAN CODES.
Fast and Robust Hashing for Database Operators
Data Center Network Architectures
Top 50 Data Structures Interview Questions
Graphs Representation, BFS, DFS
Case Study 2- Parallel Breadth-First Search Using OpenMP
CSE 2331/5331 Topic 9: Basic Graph Alg.
Genomic Data Clustering on FPGAs for Compression
External Methods Chapter 15 (continued)
Accelerating Linked-list Traversal Through Near-Data Processing
Accelerating Linked-list Traversal Through Near-Data Processing
Supporting Fault-Tolerance in Streaming Grid Applications
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Experiment Evaluation
Synchronization trade-offs in GPU implementations of Graph Algorithms
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Chuanxiong Guo, Haitao Wu, Kun Tan,
Pyramid Sketch: a Sketch Framework
Graphs Graph transversals.
Graphs Representation, BFS, DFS
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Fast Nearest Neighbor Search on Road Networks
University of Wisconsin-Madison
Graphs Part 2 Adjacency Matrix
Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs
Peng Jiang, Linchuan Chen, and Gagan Agrawal
ITEC 2620M Introduction to Data Structures
Graph Algorithms "A charlatan makes obscure what is clear; a thinker makes clear what is obscure. " - Hugh Kingsmill CLRS, Sections 22.2 – 22.4.
LANMC: LSTM-Assisted Non-Rigid Motion Correction
Fast Accesses to Big Data in Memory and Storage Systems
Presentation transcript:

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform Jialiang Zhang and Jing Li University of Wisconsin-Madison

Large-scale graph traversal Graph traversal is one of the key kernel for big-data applications Social Media Analysis Cyber security IT Infrastructure Monitoring And more… The large-scale graph traversal is challenging Lack of locality Low parallelism High synchronization cost Social Media Analysis Infrastructure Monitoring Cyber security From http://www.darpa.mil/attachments/HIVE_Proposers_Day_PM_Briefing.pdf

Summary of this work Workload Analysis Algorithm Gain deeper understanding of the state-of-the-art algorithm Dataset Large graphs typically have non-uniformed degree distribution Optimization Algorithm Associate the vertices index with the access frequency leveraging the degree information for better locality Hardware Efficient data placement policy Adjacency list compression Evaluation Experiments on FPGA-HMC platform

Summary of this work Algorithm Workload Analysis Dataset Algorithm Optimization Algorithm Hardware Evaluation

Level-synchronized BFS Level-synchronized BFS is the most widely used parallel graph traversal algorithm Start from one source vertex Construct a BFS tree by checking the neighbors of source vertices Unvisited neighbors becomes the source vertices in the next level Level 1 1 2 3 1 2 3 Level 2 5 5 4 4 Level 3 Level Synchronized BFS 6 7 6 7 Level 4

Level-synchronized BFS Level-synchronized BFS is the most widely used parallel graph traversal algorithm Start from one source vertex Construct a BFS tree by checking the neighbors of source vertices Unvisited neighbors becomes the source vertices in the next level Level 1 1 2 3 5 4 Level Synchronized BFS 6 7

Level-synchronized BFS Level-synchronized BFS is the most widely used parallel graph traversal algorithm Start from one source vertex Construct a BFS tree by checking the neighbors of source vertices Unvisited neighbors becomes the source vertices in the next level Level 1 1 2 3 1 2 3 Level 2 5 4 Level Synchronized BFS 6 7

Level-synchronized BFS Level-synchronized BFS is the most widely used parallel graph traversal algorithm Start from one source vertex Construct a BFS tree by checking the neighbors of source vertices Unvisited neighbors becomes the source vertices in the next level Level 1 1 2 3 1 2 3 Level 2 5 5 4 4 Level 3 Level Synchronized BFS 6 7

Level-synchronized BFS Level-synchronized BFS is the most widely used parallel graph traversal algorithm Start from one source vertex Construct a BFS tree by checking the neighbors of source vertices Unvisited neighbors becomes the source vertices in the next level Level 1 1 2 3 1 2 3 Level 2 5 5 4 4 Level 3 Level Synchronized BFS 6 7 6 7 Level 4

Hybrid graph traversal Hybrid graph traversal is the state-of-the-art level synchronized BFS algorithm on CPU[1] for scale-free graph In additional to the conventional top-down method, it uses the bottom-up method 1 2 3 7 6 5 4 Visited vertices in previous level Visited vertices in current level Unvisited vertices

Top-down method The traditional implementation of the level-synchronized BFS is the top-down method: 1. Scan vertices in the current level 1 2 3 Visited vertices in previous level 1 2 3 Visited vertices in current level 5 4 Unvisited vertices 6 7

Top-down method The traditional implementation of the level-synchronized BFS is the top-down method: Scan vertices in the current level Check the status of all neighbors 1 2 3 1 2 3 1 2 3 4 4 5 5 1 2 3 5 4 5 4 6 7 6 7

Top-down method The traditional implementation of the level-synchronized BFS is the top-down method: Scan vertices in the current level Check the status of all neighbors Add unvisited neighbors to next level 1 2 3 1 2 3 4 4 5 5 5 4 4 5 6 7

Top-down method The traditional implementation of the level-synchronized BFS is the top-down method: Scan vertices in the current level Check the status of all neighbors Add unvisited neighbors to next level Problem: 1.Redundant edge check, when the size of current level is large 1 2 3 1 2 3 4 4 5 5 5 4 4 5 6 7

Top-down method The traditional implementation of the level-synchronized BFS is the top-down method: Scan vertices in the current level Check the status of all neighbors Add unvisited neighbors to next level Problem: 1. Redundant edge check 2. Conflict updates 1 2 3 1 2 3 4 4 5 5 5 4 4 5 6 7

Bottom-up method To address such problem, bottom-up method [1] is proposed: 1. Scan the unvisited vertices 4 5 6 7 1 2 3 5 4 6 7 [1]Scott Beamer, etal. Direction-optimizing breadth-first search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12).

Bottom-up method To address such problem, bottom-up method is proposed: Scan the unvisited vertices Check if more than one neighbor is in the current level 4 5 6 7 1 2 6 2 5 3 6 7 4 5 1 2 3 5 4 6 7

Bottom-up method To address such problem, bottom-up method is proposed: Scan the unvisited vertices Check if more than one neighbor is in the current level Add the source vertices to the next level 4 5 6 7 1 2 6 2 5 3 6 7 4 5 1 2 3 5 4 4 5 6 7

Bottom-up method To address such problem, bottom-up method is proposed: Scan the unvisited vertices Check if more than one neighbor is in the current level Add the source vertices to the next level Efficient when the current level is large: Early termination to reduce redundant check No conflict Inefficient when the number of unvisited vertices is small 4 5 6 7 1 2 3 1 2 4 5 5 5 4 4 5 6 7

Hybrid graph traversal Hybrid graph traversal is the state-of-the-art level synchronized BFS algorithm on CPU[1] for scale-free graph It switches between top-down and bottom-up based on heuristics: the size of current level and the number of unvisited vertices The objective of hybrid graph traversal is to reduce the redundant edge check Edge check doesn’t lead to expanding the BFS tree Check the status of visited vertices in top-down method Scanning all vertices to find unvisited ones in bottom-up method

Summary of this work Algorithm Workload Analysis Dataset Algorithm Optimization Algorithm Hardware Evaluation

Degree Distribution Dataset: Kronecker Graph Widely used graph dataset for model social network Used in GRAPH500 challenge 1 million vertices; 16 million edges Only a few “hot vertices” contributes to most of the edge connections Popular people in social network Hub node in data network Controller node in infrastructure network These “hot vertices” tends to have high access frequency 256 (0.02%) vertices contributes to 97% edge connection Cumulative distribution of vertex degree

Summary of this work Algorithm Workload Analysis Dataset Algorithm Optimization Algorithm Hardware Evaluation

Degree-aware vertex index sorting. Sort the vertices based on the degree offline Assign smaller index to vertices with higher degree 3 1 3 1 1 3 2 2 Sorting 2 3 1 2 2 2

Degree-aware vertex index sorting. Sort the vertices based on the degree Assign smaller index to vertices with higher degree The objective is to associate the access frequency with the indices through degree information w/o Sorting w/ Sorting w/ Sorting

Degree-aware adjacency list reordering We reorder the adjacency list to terminate the neighbor scanning in bottom-up earlier High-degree vertices tends to be visited earlier Can be compressed by simple coding scheme Vertices indices 1 2 3 Pointer to adjacency list 3 5 7 3 adjacency list 2 3 1 2 1 reordering 1 2 Vertices indices 1 2 3 Pointer to adjacency list 3 adjacency list 1 2 3 2 1

Summary of this work Algorithm Workload Analysis Dataset Algorithm Optimization Algorithm Hardware Evaluation

Hardware architecture Two bitmaps: (Non)-visited vertices Vertices in the current level Graph topology (off-chip) Vertices list Adjacency list Top-down pipeline: Vertices in the current level-> vertices list -> neighbor list -> visited vertices Bottom-up pipeline: Non-visited vertices -> vertices list -> neighbor list -> Vertices in the current levels V list & Adj. list HMC

Hardware optimizations We focus on bottom-up method (contributes to most of the runtime): Non-visited vertices bitmap(cheap) Streaming read Access each bit once per level Vertices list and Neighbor list (cheap) Access once during BFS Vertices in the current levels (expensive) Random access Multiple access per level V list & Adj. list

Degree-aware data placement We create three different memory to store track the vertices in the current level 1-Bit addressable register file 32-bit addressable BRAM 256-bit addressable HMC We allocate the memory based on the indices (access frequency) Most of the memory access is conflict-free Most of the HMC access can be absorbed

Degree-aware adjacency list compression The major portion of the HMC traffic is reading the neighbor list: Streaming Already sorted based on the access frequency To reduce the HMC throughput, we propose to use exp-golomb coding to compress the adjacency list Exp-golomb coding is a variable length coding Exp-golomb is prefix coding Exp-golomb coding assigns shorter code to smaller indices It has lower complexity compared to Huffman-coding 1bit 9 bit

Summary of this work Algorithm Workload Analysis Dataset Algorithm Optimization Algorithm Hardware Evaluation

Experimental Setup Platform: Pico Computing AC510 FPGA: Xilinx Ultra scale KCU060 HMC bandwidth: 30GB/s on both direction HMC capacity: 4GB FPGA HMC

Dataset Evaluate on 5 scale-free graph (small diameter): 2 real-world social network 3 Graph500 graph 2 grid-like graph (large diameter)

Overall performance Achieve 229x average throughput gain compared to top-down only implementation Achieve 9x throughput gain compared to non-optimized hybrid-BFS on FPGA Achieved a energy efficiency of 1630 MTEPS/W, which is 2x higher than the #1 in green graph 500

Performance projection Assume we can use the full bandwidth of HMC (240GB/s) Achieve 9x average throughput gain compared to our GPU (GTX1060 w/ 240GB/s GDDR5)

Conclusion Degree distribution in scale-free graph is a useful property for the acceleration of graph traversal The proposed method associates indices with access frequency through vertices sorting for better locality Leveraging the improved locality, the proposed method further co-optimizes the SW/HW of the hybrid graph traversal The experimental result proves the effectiveness of the proposed result.

Thanks! We especially thank Micron for the donation of the development tool and hardware.