Advanced Science and Technology Letters Vol.30 (ICCA 2013), pp

Slides:



Advertisements
Similar presentations
Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014.
Advertisements

Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Distributed Graph Processing Abhishek Verma CS425.
GraphChi: Big Data – small machine
Felix Halim, Roland H.C. Yap, Yongzheng Wu
Paper by: Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) Pregel: A System for.
BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.
Pregel: A System for Large-Scale Graph Processing
Wook-Shin Han, Sangyeon Lee POSTECH, DGIST
Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.
Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.
Data Structures and Algorithms in Parallel Computing Lecture 4.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Advanced Science and Technology Letters Vol.30 (ICCA 2013), pp BPP: Large Graph Storage for Efficient.
D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.
Data Structures and Algorithms in Parallel Computing
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Advanced Science and Technology Letters Vol.74 (ASEA 2014), pp Development of Optimization Algorithm for.
BLFS: Supporting Fast Editing/Writing for Large- Sized Multimedia Files Seung Wan Jung 1, Seok Young Ko 2, Young Jin Nam 3, Dae-Wha Seo 1, 1 Kyungpook.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
An Algorithm for Enumerating SCCs in Web Graph Jie Han, Yong Yu, Guowei Liu, and Guirong Xue Speaker : Seo, Jong Hwa.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
TensorFlow– A system for large-scale machine learning
BD-Cache: Big Data Caching for Datacenters
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Presenter: Waqas Nawaz
Distributed Network Traffic Feature Extraction for a Real-time IDS
Chilimbi, et al. (2014) Microsoft Research
Edinburgh Napier University
BD-CACHE Big Data Caching for Datacenters
Parallel Programming By J. H. Wang May 2, 2017.
PREGEL Data Management in the Cloud
An Adaptive Load Balancing Management for
Byung Joon Park, Sung Hee Kim
Query in Streaming Environment
Aziz Nasridinov and Young-Ho Park*
Myoungjin Kim1, Yun Cui1, Hyeokju Lee1 and Hanku Lee1,2,*
Young Hoon Ko1, Yoon Sang Kim2
Kijung Shin1 Mohammad Hammoud1
Privacy Preserving Data Publishing
Data Structures and Algorithms in Parallel Computing
Linchuan Chen, Peng Jiang and Gagan Agrawal
Distributed P2P File System
Replication-based Fault-tolerance for Large-scale Graph Processing
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Multithreaded Programming
CSE451 Virtual Memory Paging Autumn 2002
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
2019/9/14 The Deep Learning Vision for Heterogeneous Network Traffic Control Proposal, Challenges, and Future Perspective Author: Nei Kato, Zubair Md.
Efficient Migration of Large-memory VMs Using Private Virtual Memory
Presentation transcript:

{kamran, Kifayat, waqas}@dke.khu.ac.kr, yklee@khu.ac.kr Advanced Science and Technology Letters Vol.30 (ICCA 2013), pp.117-121 http://dx.doi.org/10.14257/astl.2013.30.25 BPP: Large Graph Storage for Efficient Disk-Based Processing Kamran Najeebullah, Kifayat Ullah Khan, Muhammad Waqas Nawaz, Young-Koo Lee Department of Computer Engineering, Kyung Hee University, Yongin-si, Gyeonggi-do, Korea {kamran, Kifayat, waqas}@dke.khu.ac.kr, yklee@khu.ac.kr Abstract. Processing very large graphs like social networks, biological and chemical compounds is a challenging task. Distributed graph processing systems process the billion-scale graphs efficiently but incur overheads of efficient partitioning and distribution of the graph over a cluster of nodes. Distributed processing also requires cluster management and fault tolerance. In order to overcome these problems GraphChi was proposed recently. GraphChi significantly outperformed all the representative distributed processing frameworks. Still, we observe that GraphChi incurs some serious degradation in performance due to 1) high number of non-sequential I/Os for processing every chunk of graph; and 2) lack of true parallelism to process the graph. In this paper we propose a simple yet powerful engine BiShard Parallel Processor (BPP) to efficiently process billions-scale graphs on a single PC. We extend the storage structure proposed by GraphChi and introduce a new processing model called BiShard Parallel (BP). BP enables full CPU parallelism for processing the graph and significantly reduces the number of non-sequential I/Os required to process every chunk of the graph. Our experiments on real large graphs show that our solution significantly outperforms GraphChi. Keywords: Graph processing; Big data; Parallel processing; BiShard Parallel. 1 Introduction Graph processing has been a popular research area in the last decade and a lot of research has been targeted at the most common graph processing algorithms such as shortest path and some variations of clustering and page rank. Algorithms like connected components and minimum cut also have their own vital value. Graphs like social networks, biological and chemical compounds are difficult to process because of their massive size. With the growing size of the graph datasets, processing graphs has become more challenging. GraphChi [1] processes very large graphs on a single PC using an asynchronous computation model. It introduced a naive processing model, Parallel Sliding Window (PSW). PSW performs computation in execution intervals. An execution interval consist of three steps 1) load a subgraph of the input graph into memory; 2) process all the vertices of the subgraph in parallel and modify the values associated with the ISSN: 2287-1233 ASTL Copyright 2013 SERSC

Advanced Science and Technology Letters Vol.30 (ICCA 2013) vertices and their incident edges; and 3) write the updates back to the disk. PSW passes information between the vertices using out-edges. GraphChi significantly outperformed the distributed processing systems on per node basis. We observe PSW inherits two serious bottlenecks. First, in every execution interval, PSW needs to perform non-sequential reads proportional to the number of the subgraphs. For a very large graph, in every execution interval the number of non-sequential reads will be significantly high. Second, for edges having both endpoints inside the same interval, PSW needs to avoid the race conditions between the two endpoints of the edge. As GraphChi maintains only one copy of the edges, both of the endpoints may end up accessing the edge at the same time. To avoid race conditions, PSW marks such edges as critical and processes the endpoint vertices sequentially. If graph is dense and all the edges have both endpoints inside the memory, all the vertices will be processed sequentially without any parallelism. We propose a disk-based graph processing engine following asynchronous model of computation which efficiently processes billion-scale graphs on a single PC. We introduce a new processing model BiShard Parallel (BSP) based on a vertex-centric approach [2]. BSP divides the graph into several subgraphs. For every subgraph, it manages the in and out edges separately, which allows to load a subgraph with only two reads. This storage structure manages two copies of every edge (one is each direction). This setting allows every vertex to have its own copy of the edges, and ensures full parallel processing. Our contributions are two folds: 1) a new storage structure that reduces the number of I/Os significantly; and 2) a new processing model that ensures full CPU parallelism. The rest of this paper is organized as follows. Section 2 reviews related works. Section 3 lists the core idea of the paper in details. Section 4 discusses experimental settings, results and comparison with state-of-the-art. Finally Section 5 summarizes and concludes the paper. 2 Related Work GraphChi extended the work of Bender [3] and Chen et. al. [4] and proposed a mechanism that stores and processes billion-scale graphs on a single consumer PC. Their system implements a naive technique for disk-based graph processing called Parallel Sliding Window (PSW). PSW exploits the sequential I/Os and parallel computation using vertex-centric approach for processing the graph. Their results show that GraphChi out-performs all the representative disk-based distributed systems on per node basis. However, experimental results also show some bottlenecks in part of parallel graph processing and number of disk reads in every execution interval. Recently, another disk-based graph processing framework TurboGraph [5] was proposed. TurboGraph is also designed to process very large graphs on modern consumer level machine with a flash drive. It implements a naive technique called Pin and Slide. TurboGraph fully exploits parallelism of flash disk and multi-core CPU. Their results show that their system outperforms GraphChi by an order-of- magnitude. However their solution exploits some specific properties of the flash drive which are not available on the rotational disks. 118 Copyright 2013 SERSC

3 BiShard Parallel Processor Advanced Science and Technology Letters Vol.30 (ICCA 2013) 3 BiShard Parallel Processor This section describes our proposed solution BiShard Parallel Processor (BPP). BPP is an asynchronous disk-based framework for processing very large graph on a single PC. We introduce a new processing model BiShard Parallel (BP). BP extends the storage structure proposed by GraphChi to reduce the non-sequential I/Os and eliminate the race conditions while accessing the edges shared by the vertices in memory. Vertices are allowed to modify their associated values and the values associated with their out-edges. BP processes the graph one chunk at a time. Processing of every interval consists of 3 steps 1) load a chunk of graph inside the memory; 2) perform computation on the chunk and modify the associated values of the vertices and out-edges; and 3) write the updated values back to the disk. 3.1 Loading of Subgraph We divide the graph G into P number of intervals. Every interval p consists of a subset of vertices V. We associate two shards in-shard(p) and out-shard(p) with every interval p. in-shard(p) contains all the in-edges of the vertices in interval p, while out- shard(p) contains all the out-edges. In both of the shards edges are stored in order of their source vertex. Size and number of P are chosen such that any interval p can be loaded inside the memory. In order to process the interval subgraph we need to read all its vertices and their edges from the disk. We load the in-edges of the subgraph from the in-shard(p) and the out-edges of the subgraph from the out-shard(p). We perform only 2 non-sequential disk reads while processing an interval, irrespective of the graph size and total number of shards P. 3.2 Parallel vertex updates Graph algorithms are executed on graph by defining an update function. After all the interval vertices along their in and out edges are loaded inside the memory we run the update function for all the vertices in parallel. As every vertex has its own copy of the edges, there is no race condition while accessing any of the edges. We utilize true CPU parallelism by eliminating the race conditions for accessing the edges. 3.3 Writing back to disk To keep the asynchronous processing model intact we must write back the updated values to the disk as updates need to be available to any subsequent processing steps. We distribute and write the updated out-edges to all of the in-shards. Updated edges occur in sequential chunks in every shard. We keep track of the offsets in all the shards where we need to write the updates. Copyright 2013 SERSC 119

4 Experimental Evaluation Advanced Science and Technology Letters Vol.30 (ICCA 2013) 4 Experimental Evaluation We now list our experimental settings, details of the experiments and comparison with state-of-the-art. 4.1 Experiments setup Experiments were performed on a compatible PC with 3.3GHz Intel Core i5-3550 CPU, 4GB of installed main memory and a 500GB 7200 rpm disk drive. We ran Microsoft Windows 7 64-bit with default settings. We disabled the file caching to get meaningful comparisons for small and large input files. 4.2 Experimental results and comparisons We implemented page rank algorithm in the same way as implemented by GraphChi. We conducted our experiments on the wiki-vote [6] data set with 7 thousand plus vertices and more than 1 hundred thousand edges. We varied the number of shards to evaluate the effect of different number of inter-interval edges and number non-sequential I/O. Our results showed that BPP significantly outperform GraphChi with both small and large number of shards. We noticed that the performance margin is larger when number of inter-interval edges is large and performance margin gets smaller with increase in number of shards. Fig.1. Graph visualizing the comparison of execution time of GraphChi and BPP while running Page Rank algorithm. 5 Conclusion GraphChi is a single PC, disk-based, graph processing engine. It solves the problems observed in distributed processing system but suffers from serious performance issues. In this work we extend the storage structure proposed by GraphChi, and proposed a new processing method called BiShard Parallel (BP). We showed by theoretical analysis that our solution reduced the number of non-sequential seeks and 120 Copyright 2013 SERSC

Advanced Science and Technology Letters Vol.30 (ICCA 2013) I/Os incurred in GraphChi to almost half, which makes it a better choice for processing graph on SSD. We also eliminated the race conditions between the vertices to access a common edge, which hindered GraphChi from full parallel processing of the graph vertices. Acknowledgements. This research was supported by the MSIP (Ministry of Science, ICT & Future Planning), Korea, under the ITRC(Information Technology Research Center) support program (NIPA-2013- H0301-13-4006) supervised by the NIPA(National IT Industry Promotion Agency). References Kyrola, A., Blelloch, G., Guestrin, C.: Graphchi: large-scale graph computation on just a pc. In: OSDI, pp. 31--46 (2012). Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing, In: SIGMOD ACM (2010). Bender, M., Brodal, G., Fagerberg, R., Jacob, R., Vicari, E.: Optimal sparse matrix dense vector multiplication in the i/o-model. Theory of Computing Systems 47(4), pp. 934--962 (2010). Chen, Y., Gan, Q., Suel, T.: I/O-efficient techniques for computing PageRank. In: Proceedings of the eleventh international conference on Information and knowledge management, pp. 549--557, McLean, USA (2002) Han, W.-S., Sangyeon, L., Park, K., Lee, J.-H., Kim, M.-S., Kim, J., Yu, H. Turbograph: A fast parallel graph engine handling billion-scale graphs in a single pc, In: Proceedings of the 19th ACM SIGKDD Conference on Knowledge Discovery and Data mining ACM (2013) Leskovec, J., Huttenlocher, D., Kleinberg, J.: Predicting Positive and Negative Links in Online Social Networks, In: WWW (2010) Copyright 2013 SERSC 121