FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs Scribed by Vinh Ha.

Slides:



Advertisements
Similar presentations
January 11, Csci 2111: Data and File Structures Week1, Lecture 1 Introduction to the Design and Specification of File Structures.
Advertisements

Chapter 11: File System Implementation
Authors: David G. Andersen et al. Offense: Chang Seok Bae Yi Yang.
MSSG: A Framework for Massive-Scale Semantic Graphs Timothy D. R. Hartley, Umit Catalyurek, Fusun Ozguner, Andy Yoo, Scott Kohn, Keith Henderson Dept.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
MODULE 9: SCALING THE ENVIRONMENT. Agenda CP storage in a production environment – Understanding IO by Tier Designing for multiple CPs Storage sizing.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Computers in the real world Objectives Understand what is meant by memory Difference between RAM and ROM Look at how memory affects the performance of.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Graph reordering/partitioning with redundancy. Motivation 1. distributed graph processing – Use redundancy to reduce the costly communication – Reordering.
Implementing Parallel Graph Algorithms: Graph coloring Created by: Avdeev Alex, Blakey Paul.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Unit 2 - Hardware Data Storage. How is data stored? ● Data is stored as a combination of 0’s and 1’s (bits). We will learn more about this later! ● Bits.
12/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
Storage Systems CSE 598d, Spring 2007 OS Support for DB Management DB File System April 3, 2007 Mark Johnson.
Parallel IO for Cluster Computing Tran, Van Hoai.
MSBIC Hadoop Series Implementing MapReduce Jobs Bryan Smith
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
Lecture 17 Raid. Device Protocol Variants Status checks: polling vs. interrupts Data: PIO vs. DMA Control: special instructions vs. memory-mapped I/O.
Input and Output Optimization in Linux for Appropriate Resource Allocation and Management James Avery King.
About ProLion CEO, Robert Graf Headquarter in Austria
COS 518: Advanced Computer Systems Lecture 8 Michael Freedman
GridOS: Operating System Services for Grid Architectures
Memory Management (2).
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Sathya Ronak Alisha Zach Devin Josh
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Disks and RAID.
External Sorting Chapter 13
Parallel Programming By J. H. Wang May 2, 2017.
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
File System Structure How do I organize a disk into a file system?
Operating Systems (CS 340 D)
Database Management Systems (CS 564)
Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems Sungjoon Koh, Jie Zhang, Miryeong.
Oracle Storage Performance Studies
Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind
Chapter 15 QUERY EXECUTION.
MPSearch Multi-Path Search for Tree-based Indexes to Exploit Internal Parallelism of Flash SSDs.
Repairing Write Performance on Flash Devices
ASM-based storage to scale out the Database Services for Physics
湖南大学-信息科学与工程学院-计算机与科学系
Linchuan Chen, Peng Jiang and Gagan Agrawal
-A File System for Lots of Tiny Files
ICOM 6005 – Database Management Systems Design
COS 518: Advanced Computer Systems Lecture 8 Michael Freedman
Parallel I/O System for Massively Parallel Processors
Putting things in order
External Sorting Chapter 13
University of Wisconsin-Madison
Random access memory Sequential circuits all depend upon the presence of memory. A flip-flop can store one bit of information. A register can store a single.
Indexing 4/11/2019.
제 7 장 Cosequential Processing and the Sorting of Large Files
About ProLion CEO, Robert Graf Headquarter in Austria
B-Trees.
B-Trees.
COS 518: Advanced Computer Systems Lecture 9 Michael Freedman
External Sorting Chapter 13
Virtual Memory 1 1.
Finding the maximum matching of a bipartite graph
Presentation transcript:

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs Scribed by Vinh Ha

Motivation Solution Graph analysis has lots of random reads and writes Analysing large graph requires big clusters traditionally so that the aggregate memory exceeds the graph size. Graph processing has novel applications everywhere. Solution Build up a system that tries to combine SSDs and RAM. Implement a graph processing engine on top of SSDs file system designed for high parallelism and high IOPS. They tried to store the vertices’ state in memory and edge lists in SSD storage. General and flexible programming interface focusing on vertices.

Pros: The idea of storing vertices and edge lists in different places is good at least for applications which don’t write to the edge lists. The system can store billions of nodes and process them without utilizing too much memory. The performance of the system is comparable to most in-memory graph processing system, for example, Galois, a prevalent graph processing engine. It can conservatively merge the I/O requests to increase I/O throughput. Sequential I/O is important since SSDs performs way better in sequential I/O rather than random I/O Flash Graph’s API seems easy to use and understand. It minimizes the amount of data to be stored in memory.

Cons The author has this assumption that SSDs has become a commodity. But the author doesn’t really show extra resources regarding the prices of SSDs compared to other type of memory like RAM or regular storage. Another assumption that author has is that modification of edge doesn’t happen in graph processing application. SSDs wearing out will still be a problem. Flash graph’s vertex scheduler is constrained by the I/O ordering. What if the graph is really dense -> Many edges.

Discussion Questions The performance of this system in dense graph where there are lots of edges in the storage? The author mentioned that SSDs has become a commodity. Has it really become a commodity or in the future? Should we stick to the typical disk storage if we need high write IO combining with the RAM? Since the SSDs has a higher performance in sequential IO, is is the best for graph processing given that the author has stated that graph processing has high random IOs?

Thanks!