FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs Scribed by Vinh Ha
Motivation Solution Graph analysis has lots of random reads and writes Analysing large graph requires big clusters traditionally so that the aggregate memory exceeds the graph size. Graph processing has novel applications everywhere. Solution Build up a system that tries to combine SSDs and RAM. Implement a graph processing engine on top of SSDs file system designed for high parallelism and high IOPS. They tried to store the vertices’ state in memory and edge lists in SSD storage. General and flexible programming interface focusing on vertices.
Pros: The idea of storing vertices and edge lists in different places is good at least for applications which don’t write to the edge lists. The system can store billions of nodes and process them without utilizing too much memory. The performance of the system is comparable to most in-memory graph processing system, for example, Galois, a prevalent graph processing engine. It can conservatively merge the I/O requests to increase I/O throughput. Sequential I/O is important since SSDs performs way better in sequential I/O rather than random I/O Flash Graph’s API seems easy to use and understand. It minimizes the amount of data to be stored in memory.
Cons The author has this assumption that SSDs has become a commodity. But the author doesn’t really show extra resources regarding the prices of SSDs compared to other type of memory like RAM or regular storage. Another assumption that author has is that modification of edge doesn’t happen in graph processing application. SSDs wearing out will still be a problem. Flash graph’s vertex scheduler is constrained by the I/O ordering. What if the graph is really dense -> Many edges.
Discussion Questions The performance of this system in dense graph where there are lots of edges in the storage? The author mentioned that SSDs has become a commodity. Has it really become a commodity or in the future? Should we stick to the typical disk storage if we need high write IO combining with the RAM? Since the SSDs has a higher performance in sequential IO, is is the best for graph processing given that the author has stated that graph processing has high random IOs?
Thanks!