Introduction to Large-Scale Graph Computation

Introduction to Large-Scale Graph Computation
+ GraphLab and GraphChi Aapo Kyrola, Feb 27, 2013

Acknowledgments Many slides (the pretty ones) are from Joey Gonzalez’ lecture (2012) Many people involved in the research: Haijie Gu Danny Bickson Arthur Gretton Yucheng Low Joey Gonzalez Carlos Guestrin Alex Smola Joe Hellerstein David O’Hallaron Guy Blelloch

Contents Introduction to Big Graphs Properties of Real-world Graphs
Why Map-Reduce not good for big graphs  specialized systems Vertex-Centric Programming Model GraphLab -- distributed computation GraphChi -- disk-based

Basic vocabulary Graph (network) Vertex (node)
Edge (link), in-edge, out-edge Sparse graph / matrix A e B Terms: e is an out-edge of A, and in-edge of B.

introduction to Big graphs

What is a“Big” Graph? Definition changes rapidly:
GraphLab paper 2009: biggest graph 200M edges Graphlab & GraphChi papers 2012: biggest graph 6.7B edges Twitter: many times bigger. Depends on the computation as well matrix factorization (collaborative filtering) or Belief Propagation much more expensive than PageRank

What is “Big” Graph Biggest graphs available to researchers
Big Graphs are always extremely sparse. Biggest graphs available to researchers Altavista: 6.7B edges, 1.4B vertices Twitter 2010: 1.5B edges, 68M vertices Common Crawl (2012): 5 billion web pages But the industry has even bigger ones: Facebook (Oct 2012): 144B friendships, 1B users Twitter (2011): 15B follower-edges When reading about graph processing systems, be critical of the problem sizes – are they really big? Shun, Blelloch (2013, PPoPP): use single machine (256 gb RAM) for in-memory computation on same graphs as the GraphLab/GraphChi papers.

Examples of Big Graphs Twitter – what kind of graphs? follow-graph
engagement graph list-members graph topic-authority (consumers -> producers) follow graph is the obvious graph in Twitter, but we can extract many other graphs as well. For example the engagement graph may have more information than the follow graph since it is easy to follow, but engaging requires more action and will be seen by your followers (you do not want to spam by retweeting everything).

Example of Big Graphs Facebook: extended social graph
FB friend-graph: differences to Twitter’s graph? this graph is huge Slide from Facebook Engineering’s presentation

Other Big Networks WWW Academic Citations Internet traffic Phone calls

What can we compute from social networks / web graphs?
Influence ranking PageRank, TunkRank, SALSA, HITS Analysis triangle counting (clustering coefficient), community detection, information propagation, graph radii, ... Recommendations who-to-follow, who-to-follow for topic T similarities Search enhancements Facebook’s Graph Search But actually: it is a hard question by itself!

Sparse Matrices User x Item/Product matrices
How to represent sparse matrices as graphs? User x Item/Product matrices explicit feedback (ratings) implicit feedback (seen or not seen) typically very sparse Argo Plan9 From the Outer Space Titanic ... The Hobbit User 1 - 3 2 5 User 2 4 User N 1

Product – Item bipartite graph
Women on the Verge of a Nervous Breakdown 4 The Celebration City of God Wild Strawberries 3 (slide adapted from Joey Gonzalez) 2 La Dolce Vita 5

What can we compute from user-item graphs?
Collaborative filtering (recommendations) Recommend products that users with similar tests have recommended. Similarity / distance metrics Matrix factorization Random walk based methods Lots of algorithms available. See Danny Bickson’s CF toolkit for GraphChi:

Probabilistic Graphical Models
Each vertex represents a random variable Edges between vertices represent dependencies modelled with conditional probabilities Bayes networks Markov Random Fields Conditional Random Fields Goal: given evidence (observed variables), compute likelihood of the unobserved variables Exact inference generally intractable Need to use approximations.

Shopper 2 Shopper 1 Cooking Cameras
Here we have two shoppers. We would like to recommend things for them to buy based on their interests. However we may not have enough information to make informed recommendations by examining their individual histories in isolation. We can use the rich probabilistic structure to improve the recommendations for individual people.

Image Denoising Synthetic Noisy Image Graphical Model Few Updates
adapted from Joey’s slides

Still more examples CompBio Text modelling Knowledge bases
Protein-Protein interaction network Activator/deactivator gene network DNA assembly graph Text modelling word-document graphs Knowledge bases NELL project at CMU Planar graphs Road network Implicit Graphs k-NN graphs I hope you now got the idea that there are lots of interesting graphs out there. Some of them a bigger than others, but also the type of computation varies. Thus the definition of “big” is not well-defined.

Resources Stanford SNAP datasets: ClueWeb (CMU):
ClueWeb (CMU): Univ. of Milan’s repository:

properties of real world graphs
Twitter network visualization, by Akshay Java, 2009 properties of real world graphs

Natural Graphs Partial map of the Internet based on the January 15, 2005 [Image from WikiCommons]

Natural Graphs Grids and other Planar Graphs are “easy”
Easy to find separators The fundamental properties of natural graphs make them computationally challenging

Power-Law Degree of a vertex = number of adjacent edges
in-degree and out-degree

Power-Law = Scale-free
Fraction of vertices having k neighbors: P(k) ~ k-alpha Generative models: rich-get-richer (preferential attachment) copy-model Kronecker graphs (Leskovec, Faloutsos, et al.) Other phenomena with power-law characteristics? Wealth / income of individuals Size of cities

Natural Graphs  Power Law
Top 1% of vertices is adjacent to 53% of the edges! “Power Law” -Slope = α ≈ 2 slide from Joey Gonzalez. LOG-LOG Altavista Web Graph: 1.4B Vertices, 6.7B Edges

Properties of Natural Graphs
Great talk by M. Mahoney :“Extracting insight from large networks: implications of small-scale and large-scale structure” Small diameter expected distance between two nodes in Facebook: 4.74 (2011) Nice local structure, but no global structure from Michael Mahoney’s (Stanford) presentation

Graph Compression Local structure helps compression: Basic idea:
Blelloch et. al. (2003): compress web graph to 3-4 bits / link WebGraph framework from Univ of Milano social graphs ~ 10 bits / edge (2009) Basic idea: order the vertices so that topologically close vertices have ids close to each other difference encoding

Computational Challenge
Natural Graphs are very hard to partition Hard to distribute computation to many nodes in balanced way, so that the number of edges crossing partitions is minimized Why? Think about stars. Graph partitioning algorithms: METIS Spectral clustering Not feasible on very large graphs! Vertex-cuts better than edge cuts (talk about this later with GraphLab)

large-scale graph computation systems
Why MapReduce is not enough large-scale graph computation systems

Parallel Graph Computation
Distributed computation and/or multicore parallelism Sometimes confusing. We will talk mostly about distributed computation. Are classic graph algorithms parallelizable? What about distributed? Depth-first search? Breadth-first search? Priority-queue based traversals (Djikstra’s, Prim’s algorithms) BFS is actually intrisincally parallel because can express as computation on frontiers.

MapReduce for Graphs Graph computation almost always iterative
MapReduce ends up shipping the whole graph on each iteration over the network (map->reduce->map->reduce->...) Mappers and reducers are stateless

Iterative Computation is Difficult
System is not optimized for iteration: Iterations Data Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data CPU 1 CPU 2 CPU 3 Data Data Startup Penalty Disk Penalty Startup Penalty Disk Penalty Startup Penalty Disk Penalty Data Data Data Data

MapReduce and Partitioning
Map-Reduce splits the keys randomly between mappers/reducers But on natural graphs, high-degree vertices (keys) may have million-times more edges than the average Extremely uneven distribution Time of iteration = time of slowest job.

Curse of the Slow Job Iterations Barrier Barrier Barrier
Data Barrier Data Data Data CPU 1 CPU 2 CPU 1 CPU 1 Data CPU 2 CPU 2 Data Data CPU 3 CPU 3 CPU 3 Data Data Data

Map-Reduce is Bulk-Synchronous Parallel
Bulk-Synchronous Parallel = BSP (Valiant, 80s) Each iteration sees only the values of previous iteration. In linear systems literature: Jacobi iterations Pros: Simple to program Maximum parallelism Simple fault-tolerance Cons: Slower convergence Iteration time = time taken by the slowest node

Asynchronous Computation
Alternative to BSP Linear systems: Gauss-Seidel iterations When computing value for item X, can observe the most recently computed values of neighbors Often relaxed: can see most recent values available on a certain node Consistency issues: Prevent parallel threads from over-writing or corrupting values (race conditions)

MapReduce’s (Hadoop’s) poor performance on huge graphs has motivated the development of special graph-computation systems

Specialized Graph Computation Systems (Distributed)
Common to all: Graph partitions resident in memory on the computation nodes Avoid shipping the graph over and over Pregel (Google, 2010): “Think like a vertex” Messaging model BSP Open source: Giraph, Hama, Stanford GPS,.. GraphLab (2010, 2012) [CMU] Asynchronous (also BSP) Version 2.1 (“PowerGraph”) uses vertex-partitioning  extremely good performance on natural graphs + Others But do you need a distributed framework?

vertex-centric programming
“Think like a vertex” vertex-centric programming

Vertex-Centric Programming
“Think like a Vertex” (Google, 2010) Historically, similar idea used before in systolic-computation, data-flow systems the Connection Machine and others. Basic idea: each vertex computes individually its value [in parallel] Program state = vertex (and edge) values Pregel: vertices send messages to each other GraphLab/Chi: vertex reads its neighbors’ and edge values, modifies edge values (can be used to simulate messaging) Iterative Fixed-point computations are typical: iterate until the state does not change (much).

Computational Model (GraphLab and GraphChi)
Graph G = (V, E) directed edges: e = (source, destination) each edge and vertex associated with a value (user-defined type) vertex and edge values can be modified (GraphChi: structure modification also supported) A e B Terms: e is an out-edge of A, and in-edge of B. Data Let’s now discuss what is the computational setting of this work. Let’s first introduce the basic computational model. GraphChi – Aapo Kyrola

Vertex Update Function
Data Data Data Data Data Data MyFunc(vertex) { // modify neighborhood } Data Data Data Data Data

Parallel Computation Bulk-Synchronous: All vertices update in parallel (note: need 2x memory – why?) Asynchronous: Basic idea: if two vertices are not connected, can update them in parallel Two-hop connections GraphLab supports different consistency models allowing user to specify the level of “protection” = locking Efficient locking is complicated on distributed computation (hidden from user) – why?

Scheduling Often, some parts of the graph require more iterations to converge than others: Remember power-law structure Wasteful to update all vertices equal number of times.

The Scheduler Scheduler
The scheduler determines the order that vertices are updated CPU 1 e f g k j i h d c b a b c Scheduler e f b a i h i j CPU 2 The process repeats until the scheduler is empty

Types of Schedulers (GraphLab)
Round-robin Selective scheduling (skipping): round robin but jump over un-scheduled vertice FIFO Priority scheduling Approximations used in distributed computation (each node has its own priority queue) Rarely used in practice (why?)

Example: Pagerank Express Pagerank in words in the vertex-centric model

Example: Connected Components
1 2 5 3 7 4 6 First iteration: Each vertex chooses label = its id.

1 1 5 1 5 2 6 Update: my vertex id = minimum of neighbors id.

How many iterations needed for convergence? (In synchronous model) 1 1 5 1 5 1 5 What about asynchronous model? Component id = leader id (smallest id in the component)

Matrix Factorization Alternating Least Squares (ALS)
User Factors (U) Movie Factors (M) Users Movies Netflix ≈ x mj ui Iterate:

Graphlab (v2.1 = powergraph)

GraphLab 2 Open-source! http://graphlab.org
Joseph Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, Carlos Guestrin (2012). "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs." Proceedings of Operating Systems Design and Implementation (OSDI). GraphLab project started in 2009 by prof. Carlos Guestrin’s team multicore version: UAI ‘2009 distributed: VLDB ‘2012 PowerGraph: OSDI ‘2012 Motivation from Machine Learning; many ML problems can be represented naturally as graph problems. Since expanded scope to network analysis etc. Open-source! All GraphLab slides from Joey Gonzalez!

Problem: High Degree Vertices Limit Parallelism
Touches a large fraction of graph (GraphLab 1) Edge information too large for single machine Produces many messages (Pregel) Sequential Vertex-Updates Asynchronous consistency requires heavy locking (GraphLab 1) Synchronous consistency is prone to stragglers (Pregel)

2 Distribute a single vertex-update Vertex Partitioning
Move computation to data Parallelize high-degree vertices Vertex Partitioning Simple online approach, effectively partitions large power-law graphs

Factorized Vertex Updates
Split update into 3 phases Gather Apply Scatter Y Scope Y Y Y Parallel Sum Y Y Apply( , Δ)  Locally apply the accumulated Δ to vertex Update neighbors …  Δ Y Data-parallel over edges Data-parallel over edges

PageRank in GraphLab2 PageRankProgram(i) Gather( j  i ) : return wji * R[j] sum(a, b) : return a + b; Apply(i, Σ) : R[i] = β + (1 – β) * Σ Scatter( i  j ) : if (R[i] changes) then activate(j) How ever, in some cases, this can seem rather inefficient.

Distributed Execution of a GraphLab2 Vertex-Program
Machine 1 Machine 2 Gather Y’ Y’ Y’ Y’ Y Σ Σ1 Σ2 Y Apply Y Y Machine 3 Machine 4 Σ3 Σ4 Scatter

Minimizing Communication in GraphLab2
Y Communication is linear in the number of machines each vertex spans Y Y A vertex-cut minimizes machines each vertex spans Percolation theory suggests that power law graphs have good vertex cuts. [Albert et al. 2000]

Constructing Vertex-Cuts
Goal: Parallel graph partitioning on ingress GraphLab 2 provides three simple approaches: Random Edge Placement Edges are placed randomly by each machine Good theoretical guarantees Greedy Edge Placement with Coordination Edges are placed using a shared objective Better theoretical guarantees Oblivious-Greedy Edge Placement Edges are placed using a local objective

Random Vertex-Cuts Randomly assign edges to machines Machine 1
Z Y Balanced Cut Y Y Y Spans 3 Machines Y Y Y Y Z Y Z Z Spans 2 Machines Spans only 1 machine

Greedy Vertex-Cuts Place edges on machines which already have the vertices in that edge. Machine1 Machine 2 A B B C Greedy algorithm is no worse than random placement in expectation D A E B

Greedy Vertex-Cuts Derandomization: Minimizes the expected number of machines spanned by each vertex. Coordinated Maintain a shared placement history (DHT) Slower but higher quality Oblivious Operate only on local placement history Faster but lower quality Greedy algorithm is no worse than random placement in expectation

Partitioning Performance
Twitter Graph: 41M vertices, 1.4B edges Cost Construction Time Random Oblivious Better Oblivious Coordinated Coordinated Put Random, coordinated, then oblivious as compromise Random Oblivious balances partition quality and partitioning time.

Beyond Random Vertex Cuts!

Triangle Counting in Twitter Graph
40M Users 1.2B Edges Total: 34.8 Billion Triangles 1536 Machines 423 Minutes Hadoop [1] S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” presented at the WWW '11: Proceedings of the 20th international conference on World wide web, 2011. 64 Machines, 1024 Cores 1.5 Minutes Hadoop results from [Suri & Vassilvitskii '11]

LDA Performance All English language Wikipedia
2.6M documents, 8.3M words, 500M tokens LDA state-of-the-art sampler (100 Machines) Alex Smola: 150 Million tokens per Second GraphLab Sampler (64 cc2.8xlarge EC2 Nodes) 100 Million Tokens per Second Using only 200 Lines of code and 4 human hours

PageRank 8 min 40M Webpages, 1.4 Billion Links (100 iterations)
5.5 hrs 1 hr 8 min “Comparable numbers are hard to come by as everyone uses different datasets, but we try to equalize as much as possible giving the competition an advantage when in doubt” Scaling to 100 iterations so costs are in dollars and not cents. Numbers: Hadoop: Ran on Kronecker graph of 1.1B edges, 50 M45 machines. From available numbers, each machine is approximately 8 cores, 6 GB RAM or roughly a c1.xlarge instance. Putting total cost at $33 per hour Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation U Kang Brendan Meeder Christos Faloutsos Twister: Ran on ClueWeb dataset of 50M pages, 1.4B edges. Using 64 Nodes of 4 cores each. 16GB RAM per node. Or roughly a m2.xlarge to m2.2xlarge instance. Putting total cost at $28-$56 per hour. Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, Geoffrey Fox, Twister: A Runtime for Iterative MapReduce," The First International Workshop on MapReduce and its Applications (MAPREDUCE'10) - HPDC2010 GraphLab: Ran on 64 cc1.4x large instances at $83.2 per hour 40M Webpages, 1.4 Billion Links (100 iterations) Hadoop results from [Kang et al. '11] Twister (in-memory MapReduce) [Ekanayake et al. ‘10]

GraphLab Toolkits GraphLab easily incorporates external toolkits
Linux Cluster Services (Amazon AWS) MPI/TCP-IP PThreads Hadoop/HDFS GraphLab Version 2.1 API (C++) Graph Analytics Graphical Models Computer Vision Clustering Topic Modeling Collaborative Filtering GraphLab easily incorporates external toolkits Automatically detects and builds external toolkits

GraphLab Future Directions
Making programming easier Gather-Apply-Scatter difficult for some problems  often want repeated gather-applies higher level operators: edge- and vertex-map/reduces Integration with graph storage Development continues by prof. Carlos Guestrin’s team at Univ of Washington (CMU)

graphChi – large scale graph computation on just a PC

Background Spin-off of GraphLab project Also presented in OSDI ‘12
GraphLab[rador’s] small friend GraphChi[huahua] Also presented in OSDI ‘12 C++ and Java/Scala-versions available at

GraphChi: Going small with GraphLab
Solve huge problems on small or embedded devices? Key: Exploit non-volatile memory (starting with SSDs and HDs)

Could we compute Big Graphs on a single machine?
Disk-based Graph Computation I just told that the size of the data is not really a problem, it is the computation. Maybe we can also do the computation on a one disk? But let’s first ask why we even want this. Cannot we just use the cloud? Credit card in, solutions out. Facebook’s graph 144 B edges ~ 1 terabyte

Distributed State is Hard to Program
Writing distributed applications remains cumbersome. I want to take first the point of view of those folks who develop algorithms on our systems. Unfortunately, it is still hard, cumbersome, to write distributed algorithms. Even if you have some nice abstraction, you still need to understand what is happening. I think this motivation is quite clear to everyone in this room, so let me give just one example. Debugging. Some people write bugs. Now to find out what crashed a cluster can be really difficult. You need to analyze logs of many nodes and understand all the middleware. Compare this to if you can run the same big problems on your own machine. Then you just use your favorite IDE and its debugger. It is a huge difference in productivity. Cluster crash Crash in your IDE GraphChi – Aapo Kyrola

2x machines = 2x throughput
Efficient Scaling Businesses need to compute hundreds of distinct tasks on the same graph Example: personalized recommendations. Task Task Task Task Task Complex Task Simple Task Another, perhaps a bit surprising motivation comes from thinking about scalability in large scale. The industry wants to compute many tasks on the same graph. For example, to compute personalized Recommendations, same task is computed for people in different countries, different interests groups, etc. Currently: you need a cluster just to compute one single task. To compute tasks faster, you grow the cluster. But this work allows a different way. Since one machine can handle one big task, you can dedicate one task Per machine. Why does this make sense? * Clusters are complex, and expensive to scale. * while in this new model, it is very simple as nodes do not talk to each other, and you can double the throughput by doubling the machines There are other motivations as well, such as reducing costs and energy. But let’s move on. Expensive to scale 2x machines = 2x throughput Parallelize each task Parallelize across tasks

Other Benefits Costs Energy Consumption
Easier management, simpler hardware. Energy Consumption Full utilization of a single computer. Embedded systems, mobile devices A basic flash-drive can fit a huge graph.

Research Goal Compute on graphs with billions of edges, in a reasonable time, on a single PC. Reasonable = close to numbers previously reported for distributed systems in the literature. Now we can state the goal of this research, or the research problem we stated for us when we started this project. The goal has some vagueness in it, so let me briefly explain. With reasonable time, I mean that if there have been papers reporting some numbers for other systems, I assume at least the authors were happy with the performance, and our system can do the same in the same ballpark, it is likely reasonable, given the lowe costs here. Now, as a consumer PC we used a Mac Mini. Not the cheapest computer there is, especially with the SSD, but still quite a small package. Now, we have since also run GraphChi with cheaper hardware, on hard drive instead of SSD, and can say it provides good performance on the lower end as well. But for this work, we used this computer. One outcome of this research is to come up with a single computer comparison point for computing on very large graphs. Now researchers who develop large scale graph computation platforms can compare their performance to this, and analyze the relative gain achieved by distributing the computation. Before, there has not been really understanding of whether many proposed distributed frameworks are efficient or not. Experiment PC: Mac Mini (2012)

Random Access Problem Symmetrized adjacency file with values,
19 5 Symmetrized adjacency file with values, vertex in-neighbors out-neighbors 5 3:2.3, 19: 1.3, 49: 0.65,... 781: 2.3, 881: 4.2.. .... 19 3: 1.4, 9: 12.1, ... 5: 1.3, 28: 2.2, ... Random write For sufficient performance, millions of random accesses / second would be needed. Even for SSD, this is too much. synchronize Here in the table I have snippet of a simple straightforward storage of a graph as adjacency sets. For each vertex, we have a list of its in-neighbors, and out-neighbors, with associated values. Now let’s say when update vertex 5, we change the value of its in-edge from vertex 19. As vertex 19 has the out-edge, we need to update its value in 19’s list. This incurs a random write. Now, perhaps we can solve this as following: each vertex only stores its out-neighbors directly, but in-neighbors are stored as file pointers to their primary storage at the neighbors out-edge list. In this case, when we load vertex 5, we need to do a random read to fetch the value of the in-edge. Random read is better, much better, than random write – but, in our experiments, even on SSD, it is way too slow. One additional reason is the overhead of a system call. Perhaps a direct access to the SSD would help, but as we came up with a simpler solution that works even on a rotational hard drive, we abandonded this approach. ... or with file index pointers vertex in-neighbor-ptr out-neighbors 5 3: 881, 19: 10092, 49: 20763,... 781: 2.3, 881: 4.2.. .... 19 3: 882, 9: 2872, ... 5: 1.3, 28: 2.2, ... Random read read

Possible Solutions Use SSD as a memory-extension? [SSDAlloc, NSDI’11]
2. Compress the graph structure to fit into RAM? [ WebGraph framework] Too many small objects, need millions / sec. Associated values do not compress well, and are mutated. Now there are some potential remedies we pretended to consider. Compressing the graph and caching of hot nodes works in many cases, and for example for graph traversals, where you walk in the graph in a random manner, they are good solution, and there is other work on that. But for our computational model, where we actually need to modify the values of the edges, these won’t be sufficient given the constraints. Of course, if you have one terabyte of memory, you do not need any of that. 3. Cluster the graph and handle each cluster separately in RAM? 4. Caching of hot nodes? Expensive; The number of inter-cluster edges is big. Unpredictable performance.

Parallel Sliding Windows (PSW)
Our Solution Parallel Sliding Windows (PSW) Now we finally move to what is our main contribution. We call it Parallel Sliding Windows, for the reason that comes apparent soon. One reviewer commented that it should be called Parallel Tumbling Windows, but as I had committed already to using this term, and replace-alls are dangerous, I stuck with it.

Parallel Sliding Windows: Phases
PSW processes the graph one sub-graph a time: In one iteration, the whole graph is processed. And typically, next iteration is started. 1. Load 2. Compute 3. Write The basic approach is that PSW loads one sub-graph of the graph a time, computes the update-functions for it, and saves the modifications back to disk. We will show soon how the sub-graphs are defined, and how we do this without doing almost no random access. Now, we usually use this for ITERATIVE computation. That is, we process all graph in sequence, to finish a full iteration, and then move to a next one.

PSW: Shards and Intervals
1. Load 2. Compute 3. Write PSW: Shards and Intervals Vertices are numbered from 1 to n P intervals, each associated with a shard on disk. sub-graph = interval of vertices 1 v1 v2 n interval(1) interval(2) interval(P) How are the sub-graphs defined? shard(1) shard(2) shard(P) GraphChi – Aapo Kyrola

in-edges for vertices 1..100 sorted by source_id
1. Load 2. Compute 3. Write PSW: Layout Shard: in-edges for interval of vertices; sorted by source-id Vertices Vertices Vertices Vertices Shard 2 Shard 3 Shard 4 Shard 1 Shard 1 in-edges for vertices sorted by source_id Let us show an example Shards small enough to fit in memory; balance size of shards

PSW: Loading Sub-graph
2. Compute 3. Write PSW: Loading Sub-graph Load subgraph for vertices Vertices Vertices Vertices Vertices Shard 1 Shard 2 Shard 3 Shard 4 in-edges for vertices sorted by source_id Load all in-edges in memory What about out-edges? Arranged in sequence in other shards

PSW: Loading Sub-graph
2. Compute 3. Write PSW: Loading Sub-graph Load subgraph for vertices Vertices Vertices Vertices Vertices Shard 1 Shard 2 Shard 3 Shard 4 in-edges for vertices sorted by source_id Load all in-edges in memory Out-edge blocks in memory

Only P large reads for each interval.
1. Load 2. Compute 3. Write PSW Load-Phase Only P large reads for each interval. P2 reads on one full pass. GraphChi – Aapo Kyrola

1. Load 2. Compute 3. Write PSW: Execute updates Update-function is executed on interval’s vertices Edges have pointers to the loaded data blocks Changes take effect immediately  asynchronous. Block X &Data Now, when we have the sub-graph in memory, i.e for all vertices in the sub-graph we have all their in- and out-edges, we can execute the update-functions. Now comes important thing to understand: as we loaded the edges from disk, these large blocks are stored in memory. When we then create the graph objects – vertices and edges – the edge object will have a pointer to the data block. I have changed the figure to show that actually all data blocks are represented by pointers, pointing in to the blocks loaded from disk. Now: if two vertices share an edge, they will immediatelly observe change made by the other one since their edge pointers point to the same address. Deterministic scheduling prevents races between neighboring vertices. Block Y GraphChi – Aapo Kyrola

1. Load 2. Compute 3. Write PSW: Commit to Disk In write phase, the blocks are written back to disk Next load-phase sees the preceding writes  asynchronous. In total: P2 reads and writes / full pass on the graph.  Performs well on both SSD and hard drive. Block X &Data Block Y GraphChi – Aapo Kyrola

is graphchi fast enough?
Comparisons to existing systems is graphchi fast enough?

Experiment Setting Mac Mini (Apple Inc.) 8 GB RAM
256 GB SSD, 1TB hard drive Intel Core i5, 2.5 GHz Experiment graphs: Graph Vertices Edges P (shards) Preprocessing live-journal 4.8M 69M 3 0.5 min netflix 0.5M 99M 20 1 min twitter-2010 42M 1.5B 2 min uk 106M 3.7B 40 31 min uk-union 133M 5.4B 50 33 min yahoo-web 1.4B 6.6B 37 min The same graphs are typically used for benchmarking distributed graph processing systems.

Comparison to Existing Systems
See the paper for more comparisons. Comparison to Existing Systems PageRank WebGraph Belief Propagation (U Kang et al.) On a Mac Mini: GraphChi can solve as big problems as existing large-scale systems. Comparable performance. Matrix Factorization (Alt. Least Sqr.) Triangle Counting Unfortunately the literature is abundant with Pagerank experiments, but not much more. Pagerank is really not that interesting, and quite simple solutions work. Nevertheless, we get some idea. Pegasus is a hadoop-based graph mining system, and it has been used to implement a wide range of different algorithms. The best comparable result we got was for a machine learning algo “belief propagation”. Mac Mini can roughly match a 100node cluster of Pegasus. This also highlights the inefficiency of MapReduce. That said, the Hadoop ecosystem is pretty solid, and people choose it for the simplicity. Matrix factorization has been one of the core Graphlab applications, and here we show that our performance is pretty good compared to GraphLab running on a slightly older 8-core server. Last, triangle counting, which is a heavy-duty social network analysis algorithm. A paper in VLDB couple of years ago introduced a Hadoop algorithm for counting triangles. This comparison is a bit stunning. But, I remind that these are prior to PowerGraph: in OSDI, the map changed totally! However, we are confident in saying, that GraphChi is fast enough fo rmany purposes. And indeed, it can solve as big problems as the other systems have been shown to execute. It is limited by the disk space. Notes: comparison results do not include time to transfer the data to cluster, preprocessing, or the time to load the graph from disk. GraphChi computes asynchronously, while all but GraphLab synchronously.

PowerGraph Comparison
OSDI’12 PowerGraph Comparison 2 PowerGraph / GraphLab 2 outperforms previous systems by a wide margin on natural graphs. With 64 more machines, 512 more CPUs: Pagerank: 40x faster than GraphChi Triangle counting: 30x faster than GraphChi. PowerGraph really resets the speed comparisons. However, the point of ease of use remain, and GraphChi likely provides sufficient performance for most people. But if you need peak performance and have the resources, PowerGraph is the answer. GraphChi has still a role as the development platform for PowerGraph. vs. GraphChi GraphChi has state-of-the-art performance / CPU.

Evaluation: Evolving Graphs
Streaming Graph experiment Evaluation: Evolving Graphs

Streaming Graph Experiment
On the Mac Mini: Streamed edges in random order from the twitter-2010 graph (1.5 B edges) With maximum rate of 100K or 200K edges/sec. (very high rate) Simultaneously run PageRank. Data layout: Edges were streamed from hard drive Shards were stored on SSD. Hard to evalute, no comparison points edges

Ingest Rate When graph grows, shard recreations become more expensive.

Streaming: Computational Throughput
Throughput varies strongly due to shard rewrites and asymmetric computation.

Summary Introduced large graphs, challenges with them
Talked why specialized graph computation platforms are needed Vertex-centric computation model GraphLab GraphChi

Thank You! Follow me in Twitter: @kyrpov http://graphchi.org

Introduction to Large-Scale Graph Computation

Similar presentations

Presentation on theme: "Introduction to Large-Scale Graph Computation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Large-Scale Graph Computation

Similar presentations

Presentation on theme: "Introduction to Large-Scale Graph Computation"— Presentation transcript:

Similar presentations

About project

Feedback