Chronos: A Graph Engine for Temporal Graph Analysis

Chronos: A Graph Engine for Temporal Graph Analysis
Wentao Han1,3, Youshan Miao2,3, Kaiwei Li1,3, Ming Wu3, Fan Yang3, Lidong Zhou3, Vijayan Prabhakaran3, Wenguang Chen1, Enhong Chen2 Tsinghua University1 University of Science and Technology of China2 Microsoft Research3 Good morning everyone I am Youshan. Glad to be here introducing our work Chronos. It is a graph engine for temporal graph analysis. ************************** OTHER POINTS LABS fully used the cache line (Grace may lose some as neighbor cannot fulfil the cache line) “LABS is a scheduling method for all algorithms, not one-single improved algorithm” Reason LP > SP, and the cross partition edge ratio number. Q Partition#? Load balancing? Seems that snapshot parallelism not scale well, why? (L3 cache sharing) Define inter-core communication How today’s integrated memory controller works? Relationship graph vs. activity graph? What if there’s not enough memory to hold all the graph? Q past Is incremental worth sharing in detail? Not really, it is hard to express clearly in the talk

Temporal Graphs Real-world graphs evolve – temporal graphs
Temporal graph properties bring more insights 2013 2014 2012 A Social Graph YEAR TRANS: Let’s first… KEY: temporal graph brings more insight Let’s first look at temporal graphs If you take a look at the graphs in real world, you’ll notice that many of them evolve over time. Considering a social graph, there are new connections built every day. The properties of such temporal graphs could bring more insights than those static ones. For example, here we have three users’ ranking value from a social graph. (for years) If we only look at the values of the year 2014, we will have the conclusion that they are the same. **************************

Temporal Graphs Real-world graphs evolve – temporal graphs
Temporal graph properties bring more insights 2013 2014 2012 A Social Graph YEAR But with the data of the past two years, such temporal ranks can tell us their differences, e.g. they have different trends ************************** Temporal ranks can tell their differences

Temporal Graph Analysis
Computing properties on a series of graph snapshots t0 t1 t2 2013 2014 2012 Graph snapshot YEAR Static Graph Analysis TRANS: To extract such temporal graph properties… KEY: temporal graph analysis: on a series of snapshots To extract such temporal graph properties, we look at multiple points of time. For each point of time, we take a snapshot that represents the state of the graph at that time. Then we perform static graph analysis on each graph snapshot to extract properties at that time. By combining all the results, we can form a temporal graph property. So, temporal graph analysis is about computing properties on a series of graph snapshots. ************************** Graph Properties

Temporal Graph Analysis
Existing graph engines: targeting static graph analysis A possible solution: computing snapshot by snapshot 2013 2014 2012 YEAR Task 1 Task 2 Task 3 TRANS: How to execute temporal graph analysis? KEY: snapshot-by-snapshot seems can do As there are a lot of existing graph engines targeting the static graph analysis. One straightforward way to perform TGA is to use an existing graph engine and computing snapshot by snapshot. Here we treat the static graph analysis on each snapshot as a single task, and take the task one by one. **************************

Performance Issues 𝑪𝒐𝒔 𝒕 𝑻𝒐𝒕𝒂𝒍 = 𝑪𝒐𝒔 𝒕 𝒔𝒏𝒂𝒑 ×𝑵𝒖 𝒎 𝒔𝒏𝒂𝒑
𝑪𝒐𝒔 𝒕 𝑻𝒐𝒕𝒂𝒍 = 𝑪𝒐𝒔 𝒕 𝒔𝒏𝒂𝒑 ×𝑵𝒖 𝒎 𝒔𝒏𝒂𝒑 TRANS: perf is issue KEY: to optimize If taking this way, the performance could be an issue. The cost of TGA will be much more than that of SGA, as it equals to the cost of SGA, multiple the number of snapshots. In order to improve the performance for TGA which involves multiple snapshots,

Revisit: Static Graph Analysis
Propagation based graph computation model Vertex Data Array Data Propagation v1 v3 v2 v5 Local computation Edge Array scan TRANS: first revisit the graph computation on one snapshot KEY: how propagation works ,we will start the investigation from the analysis on one snapshot. Here we use a propagation based computation model to explain how static graph analysis works on one snapshot and where the performance issue is. We use a vertex data array and an edge array to express the graph. During the analysis, it will perform local computation on each vertex. Then it scan the edge array to get the edges connected to that vertex, and perform data propagation through the edges. Just as the green arrow shows. **************************

Propagation based graph computation model Vertex Data Array Cache Miss Data Propagation v1 v3 v2 v5 Local computation Edge Array scan KEY: cache miss could happen, common in graph computation Sometimes, the vertex data of the propagation destination may not be in CPU cache. This will lead to an expensive cache miss. It is a very common in graph computation. Because the graph is a highly complicated and irregular structure, when propagating through the edges. It will make the propagation become a random access. **************************

In parallel: Partition graph & computations among CPU cores Vertex Data Array Inter-core Communication Cross-partition edge v1 v3 v2 v5 Core 0 Core 1 Edge Array scan TRANS: Besides cache miss, some other thing would affect the performance as well KEY: propagation through inter-partition edge, caused expensive inter-core communication In parallel settings, besides cache miss, there’s some other factors would impact the performance as well. To analyze one graph snapshot in parallel, we partition the graph and computation among CPU cores. This will make some edges with their endpoints in a different partitions, such as the edge v1 to v3. Propagating through such an edge would cause an expensive inter-core communication. **************************

Temporal Graph Analysis: Snapshot by Snapshot
Computation on multiple graph snapshot – multiple cost Vertex Data Arrays Snapshot 1 N snapshots N cache misses N inter-core comm. Snapshot 2 Snapshot 3 TRANS: as there are many expensive cache miss etc. on one snapshot KEY: multiplied cache miss etc. if snapshot by snapshot As the TGA would involve multiple snapshots, the number of cache misses and inter-core comm. are also be multiplied, leading to multiple cost. So we look for a better way to reduce that number of misses/inter-core comm. and so as to improve the perf. **************************

Observations Real-world graph often evolve gradually (Similar snapshots) v1 v2 v1 v2 v1 v2 ' ' " " v4 v4 v4 ' " v3 v5 v3 v5 v3 v5 ' ' " " Snapshot 1 Snapshot 2 Snapshot 3 TRANS: looking for a better way to improve KEY: evolving gradually, similar snapshots We observe that real world graphs often evolve gradually. Though there are new edges added and old edges deleted from time to time, a great part of the graph stays quite stable, making the snapshots we got from the graph quite similar to each other. **************************

Observations Similar propagations across snapshots v1 v2 v1 v2 v1 v2 '
" " v4 v4 v4 ' " v3 v5 v3 v5 v3 v5 ' ' " " Snapshot 1 Snapshot 2 Snapshot 3 TRANS: when performing same algorithm KEY: similar propagations When performing the same algorithm on the similar snapshots, it is very likely that the similar propagations would happen across similar snapshots. Here we use the same color to indicate the similar propagations from vertex 1. Let’s look at propagations in red, their propagating sources stand for the same vertex in different snapshots, so do the propagating targets. **************************

Group propagations by source & target, not by snapshot
Idea Group propagations by source & target, not by snapshot v1 v2 v1 v2 v1 v2 ' ' " " v4 v4 v4 ' " v3 v5 v3 v5 v3 v5 ' ' " " Snapshot 1 Snapshot 2 Snapshot 3 Step 1 Step 2 Step 3 TRANS: so the idea is KEY: scattered -> scheduled together -> batch -> amortize cost So the idea is grouping the propagations by their source & target, instead of by the snapshot. As we see, in S-S method, the propagations are organized by snapshots, similar propagations are scattered in different snapshots, so they will be executed separately. Here we break the boundaries between snapshots, and re-order the propagations according to their source & target, so that the similar propagations could be scheduled together. During the execution, we perform multiple similar propagations in a batch, so we can amortize the cost such as the cost of cache miss and the cost of inter-core comm. ************************** Step 1 Step 2 Step 3 Step 4 Propagations: 1 2 1 3 1 4 1 5

Chronos: Data Layout Vertex Data Arrays (snapshot-by-snapshot) Snapshot 1 Place together data for the same vertex across multiple snapshots Snapshot 2 Snapshot 3 Vertex Data Array (Chronos) TRANS: with this idea, we designed Chronos KEYS: use a different data layout -> group across snapshots -> time locality With this idea, we designed Chronos. Chronos use a special designed data layout. Instead of placing the vertex data snapshot by snapshot, Chronos place together the data for the same vertex across multiple snapshots. Here the vertex 1’s data v1 v1’ and v1’’ respectively from snapshot 1, 2 and 3 are co-located. The data for vertex 1 for across multiple the snapshots can be accessed linearly, and easily fit in a cache line. We mention such locality as time-locality. ************************** (with time-locality) Snapshot 1, 2, 3 fit in a cache line

Chronos: Propagation Scheduling
Locality Aware Batch Scheduling (LABS): Batching propagating across snapshots vertex 1 -> vertex 2 across snapshots vertex 1 -> vertex 3 across snapshots Edge Array scan Vertex Data Array TRANS: to maximize benefit of locality, scheduling match data layout KEY: propagation group across snapshots To maximize the time dimension locality, Chronos uses a special propagation scheduling to match the data layout, called Locality Aware Batch Scheduling, or LABS for short. Its principle is to batch the propagations across multiple snapshots. During the graph computation, we schedule the propagations through the same edge across multiple snapshots. Here we will first propagate from vertex 1 to vertex 2 across multiple snapshots(, including …). And then we perform the propagation from vertex 1 to vertex 3 across multiple snapshots. **************************. //LABS will propagate through the same edge but for different snapshots … in a batch. fit in a cache line

Locality Aware Batch Scheduling (LABS): Batching propagating across snapshots N propagations 1 cache misses Edge Array scan Vertex Data Array TRANS: cache miss still happens KEY: reduced: 1 cache miss for N propagations When propagation to a vertex data for the 1st time, still cache miss happens. Here we got a cache miss when propagation from v1 to v2. However, after the cache miss, the vertex 2’s data for other snapshots will be loaded in to the cache, including v2’ and v2’’. That’s because they are in the same cache line. When performing the following propagation from v1’ to v2’ and v1’’ to v2’’, We will get cache hits instead of cache misses, because the data is already in cache. So here we pays the cost of 1 cache miss, finished multiple propagations. ************************** The v1-v2 propagation would cause a cache miss, same as before. But what different here is that this cache miss of v2 would have v2’ and v2’’ data loaded into cache, as they are in the same cache line. And right after that , we propagate through the edge v1’->v2’ and v1’’->v2’’. These two propagations will get cache hits instead of cache miss. Cache Hit fit in a cache line

Locality Aware Batch Scheduling (LABS): Batching propagating across snapshots N propagations 1 inter-core comm. Edge Array scan Vertex Data Array TRANS: inter-core comm. still happens KEY: reduced: 1 inter-core comm. for N propagations In parallel settings, when propagating through an cross partition edge, inter-core comm. would happen as well. But during one time communication, we can access the vertex’s multiple snapshots’ data in a batch. So here we use one time inter-core comm to finish multiple cross partition propagations. ************************** Inter-core Communication access in a batch

LABS: The Key of Chronos
A graph layout Place together vertex/edge data across snapshots A scheduling mechanism Batch propagations across snapshots Efficient Reduced cache miss / inter-core comm. That’s LABS, the key design of the Chronos engine. It has graph layout that … It has scheduling mechanism that … It improved the performance for TGA by reducing the number of cache miss and number of inter-core comm **************************

Experimental Evaluation
Large temporal graphs Various graph algorithms PageRank Weakly-connected components (WCC) Single-source shortest path (SSSP) Maximal independent set (MIS) Sparse matrix-vector multiplication (SpMV) Graph # of Vertices # of Edge Events Time Span Source Wiki 1.9 M 40.0 M 6 years Wikipedia graph from KONECT Twitter 7.5 M 61.6 M 3 months Provided by Twitter Weibo 27.7 M 4.9 B 3 years Crawled from Sina Weibo Web 133.6 M 7.2 B 12 months Web graph from DELIS Settings TRANS: to verify effectiveness KEY: various real graphs, various algo. To verify the effectiveness of Chronos, we took a series of experiments, We used 4 real-world temporal graphs with different size. Their edge numbers ranges from millions to billions. … And we use a series of typical graph algorithms, such as PageRank, weakly connected components and single-source shortest path. ************************** CPU 2.4GHz 16-Core RAM 128GB DISK 1TB SSD

Chronos: Single-Thread Effectiveness
5~9x speedup 1 TRANS: KEY: LABS improve the performance Let’s start from the single-thread case to investigate the effectiveness of Chronos Here the x-axis is the batch size, which means how many snapshots are batched for LABS. The y-axis stands for the speedup of the Chronos, and batch size = 1 is the baseline, which is computing snapshot by snapshot. It shows that LABS could bring 5 to 9 times speed up with the batch size of 32. ************************** Baseline: Snapshot by snapshot

Chronos: Single-Thread Effectiveness
92% 70% 95% TRANS: behind the perf improvement is the cache miss reduction KEY: improved by reducing cache misses Behind the speed up, is the cache miss reduction. According to the profiling, LABS could reduce cache misses. And with batch size of 32, 70 ~ 95% cache misses have been avoided. ************************** LLC- last level cache miss count DTLB: data translation look-aside buffer miss count Reduced cache misses

Chronos: Multi-Core Performance
10x 1 TRANS: After single-thread case, let’s look at the perf under parallel settings KEY: LABS-Parallelism out performs Here is the figure for multi-core settings. In this figure, x-axis stands for the number of cores and the y-axis stands for the scalability. We can see that, with 16 cores, Chronos could be more than 10x faster. ************************** Scalable: LABS remains effective More than to 10x faster

Chronos: Multi-Core Performance
98% 98% 98% TRANS: inter-core comm. reduction is the reason KEY: LABS significantly reducing the inter-core comm. The profiling numbers shows how many inter-core communications we actually reduced. And the reduction ratio is about 98% with different number of cores. The snapshot-parallelism has not inter-core communication, but LABS-parallelism could reduce most of them. It also explains why LABS-parallelism could produce better performance. ************************** Reduced inter-core comm.

More in Paper: Graph computation modes All benefit from LABS Push Mode
Pull Mode Stream Mode There are something more about LABS that I will introduce briefly here, you may refer to the paper for details. In practice, propagation graph computation model could be achieved in different modes. In the presentation we take the push mode, in which each vertex would push the data to its neighbor vertices. The pull mode is an lock free design, in which a vertex would actively pull the data from its neighbor. A graph system called xstream from sosp2013 introduce a graph engine in stream mode, which will perform the propagation with an edge stream. Although they look quite different, they all can benefit from LABS mechanism. **************************

More in Paper: Incremental graph computation Can be enhanced with LABS
Leveraging the previous snapshot’s result Computing only the changed part Can be enhanced with LABS Incremental graph computation is a class of algorithm that to get the result of the a snapshot, it leverages the result of the previous snapshot and only compute the changed part of the graph. And it can also be enhanced with LABS. **************************

Conclusion Temporal graph analysis Chronos
an emerging class of applications Chronos supports analysis of temporal graphs efficiently Joint design of data layout and scheduling Leveraging the temporal similarity of graphs Exploit data locality esp. in time dimension To conclude …

University of Science and Technology of China
Thank You! Questions? Tsinghua University University of Science and Technology of China Microsoft Research

BACKUP Experiment Environment Details
Real Graphs Similarities over Time Batch Size Discussion LABS Locking LABS with Incremental Computation LABS on Cluster Related Work Prepared questions: Loading time? The calculation time of PageRank on Weibo graph with in first 10 iterations is 774 sec., while the load time of the extracted graph is nearly 100 sec. Since the size of extracted graph is 37GB, the I/O speed is more than 300MB/s. Load balancing? Inter-core comm. How to get that number? Different graph computing model (push/pull/stream) Snapshot-parallelism Lock contention Cluster-test Already in main part In-Memory Data Structure for LABS

Experiment Setup CPU 2.4GHz Intel Xeon E5-2665 16-core RAM 128GB DISK
1TB SSD (RAID 0 with 372GB1 *3) Network InfiniBand (DDR, 40Gb/s) ClusterSize 4 1. SSD model: TOSHIBA MK4001GRZB

Temporal Distributions of Graphs
Edges increase gradually

On-disk Temporal Graph
Snapshot Groups A Snapshot Group Reduce memory access when scanning edge array Ci: checkpoint of vi: Edges without time information aij: j-th activity of vi: Edge changes, e.g., <addE, (v0, v3, w), t2 >

LABS: In-memory Design
Vertex Data Array Logically Equals to: indicate which snapshots the edge exists in TRANS: Now we know what LABS is, but how to design it in memory? KEYS: the in-memory data structure just as illustrated before, except using a compact way to keep edges Let’s take a look at how LABS is implemented in memory. The vertex data array has an index so you can get the vertex you need with the vertex ID. The data for different snapshots is sequentially placed, so you can easily find the actually data based on the snapshot ID. Similarly, there’s an index for edge array too, with which you can locate the edges belongs to a vertex with the vertex’s ID. An edge in the array is a temporal edge which represents an edge that may exist in different snapshots. It consists a data field indicating its endpoints and a bit map indicating which snapshots it exist in. Logically, the temporal edge in the green dash box equals to a series of edges from different snapshots, as showing here. Reduce memory access when scanning edge array ************************** Edge Array

Temporal Graph Re-construction
User input time points: 0, 10, 20 Scan the graph activity log [Type, Endpoints, Time]: addE, v0->v1, 0 addE, v0->v2, 15 addE, v0->v3, 6 delE, v0->v3, 8 Temporal edges [Endpoints, BitSet]: v0->v1, 111 v0->v2, 001

Chronos System Overview
On-Disk Temporal Graph Contains all the graph evolving activities User input multiple time points Scan activities(log) Reconstruct graph snapshots In-Memory Temporal Graph Contains only snapshots of interest Temporal Properties TRANS: after the key design, we take a look at the system overview to have a better understanding about the Chronos KEYS: reconstruct and then analysis We keep the temporal graph on disk which contains all the graph evolving activities as a log. When user input some time points, the system would scan the activities as needed, to reconstruct the graph snapshots at the given time points. Then we have got the in-memory temporal graph which only contains the snapshots of interest. After that, we perform the temporal graph analysis, and get the temporal graph property.

Greater Batch Size of LABS
Pros Possible to further reduce cache miss / inter-core comm. Cons Bit wide limit of the instruction: _BitScanForward64 Less snapshot similarity within a batch No more cache miss / inter-core comm. to reduce False sharing with locking

Compute Snapshot by Snapshot (another way)
Snapshot-Parallelism Vertex Data Array Snapshot 1 Core 0 Snapshot 2 Core 1 3 cache misses 3 inter-core comm. TRANS: Actually there’s another way for parallelization KEY: cannot reduce cache miss, but can achieve inter-core comm. free If we compute snapshot-by-snapshot, actually there is another way for parallelization. We could directly assign each snapshot to one CPU core without partition the graph. It could avoid inter-communication as there’s not need to coordinate between cores during computation. ************************** Snapshot 3 Core 2 Cache Miss Inter-core communication

Parallelization -- Summary
Good partitioning: Num. of intra-partition edge > Num. of inter-partition edge Partition-Parallelism Snapshot-Parallelism LABS- Parallelism Cache Miss More Less Inter-core Communications No ? TRANS: let’s compare the three ways of performing temporal graph analysis KEY: LABS-parallelism is the best Let’s review and compare the three ways of performing temporal graph analysis in parallel. PP computing one snapshot after another, for each snapshot, it will partition the snapshot and computing different partitions in parallel. SP would compute different snapshots in parallel, with single thread for each snapshot. LABS-P would partition across multiple snapshots, batching the partitions for the same part of graph but from different snapshots, and computing such batched partition with LABS mechanism. As SP doesn’t have coordination during computation, there’s no inter-core comm. LABS mechanism could reduce the number of inter-core comm., so LABS-P has less inter-core comm. than PP. As PP and SP haven’t apply LABS so they would have more cache misses than LABS-P. It is obvious that PP is not a good choice. But between SP and LABS-P, it is hard to tell which is the better. But it could be more clear if you considering that a good partitioning would have more edges with partitions instead of across partition, <XXX> Q: how about hybrid parallelization? A: can be converted into such two choices Snapshot by snapshot LABS Partition-Parallelism: Computing partitions of the same snapshot in parallel Snapshot-Parallelism: Computing snapshots in parallel LABS-Parallel: Computing LABS-batched partition in parallel

LABS Performance on Multi-Core
1 TRANS: After single-thread case, let’s look at the perf under parallel settings KEY: LABS-Parallelism out performs Multi-core experiment compares the performance of different parallelization methods. We can see from the figure that the LABS-parallelism with green dash-line out perform others. And with 16 cores, it could bring another 10 times speed up <XXX> comparing to the single-thread case. ************************** Scalable: LABS remains effective Baseline: Single Core LABS-Parallelism out-performs

LABS Performance on Cluster
A small cluster with 4 machines Benefit less than in single machine test The benefit of LABS hided by the high overhead of network Up to 10x speed up Considering the bandwidth & latency of network are much less than memory

Reduced Lock Contentions
LABS amortizes the lock cost across snapshots PageRank on the Wiki graph 96% 96% 95% 96% 32 temporal batching Reduced the time of locking by more than 95%

LABS with Incremental Computation
Incremental Computing Traditional incremental computing Incremental computing with LABS Snapshot 1 2 3 Apply LABS (BatchSize = 3) Incremental with LABS Pros: time-dimension locality Cons: duplicated computation e.g. the changes from snapshot0 to snapshot1 have been computed three times Snapshot 1 2 3

Gain of Incremental LABS
Wiki graph Baseline: Traditional Incremental

Related work Existing Graph Engines – static graph engines
Pregel (SIGMOD’10) Powergraph (OSDI’12) GraphLab (VLDB’12) Grace (ATC’12) X-stream (SOSP’13) … Active studies on changes and new concepts in evolving graph Densification law, “Shrinking diameters” diameter (KDD’05) PageRank (CIKM’07), Facebook user activities (EuroSys’09), centrality in evolving graph (MLG’10), retweet after N friends’ retweets (WWW’11), Rumors detection (SOMA’10)… References: [KDD’05] Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations [CIKM’07] Link Analysis using Time Series of Web Graphs [EuroSys’09] User Interactions in Social Networks and their Implications – interaction-eurosys09.pdf [MLG’10] Centrality Metric for Dynamic Networks – Dynamic centrality .pdf [WWW’11] Differences in the Mechanics of Information Diffusion Across Topics: Idioms, Political Hashtags, and Complex Contagion on Twitter – www11-hashtags.pdf [SOMA’10] Twitter Under Crisis: Can we trust what we RT? [WWW’05] Incremental Page Rank Computation on Evolving Graphs – incremental pagerank www2005.pdf

Chronos: A Graph Engine for Temporal Graph Analysis

Similar presentations

Presentation on theme: "Chronos: A Graph Engine for Temporal Graph Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chronos: A Graph Engine for Temporal Graph Analysis

Similar presentations

Presentation on theme: "Chronos: A Graph Engine for Temporal Graph Analysis"— Presentation transcript:

Similar presentations

About project

Feedback