Yinghui Wu, SIGMOD 2012 Query Preserving Graph Compression Wenfei Fan 1,2 Jianzhong Li 2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology 3 University of California, Santa Barbara 1
Yinghui Wu SIGMOD 2012 Querying Real-life Graphs Real life graphs as “Big Data” Complexities of several common graph queries NP-complete for subgraph isomorphism Quadratic for simulation queries Cubic time for bounded simulation queries O(|V|+|E|) for reachability queries Indexing techniques IndexQuery time time (Index)Size (Index) TCO(1)O(|V||E|)O(|V| 2 ) GRIPPO(|E|-|V|)O(|V|+|E|) Tree CoverO(log|V|)O(|V||E|)O(|V| 2 ) 2-HopO(|E| 1/2 )O(|V| 3 |TC|)O(|V||E| 1/2 ) 3-HopO(log|V| + k)O(k|V| 2 |Con(G)| )O(|V|k) Querying real-life graphs is prohibitively expensive theoretically hard to reduce! 3
Yinghui Wu SIGMOD 2012 Graph compression techniques General graph compression encoding via node ordering extrinsic information-dependent lossless compression Query-friendly compression (for e.g., neighborhood queries) construct compact data structures require decompression and algorithm revision 4 require decompression or revision of evaluation algorithms Compression for a query class?
Yinghui Wu SIGMOD 2012 Querying a recommendation network MSA 1 BSA 1 MSA 2 BSA 2 … FA 1 C1C1 FA 3 C3C3 FA 2 C2C2 CkCk FA 4 BSA FA C QpQp G MSA r BSA r FA r FA’ r CrCr C’ r Directly querying a compressed graph 2 5 preserving information only relevant to queries
Yinghui Wu SIGMOD 2012 outline Querying Preserving Graph Compression compress graphs while preserving query results Reachability preserving compression Graph pattern preserving compression Incremental query preserving compression Experimental study Conclusion Query-preserving Graph Compression 2
Yinghui Wu SIGMOD 2012 Query-preserving compression 6 Compression related to a class of queries of users’ choice Query Preserving Graph Compression, a triple where R: a compression function, F: L q ->L q is a query rewriting function, where L q denotes a class of graph queries (in the same class) P: a post-processing function For any graph G, Gr = R(G) s.t. for all Q ∈ L q, Q(G) = P(Q’(Gr)), and Any query evaluation algorithm for Q can be directly used to compute Q’(Gr), without decompressing Gr. Indexing and optimization techniques can be directly applied to Gr Lossy compression; Gr is not necessarily a subgraph of G; Gr can be directly queried without decompression rather than to restore the original graph
Yinghui Wu SIGMOD 2012 Query-preserving compression 7 … Q G Q(G) Gr Q’ Q’(Gr) direct querying R (compression) query-preserving compression P (post-processing) post processing query rewriting generic, once for all compression
Yinghui Wu SIGMOD 2012 a tale of two queries… 8 QRQR G Q(G) Gr QR’QR’ Q R ’(Gr) R QPQP G Q(G) Gr QP’QP’ Q P ’(Gr) R P Reachability preserving Compression -Q R : reachability queries - R reduce G by 95% in average in O(|V||E|) time - F is in O(1) time - P: not needed Graph Pattern preserving Compression - Q P : graph pattern queries - R reduce G by 57% in average in O(E| log|V|) time - F: identify mapping - P: linear time
Yinghui Wu SIGMOD 2012 Reachability preserving compression 9 R is in quadratic time F is in constant time no post-processing P is required. Reachability equivalence relation reachability relation R e : a node pair (u,v) ∈ R e iff they have the same set of ancestors and descendants in G. for any graph G, there is a unique maximum R e, i.e., the reachability equivalence relation of G Query preserving compression for reachability queries
Yinghui Wu SIGMOD 2012 Reachability preserving compression A reachability preserving compression for G R maps each node v in G to its reachability equivalence class [v] in Gr, and each edge to an edge between two equivalence classes (if necessary) F maps each node in Q R to its equivalence class in Gr Correctness: |Gr| ≤ |G| For any query Q R (v,w) over G, v can reach w iff R(v) can reach R(w) in Gr 10 Nodes in Gr denote equivalence classes Reduction: 95% in average for reachability queries
Yinghui Wu SIGMOD 2012 C1 QRQR MSA 1 BSA 1 MSA 2 BSA 2 … FA 1 C1C1 C3C3 FA 2 C2C2 CkCk FA 3 FA 4 FA 1 FA 3 FA 4 MSA 1 BSA 1 MSA 2 BSA 2 C1C1 FA 2 C2C2 C3C3 … C4C4 CkCk 1. Compute Re and its reduced partition 2. Construct a node for each node set in the partition 3. Construct Gr Reachability preserving compression: algorithm and example O(|V||E|)
Yinghui Wu SIGMOD 2012 Graph Pattern Preserving Compression Graph pattern preserving compression, in which for any graph G(V,E,L), R is in O(|E|log|V|), F is the identity mapping P is in linear time in the size of the query answer. Bisimulation relation: a binary relation B over V of G, s.t for each node pair (u,v) ∈ B, L(u) = L(v) for each edge (u,u’) ∈ E, there exists (v,v’) ∈ E, s.t. (u’,v’) ∈ B, for each edge (v,v’) ∈ E, there exists (u,u’) ∈ E, s.t. (u’,v’) ∈ B Bisimulation equivalence relation Rb: the unique maximum bisimulation relation Equivalence relation 12 A3A3 B4B4 A4A4 A5A5 B5B5 C3C3 C4C4 A1A1 B1B1 D1D1 C1C1 A2A2 B2B2 D2D2 C2C2 B3B3 G1G1 G2G2
Yinghui Wu SIGMOD 2012 Compressing graphs via bisimulation The pattern preserving compression R(G) = G r, where each node in Gr represents an equivalence class [v] of a node v in G, and there is an edge ([u],[v]) in G r if (u,v) is an edge in G. F(Q p ) = Q p, i.e., identity mapping. P: for each (v p, [v]) ∈ Q p (G r ), and each v’ ∈ [v], (v p,v’) ∈ Q p (G) Correctness: for any pattern query Q p, Q p (G) = P(Q p (G r )). 13 Making use of the reverse of R: nodes in Gr and Q( G ) are expanded to nodes in their equivalence classes Reduction: 57% in average for graph pattern matching
Yinghui Wu SIGMOD Compute the bisimulation equivalence relation Rb and its induced partition P: initialize and refine P w.r.t Rb until fixpoint 2. Construct Gr Graph Pattern Preserving Compression: algorithm MSA 1 BSA 1 MSA 2 BSA 2 … FA 1 C1C1 FA 3 C3C3 FA 2 C2C2 CkCk FA 4 BSA FA C QpQp G MSA r BSA r FA r FA’ r CrCr C’ r Directly querying a compressed graph 2 14 A1A1 B1B1 A 2 … B2B2 B3B3 AkAk …B k A k+1 O(|E|log|V|)
Yinghui Wu SIGMOD 2012 Incremental Graph Compression Real-life data are changing and evolving… Incremental Graph Compression: compute changes ∆Gr to Gr, s.t., Gr ⊕ ∆Gr = R (G ⊕ ∆G). update Gr without recompressing G ⊕ ∆G Affected area: the changes in the input ∆G and the output Gr |AFF| = |∆Gr| + |∆G| bounded and unbounded problem expressible by f(|AFF|)? 15 5%/week in Web graphs ∆G ∆Gr GGr Gr ⊕ ∆Gr R(G ⊕ ∆G) R Complexity measurement? Incremental Graph Compression Compressed once and incrementally maintained
Yinghui Wu SIGMOD 2012 Incremental Reachability Preserving Compression Incremental reachability preserving compression (RCM) unbounded even for unit update, i.e., a single edge insertion and deletion RCM is solvable in O(|AFF||Gr|) time without decompressing Gr 16 Reduction from single source reachability problem FA 1 C2C2 C1C1 FA 2 G FA 1 C1C1 FA 2 C2C2 Gr C1C1 FA 2 C2C2 Gr’ C1C1 FA 1 FA 2 C2C2 Gr’’ 1. Update topological ranking, initialize AFF FA 1 C1C1 FA 2 C2C2 2. (iteratively) split/merge nodes and update Gr
Yinghui Wu SIGMOD 2012 Incremental Graph Pattern Preserving Compression 17 G BSA 1 MSA 2 BSA 2 … MSA 1 FA 1 FA 2 FA 3 FA 4 C1C1 C2C2 C3C3 C4C4 FA 2 C2C2 FA 1 FA 3 FA 4 … C1C1 C3C3 C4C4 MSA 2 MSA 1 BSA 1 BSA 2 GqGq Incremental pattern preserving compression (PCM) is unbounded even for unit update RCM is solvable in O(|AFF| 2 +|Gr|) time without the need to access the original graph G 1. Update node ranking, initialize AFF 2. Iteratively split/merge nodes in Gr and update AFF Affected area Incremental compression without recomputation
Yinghui Wu SIGMOD 2012 Experimental Evaluation Experimental setting Real-life datasets: Facebook, Amazon, YouTube, wikiVote, wikiTalk, socEpinions; NotreDame, P2P, Internet; citHepTh, Citation Synthetic data, with randomly generated updates. Pattern generator, controlled by the number of nodes, edges, predicates and bounds on edges 18 ProblemBatchIncremental Reachability Preserving Compression Compression R IncRCM Transitive compressionAHO Pattern Preserving Compression Compression B IncPCM Query evaluationBFS,BiBFS; MatchIncBMatch compression ratio, memory reduction, query time, and incremental maintenance
Yinghui Wu SIGMOD 2012 Experimental Results I: compression ratio Reachability preserving compression Graph Patten preserving compression 19 in average 5% reduce SCC graphs by 81% in average Perform best on social networks due to high connectivity in average 43% Perform best on Internet
Yinghui Wu SIGMOD 2012 Experimental Results I: compression ratio 20 Reachability preserving compression ratio w.r.t edge increment Pattern preserving compression ratio w.r.t edge increment
Yinghui Wu SIGMOD 2012 Experimental Results I: compression ratio 21 2-hop as index Reduction: 92% of the memory of G in average
Yinghui Wu SIGMOD 2012 Experimental Results II: query evaluation 22 Reachability preserving compressionPattern preserving compression Reduction: 70% of the querying time over G in average
Yinghui Wu SIGMOD 2012 Experimental Results III: Incremental compression 23 Incremental reachability preserving compression w.r.t edge insertions Incremental graph pattern preserving compression w.r.t batch updates The compressed graphs can be efficiently maintained Changes up to 22%
Yinghui Wu SIGMOD 2012 Conclusion Querying preserving graph compression directly query compressed graph without decompression Reachability preserving compression Graph pattern preserving compression Incremental query preserving compression Incrementally update compressed graphs without decompression Future work Query-preserving compression for other queries Testing the compression techniques over more real-life datasets Optimizations for incremental compression techniques Extending the techniques to distributed graph querying 24 Query preserving compression: A promising approach to coping with Big Data
Yinghui Wu SIGMOD Thank you! Query preserving graph compression
Yinghui Wu SIGMOD 2012 Subgraph isomorphism and Graph Simulation Node label equivalence Edge-to-edge function/relation 26 Identical label matching, edge-to-edge function/relations Capable enough? A B D B v1v1v1v1 v2v2v2v2 E G A B DE P P A B DEED BB A G