Chen Chen, Xifeng Yan, Feida Zhu, Jiawei Han, Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center University of Illinois at Chicago
Outline Motivation Framework Efficient Computation Experiments Conclusion
Online Analytical Processing Jim Gray, 1997 OLAP as a powerful analytical tool
The Usefulness of OLAP Multi-dimensional Different perspectives Multi-level Different granularities Can we offer roll-up/drill-down and slice/dice on graph data? Traditional OLAP cannot handle this, because they ignore links among data objects
The Prevalence of Graphs Chemical compounds, computer vision objects, circuits, XML Especially various information networks Biological networks Bibliographic networks Social networks World Wide Web (WWW)
Applications WWW >= 3 billion nodes, >= 50 billion arcs Facebook >= 100 million active users Combining topological structures and node/edge attributes Great challenge to view and analyze them We propose Graph OLAP to tackle this issue
Scenario #1 A bibliographic network The collaboration patterns among researchers for SIGMOD 2004
Scenario #2
Outline Motivation Framework Data Model Two types of Graph OLAP Dimension, Measure and OLAP operations Efficient Computation Experiments Conclusion
Data Model We have a collection of network snapshots G = {G 1, G 2,..., G N } Each snapshot G i = (I 1,i, I 2,i,..., I k,i ; G i ) I 1,i, I 2,i,..., I k,i are k informational attributes describing the snapshot as a whole G i = (V i, E i ) is an attributed graph, with attributes attached with its nodes V i and edges E i Since G 1, G 2,..., G N only represent different observations of a network, V 1, V 2,..., V N actually correspond to the same set of objects
Two Types of OLAP Informational OLAP (abbr. I-OLAP) Topological OLAP (abbr. T-OLAP)
Informational OLAP Dimensions come from informational attributes attached at the whole snapshot level, so-called Info-Dims e.g., scenario #1
I-OLAP Characteristics Overlay multiple pieces of information Do not change the objects whose interactions are being looked at In the underlying snapshots, each node is a researcher In the summarized view, each node is still a researcher
Topological OLAP Dimensions come from the node/edge attributes inside individual networks, so-called Topo-Dims e.g., scenario #2
T-OLAP Characteristics Zoom in/Zoom out Network topology changed: “generalized” nodes and “generalized” edges In the underlying network, each node is a researcher In the summarized view, each node becomes an institute that comprises multiple researchers
Measures in Graph OLAP Measure is an aggregated graph I-aggregated graph T-aggregated graph Other measures like node count, average degree, etc. can be treated as derived Graph plays a dual role Data source Aggregate measure
Generality of the Framework Measures could be complex e.g., maximum flow, shortest path, centrality Combine I-OLAP and T-OLAP into a hybrid case
Graph OLAP Operations Graph I-OLAPGraph T-OLAP Roll-up Overlay multiple snapshots to form a higher-level summary via I-aggregated graph Shrink the topology and obtain a T- aggregated graph that represents a compressed view, whose topological elements (i.e., nodes and/or edges) have been merged and replaced by corresponding higher-level ones Drill-down Return to the set of lower- level snapshots from the higher-level overlaid (aggregated) graph A reverse operation of roll-up Slice/dice Select a subset of qualifying snapshots based on Info-Dims Select a subgraph of the network based on Topo-Dims
Outline Motivation Framework Efficient Computation Measure classification Optimizations Constraint pushing Experiments Conclusion
Two Categories of Strategies Top-down Generalized cells later How to combine and leverage intermediate results? Bottom-up Generalized cells first How to early-stop?
Measure Classification How to combine and leverage intermediate results? Distributive The computation of high-level cells can be directly built on low-level cells Algebraic Not distributive, but can be easily derived from several distributive measures Holistic Neither distributive nor algebraic
Examples Distributive: collaboration frequency Use distributiveness to drive computation up the cuboid lattice Algebraic: maximum flow Will prove later Semi-distributive Holistic: centrality Need to go down to the raw data and start from scratch
Optimizations Special measures may have special properties that can help optimize the calculations We discuss two of them here, with regard to I-OLAP Localization Attenuation
Localization During computation, only a neighborhood of the networks needs to be consulted e.g., the collaboration frequency of “R. Agrawal” and “R.Srikant” for [sigmod, all-years] only depends on their collaboration frequencies in each SIGMOD conferences Perfect (i.e., 0-neighborhood) localization k-neighborhood is less ideal, but still useful e.g., # of common friends shared by “R. Agrawal” and “R.Srikant”
Attenuation Consider the transporting capability (i.e., maximum flow) from source S to destination T Multiple transportation networks, each one is operated by a separate company With regard to I-OLAP, each network is a “snapshot”, and overlaying more than one snapshots means to share link capacities among companies
Attenuation Data graph C Node: cities Edge: capacity of a link Measure graph F Node: cities Edge: when maximum flow is transmitted, the quantity that passes through a link
Attenuation Maximum flow is algebraic F can be derived from C Just run the maximum flow algorithm The capacity graph C is obviously distributive Lemma Let F be a flow in C and let C F be its residual graph, where residual means that C F = C - F, then F ′ is a maximum flow in C F if and only if F + F ′ is a maximum flow in C
Attenuation Consider two snapshots that are overlaid Maximum flow F 1, F 2 already calculated from C 1, C 2 Without attenuation Compute the overall maximum flow F from C 1 + C 2 With attenuation Take F 1 + F 2 as basis Compute the residual maximum flow F ′ from (C 1 - F 1 ) + (C 2 - F 2 ), and augment it onto F 1 + F 2 Thus, our input attenuates from C 1 + C 2 to (C 1 + C 2 ) - (F 1 + F 2 ), which substantially decreases the efforts
Constraint Pushing Iceberg graph cube Partial materialization Satisfying some interestingness requirement Push the constraints Anti-monotone e.g., maximum flow |f| ≥ δ |f| Monotone e.g., diameter d ≥ δ d
Outline Motivation Framework Efficient Computation Experiments Conclusion
OLAP a Bibliographic Network We get the coauthorship data from DBLP Measure Information Centrality Two Info-Dims Area Database (DB): PODS/SIGMOD/VLDB/ICDE/EDBT Data Mining (DM): ICDM/SDM/KDD/PKDD Information Retrieval (IR): SIGIR/WWW/CIKM Time
OLAP a Bibliographic Network
Efficiency A test that computes maximum flow as the measure Synthetically generate flow networks Details in the paper, with each “snapshot” representing an individual player in the transportation industry Like the Multi-Way method, calculate low-level cells before merging them into high-level ones One takes advantage of the attenuation heuristic The other does not
Efficiency
Outline Motivation Framework Efficient Computation Experiments Conclusion
We propose a Graph OLAP framework to perform multi-dimensional, multi-level analysis on network data Measure is an aggregated graph Informational/Topological dimensions lead to I-OLAP, T-OLAP
Conclusion Mainly focusing on I-OLAP, we discuss how a graph cube can be efficiently computed and materialized distributive, algebraic, holistic Optimizations: localization, attenuation Constraint pushing
Future Works Technical issues for T-OLAP Selective drilling and discovery-driven InfoNet-OLAP