Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad L L N L Graph X-Ray: Fast Best-Effort Pattern Matching in Large Attributed Graphs Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad 8/13/2007 KDD 2007, San Jose
Input Output Query Graph Matching Subgraph Attributed Data Graph
Terminology: ``Conform’’ First, We say the subgraph H_t conforms the query graph H_q, if we have all desired job titles and connection between them. Matching Subgraph conforms Query Graph
Terminology: ``Interception’’ Intermediate node matching node matching node matching node matching node We allow the in-directed connection by introducing some extra nodes. For example, the connection between 12 and 4 is indirected. We refer this phenomena as interception, and the extra nodes, e.g. node 13 as intermediate node. And all remaining nodes as matching nodes, e.g. node 11 12,4 and 7. Matching Subgraph Query Graph Path 12-13-4 is an Interception
Terminology: ``Instantiate’’ Matching Subgraph Ht Query Graph Hq Whenever we have a matching subgraph H_t, we say H_t instantiates the query graph H_q. and the matching nodes in H_t instantiates the nodes in the query graph. for example, we say node 11 in H_t instantiates the SEC node in the query graph, and so on. Node 11 instantiates SEC node Ht instantiates Hq
Roadmap Introduction How to: Graph X-Ray Experimental Results Problem Definition Motivations How to: Graph X-Ray Experimental Results Conclusion
Motivation: Why Not SQL? Case 1: Exact match does not exist Q: How to find approximate answer? Case 2: Too many exact matches Q: How to rank them?
Motivation: Why Not SQL? (Cont.) Case 3: Exact match might be not the best answer ``Find CEO who has heavy contact with Accountant’’ Q: how to find right? Exact match 1 direct connection Inexact match Many indirect connections
Motivation: Efficiency Why Not Subgraph Isomorphism? Polynomial for fixed # of pattern query Q1: How to scale up linearly? Q2: … and with a small slope?
Wish List G-Ray meets all! Effectiveness Efficiency Both exact match & inexact Match Ranking among multiple results ``Best’’ answer (proximity-based) Efficiency Scale linearly Scale with small scope G-Ray meets all!
Roadmap Introduction How to: Graph X-Ray Experimental Results Problem Definition Motivations How to: Graph X-Ray Experimental Results Conclusion
Preliminary: Center-Piece Subgraph [Tong+] Q Original Graph Black: query nodes CePS is meta opt. in G-Ray!
Preliminary: Augmented Graph Data nodes 1,…13 Attribute nodes a Footnote Aug. Graph is crucial for computation!
G-Ray: quick overview (for loop ) Step 1: SF Step 2: NE Step 3: BR Step 4: NE Step 5: BR Step 6: NE Step 7: BR Step 8: BR SF: Seed-Finder NE: Neighborhood -Expander BR: Bridge
Seed-Finder ( ) Q: How to instantiate SEC node? A: Footnote `11’ is close to some un-known data nodes for `CEO’ `Account.’ and `Manager’
Neighborhood-Expander ( ) Q: How to instantiate CEO node? Step 1 Step 2? A: Footnote: Step 3 Step 4? Step 5 Step 6?
Bridge ( ) ? Q: A: Prim-like Alg. Footnote To maximize Step 6: NE Step 7: BR ? Q: A: Prim-like Alg. To maximize Should block node 11 and 7 Footnote Connection subgraph, or one single path?
Roadmap Introduction How to: Graph X-Ray Experimental Results Problem Definition Motivation How to: Graph X-Ray Experimental Results Conclusion Now, let’s see some experimental results.
Experimental Results Datasets DBLP Node: author (315k) Edge: co-authorship (1,800k) Attribute: conference & year (13k) KDD-2001, SIGMOD… We use DBLP to construct an attributed graph, where the nodes are authors and attribute is conference and year. The edge is constructed from co-authorship relationship.
Effectiveness: star-query Here is a star-query, we want to a star-shape group of co-authors, with one author coming from each of PODS, IAT and ISBMS. We see Dr. Phillips Yu is in the center and the rest matching authors being well known domain experts in each conf. Query Result
Effectiveness: line-query And here is a line query, we want to find authors from 4 different conferences who cooperate in a line fashion. Result
Effectiveness: loop-query And this is a loop query. Result
Efficiency Response Time # of Edges Scale linearly Small slope 3-5 Seconds # of Edges ~2 M edges
Roadmap Introduction How to: Graph X-Ray Experimental Results Problem Definition Motivation How to: Graph X-Ray Experimental Results Conclusion
Conclusion Graph X-Ray (G-Ray) More details in Poster Session Best effort pattern match in large attributed graphs Scale linearly with small slope More details in Poster Session Monday (tonight) board number 8
G-Ray X-Ray www.cs.cmu.edu/~htong Thank you!
Backup-slides
Proximity on Graph a.k.a relevance, closeness Multi-faceted 1 4 3 2 5 6 7 9 10 8 11 12 a.k.a relevance, closeness Multi-faceted Punish long path Edge weight Now, I will introduce some key concepts behind G-Ray. Once we have these key concepts, the alg. itself is quite straight-forward. So, the fist one is the proximity on the graph. How can we measure the proximity, or in other words, the relevance, the closeness, between two nodes on the graph. Without going into the details, I want claim that random walk with restart is a good solution for this problem. suppose How to: ---- random walk with restart
Random walk with restart 1 4 3 2 5 6 7 9 10 8 11 12 0.13 0.10 0.05 0.08 0.04 0.02 0.03 Node 4 Node 1 Node 2 Node 3 Node 5 Node 6 Node 7 Node 8 Node 9 Node 10 Node 11 Node 12 0.13 0.10 0.22 0.05 0.08 0.04 0.03 0.02 Nearby nodes, higher scores Ranking vector More red, more relevant
How to rank the results Our goodness function Measure the proximity between any two matching nodes if they are required to be connected. (two-way) Multiply them together In G-Ray, we approximately optimize this goodness functions If we have multiple matching subgraphs, we can rank them according to this goodness functions
How to rank the results Goodness = Prox (12, 4) x Prox (4, 12) x matching node matching node matching node matching node Goodness = Prox (12, 4) x Prox (4, 12) x Prox (7, 4) x Prox (4, 7) x Prox (11, 7) x Prox (7, 11) x Prox (12, 11) x Prox (11, 12)