N EIGHBORHOOD B ASED F AST G RAPH S EARCH I N L ARGE N ETWORKS Arijit Khan, Nan Li, Xifeng Yan, Ziyu Guan Computer Science UC Santa Barbara {arijitkhan,

N EIGHBORHOOD B ASED F AST G RAPH S EARCH I N L ARGE N ETWORKS Arijit Khan, Nan Li, Xifeng Yan, Ziyu Guan Computer Science UC Santa Barbara {arijitkhan, nanli, xyan, ziyuguan}@cs.ucsb.edu Supriyo Chakraborty UC Los Angeles supriyo@ee.ucla.edu Shu Tao IBM TJ Watson shutao@us.ibm.com

Neighborhood Based Fast Graph Search in Large Networks M OTIVATION (RDF Q UERY ) 2  Which actors have appeared in both a “John Waters” movie and a “Steven Spielberg” movie? Director Movie Name Title direct ER Diagram  Writing of a SPARQL query requires to know how the entities are connected in the graph data. SELECT ?actorName WHERE { ?actor ?actorName. ?director1 “S. Spielberg”. ?director1Movie ?actor; ?director1. ?director2 “J. Waters”. ?director2Movie ?actor; ?director2. } SPARQL Query Name Actor act

Neighborhood Based Fast Graph Search in Large Networks RDF QUERY 3 ? J. WatersS. Spielberg Query Graph J. WatersS. Spielberg Darren E. Burrows Amistad Cry-Baby Matching Subgraph  How the entities are connected is less important than how closely they are connected. Director Movie Name Title direct ER Diagram Name Actor act

Neighborhood Based Fast Graph Search in Large Networks 4 A PPROXIMATE G RAPH M ATCHING  Find the athlete who is from ‘Romania’ and won ‘gold’ in ‘3000m’ and ‘bronze’ in ‘1500m’ in ‘1984’ Olympics? Bronze1500m Query Graph Matching Subgraph 1984 3000mGold Romania Bronze1500m1984 3000mGold Romania Maricica Puica  Graph Edit Distance: 7  # Missing Edges: 4  Maximum Common Subgraph Size: 3  Still a close approximate match of the query graph !!! ?

Neighborhood Based Fast Graph Search in Large Networks 5 G RAPH A LIGNMENT  Align the nodes of two graphs based on their attributes. Graph Alignment  Name Disambiguation and Database Schema Matching. Linked InTwitter

Neighborhood Based Fast Graph Search in Large Networks 6 R OADMAP  Problem Formulation  Search Algorithm  Indexing  Query Optimization  Experimental Results  Conclusion

Neighborhood Based Fast Graph Search in Large Networks  # Missing Edges: 1 (both for f 1 and f 2 )  Graph Edit Distance: 2 (for f 1 ), 1 (for f 2 )  Graph Edit distance, # of Missing Edges are not scalable for large graphs. 7 P ROBLEM F ORMULATION Difficulties with the # of Edge Mismatch or Graph Edit Distance  f 1 is a better match than f 2 considering the proximity of the labels. a a c b b c abc f1f1 f2f2 Q G

Neighborhood Based Fast Graph Search in Large Networks  Approximate query matching techniques, that preserve the shape of the query graph, might not be appropriate. 8 P ROBLEM F ORMULATION Problem with Shape Preserving Approx. Query Matching  If two labels are close in the query graph, they should also be close in the matching subgraph.

Neighborhood Based Fast Graph Search in Large Networks  If the query graph Q is subgraph isomorphic to target graph G, then the cost of matching Q in G must be 0.  The farther the labels are in G compared to that in Q, the higher will be the cost of matching. 9 A G OOD S UB G RAPH M ATCHING A LGORITHM S HOULD H AVE … Problem with Random Walk Based Methods f G Q  Random Walk Based Models (i.e. Personalized Page Rank) does not satisfy these requirements. GQ Green → Yellow 0.750.67 Green → Blue 0.250.33 Random Walk Probabilities

Neighborhood Based Fast Graph Search in Large Networks  Convert the label distribution in the neighborhood of each node u into a multi-dimensional vector R(u)={ }. 10 I NFORMATION P ROPAGATION M ODEL Information Propagation Model  h = 2, α = 0.5  R Q (v 1 )= { }, R Q (v 2 )={ }  R f1 (u 1 )= { }, R f1 (u 2 )= { }  R f2 (u 1 )= { }, R f2 (u’ 2 )= { } Example of Neighborhood Vectorization

Neighborhood Based Fast Graph Search in Large Networks  Neighborhood Based Cost Function: - Positive difference between the neighborhood vectors. 11 P ROBLEM D EFINITION Neighborhood Based Cost Function  C N (f 1 ) = 0  C N (f 2 ) = (0.5-0.25)+(0.5- 0.25)=0.5  h = 2, α = 0.5  R Q (v 1 )= { }, R Q (v 2 )={ }  R f1 (u 1 )= { }, R f1 (u 2 )= { }  R f2 (u 1 )= { }, R f2 (u’ 2 )= { }  Neighborhood Based Top- k Similarity Search: Given a target graph G and a query graph Q, find the top- k embeddings with respect to cost C N.

Neighborhood Based Fast Graph Search in Large Networks 12 Cost Function Properties False Positive, C N (f )=0, for h=1.  For an exact embedding f e, C N (f e )=0.  Neighborhood Based Cost Function can have False Positives.  Given a graph G and a query graph Q, if each of their nodes has a distinct label, for any inexact embedding f, C N (f )>0, for all h>0, α > 0

Neighborhood Based Fast Graph Search in Large Networks 13 Cost Function Properties  Neighborhood Based Top- k Similarity Search is NP-hard.  Given two graphs Q and G of same number of nodes, it can be determined in polynomial time if G itself is an embedding f of Q with C N (f )=0.

Neighborhood Based Fast Graph Search in Large Networks 15 S EARCH A LGORITHM  Step 1: Match a node u of target graph G with some node v of query graph Q, if L(v) ⊆ L(u) and cost(u,v) is less than a predefined cost threshold ε.  Step 2: Discard the labels of the unmatched nodes in the target graph.  Step 3: Propagate the labels only among the matched nodes from the previous step. Repeat steps 1 and 2 until no node can be discarded further. G Q v1v1 v2v2 v3v3 v4v4 u1u1 u2u2 u3u3 u4u4 u5u5 u6u6 h=1, α=0.5, ε=0 Search Algorithm f 1 st Round: cost(u 1, v 1 )=0 cost(u 5,v 1 )=0 cost(u 2,v 3 )=0.5.. match(v 1 ) = {u 1, u 5 } match(v 2 ) = {u 3 } match(v 3 ) = {u 6 } match(v 4 ) = {u 4 } 2 nd Round: cost(u 1, v 1 )=0.5 cost(u 5,v 1 )=0.. match(v 1 ) = {u 5 } match(v 2 ) = {u 3 } match(v 3 ) = {u 6 } match(v 4 ) = {u 4 }

Neighborhood Based Fast Graph Search in Large Networks 17  Index the neighborhood vectors for the first round of matching.  Two Types of Indexing: - Label Based (Hashing of Node Labels) - Neighborhood Based c b u3u3 u4u4 a u1u1 u2u2 u5u5 u6u6 ? v3v3 v4v4 v1v1 v2v2 G Q a a a b R Q (v 1 ) ={, } R G (u 1 )= {, } R G (u 2 )={,, } R G (u 3 )={, } R G (u 4 )={,, } R G (u 5 )={,, } R G (u 6 )={,, } b a h=2, α=0.5, ε=0 a u2u2 u1u1 u4u4 u6u6 u5u5 u3u3 b u1u1 u5u5 u6u6 u2u2 u3u3 u4u4 cost = 0 cost = 0.25 > ε a, 1.0 a, 1.25 a, 0.75b, 0.5 b, 0.75 a, 0.5 b, 0.75 Threshold Algorithm Neighborhood Vectors R G (u 1 )= {, } R G (u 2 )={,, } R G (u 3 )={, } R G (u 4 )={,, } R G (u 5 )={,, } R G (u 6 )={,, } Index Structure I NDEXING

Neighborhood Based Fast Graph Search in Large Networks 18  Insertion/ deletion of nodes/ edges incur local changes in the neighborhood vectors of only a few nodes.  Index structure consists of sorted list of nodes based on the label association values in their neighborhood vectors.  Index can be implemented using Priority Queue. Easy to perform local updates. D YNAMIC U PDATE

Neighborhood Based Fast Graph Search in Large Networks Q UERY O PTIMIZATION  Non-discriminative labels increase the number of node matches in the initial rounds of search algorithm.  Eliminate non-discriminative labels initially; add them in the final stage of search algorithm.  Labels with Heavy-head distribution are more discriminative than those with Heavy-tail distribution. 20 A u (l) |u| Heavy Head (Discriminative) Distribution Heavy Tail (Non-Discriminative) Distribution Pruned Not Pruned

Neighborhood Based Fast Graph Search in Large Networks 22  Data Sets :  Efficiency: E XPERIMENTAL R ESULTS # of Node# of Edges# of LabelsAvg. # of Labels/ Node FreeBase 172,015579,869159,5141 Intrusion 200,858703,0201,00025 DBLP 684,9117,764,604683,9271 WebGraph 10M213M10,0001 FreeBaseIntrusionDBLPWebGraph 2-hop Indexing (Off-line) 280.0 sec227.0 sec1733.0 sec5,125.0 sec Top-1 Search* (On-line) 0.06 sec1.6 sec0.02 sec0.11 sec *Query graph is a subgraph of the target graph; # of nodes in Query Graph = 50

Neighborhood Based Fast Graph Search in Large Networks 23  Error Ratio: # of incorrectly identified nodes of the target graph in all top-1 matches divided by the # of nodes in all the query graphs in a query set.  Noise Ratio: # of edges added divided by total number of nodes in query graphs. R OBUSTNESS R ESULTS Robustness Results (FreeBase)  Diameter 2 ≡ 100 nodes  Diameter 3 ≡ 150 nodes  Diameter 4 ≡ 200 nodes

Neighborhood Based Fast Graph Search in Large Networks 24  Noise Ratio: # of edges added divided by total number of nodes in query graphs. C ONVERGENCE R ESULTS Convergence Results (DBLP)  Diameter 2 ≡ 100 nodes  Diameter 3 ≡ 150 nodes  Diameter 4 ≡ 200 nodes

Neighborhood Based Fast Graph Search in Large Networks 25 S CALABILITY R ESULTS Scalability Results (WebGraph)  Query graph is a subgraph of the target graph.  # of nodes in Query Graph = 50  Indexing is performed for h=2 hops.

Neighborhood Based Fast Graph Search in Large Networks 27 C ONCLUSION  New Measure of Graph Similarity based on Neighborhood structure.  Information Propagation Model to convert a large graph into multi-dimensional vectors.  Iterative pruning based efficient and scalable search algorithm using the neighborhood vectors.  Efficient Indexing and Query Optimization Techniques.  How to match the labels when they are not exactly same in two graphs?

Neighborhood Based Fast Graph Search in Large Networks 28

N EIGHBORHOOD B ASED F AST G RAPH S EARCH I N L ARGE N ETWORKS Arijit Khan, Nan Li, Xifeng Yan, Ziyu Guan Computer Science UC Santa Barbara {arijitkhan,

Similar presentations

Presentation on theme: "N EIGHBORHOOD B ASED F AST G RAPH S EARCH I N L ARGE N ETWORKS Arijit Khan, Nan Li, Xifeng Yan, Ziyu Guan Computer Science UC Santa Barbara {arijitkhan,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

N EIGHBORHOOD B ASED F AST G RAPH S EARCH I N L ARGE N ETWORKS Arijit Khan, Nan Li, Xifeng Yan, Ziyu Guan Computer Science UC Santa Barbara {arijitkhan,

Similar presentations

Presentation on theme: "N EIGHBORHOOD B ASED F AST G RAPH S EARCH I N L ARGE N ETWORKS Arijit Khan, Nan Li, Xifeng Yan, Ziyu Guan Computer Science UC Santa Barbara {arijitkhan,"— Presentation transcript:

Similar presentations

About project

Feedback