N EIGHBORHOOD B ASED F AST G RAPH S EARCH I N L ARGE N ETWORKS Arijit Khan, Nan Li, Xifeng Yan, Ziyu Guan Computer Science UC Santa Barbara {arijitkhan,

Slides:

Advertisements

Similar presentations

Indexing DNA Sequences Using q-Grams

Advertisements

Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.

S YSTEM -W IDE E NERGY M ANAGEMENT FOR R EAL -T IME T ASKS : L OWER B OUND AND A PPROXIMATION Xiliang Zhong and Cheng-Zhong Xu ICCAD 2006, ACM Trans. on.

A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.

BiG-Align: Fast Bipartite Graph Alignment

1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.

gSpan: Graph-based substructure pattern mining

New Models for Graph Pattern Matching Shuai Ma ( 马帅 )

Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.

Coverage by Directional Sensors Jing Ai and Alhussein A. Abouzeid Dept. of Electrical, Computer and Systems Engineering Rensselaer Polytechnic Institute.

Robust Global Registration Natasha Gelfand Niloy Mitra Leonidas Guibas Helmut Pottmann.

Probabilistic Graph and Hypergraph Matching

ZIYANG LIU, Peng Sun, Yi Chen Arizona State University S TRUCTURED Q UERY R ESULT D IFFERENTIATION.

O N F LOW A UTHORITY D ISCOVERY IN S OCIAL N ETWORKS Arijit Khan, Xifeng Yan Computer Science University of California, Santa Barbara {arijitkhan,

More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

Multi-level Proximity Routing and its applications for Networking Tomer Tankel Dept. of Electrical Eng. – Systems.

1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.

Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.

Clustering (Part II) 11/26/07. Spectral Clustering.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.

Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

Hubert CARDOTJY- RAMELRashid-Jalal QURESHI Université François Rabelais de Tours, Laboratoire d'Informatique 64, Avenue Jean Portalis, TOURS – France.

Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim

Primal-Dual Meets Local Search: Approximating MST’s with Non-uniform Degree Bounds Author: Jochen Könemann R. Ravi From CMU CS 3150 Presentation by Dan.

Efficient Gathering of Correlated Data in Sensor Networks

Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY.

DATA MINING LECTURE 13 Absorbing Random walks Coverage.

Scalable and Efficient Data Streaming Algorithms for Detecting Common Content in Internet Traffic Minho Sung Networking & Telecommunications Group College.

Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.

DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.

On Node Classification in Dynamic Content-based Networks.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Register Placement for High- Performance Circuits M. Chiang, T. Okamoto and T. Yoshimura Waseda University, Japan DATE 2009.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad

1 Computing Full Disjunctions Yaron Kanza Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science The Hebrew University of.

Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.

CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.

I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1.

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Yinghui Wu, SIGMOD Incremental Graph Pattern Matching Wenfei Fan Xin Wang Yinghui Wu University of Edinburgh Jianzhong Li Jizhou Luo Harbin Institute.

Privacy Preserving Outlier Detection using Locality Sensitive Hashing

Errol Lloyd Design and Analysis of Algorithms Approximation Algorithms for NP-complete Problems Bin Packing Networks.

Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.

Nanyang Technological University

Outline Introduction State-of-the-art solutions

CS 326A: Motion Planning Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces (1996) L. Kavraki, P. Švestka, J.-C. Latombe,

A paper on Join Synopses for Approximate Query Answering

Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad

Localizing the Delaunay Triangulation and its Parallel Implementation

Diversified Top-k Subgraph Querying in a Large Graph

Tahsin Reza Matei Ripeanu Nicolas Tripoul

Data Mining Classification: Alternative Techniques

Lecture 6: Counting triangles Dynamic graphs & sampling

Lecture 15: Least Square Regression Metric Embeddings

Alan Kuhnle*, Victoria G. Crawford, and My T. Thai

Presentation transcript:

N EIGHBORHOOD B ASED F AST G RAPH S EARCH I N L ARGE N ETWORKS Arijit Khan, Nan Li, Xifeng Yan, Ziyu Guan Computer Science UC Santa Barbara {arijitkhan, nanli, xyan, Supriyo Chakraborty UC Los Angeles Shu Tao IBM TJ Watson

Neighborhood Based Fast Graph Search in Large Networks M OTIVATION (RDF Q UERY ) 2  Which actors have appeared in both a “John Waters” movie and a “Steven Spielberg” movie? Director Movie Name Title direct ER Diagram  Writing of a SPARQL query requires to know how the entities are connected in the graph data. SELECT ?actorName WHERE { ?actor ?actorName. ?director1 “S. Spielberg”. ?director1Movie ?actor; ?director1. ?director2 “J. Waters”. ?director2Movie ?actor; ?director2. } SPARQL Query Name Actor act

Neighborhood Based Fast Graph Search in Large Networks RDF QUERY 3 ? J. WatersS. Spielberg Query Graph J. WatersS. Spielberg Darren E. Burrows Amistad Cry-Baby Matching Subgraph  How the entities are connected is less important than how closely they are connected. Director Movie Name Title direct ER Diagram Name Actor act

Neighborhood Based Fast Graph Search in Large Networks 4 A PPROXIMATE G RAPH M ATCHING  Find the athlete who is from ‘Romania’ and won ‘gold’ in ‘3000m’ and ‘bronze’ in ‘1500m’ in ‘1984’ Olympics? Bronze1500m Query Graph Matching Subgraph mGold Romania Bronze1500m mGold Romania Maricica Puica  Graph Edit Distance: 7  # Missing Edges: 4  Maximum Common Subgraph Size: 3  Still a close approximate match of the query graph !!! ?

Neighborhood Based Fast Graph Search in Large Networks 5 G RAPH A LIGNMENT  Align the nodes of two graphs based on their attributes. Graph Alignment  Name Disambiguation and Database Schema Matching. Linked InTwitter

Neighborhood Based Fast Graph Search in Large Networks 6 R OADMAP  Problem Formulation  Search Algorithm  Indexing  Query Optimization  Experimental Results  Conclusion

Neighborhood Based Fast Graph Search in Large Networks  # Missing Edges: 1 (both for f 1 and f 2 )  Graph Edit Distance: 2 (for f 1 ), 1 (for f 2 )  Graph Edit distance, # of Missing Edges are not scalable for large graphs. 7 P ROBLEM F ORMULATION Difficulties with the # of Edge Mismatch or Graph Edit Distance  f 1 is a better match than f 2 considering the proximity of the labels. a a c b b c abc f1f1 f2f2 Q G

Neighborhood Based Fast Graph Search in Large Networks  Approximate query matching techniques, that preserve the shape of the query graph, might not be appropriate. 8 P ROBLEM F ORMULATION Problem with Shape Preserving Approx. Query Matching  If two labels are close in the query graph, they should also be close in the matching subgraph.

Neighborhood Based Fast Graph Search in Large Networks  If the query graph Q is subgraph isomorphic to target graph G, then the cost of matching Q in G must be 0.  The farther the labels are in G compared to that in Q, the higher will be the cost of matching. 9 A G OOD S UB G RAPH M ATCHING A LGORITHM S HOULD H AVE … Problem with Random Walk Based Methods f G Q  Random Walk Based Models (i.e. Personalized Page Rank) does not satisfy these requirements. GQ Green → Yellow Green → Blue Random Walk Probabilities

Neighborhood Based Fast Graph Search in Large Networks  Convert the label distribution in the neighborhood of each node u into a multi-dimensional vector R(u)={ }. 10 I NFORMATION P ROPAGATION M ODEL Information Propagation Model  h = 2, α = 0.5  R Q (v 1 )= { }, R Q (v 2 )={ }  R f1 (u 1 )= { }, R f1 (u 2 )= { }  R f2 (u 1 )= { }, R f2 (u’ 2 )= { } Example of Neighborhood Vectorization

Neighborhood Based Fast Graph Search in Large Networks  Neighborhood Based Cost Function: - Positive difference between the neighborhood vectors. 11 P ROBLEM D EFINITION Neighborhood Based Cost Function  C N (f 1 ) = 0  C N (f 2 ) = ( )+( )=0.5  h = 2, α = 0.5  R Q (v 1 )= { }, R Q (v 2 )={ }  R f1 (u 1 )= { }, R f1 (u 2 )= { }  R f2 (u 1 )= { }, R f2 (u’ 2 )= { }  Neighborhood Based Top- k Similarity Search: Given a target graph G and a query graph Q, find the top- k embeddings with respect to cost C N.

Neighborhood Based Fast Graph Search in Large Networks 12 Cost Function Properties False Positive, C N (f )=0, for h=1.  For an exact embedding f e, C N (f e )=0.  Neighborhood Based Cost Function can have False Positives.  Given a graph G and a query graph Q, if each of their nodes has a distinct label, for any inexact embedding f, C N (f )>0, for all h>0, α > 0

Neighborhood Based Fast Graph Search in Large Networks 13 Cost Function Properties  Neighborhood Based Top- k Similarity Search is NP-hard.  Given two graphs Q and G of same number of nodes, it can be determined in polynomial time if G itself is an embedding f of Q with C N (f )=0.

Neighborhood Based Fast Graph Search in Large Networks 14 R OADMAP  Problem Formulation  Search Algorithm  Indexing  Query Optimization  Experimental Results  Conclusion

Neighborhood Based Fast Graph Search in Large Networks 15 S EARCH A LGORITHM  Step 1: Match a node u of target graph G with some node v of query graph Q, if L(v) ⊆ L(u) and cost(u,v) is less than a predefined cost threshold ε.  Step 2: Discard the labels of the unmatched nodes in the target graph.  Step 3: Propagate the labels only among the matched nodes from the previous step. Repeat steps 1 and 2 until no node can be discarded further. G Q v1v1 v2v2 v3v3 v4v4 u1u1 u2u2 u3u3 u4u4 u5u5 u6u6 h=1, α=0.5, ε=0 Search Algorithm f 1 st Round: cost(u 1, v 1 )=0 cost(u 5,v 1 )=0 cost(u 2,v 3 )=0.5.. match(v 1 ) = {u 1, u 5 } match(v 2 ) = {u 3 } match(v 3 ) = {u 6 } match(v 4 ) = {u 4 } 2 nd Round: cost(u 1, v 1 )=0.5 cost(u 5,v 1 )=0.. match(v 1 ) = {u 5 } match(v 2 ) = {u 3 } match(v 3 ) = {u 6 } match(v 4 ) = {u 4 }

Neighborhood Based Fast Graph Search in Large Networks 16 R OADMAP  Problem Formulation  Search Algorithm  Indexing  Query Optimization  Experimental Results  Conclusion

Neighborhood Based Fast Graph Search in Large Networks 17  Index the neighborhood vectors for the first round of matching.  Two Types of Indexing: - Label Based (Hashing of Node Labels) - Neighborhood Based c b u3u3 u4u4 a u1u1 u2u2 u5u5 u6u6 ? v3v3 v4v4 v1v1 v2v2 G Q a a a b R Q (v 1 ) ={, } R G (u 1 )= {, } R G (u 2 )={,, } R G (u 3 )={, } R G (u 4 )={,, } R G (u 5 )={,, } R G (u 6 )={,, } b a h=2, α=0.5, ε=0 a u2u2 u1u1 u4u4 u6u6 u5u5 u3u3 b u1u1 u5u5 u6u6 u2u2 u3u3 u4u4 cost = 0 cost = 0.25 > ε a, 1.0 a, 1.25 a, 0.75b, 0.5 b, 0.75 a, 0.5 b, 0.75 Threshold Algorithm Neighborhood Vectors R G (u 1 )= {, } R G (u 2 )={,, } R G (u 3 )={, } R G (u 4 )={,, } R G (u 5 )={,, } R G (u 6 )={,, } Index Structure I NDEXING

Neighborhood Based Fast Graph Search in Large Networks 18  Insertion/ deletion of nodes/ edges incur local changes in the neighborhood vectors of only a few nodes.  Index structure consists of sorted list of nodes based on the label association values in their neighborhood vectors.  Index can be implemented using Priority Queue. Easy to perform local updates. D YNAMIC U PDATE

Neighborhood Based Fast Graph Search in Large Networks 19 R OADMAP  Problem Formulation  Search Algorithm  Indexing  Query Optimization  Experimental Results  Conclusion

Neighborhood Based Fast Graph Search in Large Networks Q UERY O PTIMIZATION  Non-discriminative labels increase the number of node matches in the initial rounds of search algorithm.  Eliminate non-discriminative labels initially; add them in the final stage of search algorithm.  Labels with Heavy-head distribution are more discriminative than those with Heavy-tail distribution. 20 A u (l) |u| Heavy Head (Discriminative) Distribution Heavy Tail (Non-Discriminative) Distribution Pruned Not Pruned

Neighborhood Based Fast Graph Search in Large Networks 21 R OADMAP  Problem Formulation  Search Algorithm  Indexing  Query Optimization  Experimental Results  Conclusion

Neighborhood Based Fast Graph Search in Large Networks 22  Data Sets :  Efficiency: E XPERIMENTAL R ESULTS # of Node# of Edges# of LabelsAvg. # of Labels/ Node FreeBase 172,015579,869159,5141 Intrusion 200,858703,0201,00025 DBLP 684,9117,764,604683,9271 WebGraph 10M213M10,0001 FreeBaseIntrusionDBLPWebGraph 2-hop Indexing (Off-line) sec227.0 sec sec5,125.0 sec Top-1 Search* (On-line) 0.06 sec1.6 sec0.02 sec0.11 sec *Query graph is a subgraph of the target graph; # of nodes in Query Graph = 50

Neighborhood Based Fast Graph Search in Large Networks 23  Error Ratio: # of incorrectly identified nodes of the target graph in all top-1 matches divided by the # of nodes in all the query graphs in a query set.  Noise Ratio: # of edges added divided by total number of nodes in query graphs. R OBUSTNESS R ESULTS Robustness Results (FreeBase)  Diameter 2 ≡ 100 nodes  Diameter 3 ≡ 150 nodes  Diameter 4 ≡ 200 nodes

Neighborhood Based Fast Graph Search in Large Networks 24  Noise Ratio: # of edges added divided by total number of nodes in query graphs. C ONVERGENCE R ESULTS Convergence Results (DBLP)  Diameter 2 ≡ 100 nodes  Diameter 3 ≡ 150 nodes  Diameter 4 ≡ 200 nodes

Neighborhood Based Fast Graph Search in Large Networks 25 S CALABILITY R ESULTS Scalability Results (WebGraph)  Query graph is a subgraph of the target graph.  # of nodes in Query Graph = 50  Indexing is performed for h=2 hops.

Neighborhood Based Fast Graph Search in Large Networks 26 R OADMAP  Problem Formulation  Search Algorithm  Indexing  Query Optimization  Experimental Results  Conclusion

Neighborhood Based Fast Graph Search in Large Networks 27 C ONCLUSION  New Measure of Graph Similarity based on Neighborhood structure.  Information Propagation Model to convert a large graph into multi-dimensional vectors.  Iterative pruning based efficient and scalable search algorithm using the neighborhood vectors.  Efficient Indexing and Query Optimization Techniques.  How to match the labels when they are not exactly same in two graphs?

Neighborhood Based Fast Graph Search in Large Networks 28