Graph Search with Indexing

Slides:



Advertisements
Similar presentations
Graph Mining Laks V.S. Lakshmanan
Advertisements

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
gSpan: Graph-based substructure pattern mining
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
University of Illinois at Urbana-Champaign Graph Indexing: Tree + Δ ≥ Graph Peixiang Zhao Jeffrey Xu Yu Philip S. Yu Peixiang Zhao Jeffrey Xu Yu Philip.
Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
IGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques Jeffrey Xu Yu et. al. VLDB ‘10 Presented by Tao Yu.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
Mining Graphs with Constrains on Symmetry and Diameter Natalia Vanetik Deutsche Telecom Laboratories at Ben-Gurion University IWGD10 workshop July 14th,
Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Slides are modified from Jiawei Han & Micheline Kamber
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
Graph Indexing: A Frequent Structure-­based Approach 指導老師:曾新穆 教授 組員:李彥寬、洪世敏、丁鏘巽、 黃冠霖、詹博丞 日期: 2013/11/ /11/141.
Graph Indexing From managing and mining graph data.
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.
Data Mining: Principles and Algorithms Graph Pattern Mining Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign
Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.
1 Substructure Similarity Search in Graph Databases R 陳芃安.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.
Gspan: Graph-based Substructure Pattern Mining
Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.
Dense-Region Based Compact Data Cube
Cohesive Subgraph Computation over Large Graphs
Finding Dense and Connected Subgraphs in Dual Networks
Outline Introduction State-of-the-art solutions
TITLE What should be in Objective, Method and Significant
Data Center Network Architectures
Mining in Graphs and Complex Structures
Probabilistic Data Management
Mining Frequent Subgraphs
September 19, 2018.
Jiawei Han Department of Computer Science
Data Mining: Concepts and Techniques — Chapter 9 — 9.1. Graph mining
On Efficient Graph Substructure Selection
Mining, Indexing and Searching Graphs in Biological Databases
Design of Declarative Graph Query Languages: On the Choice between Value, Pattern and Object based Representations for Graphs Hasan Jamil Department of.
CSc4730/6730 Scientific Visualization
Graph Database Mining and Its Applications
Mining and Searching Graphs in Biological Databases
Slides are modified from Jiawei Han & Micheline Kamber
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Chapter 17 Designing Databases
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Approximate Graph Mining with Label Costs
Presentation transcript:

Graph Search with Indexing Hakan Kardes

Research Papers Covered in this Talk X. Yan and J. Han, gSpan: Graph-Based Substructure Pattern Mining, ICDM'02 X. Yan and J. Han, CloseGraph: Mining Closed Frequent Graph Patterns, KDD'03 X. Yan, P. S. Yu, and J. Han, Graph Indexing: A Frequent Structure-based Approach, SIGMOD'04 (also in TODS’05, Google Scholar: ranked #1 out of 63,300 entries on “Graph Indexing”) X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”, SIGMOD'05 (also in TODS’06) F. Zhu, X. Yan, J. Han, and P. S. Yu, “gPrune: A Constraint Pushing Framework for Graph Pattern Mining”, PAKDD'07 (Best Student Paper Award) C. Chen, X. Yan, P. S. Yu, J. Han, D. Zhang, and X. Gu, “Towards Graph Containment Search and Indexing”, VLDB'07, Vienna, Austria, Sept. 2007 October 14, 2018

Graph, Graph, Everywhere Aspirin Yeast protein interaction network An Internet Web Co-author network October 14, 2018

Why Graph Mining and Searching? Graphs are ubiquitous Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis XML databases, Web, and social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) Complexity of algorithms: high complexity of current algorithms! October 14, 2018

Outline Graph Search: Querying Graph Databases Graph indexing methods Path based indexing gIndex Experiments Conclusion October 14, 2018

Graph Search: Querying Graph Databases Given a graph database and a query graph, find all graphs containing this query graph query graph graph database October 14, 2018

Scalability Issue Sequential scan Disk I/O Query graph Sequential scan Disk I/O Subgraph isomorphism testing An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02 Grace: Srinath Srinivasa, et al. ICDE'03 Sample database Two issues have to be solved: Disk I/O for false positive and subgraph isomorphism testing for false positive. Therefore, an indexing mechanism is needed. The list just shows a partial list of the previous research. DayLight system is a commercial product targeted for chemical information system. The following is copied from www.daylight.com. Once a large graph database is formed, how can we do database management, and further data warehouse and data mining. 2. GraphGrep uses a path-based approach to do graph indexing. Its implementation is available publicly. 3. Grace develops a hierarchical vector space method. It tries to extract features from the graphs in a hierarchical way. DayLight System “At Infinity Pharmaceuticals we use the Daylight toolkits for in-house development and have implemented DayCart (as a central component of our chemical registration system. This system has been fully integrated within all aspects of our drug discovery processes and software applications. We have successfully "pushed" the registration process to the chemists; they can quickly and easily register compounds within the context of their electronic notebooks. This has provided a significant enhancement to both the efficiency of data capture and the quality of the data within the chemistry database(s). DayCart integrates well into our overall application and data environment providing a flexible chemical data model which facilitates a data architecture that is tailored to Infinity's processes and scientific approach while at the same time allows speed and scalability. The benefit delivered to Infinity is measured in the strong capability to rapidly and efficiently develop intellectual property and access corporate knowledge effectively. “ October 14, 2018

Two steps in processing graph queries Framework Two steps in processing graph queries Step 1. Index Construction Enumerate structures in the graph database, build an inverted index between structures and graphs Step 2. Query Processing Enumerate structures in the query graph Calculate the candidate graphs containing these structures Prune the false positive answers by performing subgraph isomorphism test October 14, 2018

Indexing Strategy Query graph (Q) Graph (G) If graph G contains query graph Q, G should contain any substructure of Q Substructure Our work, also with all the previous work follows this indexing strategy. Remarks Index substructures of a query graph to prune graphs that do not contain these substructures October 14, 2018

Cost Analysis Query Response Time Disk I/O time Isomorphism testing time Graph index access time T_io is the time of fetching a graph from a disk, which may result in one disk block IO. I did not show it in the paper. If the graph dataset can not be held in the main memory, then we need also count the IO time Our goal is to minimize |Cq| Size of candidate answer set Remark: make |Cq| as small as possible October 14, 2018

Outline Graph Search: Querying Graph Databases Graph indexing methods Path based indexing gIndex Experiments Conclusion October 14, 2018

Path-Based Approach Sample database (a) (b) (c) Paths 0-length: C, O, N, S 1-length: C-C, C-O, C-N, C-S, N-N, S-O 2-length: C-C-C, C-O-C, C-N-C, ... 3-length: ... Built an inverted index between paths and graphs October 14, 2018

Problems of Path-Based Approach Sample database (a) (b) (c) Query graph Only graph (c) contains this query graph. However, if we only index paths: C, C-C, C-C-C, C-C-C-C, we cannot prune graph (a) and (b). Only graph (C) contains the query graph. Unfortunately, if we use path only, we cannot prune graphs (a) and (b) The question is why we use path. The reason is path is very simple, easier to manipulate. October 14, 2018

gIndex: Indexing Graphs by Data Mining Another methodology on graph index: Identify frequent structures in the database, the frequent structures are subgraphs that appear quite often in the graph database Prune redundant frequent structures to maintain a small set of discriminative structures Create an inverted index between discriminative frequent structures and graphs in the database The reason of doing frequent graphs is just because the number of frequent graphs is much less than the number of all subgraphs in the graph database although this number is still too large for indexing. October 14, 2018

IDEAS: Indexing with Two Constraints discriminative (~103) frequent (~105) structure (>106) October 14, 2018

Why Discriminative Subgraphs? Sample database (a) (b) (c) All graphs contain structures: C, C-C, C-C-C Why bother indexing these redundant frequent structures? Only index structures that provide more information than existing structures October 14, 2018

Discriminative Structures Pinpoint the most useful frequent structures Given a set of structures f1, f2, …, fn and a new structure x, we measure the extra indexing power provided by x, P (x|f1, f2, …, fn), where fi is contained in x When P is small enough, x is a discriminative structure and should be included in the index Index discriminative frequent structures only Reduce the index size by an order of magnitude October 14, 2018

Why Frequent Structures? We cannot index (or even search) all of substructures Large structures will likely be indexed well by their substructures Size-increasing support threshold minimum support threshold support size October 14, 2018

Why Frequent Structures? We cannot index (or even search) all of substructures Large structures will likely be indexed well by their substructures Size-increasing support threshold minimum support threshold support size October 14, 2018

Outline Graph Search: Querying Graph Databases Graph indexing methods Path based indexing gIndex Experiments Conclusion October 14, 2018

Experimental Setting The AIDS antiviral screen compound dataset from NCI/NIH, containing 43,905 chemical compounds Query graphs are randomly extracted from the dataset. GraphGrep: maximum length (edges) of paths is set at 10 gIndex: maximum size (edges) of structures is set at 10 The fingerprint set size is set as large as possible until the program takes 1G memory. Here it is 10K. Other parameter setting is provided in the paper October 14, 2018

Experiments: Index Size # OF FEATURES Remarks: It shows the raw feature set of path-based approach and structure-based approach. Of course, we can compress the paths into an arbitrary size of index with performance loss. In our experiment setting, we do not compress the path-based index in order to illustrate the best performance it can achieve. It is also possible to apply discriminative filtering on paths. We have not done experiments on it. The number of frequent structures or discriminative frequent structures does not change over the different size of databases. DATABASE SIZE October 14, 2018

Experiments: Answer Set Size # OF CANDIDATES 1. That is, the number of graphs containing the query graph is less than 50. 2. Query size are about the number of edges in a query. 3. Answer set size is the number of candidate graphs that may contain the query graph. 4. The actual match is the lower bound that all algorithms can achieve. That is, it is the actual number of graphs that really contain the query graph. 5. The performance of queries whose support is less than 50 QUERY SIZE October 14, 2018

Experiments: Incremental Maintenance The fact that the number of frequent structures does not change over the different size of databases inspires us to design an algorithm which construct the index incrementally. 2. The black line shows the performance of indexing from the scratch, for example, from a database with 2,000 graphs, 4,000 graphs, 6,000 graphs, 8,000 graphs and 10,000 graphs. 3. The red line shows the performance of incremental indexing, for example, we first built an index for 2,000 graphs, then we add another 2,000 graphs into the database, update the indexing of the original discriminative frequent graphs, then we add another 2,000, … and so on Frequent structures are stable to database updating Index can be built based on a small portion of a graph database, but be used for the whole database October 14, 2018

Conclusions Graph indexing has wide applications Daylight and GraphGrep: path indexing gIndex: Graph indexing Frequent and discirminative subgraphs are high-quality indexing fatures There are some more indexing methods but they are similar to the gIndex Grafill: Similairty (subgraph) search in graph databases Graph indexing and feature-based approximate matching cIndex: Containment graph indexing A contrast feature-based indexing model October 14, 2018

QUESTIONS