Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Vincent S. Tseng, Cheng-Wei Wu, Bai-En Shie, and Philip S. Yu SIG KDD 2010 UP-Growth: An Efficient Algorithm for High Utility Itemset Mining 2010/8/25.

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

gSpan: Graph-based substructure pattern mining

Analysis of Algorithms CS Data Structures Section 2.6.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

COLLABORATIVE FILTERING Mustafa Cavdar Neslihan Bulut.

1 TCAM Razor: A Systematic Approach Towards Minimizing Packet Classifiers in TCAMs Department of Computer Science and Information Engineering National.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Hierarchical Constraint Satisfaction in Spatial Database Dimitris Papadias, Panos Kalnis And Nikos Mamoulis.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.

Mining Graphs with Constrains on Symmetry and Diameter Natalia Vanetik Deutsche Telecom Laboratories at Ben-Gurion University IWGD10 workshop July 14th,

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.

NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Querying Structured Text in an XML Database By Xuemei Luo.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Mining High Utility Itemset in Big Data

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Shape-based Similarity Query for Trajectory of Mobile Object NTT Communication Science Laboratories, NTT Corporation, JAPAN. Yutaka Yanagisawa Jun-ichi.

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:

Challenges in Mining Large Image Datasets Jelena Tešić, B.S. Manjunath University of California, Santa Barbara

1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.

Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai Date: 2009/02/09 Finding Question-Answer Pairs from Online.

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Chung-hung.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.

Yen-Ting Yu Iris Hui-Ru Jiang Yumin Zhang Charles Chiang DRC-Based Hotspot Detection Considering Edge Tolerance and Incomplete Specification ICCAD’14.

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.

Graph Indexing From managing and mining graph data.

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Searching for Pattern Rules Guichong Li and Howard J. Hamilton Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor ： Jia-Ling Koh Speaker ： Tsui-Feng Yen.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Gspan: Graph-based Substructure Pattern Mining

Queensland University of Technology

Outline Introduction State-of-the-art solutions

G10 Anuj Karpatne Vijay Borra

A paper on Join Synopses for Approximate Query Answering

Probabilistic Data Management

On Efficient Graph Substructure Selection

An Efficient Partition Based Method for Exact Set Similarity Joins

Efficient Aggregation over Objects with Extent

Presentation transcript:

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda

Outline Motivation Challenges Problem Definition Solution Performance Evaluation Related Works

Motivation Graph Databases and their importance Correlation mining from graph databases Structural similarity and statistical similarity

Challenges Candidate key High complexity graph operations Vast search space

Problem Definition Pearson’s Correlation Coefficient Popularly used correlation measure Definition Given two graphs g1 and g2, the Pearson’s Correlation Coefficient of g1 and g2, denoted as φ(g1, g2), is defined as follows When supp(g1) or supp(g2) is equal to 0 or 1, φ(g1, g2) is defined to be 0.The range of φ(g1, g2) falls within [−1, 1] In this paper we are concerned about positively correlated graphs only

Problem Definition Correlated Graphs Two graphs g1 and g2 are correlated if and only if φ(g1, g2) ≥ θ, where θ (0 < θ ≤ 1) is a user-specified minimum correlation threshold.

Problem Definition Correlated Graph Search Given a graph database D, a correlation query graph q and a minimum correlation threshold θ, the problem of Correlated Graph Search (CGS) is to find the set of all graphs that are correlated with q. The answer set of the CGS problem is defined as Aq = {(g,Dg) : φ(q, g) ≥ θ}.

Solution-Candidate Set Generation Mine the set of frequent graphs (FG’s) from D using the thresholds Drawbacks 1.All existing FG mining algorithms generate graphs with higher support before those with lower support. 2.Not efficient and scalable,especially when D is large or the lower bound is low.

Solution-Candidate Set Generation Mine the set of FG’s using the threshold Advantages 1.Efficient candidate generation. 2.Significant reduction in search space.

Solution-Framework The framework of the solution consists of the following four steps. 1.Obtain the projected database Dq of q. 2.Mine the set of candidate graphs C from Dq, using lower(q,g)/supp(q) as the minimum support threshold. 3.Refine C by three heuristic rules. 4.For each candidate graph g C, a)Obtain Dg. b)Add (g,Dg) to Aq if φ(q, g) ≥ θ.

Solution-Heuristic Rules Heuristic Rule 1 Given a graph g, if g C and g q, then g base(Aq) Identifies graphs that are guaranteed to be answers

Solution-Heuristic Rules Heuristic Rule 2 Given two graphs g1 and g2, where g1 g2 and supp(g1, q) = supp(g2, q), if g1 base(Aq), then g2 base(Aq) Helps in reduction of the search space so that the unrewarding query costs for false positives.

Solution-Heuristic Rules Heuristic Rule 3 Given two graphs g1 and g2, where g1 g2, if supp(g2, q) < f(supp(g1)), then g2 base(Aq) Helps in reduction of the search space so that the unrewarding query costs for false positives.

Solution-Algorithm Input: A graph database D, a query graph q, and a correlation threshold θ. Output: The answer set Aq. 1.Obtain Dq; 2.Mine FGs from Dq using lower(q,g) supp(q) as the minimum support threshold and add the FGs to C; 3.for each graph g C in size-descending order do 4.if (g q) 5.Add (g,Dg) to Aq; 6.else 7.Obtain Dg; 8.if (φ(q, g) ≥ θ) 9.Add (g,Dg) to Aq; 10.else 11.H2 ← {g’ C : g g, supp(g’;Dq) = supp(g;Dq)}; 12.C ← C−H2; 13.H3 ← {g’ C : g g, supp(g’;Dq) < f(supp(g))/supp(q) }; 14.C ← C−H3;

Solution-Example Consider the graph database below

Solution-Example Query q Candidate set

Performance Evaluation The dataset contains the compound structures of cancer and AIDS data from NCI open database compunds. The dataset contains about 249k graphs. On average each graph in dataset has 21 nodes and 23 edges. The number of distinct labels for nodes and edges is 88. We randomly generate four sets of queries, F1, F2, F3 and F4 each of which contain 100 queries. The support ranges for the queries in F1 to F4 are [0.02,0.05],(0.05,0.07],(0.07,0.1] and (0.1,1.0]

Performance Evaluation Effect of candidate generation

Performance Evaluation Effect of

Performance Evaluation Effect of Heuristic Rules

Performance Evaluation Effect of Graph Size

Related Works Raymond proposes an efficient algorithm MCES for similarity search. Williams proposes an indexing technique that adopts graph decomposition method for similarity search. Zhang and Feigenbaum adopted φ correlation coefficient to measure the correlated pairs in transaction databases.