Frequent Neighborhood Patterns: Mining Algorithms and Applications

Slides:

Advertisements

Similar presentations

Identifying the Most Influential Data Objects with Reverse Top-k Queries By Akrivi Vlachou 1, Christos Doulkeridis 1, Kjetil Nørvag 1 and Yannis Kotidis.

Advertisements

Graph Mining Laks V.S. Lakshmanan

Discovering Queries based on Example Tuples

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,

1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.

gSpan: Graph-based substructure pattern mining

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.

5/12/2015PhD seminar CS BGU Counting subgraphs Support measures for graphs Natalia Vanetik.

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

SCS CMU Proximity Tracking on Time- Evolving Bipartite Graphs Speaker: Hanghang Tong Joint Work with Spiros Papadimitriou, Philip S. Yu, Christos Faloutsos.

Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium.

1 A DATA MINING APPROACH FOR LOCATION PREDICTION IN MOBILE ENVIRONMENTS* by Gökhan Yavaş Feb 22, 2005 *: To appear in Data and Knowledge Engineering, Elsevier.

33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.

Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.

Cost-based Optimization of Graph Queries Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics IDAR 2007.

Slides are modified from Jiawei Han & Micheline Kamber

Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.

Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.

ALIP: Automatic Linguistic Indexing of Pictures Jia Li The Pennsylvania State University.

Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

On Node Classification in Dynamic Content-based Networks.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

Efficient Processing of Top-k Spatial Preference Queries

Guided Learning for Role Discovery (GLRD) Presented by Rui Liu Gilpin, Sean, Tina Eliassi-Rad, and Ian Davidson. "Guided learning for role discovery (glrd):

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.

Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.

Graph Indexing From managing and mining graph data.

GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

1 Top Down FP-Growth for Association Rule Mining By Ke Wang.

Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.

1 Substructure Similarity Search in Graph Databases R 陳芃安.

Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.

Gspan: Graph-based Substructure Pattern Mining

1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.

Queensland University of Technology

Finding Dense and Connected Subgraphs in Dual Networks

Mining in Graphs and Complex Structures

Sofus A. Macskassy Fetch Technologies

Supervised Time Series Pattern Discovery through Local Importance

Probabilistic Data Management

Mining Frequent Subgraphs

Graph Search with Indexing

Distributed Representations of Subgraphs

PEBL: Web Page Classification without Negative Examples

On Efficient Graph Substructure Selection

Graph Database Mining and Its Applications

KDD Reviews 周天烁 2018年5月9日.

Jialong Han1, Kai Zheng2, Aixin Sun1, Shuo Shang3, and Ji-Rong Wen4

Efficient Subgraph Similarity All-Matching

MCN: A New Semantics Towards Effective XML Keyword Search

A DATA MINING APPROACH FOR LOCATION PREDICTION IN MOBILE ENVIRONMENTS*

Binghui Wang, Le Zhang, Neil Zhenqiang Gong

Asymmetric Transitivity Preserving Graph Embedding

A Framework for Testing Query Transformation Rules

Efficient Processing of Top-k Spatial Preference Queries

Fraction-Score: A New Support Measure for Co-location Pattern Mining

Approximate Graph Mining with Label Costs

Promising “Newer” Technologies to Cope with the

Presentation transcript:

Frequent Neighborhood Patterns: Mining Algorithms and Applications Jialong Han Doctoral thesis work, supervised by Prof. Ji-Rong Wen

Outline Background Frequent Neighborhood Patterns: Definitions Mining Algorithm Applications Knowledge Discovery in Graphs Within-Network Classification Reverse Top-k Queries Conclusions 2018/11/20

Molecule Structure Databases1 Graphs Social Networks2 Web Graphs3 Molecule Structure Databases1 Knowledge Bases5 Academic Networks4 2018/11/20

Graph Databases: Two Settings [KK05] Graph-transaction setting Core concept: transactions Molecule structure databases Properties of a transaction depends on its structure. Frequent subgraph mining Applications Single-graph setting Social networks, web graphs, academic networks, knowledge bases, … Core concept: nodes Persons, web pages, papers, general entities, … 2018/11/20

Frequent Patterns for Nodes (in the Single-Graph Setting)? Properties of a node depends on its surrounding structure. Academic networks: an author citing his own paper Social networks: a person with a son and a daughter Within a molecule structure: a carbon atom appearing on a cycle of length 6 Problems to be answered in this thesis Is there a class of frequent patterns characterizing the common surrounding structures of many nodes? If yes, can these frequent patterns support any node-related applications? 2018/11/20 “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Problem Formulation A neighborhood pattern 𝑃 is a tuple 〈𝐺, 𝑣 𝑝 〉, where 𝐺 is a connected graph, and 𝑣 𝑝 ∈𝑉 𝐺 is the pivot of 𝑃. Given a database 𝐷, nodes that 𝑃 matches ≝ nodes residing in a surrounding structure like 𝑣 𝑝 in 𝐺. Support of 𝑃: number of nodes that 𝑃 matches 𝑃 is a frequent neighborhood pattern if its support exceeds τ. The mining problem: Given 𝐷 and τ, find all frequent neighborhood patterns. Pivot NP: authors once citing their own papers Single-graph database 2018/11/20 “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Mining Algorithm FNM（Frequent Neighborhood Mining） Initialize 𝐹 1 ; 𝑖←2 ; While 𝐹 𝑖−1 ≠∅ Do 𝐶 𝑖 ←𝐽𝑜𝑖𝑛( 𝐹 𝑖−1 ) ; 𝐹 𝑖 ←𝑉𝑒𝑟𝑖𝑓𝑦( 𝐶 𝑖 ) ; End While Return 𝑖 𝐹 𝑖 ; The Apriori Framework Apriori property/Anti-monotonicity 𝑃’s support does not exceed that of its sub-patterns. Enables an Apriori mining framework [AIS93]: Join-Verify Challenge: non-trivial Building Blocks BBs: patterns that CANNOT be obtained by joining smaller ones Traditional frequent pattern mining: BBs = all size-1 patterns However, in FNM: BBs appear at level-2 and above. What do BBs look like in general cases? 2018/11/20 “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Building Block Theorem of FNM Call 𝑃 a path pattern if it is a path, with the pivot on the one end, and contains at most one vertex label, (if does) appearing on the other end. Theorem: 𝑃 is a BB iff. it is a path pattern. Level 1 Level 0 … φ Frequent subgraph mining FNM • Search Space BBs Non-BBs Path Patterns Extend 2018/11/20 “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Application 1: Knowledge Discovery in Single-Graphs Frequent neighborhood patterns has easy-to-interpret semantics, and helps discover hidden knowledge in single-graphs. 2018/11/20 “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Application 2: Within-Network Classification Task: molecule structure completion [DK09] Input: a single-graph database, 𝑉 𝑈 ⊆𝑉 unlabeled Output: labels of nodes in 𝑉 𝑈 Neighborhood patterns as node features Mine frequent neighborhood patterns { 𝑃 𝑗 } on 𝑉 K = 𝑉− 𝑉 𝑈 , 𝑗=1…𝑚; Vectorize all 𝑣∈𝑉 as 𝒙 𝑣 =( 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑗 ,…, 𝑥 𝑚 ), where 𝑥 𝑗 = 1, 𝑃 𝑗 matches 𝑣 0, else Train model 𝑀 using {( 𝒙 𝑣 , 𝑦 𝑣 )|𝑣∈ 𝑉 𝐾 }, and (iteratively) classify 𝒙 𝑣 from 𝑉 𝑈 with 𝑀. ? 2018/11/20 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

Preliminary Results and Problems RL-RW-Deg 𝑟=1 𝑟≤2 𝑟≤3 𝑟≤4 #Feature - 906.2 4804.1 7370.7 7978.6 F1 0.804 0.824 0.834 0.836 Time(s) 79.6 3.1 18.3 28.8 31.4 Label ratio = 50% Outperforms the baseline by 11.7% in terms of F1 Baseline: RL-RW-Deg [DK09] Problem: are all features useful? Definition: the radius of 𝑃 is 𝑟 𝑃 = max 𝑣∈𝑉(𝑃) 𝑑( 𝑣 𝑝 ,𝑣) Larger radius, less (conditional) contribution 𝑟 = 2 20/11/2018 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

Markov Assumption for WNC [MP07] Distant structures (node/edge) have small impacts on the classification of 𝑣. An efficiency-effectiveness tradeoff 𝑃 with a large radius falls under Markov assumption. Can we do FNM without generating 𝑃 with 𝑟 𝑃 > 𝑟 𝑚𝑎𝑥 ? FNM cannot control 𝑟 𝑃 directly. Late-filtration with 𝑟 𝑚𝑎𝑥 : wasted computations. Early-filtration with 𝑟 𝑚𝑎𝑥 (e.g., from path-pattern-generating stage) : BBs missed again! 2018/11/20 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

BB Theorem of Radius-Constrained FNM（r-FNM） After introducing 𝑟 𝑚𝑎𝑥 , some non-path patterns become BBs. Theorem: Under radius constraints, 𝑃 is a BB iff. it is a path pattern or a zipper pattern. FNM with radius constraints: r-FNM = FNM + zipper pattern handling Zipper patterns, 𝑟 𝑚𝑎𝑥 =3 𝑟 𝑚𝑎𝑥 =3 2018/11/20 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

Superiorities of r-FNM Saves feature extraction time when Markov assumption needs to be involved The 𝑟 𝑚𝑎𝑥 ~K problem Provides more choices on the efficiency- effectiveness tradeoff 2018/11/20 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

Application 3: Reverse Top-k Queries Knowledge bases A single-graph database Access interface: structural query languages Hard for ordinary users to formulate queries Can we find the query using representative partial answers? “Representative” Persons born in Europe Which chess player was born and died in the same place? SELECT ?uri WHERE { ?uri :type :ChessPlayer . ?uri :birthPlace ?place . ?uri :deathPlace ?place } Which chess player was born and died in the same place6 ? Complete Answers M. Botvinnik P. Morphy … Representative Partial Answers M. Botvinnik ？ 2018/11/20 “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16

Reverse Top-k Neighborhood Pattern Queries SELECT ?uri WHERE { ?uri :type :ChessPlayer . ?uri :birthPlace ?place . ?uri :deathPlace ?place } = Natural language questions -> node queries -> Neighborhood pattern queries Problem statement: Given a database 𝐷 and an order ≺ on 𝑉 𝐷 , for input nodes 𝐼⊆𝑉 𝐷 , find all neighborhood pattern queries 𝑞 such that 𝐷 𝑞 ⊇𝐼, and when ranking 𝐷 𝑞 , nodes in 𝐼 all appear in the Top-k results. A filter-refine approach Reduce the filter sub-problem to FNM 𝐼 = { M. Botvinnik } 2018/11/20 “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16

Refine Stage: Observations and Optimizations To verify 𝑞, 𝐷 𝑞 needs not be completely evaluated -> Indicator answers. 𝐼𝐴 𝑞 = 𝑣 𝑣∈𝐷 𝑞 ∧𝑣≺ 𝑖𝑛𝑓 𝐼 ∧𝑣∉𝐼 Only nodes in 𝐼𝐴 𝑞 affect the Top-k condition. 𝑞 meets the Top-k condition iff. 𝐼𝐴 𝑞 ≤𝑘−|𝐼|. 𝐼𝐴 𝑞 of different 𝑞 overlap with each other -> Shared evaluation. For q 1 , q 2 , q 1 is a sub-query of q 2 , we have 𝐼𝐴(𝑞 1 )⊇𝐼𝐴( 𝑞 2 ). Even 𝐼𝐴 𝑞 needs not be completely obtained to reject 𝑞 -> Partial evaluation. Only an lower bound of 𝐼𝐴 𝑞 is needed. The number of “match” checks can be reduced. Order ≺ Persons born in Europe Chess players dying in his birth place B. Obama V. Putin M. Botvinnik G. Kasparov P. Morphy E. Lasker Rank: 4 Rank: 1 2018/11/20 “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16

Experiments DBpedia 3.9 knowledge base, 52 questions in QALD-4-Task-1 dataset, allocated into 5 groups w.r.t. the shape of their ground truth query. Efficiency evaluation Three optimizations: speedup of up to 1 to 2 orders of magnitude each. Effectiveness evaluation Two examples are enough to narrow down the sets of returned queries. 2018/11/20 “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16

Related Work Frequent subgraph mining Within-network classification Graph-transaction setting [IWM00, KK04, YH02] Single-graph setting [KK05, VGS02, FB07, BN08] Within-network classification Homophily-based [MP03] Neighborhood-structure-based [DK09, NGK13] Reverse queries Reverse engineering SQL queries [TCP09, ZEPS13, SCC+14] Reverse nearest neighbor queries [KM00]、reverse top-k queries [VDKN10]、 Reverse skyline queries [DS07] 2018/11/20

Conclusions We proposed a new class of node patterns in the single-graph setting: Frequent Neighborhood Patterns. Algorithmic challenge: non-trivial building blocks We discussed three applications of frequent neighborhood patterns. Knowledge discovery, within-network classification, and reverse top-k queries Future work: other node-centric applications in single-graph databases Setting Patterns Designed for Applications Frequent Pattern Discovery Classification Reverse Queries Indexing Graph-transaction Subgraph patterns Transactions [IWM00, KK04, YH02] [DKK03] [YYH04] Single-graph Subgraph Patterns Subgraphs [KK05, VGS02, FB07, BN08] Neighborhood Patterns Nodes √ Future work 2018/11/20

Thank you! Q&A 2018/11/20

References 1 Picture is from http://icep.wikispaces.com/2D+chemical+database+searching+systems 2 Picture is from http://7.mshcdn.com/wp-content/uploads/2012/09/social-graph-640.jpeg 3 Picture is from http://www.analiticaweb.es/wp-content/uploads/2009/09/google.page.rank.explained.jpg 4 Picture is from http://pages.cs.wisc.edu/~lixiujun/samples/social/dblp 5 Picture is from http://resources.mpi-inf.mpg.de/yago-naga/yago/img/yago-graph.png 6 Picture is from http://upload.chinaz.com/upimg/allimg/091020/1718320.gif [AIS93] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In SIGMOD Conference, pages 207–216, 1993. [BN08] Björn Bringmann and Siegfried Nijssen. What is frequent in a single graph? In PAKDD, pages 858–863, 2008. [DK09] Christian Desrosiers and George Karypis. Within-network classification using local structure similarity. In ECML/PKDD (1), pages 260–275, 2009. [DKK03] Mukund Deshpande, Michihiro Kuramochi, and George Karypis. Frequent sub-structure-based approaches for classifying chemical compounds. In ICDM, pages 35–42, 2003. 2018/11/20

References (cont.) [DS07] Evangelos Dellis and Bernhard Seeger. Efficient computation of reverse skyline queries. In Proceedings of the 33rd international conference on Very large data bases, pages 291–302. VLDB Endowment, 2007. [FB07] Mathias Fiedler and Christian Borgelt. Subgraph support in a single large graph. In Data Mining Workshops, 2007. ICDM Workshops 2007, pages 399–404. IEEE, 2007. [IWM00] Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In PKDD, pages 13–23, 2000. [KK04] Michihiro Kuramochi and George Karypis. An efficient algorithm for discovering frequent subgraphs. Knowledge and Data Engineering, 16(9):1038–1051, 2004. [KK05] Michihiro Kuramochi and George Karypis. Finding frequent patterns in a large sparse graph. Data Min. Knowl. Discov., 11(3):243–271, 2005. [KM00] Flip Korn and S Muthukrishnan. Influence sets based on reverse nearest neighbor queries. In ACM SIGMOD Record, volume 29, pages 201–212. ACM, 2000. [MP03] Sofus A Macskassy and Foster Provost. A simple relational classifier. In Proc. of the 2nd Workshop on Multi-Relational Data Mining (MRDM) at KDD, pages 64–76, 2003. [MP07] Sofus A. Macskassy and Foster J. Provost. Classification in networked data: A toolkit and a univariate case study. Journal of Machine Learning Research, 8:935–983, 2007. 2018/11/20

References (cont.) [NGK13] Marion Neumann, Roman Garnett, and Kristian Kersting. Coinciding walk kernels: Parallel absorbing random walks for learning with graphs and few labels. In Asian Conference on Machine Learning, pages 357– 372, 2013. [SCC+14] Yanyan Shen, Kaushik Chakrabarti, Surajit Chaudhuri, Bolin Ding, and Lev Novik. Discovering queries based on example tuples. In SIGMOD, 2014. [TCP09] Quoc Trung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. Query by output. In SIGMOD, 2009. [VDKN10] Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis, and Kjetil Norvag. Reverse top-k queries. In ICDE, 2010. [VGS02] Natalia Vanetik, Ehud Gudes, and Solomon Eyal Shimony. Computing frequent graph patterns from semistructured data. In ICDM, pages 458–465, 2002. [YH02] Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In ICDM, pages 721–724, 2002. [YYH04] Xifeng Yan, Philip S. Yu, and Jiawei Han. Graph indexing: A frequent structure-based approach. In SIGMOD Conference, pages 335–346, 2004. [ZEPS13] Meihui Zhang, Hazem Elmeleegy, Cecilia M Procopiuc, and Divesh Srivastava. Reverse engineering complex join queries. In SIGMOD, 2013. 2018/11/20