Frequent Neighborhood Patterns: Mining Algorithms and Applications

Frequent Neighborhood Patterns: Mining Algorithms and Applications
Jialong Han Doctoral thesis work, supervised by Prof. Ji-Rong Wen

Outline Background Frequent Neighborhood Patterns: Definitions
Mining Algorithm Applications Knowledge Discovery in Graphs Within-Network Classification Reverse Top-k Queries Conclusions 2018/11/20

Molecule Structure Databases1
Graphs Social Networks2 Web Graphs3 Molecule Structure Databases1 Knowledge Bases5 Academic Networks4 2018/11/20

Graph Databases: Two Settings [KK05]
Graph-transaction setting Core concept: transactions Molecule structure databases Properties of a transaction depends on its structure. Frequent subgraph mining Applications Single-graph setting Social networks, web graphs, academic networks, knowledge bases, … Core concept: nodes Persons, web pages, papers, general entities, … 2018/11/20

Frequent Patterns for Nodes (in the Single-Graph Setting)?
Properties of a node depends on its surrounding structure. Academic networks: an author citing his own paper Social networks: a person with a son and a daughter Within a molecule structure: a carbon atom appearing on a cycle of length 6 Problems to be answered in this thesis Is there a class of frequent patterns characterizing the common surrounding structures of many nodes? If yes, can these frequent patterns support any node-related applications? 2018/11/20 “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Problem Formulation A neighborhood pattern 𝑃 is a tuple 〈𝐺, 𝑣 𝑝 〉, where 𝐺 is a connected graph, and 𝑣 𝑝 ∈𝑉 𝐺 is the pivot of 𝑃. Given a database 𝐷, nodes that 𝑃 matches ≝ nodes residing in a surrounding structure like 𝑣 𝑝 in 𝐺. Support of 𝑃: number of nodes that 𝑃 matches 𝑃 is a frequent neighborhood pattern if its support exceeds τ. The mining problem: Given 𝐷 and τ, find all frequent neighborhood patterns. Pivot NP: authors once citing their own papers Single-graph database 2018/11/20 “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Mining Algorithm FNM（Frequent Neighborhood Mining）
Initialize 𝐹 1 ; 𝑖←2 ; While 𝐹 𝑖−1 ≠∅ Do 𝐶 𝑖 ←𝐽𝑜𝑖𝑛( 𝐹 𝑖−1 ) ; 𝐹 𝑖 ←𝑉𝑒𝑟𝑖𝑓𝑦( 𝐶 𝑖 ) ; End While Return 𝑖 𝐹 𝑖 ; The Apriori Framework Apriori property/Anti-monotonicity 𝑃’s support does not exceed that of its sub-patterns. Enables an Apriori mining framework [AIS93]: Join-Verify Challenge: non-trivial Building Blocks BBs: patterns that CANNOT be obtained by joining smaller ones Traditional frequent pattern mining: BBs = all size-1 patterns However, in FNM: BBs appear at level-2 and above. What do BBs look like in general cases? 2018/11/20 “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Building Block Theorem of FNM
Call 𝑃 a path pattern if it is a path, with the pivot on the one end, and contains at most one vertex label, (if does) appearing on the other end. Theorem: 𝑃 is a BB iff. it is a path pattern. Level 1 Level 0 … φ Frequent subgraph mining FNM • Search Space BBs Non-BBs Path Patterns Extend 2018/11/20 “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Application 1: Knowledge Discovery in Single-Graphs
Frequent neighborhood patterns has easy-to-interpret semantics, and helps discover hidden knowledge in single-graphs. 2018/11/20 “Mining Frequent Neighborhood Patterns in a Large Labeled Graph”, CIKM’13

Application 2: Within-Network Classification
Task: molecule structure completion [DK09] Input: a single-graph database, 𝑉 𝑈 ⊆𝑉 unlabeled Output: labels of nodes in 𝑉 𝑈 Neighborhood patterns as node features Mine frequent neighborhood patterns { 𝑃 𝑗 } on 𝑉 K = 𝑉− 𝑉 𝑈 , 𝑗=1…𝑚; Vectorize all 𝑣∈𝑉 as 𝒙 𝑣 =( 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑗 ,…, 𝑥 𝑚 ), where 𝑥 𝑗 = 1, 𝑃 𝑗 matches 𝑣 0, else Train model 𝑀 using {( 𝒙 𝑣 , 𝑦 𝑣 )|𝑣∈ 𝑉 𝐾 }, and (iteratively) classify 𝒙 𝑣 from 𝑉 𝑈 with 𝑀. ? 2018/11/20 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

Preliminary Results and Problems
RL-RW-Deg 𝑟=1 𝑟≤2 𝑟≤3 𝑟≤4 #Feature - 906.2 4804.1 7370.7 7978.6 F1 0.804 0.824 0.834 0.836 Time(s) 79.6 3.1 18.3 28.8 31.4 Label ratio = 50% Outperforms the baseline by 11.7% in terms of F1 Baseline: RL-RW-Deg [DK09] Problem: are all features useful? Definition: the radius of 𝑃 is 𝑟 𝑃 = max 𝑣∈𝑉(𝑃) 𝑑( 𝑣 𝑝 ,𝑣) Larger radius, less (conditional) contribution 𝑟 = 2 20/11/2018 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

Markov Assumption for WNC [MP07]
Distant structures (node/edge) have small impacts on the classification of 𝑣. An efficiency-effectiveness tradeoff 𝑃 with a large radius falls under Markov assumption. Can we do FNM without generating 𝑃 with 𝑟 𝑃 > 𝑟 𝑚𝑎𝑥 ? FNM cannot control 𝑟 𝑃 directly. Late-filtration with 𝑟 𝑚𝑎𝑥 : wasted computations. Early-filtration with 𝑟 𝑚𝑎𝑥 (e.g., from path-pattern-generating stage) : BBs missed again! 2018/11/20 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

BB Theorem of Radius-Constrained FNM（r-FNM）
After introducing 𝑟 𝑚𝑎𝑥 , some non-path patterns become BBs. Theorem: Under radius constraints, 𝑃 is a BB iff. it is a path pattern or a zipper pattern. FNM with radius constraints: r-FNM = FNM + zipper pattern handling Zipper patterns, 𝑟 𝑚𝑎𝑥 =3 𝑟 𝑚𝑎𝑥 =3 2018/11/20 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

Superiorities of r-FNM
Saves feature extraction time when Markov assumption needs to be involved The 𝑟 𝑚𝑎𝑥 ~K problem Provides more choices on the efficiency- effectiveness tradeoff 2018/11/20 ”Within-Network Classification Using Radius-Constrained Neighborhood Patterns”, CIKM’14

Application 3: Reverse Top-k Queries
Knowledge bases A single-graph database Access interface: structural query languages Hard for ordinary users to formulate queries Can we find the query using representative partial answers? “Representative” Persons born in Europe Which chess player was born and died in the same place? SELECT ?uri WHERE { ?uri :type :ChessPlayer . ?uri :birthPlace ?place . ?uri :deathPlace ?place } Which chess player was born and died in the same place6 ? Complete Answers M. Botvinnik P. Morphy … Representative Partial Answers M. Botvinnik ？ 2018/11/20 “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16

Reverse Top-k Neighborhood Pattern Queries
SELECT ?uri WHERE { ?uri :type :ChessPlayer . ?uri :birthPlace ?place . ?uri :deathPlace ?place } = Natural language questions -> node queries -> Neighborhood pattern queries Problem statement: Given a database 𝐷 and an order ≺ on 𝑉 𝐷 , for input nodes 𝐼⊆𝑉 𝐷 , find all neighborhood pattern queries 𝑞 such that 𝐷 𝑞 ⊇𝐼, and when ranking 𝐷 𝑞 , nodes in 𝐼 all appear in the Top-k results. A filter-refine approach Reduce the filter sub-problem to FNM 𝐼 = { M. Botvinnik } 2018/11/20 “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16

Refine Stage: Observations and Optimizations
To verify 𝑞, 𝐷 𝑞 needs not be completely evaluated -> Indicator answers. 𝐼𝐴 𝑞 = 𝑣 𝑣∈𝐷 𝑞 ∧𝑣≺ 𝑖𝑛𝑓 𝐼 ∧𝑣∉𝐼 Only nodes in 𝐼𝐴 𝑞 affect the Top-k condition. 𝑞 meets the Top-k condition iff. 𝐼𝐴 𝑞 ≤𝑘−|𝐼|. 𝐼𝐴 𝑞 of different 𝑞 overlap with each other -> Shared evaluation. For q 1 , q 2 , q 1 is a sub-query of q 2 , we have 𝐼𝐴(𝑞 1 )⊇𝐼𝐴( 𝑞 2 ). Even 𝐼𝐴 𝑞 needs not be completely obtained to reject 𝑞 -> Partial evaluation. Only an lower bound of 𝐼𝐴 𝑞 is needed. The number of “match” checks can be reduced. Order ≺ Persons born in Europe Chess players dying in his birth place B. Obama V. Putin M. Botvinnik G. Kasparov P. Morphy E. Lasker Rank: 4 Rank: 1 2018/11/20 “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16

Experiments DBpedia 3.9 knowledge base, 52 questions in QALD-4-Task-1 dataset, allocated into 5 groups w.r.t. the shape of their ground truth query. Efficiency evaluation Three optimizations: speedup of up to 1 to 2 orders of magnitude each. Effectiveness evaluation Two examples are enough to narrow down the sets of returned queries. 2018/11/20 “Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base”, ICDE’16

Related Work Frequent subgraph mining Within-network classification
Graph-transaction setting [IWM00, KK04, YH02] Single-graph setting [KK05, VGS02, FB07, BN08] Within-network classification Homophily-based [MP03] Neighborhood-structure-based [DK09, NGK13] Reverse queries Reverse engineering SQL queries [TCP09, ZEPS13, SCC+14] Reverse nearest neighbor queries [KM00]、reverse top-k queries [VDKN10]、 Reverse skyline queries [DS07] 2018/11/20

Conclusions We proposed a new class of node patterns in the single-graph setting: Frequent Neighborhood Patterns. Algorithmic challenge: non-trivial building blocks We discussed three applications of frequent neighborhood patterns. Knowledge discovery, within-network classification, and reverse top-k queries Future work: other node-centric applications in single-graph databases Setting Patterns Designed for Applications Frequent Pattern Discovery Classification Reverse Queries Indexing Graph-transaction Subgraph patterns Transactions [IWM00, KK04, YH02] [DKK03] [YYH04] Single-graph Subgraph Patterns Subgraphs [KK05, VGS02, FB07, BN08] Neighborhood Patterns Nodes √ Future work 2018/11/20

Thank you! Q&A 2018/11/20

References 1 Picture is from 2 Picture is from 3 Picture is from 4 Picture is from 5 Picture is from 6 Picture is from [AIS93] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In SIGMOD Conference, pages 207–216, 1993. [BN08] Björn Bringmann and Siegfried Nijssen. What is frequent in a single graph? In PAKDD, pages 858–863, [DK09] Christian Desrosiers and George Karypis. Within-network classification using local structure similarity. In ECML/PKDD (1), pages 260–275, 2009. [DKK03] Mukund Deshpande, Michihiro Kuramochi, and George Karypis. Frequent sub-structure-based approaches for classifying chemical compounds. In ICDM, pages 35–42, 2003. 2018/11/20

References (cont.) [DS07] Evangelos Dellis and Bernhard Seeger. Efficient computation of reverse skyline queries. In Proceedings of the 33rd international conference on Very large data bases, pages 291–302. VLDB Endowment, [FB07] Mathias Fiedler and Christian Borgelt. Subgraph support in a single large graph. In Data Mining Workshops, ICDM Workshops 2007, pages 399–404. IEEE, [IWM00] Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In PKDD, pages 13–23, [KK04] Michihiro Kuramochi and George Karypis. An efficient algorithm for discovering frequent subgraphs. Knowledge and Data Engineering, 16(9):1038–1051, [KK05] Michihiro Kuramochi and George Karypis. Finding frequent patterns in a large sparse graph. Data Min. Knowl. Discov., 11(3):243–271, [KM00] Flip Korn and S Muthukrishnan. Influence sets based on reverse nearest neighbor queries. In ACM SIGMOD Record, volume 29, pages 201–212. ACM, [MP03] Sofus A Macskassy and Foster Provost. A simple relational classifier. In Proc. of the 2nd Workshop on Multi-Relational Data Mining (MRDM) at KDD, pages 64–76, [MP07] Sofus A. Macskassy and Foster J. Provost. Classification in networked data: A toolkit and a univariate case study. Journal of Machine Learning Research, 8:935–983, 2007. 2018/11/20

References (cont.) [NGK13] Marion Neumann, Roman Garnett, and Kristian Kersting. Coinciding walk kernels: Parallel absorbing random walks for learning with graphs and few labels. In Asian Conference on Machine Learning, pages 357– 372, [SCC+14] Yanyan Shen, Kaushik Chakrabarti, Surajit Chaudhuri, Bolin Ding, and Lev Novik. Discovering queries based on example tuples. In SIGMOD, [TCP09] Quoc Trung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. Query by output. In SIGMOD, [VDKN10] Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis, and Kjetil Norvag. Reverse top-k queries. In ICDE, [VGS02] Natalia Vanetik, Ehud Gudes, and Solomon Eyal Shimony. Computing frequent graph patterns from semistructured data. In ICDM, pages 458–465, [YH02] Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. In ICDM, pages 721–724, [YYH04] Xifeng Yan, Philip S. Yu, and Jiawei Han. Graph indexing: A frequent structure-based approach. In SIGMOD Conference, pages 335–346, [ZEPS13] Meihui Zhang, Hazem Elmeleegy, Cecilia M Procopiuc, and Divesh Srivastava. Reverse engineering complex join queries. In SIGMOD, 2013. 2018/11/20

Frequent Neighborhood Patterns: Mining Algorithms and Applications

Similar presentations

Presentation on theme: "Frequent Neighborhood Patterns: Mining Algorithms and Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Frequent Neighborhood Patterns: Mining Algorithms and Applications

Similar presentations

Presentation on theme: "Frequent Neighborhood Patterns: Mining Algorithms and Applications"— Presentation transcript:

Similar presentations

About project

Feedback