Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs.

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
gSpan: Graph-based substructure pattern mining
PARTITIONAL CLUSTERING
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.
Computer Vision Lab. SNU Young Ki Baik An Introduction to MCMC for Machine Learning (Markov Chain Monte Carlo)
CHAPTER 2 D IRECT M ETHODS FOR S TOCHASTIC S EARCH Organization of chapter in ISSO –Introductory material –Random search methods Attributes of random search.
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
COM (Co-Occurrence Miner): Graph Classification Based on Pattern Co-occurrence Ning Jin, Calvin Young, Wei Wang University of North Carolina at Chapel.
Reduced Support Vector Machine
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
ROC Curves.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
FLANN Fast Library for Approximate Nearest Neighbors
Graph Classification.
Evaluating Performance for Data Mining Techniques
Introduction to Monte Carlo Methods D.J.C. Mackay.
Gene expression & Clustering (Chapter 10)
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Efficient Model Selection for Support Vector Machines
Dense subgraphs of random graphs Uriel Feige Weizmann Institute.
Image Segmentation Seminar III Xiaofeng Fan. Today ’ s Presentation Problem Definition Problem Definition Approach Approach Segmentation Methods Segmentation.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
WALKING IN FACEBOOK: A CASE STUDY OF UNBIASED SAMPLING OF OSNS junction.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Bug Localization with Machine Learning Techniques Wujie Zheng
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
SPIN: Mining Maximal Frequent Subgraphs from Graph Databases Jun Huan, Wei Wang, Jan Prins, Jiong Yang KDD 2004.
Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.
An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.
Markov Chain Monte Carlo and Gibbs Sampling Vasileios Hatzivassiloglou University of Texas at Dallas.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Challenges in Mining Large Image Datasets Jelena Tešić, B.S. Manjunath University of California, Santa Barbara
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Assessing the significance of (data mining) results Data D, an algorithm A Beautiful result A (D) But: what does it mean? How to determine whether the.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Graph Indexing From managing and mining graph data.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Gspan: Graph-based Substructure Pattern Mining
April 21, 2016Introduction to Artificial Intelligence Lecture 22: Computer Vision II 1 Canny Edge Detector The Canny edge detector is a good approximation.
7. Performance Measurement
Semi-Supervised Clustering
Chapter 15 – Cluster Analysis
Approximating the MST Weight in Sublinear Time
SEG 4630 E-Commerce Data Mining — Final Review —
Learning with information of features
Haim Kaplan and Uri Zwick
CSE572, CBS598: Data Mining by H. Liu
Mining Frequent Subgraphs
Clustering.
CSE572, CBS572: Data Mining by H. Liu
CSE572: Data Mining by H. Liu
Approximate Graph Mining with Label Costs
Presentation transcript:

Mohammad Hasan, Mohammed Zaki RPI, Troy, NY

Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs Discriminatory Subgraphs Classifier 411/11/2015

Mining Task Dataset 30 graphs Average vertex count: 2154 Average edge count: Support 40% Result No Result (used gSpan, Gaston) in a week of running on 2 GHz dual-core PC with 4 GB running Linux 511/11/2015

Limitations of Existing Subgraph Mining Algorithms Work only for small graphs The most popular datasets in graph mining are chemical graphs Chemical graphs are mostly tree In DTP dataset (most popular dataset) average vertex count is 43 and average edge count is 45 Perform a complete enumeration For large input graph, output set is neither enumerable nor usable They follow a fixed enumeration order Partial run does not efficiently generate the interesting subgraphs avoid complete enumeration to sample a set of interesting subgraphs from the output set 611/11/2015

Why sampling a solution? Observation 1: Mining is only exploratory step, mined patterns are generally used in subsequent KD task Not all frequent patterns are equally important for the desired task at hand Large output set leads to information overload problem Observation 2: Traditional mining algorithms explore the output space with a fixed enumeration order Good for generating non-duplicate candidate patterns But, subsequent patterns in that order are very similar complete enumeration is generally unnecessary Sampling can change enumeration order to sample interesting and non-redundant subgraphs with a higher chance 711/11/2015

Output Space Traditional frequent subgraphs for a given support threshold Can also augment with other constraint To find good patterns for the desired KD task Input Space Output Space for FPM with support = 2 911/11/2015

Sampling from Output Space Return a random pattern from the output set Random pattern is obtained by sampling from a desired distribution Define an interestingness function, f : F  R + ; f(p) returns the score of pattern p The desired sampling distribution is proportional to the interestingness score If the output space have only 3 patterns with scores 2,3,4, the sampling should be performed from {2/9, 1/3, 4/9} distribution Efficiency consideration Enumerate as few auxiliary patterns as possible 1011/11/2015

How to choose f? Depends on application needs For exploratory data analysis (EDA), every frequent pattern can have a uniform score For Top-K pattern mining, support values can be used as scores, which is support biased sampling. For subgraph summarization task, only maximal graph patterns has uniform non-zero score For graph classification, discriminatory subgraphs should have high scores 1111/11/2015

Challenges The output space can not be instantiate Complete statistics about the output space is not known. Target distribution is not known entirely Output Space of Graph Mining g1 g3 g2 g4 g5 s 1 s 2 s 3 s n Graphs Scores We want, 1311/11/2015

MCMC Sampling In POG, every pattern is connected to it sub-pattern (with one less edge) and all its super patterns (with one more edge Solution Approach (MCMC Sampling) Perform random walk in the output space Represent the output space as a transition graph to allow local transitions Edges of transition graph are chosen based on structural similarity Make sure that the random walk is ergodic POG as transition graph 1411/11/2015

Algorithm Define the transition graph (for instance, POG) Define interestingness function that select desired sampling distribution Perform random walk on the transition graph Compute the neighborhood locally Compute Transition probability Utilize the interestingness score makes the method generic Return the currently visiting pattern after k iterations. 1511/11/2015

Local Computation of Output Space g0g0 Super Patterns Sub Patterns Pattern that are not part of the output space is discarded during local neighborhood computation P 01 p 02 p 03 p 04 p 05 p 00 g1 g2 g3 g5 g4 g5g2g4g3g1u Σ =1 1611/11/2015

Compute P to achieve Target Distribution If π is the stationary distribution, and P is the transition matrix, in equilibrium, we have, Main task is to choose P, so that the desired stationary distribution is achieved In fact, we compute only one row of P (local computation) s 1 s 2 s 3 s n Graphs Scores We want, 1711/11/2015

Use Metropolis-Hastings (MH) Algorithm 1. Fix an arbitrary proposal distribution beforehand (q) 2. Find a neighbor j (to move to) by using the above distribution 3. Compute acceptance probability and accept the move with this probability 4. If accept move to j; otherwise, go to step q 01 q 02 q 03 q 04 q 05 q 00 Select 3 11/11/2015

Uniform Sampling of Frequent Patterns Target Distribution 1/n, 1/n,..., 1/n How to achieve it? Use uniform proposal distribution Acceptance probability is: d x : Degree of a vertex x 1911/11/2015

Uniform Sampling, Transition Probability Matrix B A D A D P /11/2015

Discriminatory Subgraph Sampling Database graphs are labeled Subgraphs may be used as Feature for supervised classification Graph Kernel Graph Label G1 G2 G3 +1 Subgraph Mining graphsg1g2g3... G1 G2 G3 Embedding Counts Or Binary 2111/11/2015

Sampling in Proportion to Discriminatory Score (f) Interestingness score (feature quality) Entropy Delta score = abs (positive support – negative support) Direct Mining is difficult Score values (entropy, delta score) are neither monotone nor anti- monotone P C Score(P) Score(C) 2211/11/2015

Discriminatory Subgraph Sampling Use Metropis-Hastings Algorithm Choose neighbor uniformly as proposal distribution Compute acceptance probability from the delta score Delta Score of j and i Ratio of degree of i and j 2311/11/2015

Datasets Name# of GraphsAverage Vertex count Average Edge Count DTP Chess Mutagenicity2401 (+) 1936 (-) 1718 PPI Cell-Graphs /11/2015

Result Evaluation Metrics Sampling Quality Our sampling distribution vs target sampling distribution Median and standard deviation of visit count How the sampling converges (convergence rate) Variation Distance: Scalability Test Experiments on large datasets Quality of Sampled Patterns 2611/11/2015

Uniform Sampling Results Experiment Setup Run the sampling algorithm for sufficient number of iterations and observe the visit count distribution For a dataset with n frequent patterns, we perform 200*n iterations Result on DTP Chemical Dataset Uniform Sampling Max count Min count MedianStd Ideal Sampling Median Std /11/2015

Sampling Quality Depends on the choice of proposal distribution If the vertices of POG have similar degree values, sampling is good Earlier dataset have patterns with widely varying degree values [ For clique dataset, sampling quality is almost perfect Result on Chess (Itemset) Dataset (100*n iterations) Uniform Sampling Max count Min count MedianStd Ideal Sampling Median Std /11/2015

Discriminatory sampling results (Mutagenicity dataset) Distribution of Delta Score among all frequent Patterns Relation between sampling rate and Delta Score 2911/11/2015

Discriminatory sampling results (cont) Sample NoDelta ScoreRank% of POG explored /11/2015

Discriminatory sampling results (cell Graphs) Total graphs 30, min-sup = 6 No graph mining algorithm could run the dataset for a week of running ( on a 2GHz with 4GB of RAM machine) 3111/11/2015

Summary Existing Algorithms Output Space Sampling Random walk on the subgraph space Arbitrary Extension Sampling algorithm Depth-first or Breadth first walk on the subgraph space Rightmost Extension Complete algorithm Quality: Sampling quality guaranty Scalability: Visits only a small part of the search space Non-Redundant: finds very dissimilar patterns by virtue of randomness Genericity: In terms of pattern type and sampling objective 3211/11/2015

Future Works and Discussion Important to choose proposal distribution wisely to get better sampling For large graph, support counting is still a bottleneck How to scrap the isomorphism checking entirely How to effectively parallelize the support counting How to make the random walk to converge faster The POG graph generally have smaller spectral gap, as a result the convergence is slow. This makes the algorithm costly (more steps to find good samples) 3311/11/2015

Acceptance Probability Computation Desired Distribution Proposal Distribution Interestingness value 3611/11/2015

Support Biased Sampling s 1 s 2 s 3 s n Graphs Support We want, What proposal distribution to choose? α=1, if N up (u) = ø, α=0, if N down (u) = ø u link 3711/11/2015

Example of Support Biased Sampling B A D A D P 3 x 1/9 2 X 1/2 α= 1/3, q(u, v) = ½, q(v, u)=1/(3x3) = 1/9 s(u) = 2 s(v) = /11/2015

Sampling Convergence 3911/11/2015

Support Biased Sampling Scatter plot of Visit count and Support shows positive Correlation Correlation: /11/2015

Specific Sampling Examples and Utilization Uniform Sampling of Frequent Pattern To explore the frequent patterns To set a proper value of minimum support To make an approximate counting Support Biased Sampling To find Top-k Pattern in terms of support value Discriminatory subgraph sampling Finding subgraphs that are good features for classification 4111/11/2015