Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs.

Mohammad Hasan, Mohammed Zaki RPI, Troy, NY

Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs Discriminatory Subgraphs Classifier 411/11/2015

Mining Task Dataset 30 graphs Average vertex count: 2154 Average edge count: 36945 Support 40% Result No Result (used gSpan, Gaston) in a week of running on 2 GHz dual-core PC with 4 GB running Linux 511/11/2015

Limitations of Existing Subgraph Mining Algorithms Work only for small graphs The most popular datasets in graph mining are chemical graphs Chemical graphs are mostly tree In DTP dataset (most popular dataset) average vertex count is 43 and average edge count is 45 Perform a complete enumeration For large input graph, output set is neither enumerable nor usable They follow a fixed enumeration order Partial run does not efficiently generate the interesting subgraphs avoid complete enumeration to sample a set of interesting subgraphs from the output set 611/11/2015

Why sampling a solution? Observation 1: Mining is only exploratory step, mined patterns are generally used in subsequent KD task Not all frequent patterns are equally important for the desired task at hand Large output set leads to information overload problem Observation 2: Traditional mining algorithms explore the output space with a fixed enumeration order Good for generating non-duplicate candidate patterns But, subsequent patterns in that order are very similar complete enumeration is generally unnecessary Sampling can change enumeration order to sample interesting and non-redundant subgraphs with a higher chance 711/11/2015

Output Space Traditional frequent subgraphs for a given support threshold Can also augment with other constraint To find good patterns for the desired KD task Input Space Output Space for FPM with support = 2 911/11/2015

Sampling from Output Space Return a random pattern from the output set Random pattern is obtained by sampling from a desired distribution Define an interestingness function, f : F  R + ; f(p) returns the score of pattern p The desired sampling distribution is proportional to the interestingness score If the output space have only 3 patterns with scores 2,3,4, the sampling should be performed from {2/9, 1/3, 4/9} distribution Efficiency consideration Enumerate as few auxiliary patterns as possible 1011/11/2015

How to choose f? Depends on application needs For exploratory data analysis (EDA), every frequent pattern can have a uniform score For Top-K pattern mining, support values can be used as scores, which is support biased sampling. For subgraph summarization task, only maximal graph patterns has uniform non-zero score For graph classification, discriminatory subgraphs should have high scores 1111/11/2015

Challenges The output space can not be instantiate Complete statistics about the output space is not known. Target distribution is not known entirely Output Space of Graph Mining g1 g3 g2 g4 g5 s 1 s 2 s 3 s n Graphs Scores We want, 1311/11/2015

MCMC Sampling In POG, every pattern is connected to it sub-pattern (with one less edge) and all its super patterns (with one more edge Solution Approach (MCMC Sampling) Perform random walk in the output space Represent the output space as a transition graph to allow local transitions Edges of transition graph are chosen based on structural similarity Make sure that the random walk is ergodic POG as transition graph 1411/11/2015

Algorithm Define the transition graph (for instance, POG) Define interestingness function that select desired sampling distribution Perform random walk on the transition graph Compute the neighborhood locally Compute Transition probability Utilize the interestingness score makes the method generic Return the currently visiting pattern after k iterations. 1511/11/2015

Local Computation of Output Space g0g0 Super Patterns Sub Patterns Pattern that are not part of the output space is discarded during local neighborhood computation P 01 p 02 p 03 p 04 p 05 p 00 g1 g2 g3 g5 g4 g5g2g4g3g1u Σ =1 1611/11/2015

Compute P to achieve Target Distribution If π is the stationary distribution, and P is the transition matrix, in equilibrium, we have, Main task is to choose P, so that the desired stationary distribution is achieved In fact, we compute only one row of P (local computation) s 1 s 2 s 3 s n Graphs Scores We want, 1711/11/2015

Use Metropolis-Hastings (MH) Algorithm 1. Fix an arbitrary proposal distribution beforehand (q) 2. Find a neighbor j (to move to) by using the above distribution 3. Compute acceptance probability and accept the move with this probability 4. If accept move to j; otherwise, go to step 2 123 0 45 q 01 q 02 q 03 q 04 q 05 q 00 Select 3 11/11/2015

Uniform Sampling of Frequent Patterns Target Distribution 1/n, 1/n,..., 1/n How to achieve it? Use uniform proposal distribution Acceptance probability is: d x : Degree of a vertex x 1911/11/2015

Uniform Sampling, Transition Probability Matrix B A D A D P 1 4 2011/11/2015

Discriminatory Subgraph Sampling Database graphs are labeled Subgraphs may be used as Feature for supervised classification Graph Kernel Graph Label G1 G2 G3 +1 Subgraph Mining graphsg1g2g3... G1 G2 G3 Embedding Counts Or Binary 2111/11/2015

Sampling in Proportion to Discriminatory Score (f) Interestingness score (feature quality) Entropy Delta score = abs (positive support – negative support) Direct Mining is difficult Score values (entropy, delta score) are neither monotone nor anti- monotone P C Score(P) Score(C) 2211/11/2015

Discriminatory Subgraph Sampling Use Metropis-Hastings Algorithm Choose neighbor uniformly as proposal distribution Compute acceptance probability from the delta score Delta Score of j and i Ratio of degree of i and j 2311/11/2015

Datasets Name# of GraphsAverage Vertex count Average Edge Count DTP10844345 Chess319610.25- Mutagenicity2401 (+) 1936 (-) 1718 PPI3215481607 Cell-Graphs30218436945 2511/11/2015

Result Evaluation Metrics Sampling Quality Our sampling distribution vs target sampling distribution Median and standard deviation of visit count How the sampling converges (convergence rate) Variation Distance: Scalability Test Experiments on large datasets Quality of Sampled Patterns 2611/11/2015

Uniform Sampling Results Experiment Setup Run the sampling algorithm for sufficient number of iterations and observe the visit count distribution For a dataset with n frequent patterns, we perform 200*n iterations Result on DTP Chemical Dataset Uniform Sampling Max count Min count MedianStd 3383220959.02 Ideal Sampling Median Std 20014.11 2711/11/2015

Sampling Quality Depends on the choice of proposal distribution If the vertices of POG have similar degree values, sampling is good Earlier dataset have patterns with widely varying degree values [ For clique dataset, sampling quality is almost perfect Result on Chess (Itemset) Dataset (100*n iterations) Uniform Sampling Max count Min count MedianStd 156610013.64 Ideal Sampling Median Std 10010 2811/11/2015

Discriminatory sampling results (Mutagenicity dataset) Distribution of Delta Score among all frequent Patterns Relation between sampling rate and Delta Score 2911/11/2015

Discriminatory sampling results (cont) Sample NoDelta ScoreRank% of POG explored 14041325.7 26442111.0 37071010.8 472548.9 52805952.8 672548.9 7627273.3 870997.7 972159.1 1072548.9 3011/11/2015

Discriminatory sampling results (cell Graphs) Total graphs 30, min-sup = 6 No graph mining algorithm could run the dataset for a week of running ( on a 2GHz with 4GB of RAM machine) 3111/11/2015

Summary Existing Algorithms Output Space Sampling Random walk on the subgraph space Arbitrary Extension Sampling algorithm Depth-first or Breadth first walk on the subgraph space Rightmost Extension Complete algorithm Quality: Sampling quality guaranty Scalability: Visits only a small part of the search space Non-Redundant: finds very dissimilar patterns by virtue of randomness Genericity: In terms of pattern type and sampling objective 3211/11/2015

Future Works and Discussion Important to choose proposal distribution wisely to get better sampling For large graph, support counting is still a bottleneck How to scrap the isomorphism checking entirely How to effectively parallelize the support counting How to make the random walk to converge faster The POG graph generally have smaller spectral gap, as a result the convergence is slow. This makes the algorithm costly (more steps to find good samples) 3311/11/2015

Acceptance Probability Computation Desired Distribution Proposal Distribution Interestingness value 3611/11/2015

Support Biased Sampling s 1 s 2 s 3 s n Graphs Support We want, What proposal distribution to choose? α=1, if N up (u) = ø, α=0, if N down (u) = ø u link 3711/11/2015

Example of Support Biased Sampling B A D A D P 3 x 1/9 2 X 1/2 α= 1/3, q(u, v) = ½, q(v, u)=1/(3x3) = 1/9 s(u) = 2 s(v) = 3 3 1 3811/11/2015

Sampling Convergence 3911/11/2015

Support Biased Sampling Scatter plot of Visit count and Support shows positive Correlation Correlation: 0.76 4011/11/2015

Specific Sampling Examples and Utilization Uniform Sampling of Frequent Pattern To explore the frequent patterns To set a proper value of minimum support To make an approximate counting Support Biased Sampling To find Top-k Pattern in terms of support value Discriminatory subgraph sampling Finding subgraphs that are good features for classification 4111/11/2015

Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs.

Similar presentations

Presentation on theme: "Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs.

Similar presentations

Presentation on theme: "Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs."— Presentation transcript:

Similar presentations

About project

Feedback