Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

2 Program FlowsXML DocsChemical Compounds  In real apps, data are not directly represented as feature vectors, but graphs with complex structures. E.g. G(V, E, l) - y  Conventional data mining and machine learning approaches assume data are represented as feature vectors. E.g. (x 1, x 2, …, x d ) - y

3 ++ -- ?? Training Graphs Testing Graph  Drug activity prediction problem  Given a set of chemical compounds labeled with activities  Predict the activities of testing molecules Also Program Flows and XML docs Program FlowsXML Docs

4 Feature Vectors H H H H N N x1x1 x2x2 H H H H H H O O C C H H H H H H H H H H H H H H H H 10 01 1 1 Subgraph Patterns … … … H H G1G1 G2G2 g1g1g1g1 g2g2g2g2 g3g3g3g3 H H H H N N H H H H N N O O O O C C C C C C C C O O O O C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C How to mine a set of subgraph patterns in order to effectively perform graph classification? Classifier x1x1 x2x2 Graph ObjectsFeature VectorsClassifiers

5 Two Components: 1.Evaluation (effective) whether a subgraph feature is relevant to graph classification? 2.Search space pruning (efficient) how to avoid enumerating all subgraph features? Labeled Graphs + + - - Discriminative Subgraphs H H H H N N C C C C C C C C C C C C O O O O C C C C

6  Supervised Settings  Require a large set of labeled training graphs  However… Labeling a graph is hard ! Labeled Graphs + + - - Discriminative Subgraphs H H H H N N C C C C C C C C C C C C

7 Supervised Methods: 1.Evaluation effective? require large amount of label information 2.Search space pruning efficient? pruning performances rely on large amount of label information Labeled Graphs Discriminative Subgraphs H H H H N N C C C C C C C C C C C C O O O O C C C C +-

8 Subgraph Patterns F(p) Classification + + + + - - H H H H N N C C C C C C C C C C C C O O O O C C C C Evaluation Criteria Labeled Graphs +- Unlabeled Graphs?? ?  Mine useful subgraph patterns using labeled and unlabeled graphs

9  Evaluation: How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective)  Search Space Pruning: How to prune the subgraph search space using both labeled and unlabeled graphs? (efficient)

10  Cannot-Link Graphs in different classes should be far away  Must-Link Graphs in the same class should be close  Separability Unlabeled graphs are able to be separated from each other ++ ++ -- -- 2 1 3 Labeled Unlabeled Labeled

11 +- ++ ?? CCannot-Link Graphs in different classes should be far away MMust-Link Graphs in the same class should be close SSeparability Unlabeled graphs are able to be separated from each other Evaluation Function:

12 gSemi Score  gSemi Score: represents the k-th subgraph feature In matrix form: (the sum over all selected features) N H H good C C C C C C C C bad

13 #labeled Graphs =30 #labeled Graphs =50 #labeled Graphs =70 MCF-3 dataset NCI-H23 dataset OVCAR-8 dataset PTC-MM dataset PTC-FM dataset

14 Accuracy (higher is better)  Our approach performed best at NCI and PTC datasets gSemi (Semi-Supervised) Information Gain (Supervised) Frequent (Unsupervised) (#labeled graphs=30, MCF-3) # Selected Features

15  How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective)  How to prune the subgraph search space using both labeled and unlabeled graphs? (efficient)

16 Pattern Search Tree ┴ 0-edges 1-edge 2-edges … gSpan [Yan et. al ICDM’02] An efficient algorithm to enumerate all frequent subgraph patterns (frequency ≥ min_support) not frequent  Too many frequent subgraph patterns  Find the most useful one(s) Best node(s) in this tree How to find the Best node(s) in this tree without searching all the nodes? (Branch and Bound to prune the search space)

17 N H H best subgraph so far C C C C C C C C C C H H C C C C C C C C C C C C C C C C C C C C C C… C C C C C C C C best score so far upper bound … current node C C C C C C C C Pattern Search Tree current score sub-tree C C C C C C C C C C H H C C C C C C C C C C C C C C C C C C C C C C … gSemi score If best score ≥ upper bound We can prune the entire sub-tree

18 Without gSemi pruning gSemi pruning (MCF-3 dataset) Running time (seconds) (lower is better)

19 gSemi pruning Without gSemi pruning (MCF-3 dataset) # Subgraphs explored (lower is better)

20 Accuracy (higher is better) (MCF-3 dataset, #label=50, #feature=20) best (α=1, β=0.1) Semi-Supervised close toSupervised Unsupervised α (cannot-link constraints) (must-link constraints) β

21  Semi-Supervised Feature Selection for Graph Classification  Evaluating subgraph features using both labeled and unlabeled graphs (effective)  Branch&bound pruning the search space using labeled and unlabeled graphs (efficient) Thank you!

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

Similar presentations

Presentation on theme: "Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

Similar presentations

Presentation on theme: "Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010."— Presentation transcript:

Similar presentations

About project

Feedback