Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

Slides:

Advertisements

Similar presentations

Data not in the pre-defined feature vectors that can be used to construct predictive models. Applications: Transactional database Sequence database Graph.

Advertisements

Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Introduction to Graph Mining

Frequent Subgraph Pattern Mining on Uncertain Graph Data

Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.

1 Graph Mining Applications to Machine Learning Problems Max Planck Institute for Biological Cybernetics Koji Tsuda.

Fast Computational Methods for Visually Guided Robots Maryam Mahdaviani, Nando de Freitas, Bob Fraser and Firas Hamze Department of Computer Science, University.

COM (Co-Occurrence Miner): Graph Classification Based on Pattern Co-occurrence Ning Jin, Calvin Young, Wei Wang University of North Carolina at Chapel.

© 2008 IBM Corporation Mining Significant Graph Patterns by Leap Search Xifeng Yan (IBM T. J. Watson) Hong Cheng, Jiawei Han (UIUC) Philip S. Yu (UIC)

Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Proceedings of the 2007 SIAM International Conference on Data Mining.

Classification.

Mining Graphs with Constrains on Symmetry and Diameter Natalia Vanetik Deutsche Telecom Laboratories at Ben-Gurion University IWGD10 workshop July 14th,

Graph Classification.

Introduction to Data Mining Engineering Group in ACL.

Relaxed Transfer of Different Classes via Spectral Partition Xiaoxiao Shi 1 Wei Fan 2 Qiang Yang 3 Jiangtao Ren 4 1 University of Illinois at Chicago 2.

Slides are modified from Jiawei Han & Micheline Kamber

Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

Xiaoxiao Shi, Qi Liu, Wei Fan, Philip S. Yu, and Ruixin Zhu

Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases Jun Huan, Wei Wang, Jan Prins, Jiong Yang KDD 2004.

1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Search by partial solutions.  nodes are partial or complete states  graphs are DAGs (may be trees) source (root) is empty state sinks (leaves) are complete.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.

Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs.

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.

1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.

Slides for “Data Mining” by I. H. Witten and E. Frank.

1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.

Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.

Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.

CIS 335 CIS 335 Data Mining Classification Part I.

Graph Indexing From managing and mining graph data.

1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

By William Campbell MACHINE LEARNING. OVERVIEW What is machine learning? Human decision-making Learning algorithms Applications.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.

Motoki Shiga, Ichigaku Takigawa, Hiroshi Mamitsuka

Experience Report: System Log Analysis for Anomaly Detection

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Semi-Supervised Clustering

Mining in Graphs and Complex Structures

Sofus A. Macskassy Fetch Technologies

Rule Induction for Classification Using

Issues in Decision-Tree Learning Avoiding overfitting through pruning

Graph Search with Indexing

Hyper-parameter tuning for graph kernels via Multiple Kernel Learning

Unsupervised Learning and Autoencoders

Classification and Prediction

Mining Frequent Subgraphs

Branch and Bound.

CSCI N317 Computation for Scientific Applications Unit Weka

Graph Classification SEG 5010 Week 3.

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Mingzhen Mo and Irwin King

Approximate Graph Mining with Label Costs

Presentation transcript:

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

2 Program FlowsXML DocsChemical Compounds  In real apps, data are not directly represented as feature vectors, but graphs with complex structures. E.g. G(V, E, l) - y  Conventional data mining and machine learning approaches assume data are represented as feature vectors. E.g. (x 1, x 2, …, x d ) - y

?? Training Graphs Testing Graph  Drug activity prediction problem  Given a set of chemical compounds labeled with activities  Predict the activities of testing molecules Also Program Flows and XML docs Program FlowsXML Docs

4 Feature Vectors H H H H N N x1x1 x2x2 H H H H H H O O C C H H H H H H H H H H H H H H H H Subgraph Patterns … … … H H G1G1 G2G2 g1g1g1g1 g2g2g2g2 g3g3g3g3 H H H H N N H H H H N N O O O O C C C C C C C C O O O O C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C How to mine a set of subgraph patterns in order to effectively perform graph classification? Classifier x1x1 x2x2 Graph ObjectsFeature VectorsClassifiers

5 Two Components: 1.Evaluation (effective) whether a subgraph feature is relevant to graph classification? 2.Search space pruning (efficient) how to avoid enumerating all subgraph features? Labeled Graphs Discriminative Subgraphs H H H H N N C C C C C C C C C C C C O O O O C C C C

6  Supervised Settings  Require a large set of labeled training graphs  However… Labeling a graph is hard ! Labeled Graphs Discriminative Subgraphs H H H H N N C C C C C C C C C C C C

7 Supervised Methods: 1.Evaluation effective? require large amount of label information 2.Search space pruning efficient? pruning performances rely on large amount of label information Labeled Graphs Discriminative Subgraphs H H H H N N C C C C C C C C C C C C O O O O C C C C +-

8 Subgraph Patterns F(p) Classification H H H H N N C C C C C C C C C C C C O O O O C C C C Evaluation Criteria Labeled Graphs +- Unlabeled Graphs?? ?  Mine useful subgraph patterns using labeled and unlabeled graphs

9  Evaluation: How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective)  Search Space Pruning: How to prune the subgraph search space using both labeled and unlabeled graphs? (efficient)

10  Cannot-Link Graphs in different classes should be far away  Must-Link Graphs in the same class should be close  Separability Unlabeled graphs are able to be separated from each other Labeled Unlabeled Labeled

?? CCannot-Link Graphs in different classes should be far away MMust-Link Graphs in the same class should be close SSeparability Unlabeled graphs are able to be separated from each other Evaluation Function:

12 gSemi Score  gSemi Score: represents the k-th subgraph feature In matrix form: (the sum over all selected features) N H H good C C C C C C C C bad

13 #labeled Graphs =30 #labeled Graphs =50 #labeled Graphs =70 MCF-3 dataset NCI-H23 dataset OVCAR-8 dataset PTC-MM dataset PTC-FM dataset

14 Accuracy (higher is better)  Our approach performed best at NCI and PTC datasets gSemi (Semi-Supervised) Information Gain (Supervised) Frequent (Unsupervised) (#labeled graphs=30, MCF-3) # Selected Features

15  How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective)  How to prune the subgraph search space using both labeled and unlabeled graphs? (efficient)

16 Pattern Search Tree ┴ 0-edges 1-edge 2-edges … gSpan [Yan et. al ICDM’02] An efficient algorithm to enumerate all frequent subgraph patterns (frequency ≥ min_support) not frequent  Too many frequent subgraph patterns  Find the most useful one(s) Best node(s) in this tree How to find the Best node(s) in this tree without searching all the nodes? (Branch and Bound to prune the search space)

17 N H H best subgraph so far C C C C C C C C C C H H C C C C C C C C C C C C C C C C C C C C C C… C C C C C C C C best score so far upper bound … current node C C C C C C C C Pattern Search Tree current score sub-tree C C C C C C C C C C H H C C C C C C C C C C C C C C C C C C C C C C … gSemi score If best score ≥ upper bound We can prune the entire sub-tree

18 Without gSemi pruning gSemi pruning (MCF-3 dataset) Running time (seconds) (lower is better)

19 gSemi pruning Without gSemi pruning (MCF-3 dataset) # Subgraphs explored (lower is better)

20 Accuracy (higher is better) (MCF-3 dataset, #label=50, #feature=20) best (α=1, β=0.1) Semi-Supervised close toSupervised Unsupervised α (cannot-link constraints) (must-link constraints) β

21  Semi-Supervised Feature Selection for Graph Classification  Evaluating subgraph features using both labeled and unlabeled graphs (effective)  Branch&bound pruning the search space using labeled and unlabeled graphs (efficient) Thank you!