 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,

Slides:



Advertisements
Similar presentations
Graph Mining Laks V.S. Lakshmanan
Advertisements

gSpan: Graph-based substructure pattern mining
PARTITIONAL CLUSTERING
NeMoFinder: Dissecting genome- wide protein-protein intractions with meso-scale network motifs Mike Yuan.
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Introduction to Graph Mining
Mining Graphs.
Data Mining Techniques: Clustering
Association Analysis (7) (Mining Graphs)
Continuous Data Stream Processing  Music Virtual Channel – extensions  Data Stream Monitoring – tree pattern mining  Continuous Query Processing – sequence.
Structural Web Search Using a Graph-Based Discovery System Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington
Aki Hecht Seminar in Databases (236826) January 2009
Design Patterns for Metamodel Design Domain-Specific Modeling Workshop Portland, Oregon October 23, 2011 Hyun Cho and Jeff Gray University of Alabama Department.
COM (Co-Occurrence Miner): Graph Classification Based on Pattern Co-occurrence Ning Jin, Calvin Young, Wei Wang University of North Carolina at Chapel.
Graph-Based Data Mining Diane J. Cook University of Texas at Arlington
Overview of Web Data Mining and Applications Part I
Graph Classification.
Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Mining Techniques
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Slides are modified from Jiawei Han & Micheline Kamber
Topological Summaries: Using Graphs for Chemical Searching and Mining Graphs are a flexible & unifying model Scalable similarity searches through novel.
Advanced Association Rule Mining and Beyond. Continuous and Categorical Attributes Example of Association Rule: {Number of Pages  [5,10)  (Browser=Mozilla)}
Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Web Mining: Phrase-based Document Indexing and Document Clustering Khaled Hammouda, Ph.D. Candidate Mohamed Kamel, Supervisor, PI PAMI Research Group University.
Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,
An Efficient Algorithm for Discovering Frequent Subgraphs Michihiro Kuramochi and George Karypis ICDM, 2001 報告者:蔡明瑾.
Data Mining By Dave Maung.
Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Mining Turbulence Data Ivan Marusic Department of Aerospace Engineering and Mechanics University of Minnesota Collaborators: Victoria Interrante, George.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
1 Knowledge Discovery from Transportation Network Data Paper Review Jiang, W., Vaidya, J., Balaporia, Z., Clifton, C., and Banich, B. Knowledge Discovery.
Data Mining and Decision Support
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Graph Indexing From managing and mining graph data.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
1. ABSTRACT Information access through Internet provides intruders various ways of attacking a computer system. Establishment of a safe and strong network.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, George Karypis University of Minnesota,
Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Gspan: Graph-based Substructure Pattern Mining
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Oracle Advanced Analytics
Mining in Graphs and Complex Structures
Patterns extraction from process executions
Mining Frequent Subgraphs
Jiawei Han Department of Computer Science
Graph Search with Indexing
On Efficient Graph Substructure Selection
Graph Database Mining and Its Applications
Discovering Larger Network Motifs
Slides are modified from Jiawei Han & Micheline Kamber
Presentation transcript:

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial, and textual data sets.  The success of these techniques has renewed interest in applying them to various scientific and engineering fields. Astronomy Life Sciences Ecosystem Modeling Structural Mechanics …

 Most of existing data mining algorithms assume that the data is represented via Transactions (set of items) Sequence of items or events Multi-dimensional vectors Time series  Scientific datasets with structures, layers, hierarchy, geometry, and arbitrary relations can not be accurately modeled using such frameworks. e.g., Numerical simulations, 3D protein structures, chemical compounds, etc. Need algorithms that operate on scientific datasets in their native representation

 There are two basic choices. Treat each dataset/application differently and develop custom representations/algorithms. Employ a new way of modeling such datasets and develop algorithms that span across different applications!  What should be the properties of this general modeling framework? Abstract compared with the original raw data. Yet powerful enough to capture the important characteristics. Labeled directed/undirected topological/geometric graphs and hyper graphs

Graphs are suitable for capturing arbitrary relations between the various elements. VertexElement Element’s Attributes Relation Between Two Elements Type Of Relation Vertex Label Edge Label Edge Data InstanceGraph Instance Relation between a Set of Elements Hyper Edge Provide enormous flexibility for modeling the underlying data as they allow the modeler to decide on what the elements should be and the type of relations to be modeled

PDB; 1MWP N-Terminal Domain Of The Amyloid Precursor Protein Alzheimer's disease amyloid A4 protein precursor

 Develop algorithms to mine and analyze graph data sets. Finding patterns in these graphs Finding groups of similar graphs (clustering) Building predictive models for the graphs (classification)

Structural motif discovery High-throughput screening Protein fold recognition VLSI reverse engineering A lot more … Beyond Scientific Applications Semantic web Mining relational profiles Behavioral modeling Intrusion detection Citation analysis …

Approach #1: Frequent Subgraph Mining Find all subgraphs g within a set of graph transactions G such that where t is the minimum support Focus on pruning and fast, code-based graph matching

 Approach #1: Algorithms Apriori-based Graph Mining (AGM)  Inokuchi, Washio & Motoda (Osaka U., Japan) Frequent Sub-Graph discovery (FSG)  Kuramochi & Karypis (U. Minnesota) Graph-based Substructure pattern mining (gSpan)  Yan & Han (UIUC) Fast Frequent Subgraph Mining (FFSM), Spanning tree based maximal graph mining (Spin)  Huan, Wang & Prins (UNC Chapel Hill) Graph, Sequences and Tree extraction (Gaston)  Kazius & Nijssen (U. Leiden, Netherlands)

 A pattern is a relation between the object’s elements that is recurring over and over again. Common structures in a family of chemical compounds or proteins. Similar arrangements of vortices at different “instances” of turbulent fluid flows. …  There are two general ways to formally define a pattern in the context of graphs Arbitrary subgraphs (connected or not) Induced subgraphs (connected or not)  Frequent pattern discovery translates to frequent subgraph discovery…

 Candidate generation  Candidate pruning  Frequency counting  Key to FSG’s computational efficiency Simple operations become complicated & expensive when dealing with graphs…

Multiple candidates for the same core!

Multiple cores between two (k-1)-subgraphs

v0v0 B v1v1 B v2v2 B v3v3 B v4v4 A v5v5 A Label = “ ” Label = “ ”

Discover Frequent Sub-graphs 1 Select Discriminating Features 2 Learn a Classification Model 4 Transform Graphs in Feature Representation 3 Graph Databases

 Approach #2: Find subgraph S within a set of one or more graphs G that maximally compresses G where (G|S) is G compressed by S, i.e., instances of S in G replaced by single vertex  Focus on efficient subgraph generation and heuristic search

THE BASIC IDEA BEHIND THE GBI

PAIRWISE CHUNKING

 Graphs provide a powerful mechanism to represent relational and physical datasets.  Can be used as a quick prototyping tool to test out whether or not data-mining techniques can help a particular application area and problem.  Their benefits can be realized if there exists an extensive set of efficient and scalable algorithms to mine them…

 Takashi Matsuda, Hiroshi Motoda, Takashi Washio, Graph-based induction and its applications, Advanced Engineering Informatics, Volume 16, Issue 2, April 2002, Pages  Michihiro Kuramochi, George Karypis, "Frequent Subgraph Discovery," Data Mining, IEEE International Conference on, pp. 313, First IEEE International Conference on Data Mining (ICDM'01), 2001.