Workshop1 Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department.

Slides:

Advertisements

Similar presentations

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,

Advertisements

Clustering Basic Concepts and Algorithms

gSpan: Graph-based substructure pattern mining

PARTITIONAL CLUSTERING

Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ

Advanced Data Structures

GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington.

Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins Shaobing Su Supervisor: Dr. Lawrence B. Holder.

Data Mining Techniques: Clustering

FLAIRS '991 Applying the SUBDUE Substructure Discovery System to the Chemical Toxicity Domain Ravindra N. Chittimoori, Diane J. Cook, Lawrence B. Holder.

Structural Web Search Using a Graph-Based Discovery System Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Data Mining in DNA: Using the SUBDUE Knowledge Discovery System to Find Potential Gene Regulatory Sequences by Ronald K. Maglothin.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane.

University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

Graph-Based Concept Learning Jesus A. Gonzalez, Lawrence B. Holder, and Diane J. Cook Department of Computer Science and Engineering University of Texas.

Structural Knowledge Discovery Used to Analyze Earthquake Activity Jesus A. Gonzalez Lawrence B. Holder Diane J. Cook.

Graph-Based Data Mining Diane J. Cook University of Texas at Arlington

FLAIRS Graph-Based Concept Learning Jesus Gonzalez, Lawrence Holder and Diane Cook Department of Computer Science and Engineering The University.

Subdue Graph Visualizer by Gayathri Sampath, M.S. (CSE) University of Texas at Arlington.

Graph, Search Algorithms Ka-Lok Ng Department of Bioinformatics Asia University.

Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.

GUI implementation for Supervised and Unsupervised SUBDUE System.

On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.

Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington

Graph Classification.

Mining Scientific Data Sets Using Graphs George Karypis Department of Computer Science & Engineering University of Minnesota (Michihiro Kuramochi & Mukund.

FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,

Using Abstraction to Speed Up Search Robert Holte University of Ottawa.

Graph-Based Data Mining and Applications István Jónyer Department of Computer Science Oklahoma State University.

Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.

1 SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS Jesus A. Gonzalez Supervisor:Dr. Lawrence B. Holder Committee:Dr. Diane J. Cook Dr. Lynn.

1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.

JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.

Technological Educational Institute Of Crete Department Of Applied Informatics and Multimedia Intelligent Systems Laboratory 1 CLUSTERS Prof. George Papadourakis,

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

Marina Drosou, Evaggelia Pitoura Computer Science Department

ICS 252 Introduction to Computer Design Lecture 12 Winter 2004 Eli Bozorgzadeh Computer Science Department-UCI.

Text Clustering Hongning Wang

Repeating patterns Can you work out the next shape in the pattern?

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

Efficient Rule-Based Attribute-Oriented Induction for Data Mining Authors: Cheung et al. Graduate: Yu-Wei Su Advisor: Dr. Hsu.

UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Spring, 2010 Lecture 2 Tuesday, 2/2/10 Design Patterns for Optimization.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Gspan: Graph-based Substructure Pattern Mining

Data Mining and Text Mining. The Standard Data Mining process.

Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,

Catalogs contain hundreds of millions of objects

CS 201: Design and Analysis of Algorithms

Trees Chapter 15.

Chapter 5 : Trees.

Data Mining K-means Algorithm

Topic 3: Cluster Analysis

Fuzzy Clustering.

Critical Issues with Respect to Clustering

CSE572, CBS598: Data Mining by H. Liu

DATA MINING Introductory and Advanced Topics Part II - Clustering

CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu

Clustering Wei Wang.

Topic 5: Cluster Analysis

CSE572: Data Mining by H. Liu

Can you work out the next shape in the pattern?

Can you work out the next shape in the pattern?

Cs212: Data Structures Lecture 7: Tree_Part1

Presentation transcript:

Workshop1 Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department of Computer Science and Engineering

Workshop2 Motivation Structural/relational data Ease of graph representation

Workshop3 Graph-Based Discovery object triangle R1 C1 T1 B1 T2 B2 T3 B3 T4 B4 Input DatabaseSubstructure S1 (graph form) Compressed Database R1 C1 object square on shape

Workshop4 Algorithm 1. Create substructure for each unique vertex label Substructures: triangle (4), square (4), circle (1), rectangle (1) circle rectangle triangle square on triangle square on triangle square on triangle square on

Workshop5 Algorithm 2. Expand best substructure by an edge or edge+neighboring vertex Substructures: triangle square on rectangle square on rectangle triangle on circle rectangle triangle square on triangle square on triangle square on triangle square on rectangle circle on

Workshop6 Algorithm 3. Keep only best beam-width substructures on queue 4. Terminate when queue is empty or #discovered substructures >= limit 5. Compress graph and repeat to generate hierarchical description Note: polynomially constrained

Workshop7 Evaluation Metric Substructures evaluated based on ability to compress input graph Compression measured using minimum description length (DL) Best substructure S in graph G minimizes: DL(S) + DL(G|S)

Workshop8 Examples

Workshop9 Inexact Graph Match Some variations may occur between instances Want to abstract over minor differences Difference = cost of transforming one graph to isomorphism of another Match if cost/size < threshold

Workshop10 Parallel/Distributed Discovery Divide graph into P partitions using Metis, distribute to P processors Each processor performs serial Subdue on local partition Broadcast best substructures, evaluate on other processors Master processor stores best global substructures Close to linear speedup

Workshop11 Graph-Based Concept Learning One graph stores positive examples One graph stores negative examples Find substructure that compresses positive graph but not negative graph (PosEgsNotCovered) + (NegEgsCovered) Multiple iterations implements set- covering approach

Workshop12 Concept-Learning Example object on triangle square shape

Workshop13 Concept-Learning Results Chess endgames (19,257 examples) Black King is (+) or is not (-) in check 99.8% FOIL, 99.21% Subdue

Workshop14 More Concept-Learning Results Tic-Tac-Toe endgames + is win for X (958 examples) 100% Subdue, 92.35% FOIL Bach chorales Musical sequences (20 sequences) 100% Subdue, 85.71% FOIL

Workshop15 Graph-Based Clustering Iterate Subdue until single vertex Each cluster (substructure) inserted into a classification lattice Root

Workshop16 Clustering Example: Animals NameBody Cover Heart ChamberBody Temp.Fertilization mammalhairfourregulatedinternal birdfeathersfourregulatedinternal reptilecornified-skinimperfect-fourunregulatedinternal amphibianmoist-skinthreeunregulatedexternal fishscalestwounregulatedexternal animal hair mammal BodyCover Fertilization HeartChamber BodyTemp internalregulated Name four

Workshop17 Graph-Based Clustering Results Animals BodyTemp: unregulated HeartChamber: four BodyTemp: regulated Fertilization: internal Fertilization: external Name: mammal BodyCover: hair Name: bird BodyCover: feathers Name: reptile BodyCover: cornified-skin HeartChamber: imperfect-four Fertilization: internal Name: fish BodyCover: scales HeartChamber: two Name: amphibian BodyCover: moist-skin HeartChamber: three

Workshop18 Cobweb Results Comparison of Subdue and Cobweb results Subdue lattice produced better generalization, resulting in less clusters at higher levels Subdue lattice identifies overlap between (reptile) and (amphibian/fish) animals amphibian/fish mammal/bird reptile mammalbird fishamphibian

Workshop19 Clustering Example: DNA

Workshop20 Graph-Based Clustering Results Coverage 61% 68% 71% DNA O | O == P — OH C — NC — C \ O | O == P — OH | O | CH 2 C \ N — C \ C O \ C / \ C — C N — C / \ O C

Workshop21 Evaluation of Clusterings Traditional evaluation: Not applicable to hierarchical domains Does not make sense to compare clusters in different subtrees Not applicable to relational clusterings

Workshop22 Properties of Good Clusterings Small number of clusters Large coverage  good generality Big cluster descriptions More features  more inferential power Minimal or no overlap between clusters More distinct clusters  better defined concepts

Workshop23 New Evaluation Heuristic for Hierarchical Clusterings Clustering rooted at C with c children H i having |H i | instances H i,k distance() measured by inexact graph match Animals: SubdueCQ=2.6, CobwebCQ=1.7

Workshop24 Graph-Based Data Mining: Application Domains Biochemical domains Protein data DNA data Toxicology (cancer) data Spatial-temporal domains Earthquake data Aircraft Safety and Reporting System Telecommunications data Program source code Web topology web_page hyperlink home … …

Workshop25 Theoretical Analysis Galois lattice [Lequiere et al.] Conceptual graphs [Sowa et al.] PAC analysis [Jappy et al.]

Workshop26 Graph-based Data Mining Pattern (substructure) discovery Hierarchical discovery Distributed discovery Concept learning Clustering Compression heuristic based on minimum description length

Workshop27 Future Work Concept learning Theoretical analysis Comparison to ILP systems Clustering Classification lattice Hierarchical relational conceptual clustering evaluation metric Probabilistic substructures Domains: WWW, source code

Workshop28 Subdue Source Code and Data