Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington

Slides:



Advertisements
Similar presentations
gSpan: Graph-based substructure pattern mining
Advertisements

Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington
GRAPH-BASED HIERARCHICAL CONCEPTUAL CLUSTERING by Istvan Jonyer, Lawrence B. Holder and Diane J. Cook The University of Texas at Arlington.
Data Mining Techniques: Clustering
FLAIRS '991 Applying the SUBDUE Substructure Discovery System to the Chemical Toxicity Domain Ravindra N. Chittimoori, Diane J. Cook, Lawrence B. Holder.
A Web of Concepts Dalvi, et al. Presented by Andrew Zitzelberger.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Structural Web Search Using a Graph-Based Discovery System Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Discovering Substructures in Chemical Toxicity Domain Masters Project Defense by Ravindra Nath Chittimoori Committee: DR. Lawrence B. Holder, DR. Diane.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Graph-Based Concept Learning Jesus A. Gonzalez, Lawrence B. Holder, and Diane J. Cook Department of Computer Science and Engineering University of Texas.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Structural Knowledge Discovery Used to Analyze Earthquake Activity Jesus A. Gonzalez Lawrence B. Holder Diane J. Cook.
Graph-Based Data Mining Diane J. Cook University of Texas at Arlington
FLAIRS Graph-Based Concept Learning Jesus Gonzalez, Lawrence Holder and Diane Cook Department of Computer Science and Engineering The University.
Subdue Graph Visualizer by Gayathri Sampath, M.S. (CSE) University of Texas at Arlington.
Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.
GUI implementation for Supervised and Unsupervised SUBDUE System.
CISC220 Fall 2009 James Atlas Nov 13: Graphs, Line Intersections.
Workshop1 Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 22 Jim Martin.
Data Mining – Intro.
Chapter 5 Data mining : A Closer Look.
Chapter 15 Graph Theory © 2008 Pearson Addison-Wesley.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Graph-Based Data Mining and Applications István Jónyer Department of Computer Science Oklahoma State University.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Hubert CARDOTJY- RAMELRashid-Jalal QURESHI Université François Rabelais de Tours, Laboratoire d'Informatique 64, Avenue Jean Portalis, TOURS – France.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
1 SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS Jesus A. Gonzalez Supervisor:Dr. Lawrence B. Holder Committee:Dr. Diane J. Cook Dr. Lynn.
Using Hyperlink structure information for web search.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
TCP Traffic and Congestion Control in ATM Networks
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
Principles of Data Mining. Introduction: Topics 1. Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Mining Weather Data for Decision Support Roy George Army High Performance Computing Research Center Clark Atlanta University Atlanta, GA
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Data Structures and Algorithms in Parallel Computing Lecture 2.
1 Knowledge Discovery from Transportation Network Data Paper Review Jiang, W., Vaidya, J., Balaporia, Z., Clifton, C., and Banich, B. Knowledge Discovery.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
1/12/ Multimedia Data Mining. Multimedia data types any type of information medium that can be represented, processed, stored and transmitted over.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Data Mining and Decision Support
2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
1 A Methodology for automatic retrieval of similarly shaped machinable components Mark Ascher - Dept of ECE.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Gspan: Graph-based Substructure Pattern Mining
Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,
School of Computer Science & Engineering
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehousing and Data Mining
DATA MINING Introductory and Advanced Topics Part II - Clustering
ITEC 3220A Using and Designing Database Systems
Applying principles of computer science in a biological context
Presentation transcript:

Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington

Data Mining “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [Frawley et al., 92]   Increasing ability to generate data   Increasing ability to store data

KDD Process

Approaches to Data Mining n Pattern extraction n Prediction / classification n Clustering DebtLoan No Loan Income Debt<50 Income NOYES NO YES yesno < >100< >100

Substructure Discovery n Most data mining algorithms deal with linear attribute-value data n Need to represent and learn relationships between attributes

n Discovers repetitive substructure patterns in graph databases n Pattern extraction, classification, clustering n Serial and parallel / distributed versions n Applied to CAD circuits, telecom, DNA, and more n

object triangle Graph Representation n Input is a labeled graph n A substructure is connected subgraph n An instance of a substructure is a subgraph that is isomorphic to substructure definition R1 C1 T1 S1 T2 S2 T3 S3 T4 S4 Input DatabaseSubstructure S1 (graph form) Compressed Database R1 C1 object square on shape

MDL Principle n Best theory minimizes description length of data n Evaluate substructure based ability to compress DL of graph n Description length = DL(S) + DL(G|S)

Algorithm 1. Create substructure for each unique vertex label circle rectangle left triangle square on triangle square on triangle square on triangle square on left Substructures: triangle (4), square (4), circle (1), rectangle (1)

Algorithm 2. Expand best substructure by an edge or edge+neighboring vertex circle rectangle left triangle square on triangle square on triangle square on triangle square on left Substructures: triangle square on circle left square rectangle square on rectangle triangle on

Algorithm 3. Keep only best substructures on queue (specified by beam width) 4. Terminate when queue is empty or #discovered substructures >= limit 5. Compress graph and repeat to generate hierarchical description Note: polynomially constrained [IEEE Exp96]

Examples [Jair94]

Inexact Graph Match [JIIS95] n Some variations may occur between instances n Want to abstract over minor differences n Difference = cost of transforming one graph to make it isomorphic to another n Match if cost/size < threshold

Inexact Graph Match 12 AB a b 5 34 BA b aa b B  (1,3) 1 (1,4) 0 (1,5) 1 (1, ) 1 (2,4) 7 (2,5) 6 (2, ) 10 (2,3) 3 (2,5) 6 (2, ) 9 (2,3) 7 (2,4) 7 (2, ) 10 (2,3) 9 (2,4) 10 (2,5) 9 (2, ) 11 Least-cost match is {(1,4), (2,3)}

Background Knowledge [IEEE TKDE96] n Some substructures not relevant n Background knowledge can bias search n Two types n Model knowledge n Graph match rules

Parallel/distributed Subdue [JPDC00] n Scalability issues n Three approaches n Dynamic partitioning n Functional parallel n Static partitioning

Dynamic Partitioning n Processor i stores ith vertex label n Each processor operates as in serial Subdue n Avoid replication by expanding to higher vertices v1v2 v3 v4 v5 e1 e2 e3e4

Dynamic Partitioning n Partitions are logical n Excessive processor idling and load balancing n Results very poor

Functional Parallel n Master processor controls search queue n Slaves evaluate and expand substructures n Synchronization after each step

Functional Parallel Results n ART database: 1,000 vertices and 2,000 edges n CAD database: 8,441 vertices and 19,206 edges

Static Partitioning n Divide graph into P partitions, distribute to P processors n Each processor performs serial Subdue on local partition n Broadcast best substructures, evaluate on other processors n Master processor stores best global substructures

Static Partitioning Results n Close to linear speedup n Continue until #processors > #vertices

Speedup Comparison

Issues n When partition graph, lose information n Metis graph partitioning system n Quality of resulting substructures? n Recapture by overlap, multiple partitions n Evaluating more substructures globally

Compression Results

Recapture Lost Information n Allow overlap between partitions n Run twice with two partitions, max results

Recapture Lost Information

AutoClass n Linear representation n Fit possible probabilistic models to data n Satellite data, DNA data, Landsat data

S UBDUE /AutoClass Combined Data structural features structural patterns Classes linear features = Combination of linear data or addition of linear features Subdue AutoClass + +

Example color squares n AutoClass Rep - tuple for each line (x1, y1, x2, y2, angle, length, color) n Add structure (neighboring edge information) n Subdue Rep - each line is node in graph, edges between connecting lines n Attributes from nodes

Results n AutoClass (12 classes) n Subdue (top substructure) Class 0 (20): Color=green, LineNo=Line1=Line2=98 +/- 10 Class 1 (20): Color=red, LineNo=Line1=Line2=99 +/- 10 … Class 11 (3): Line2=1 +/-13, Color=green

Combined Results n Combine 4 entries for each square into one n 30 tuples (one for each square) n Discover Class 0 (10): Color1=red, Color2=red, Color3=green, Color4=green Class 1 (10): Color1=green, Color2=green, Color3=blue, Color4=blue Class 2 (10): Color1=blue, Color2=blue, Color3=red, Color4=red

More Results

Supervised S UBDUE [IEEE IS00] n One graph stores positive examples n One graph stores negative examples n Find substructure that compresses positive graph but not negative graph

Example object on triangle square shape

Results n Chess endgames (19,257 examples), BK is (+) or is not (-) in check n 99.8% FOIL, 99.77% C4.5, 99.21% Subdue

More Results n Tic Tac Toe endgames n + is win for X (958 examples) n 100% Subdue, 92.35% FOIL, 96.03% C4.5 n Bach chorales n Musical sequences (20 sequences) n 100% Subdue, 85.71% FOIL, 82.00% C4.5

Clustering Using S UBDUE n Iterate Subdue until single vertex n Each cluster (substructure) inserted into a classification lattice n Early results similar to COBWEB [Fisher87] Root

Discovery Application Domains n Biochemical domains n Protein data [PSB99, IDA99] n Human Genome DNA data n Toxicology (cancer) data n Spatial-temporal domains n Earthquake data n Aircraft Safety and Reporting System n Telecommunications data n Program source code

Structured Web Search [AAAI-AIWS00] n Existing search engines use linear feature match n Subdue searches based on structure n Incorporation of WordNet allows for inexact feature match through synset path length n Technique n Breadth-first search through domain to generate graph n Nodes represent pages / documents n Edges represent hyperlinks n Additional nodes used to represent document keywords n Pose query as graph n Search for query match within domain graph

Sample Search Instructor Teaching Robotics Research Robotics Publicatio n Robotics http Postscript | PDF

Query: Find all pages which link to a page containing term ‘subdue’ Subgraph vertices: 1 _page_ URL: 7 _page_ URL: 8Subdue [1->7] hyperlink [7->8] word subdue page hyperlink /* Vertex ID Label */ s v 1 _page_ v 2 _page_ v 3 subdue /* Edge Vertex 1 Vertex 2 Label */ d 1 2 _hyperlink_ d 2 3 _word_ word page

Search for Presentation Pages n Subdue  22 instances n AltaVista  Query “host:www-cse.uta.edu AND image:next_motif.gif AND image:up_motif.gif AND image:previous_motif.gif.”  12 instances page hyperlink

Search for Reference Pages n Search for page with at least 35 in links n 5 pages in www-cse n AltaVista cannot perform this type of search page hyperlink …

Search for pages on ‘jobs in computer science’ n Inexact match: allow one level of synonyms n Subdue found 33 matches n Words include employment, work, job, problem, task n AltaVista found 2 matches page jobscomputerscience word

Search for ‘authority’ hub and authority pages n Subdue found 3 hub (and 3 authority) pages n AltaVista cannot perform this type of search n Inexact match applied with threshold = 0.2 (4.2 transformations allowed) n Subdue found 13 matches page hyperlink page word algorithms HUBS AUTHORITIES

Subdue Learning from Web Data n Distinguish professors’ and students’ web pages n Learned concept (professors have “box” in address field) n Distinguish online stores and professors’ web pages n Learned concept (stores have more levels in graph) pagebox word page

To Learn More cygnus.uta.edu/subdue