Download presentation
Presentation is loading. Please wait.
1
Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.eduhttp://www-cse.uta.edu/~cook
2
Data Mining “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [Frawley et al., 92] Increasing ability to generate data Increasing ability to store data
3
KDD Process
4
Approaches to Data Mining n Pattern extraction n Prediction / classification n Clustering DebtLoan No Loan 0.123 0.203 0.117 0.545 Income Debt<50 Income NOYES NO YES yesno <50 50- 100 >100<50 50- 100 >100
5
Substructure Discovery n Most data mining algorithms deal with linear attribute-value data n Need to represent and learn relationships between attributes
6
n Discovers repetitive substructure patterns in graph databases n Pattern extraction, classification, clustering n Serial and parallel / distributed versions n Applied to CAD circuits, telecom, DNA, and more n http://cygnus.uta.edu/subdue
7
object triangle Graph Representation n Input is a labeled graph n A substructure is connected subgraph n An instance of a substructure is a subgraph that is isomorphic to substructure definition R1 C1 T1 S1 T2 S2 T3 S3 T4 S4 Input DatabaseSubstructure S1 (graph form) Compressed Database R1 C1 object square on shape
8
MDL Principle n Best theory minimizes description length of data n Evaluate substructure based ability to compress DL of graph n Description length = DL(S) + DL(G|S)
9
Algorithm 1. Create substructure for each unique vertex label circle rectangle left triangle square on triangle square on triangle square on triangle square on left Substructures: triangle (4), square (4), circle (1), rectangle (1)
10
Algorithm 2. Expand best substructure by an edge or edge+neighboring vertex circle rectangle left triangle square on triangle square on triangle square on triangle square on left Substructures: triangle square on circle left square rectangle square on rectangle triangle on
11
Algorithm 3. Keep only best substructures on queue (specified by beam width) 4. Terminate when queue is empty or #discovered substructures >= limit 5. Compress graph and repeat to generate hierarchical description Note: polynomially constrained [IEEE Exp96]
12
Examples [Jair94]
13
Inexact Graph Match [JIIS95] n Some variations may occur between instances n Want to abstract over minor differences n Difference = cost of transforming one graph to make it isomorphic to another n Match if cost/size < threshold
14
Inexact Graph Match 12 AB a b 5 34 BA b aa b B (1,3) 1 (1,4) 0 (1,5) 1 (1, ) 1 (2,4) 7 (2,5) 6 (2, ) 10 (2,3) 3 (2,5) 6 (2, ) 9 (2,3) 7 (2,4) 7 (2, ) 10 (2,3) 9 (2,4) 10 (2,5) 9 (2, ) 11 Least-cost match is {(1,4), (2,3)}
15
Background Knowledge [IEEE TKDE96] n Some substructures not relevant n Background knowledge can bias search n Two types n Model knowledge n Graph match rules
17
Parallel/distributed Subdue [JPDC00] n Scalability issues n Three approaches n Dynamic partitioning n Functional parallel n Static partitioning
18
Dynamic Partitioning n Processor i stores ith vertex label n Each processor operates as in serial Subdue n Avoid replication by expanding to higher vertices v1v2 v3 v4 v5 e1 e2 e3e4
19
Dynamic Partitioning n Partitions are logical n Excessive processor idling and load balancing n Results very poor
20
Functional Parallel n Master processor controls search queue n Slaves evaluate and expand substructures n Synchronization after each step
21
Functional Parallel Results n ART database: 1,000 vertices and 2,000 edges n CAD database: 8,441 vertices and 19,206 edges
22
Static Partitioning n Divide graph into P partitions, distribute to P processors n Each processor performs serial Subdue on local partition n Broadcast best substructures, evaluate on other processors n Master processor stores best global substructures
23
Static Partitioning Results n Close to linear speedup n Continue until #processors > #vertices
24
Speedup Comparison
25
Issues n When partition graph, lose information n Metis graph partitioning system n Quality of resulting substructures? n Recapture by overlap, multiple partitions n Evaluating more substructures globally
26
Compression Results
27
Recapture Lost Information n Allow overlap between partitions n Run twice with two partitions, max results
28
Recapture Lost Information
29
AutoClass n Linear representation n Fit possible probabilistic models to data n Satellite data, DNA data, Landsat data
30
S UBDUE /AutoClass Combined Data structural features structural patterns Classes linear features = Combination of linear data or addition of linear features Subdue AutoClass + +
31
Example - 30 2-color squares n AutoClass Rep - tuple for each line (x1, y1, x2, y2, angle, length, color) n Add structure (neighboring edge information) n Subdue Rep - each line is node in graph, edges between connecting lines n Attributes from nodes
32
Results n AutoClass (12 classes) n Subdue (top substructure) Class 0 (20): Color=green, LineNo=Line1=Line2=98 +/- 10 Class 1 (20): Color=red, LineNo=Line1=Line2=99 +/- 10 … Class 11 (3): Line2=1 +/-13, Color=green
33
Combined Results n Combine 4 entries for each square into one n 30 tuples (one for each square) n Discover Class 0 (10): Color1=red, Color2=red, Color3=green, Color4=green Class 1 (10): Color1=green, Color2=green, Color3=blue, Color4=blue Class 2 (10): Color1=blue, Color2=blue, Color3=red, Color4=red
34
More Results
35
Supervised S UBDUE [IEEE IS00] n One graph stores positive examples n One graph stores negative examples n Find substructure that compresses positive graph but not negative graph
36
Example object on triangle square shape
37
Results n Chess endgames (19,257 examples), BK is (+) or is not (-) in check n 99.8% FOIL, 99.77% C4.5, 99.21% Subdue
38
More Results n Tic Tac Toe endgames n + is win for X (958 examples) n 100% Subdue, 92.35% FOIL, 96.03% C4.5 n Bach chorales n Musical sequences (20 sequences) n 100% Subdue, 85.71% FOIL, 82.00% C4.5
39
Clustering Using S UBDUE n Iterate Subdue until single vertex n Each cluster (substructure) inserted into a classification lattice n Early results similar to COBWEB [Fisher87] Root
40
Discovery Application Domains n Biochemical domains n Protein data [PSB99, IDA99] n Human Genome DNA data n Toxicology (cancer) data n Spatial-temporal domains n Earthquake data n Aircraft Safety and Reporting System n Telecommunications data n Program source code
41
Structured Web Search [AAAI-AIWS00] n Existing search engines use linear feature match n Subdue searches based on structure n Incorporation of WordNet allows for inexact feature match through synset path length n Technique n Breadth-first search through domain to generate graph n Nodes represent pages / documents n Edges represent hyperlinks n Additional nodes used to represent document keywords n Pose query as graph n Search for query match within domain graph
42
Sample Search Instructor Teaching Robotics Research Robotics Publicatio n Robotics http Postscript | PDF
43
Query: Find all pages which link to a page containing term ‘subdue’ Subgraph vertices: 1 _page_ URL: http://cygnus.uta.eduhttp://cygnus.uta.edu 7 _page_ URL: http://cygnus.uta.edu/projects.htmlhttp://cygnus.uta.edu/projects.html 8Subdue [1->7] hyperlink [7->8] word subdue page hyperlink /* Vertex ID Label */ s v 1 _page_ v 2 _page_ v 3 subdue /* Edge Vertex 1 Vertex 2 Label */ d 1 2 _hyperlink_ d 2 3 _word_ word page
44
Search for Presentation Pages n Subdue 22 instances n AltaVista Query “host:www-cse.uta.edu AND image:next_motif.gif AND image:up_motif.gif AND image:previous_motif.gif.” 12 instances page hyperlink
45
Search for Reference Pages n Search for page with at least 35 in links n 5 pages in www-cse n AltaVista cannot perform this type of search page hyperlink …
46
Search for pages on ‘jobs in computer science’ n Inexact match: allow one level of synonyms n Subdue found 33 matches n Words include employment, work, job, problem, task n AltaVista found 2 matches page jobscomputerscience word
47
Search for ‘authority’ hub and authority pages n Subdue found 3 hub (and 3 authority) pages n AltaVista cannot perform this type of search n Inexact match applied with threshold = 0.2 (4.2 transformations allowed) n Subdue found 13 matches page hyperlink page word algorithms HUBS AUTHORITIES
48
Subdue Learning from Web Data n Distinguish professors’ and students’ web pages n Learned concept (professors have “box” in address field) n Distinguish online stores and professors’ web pages n Learned concept (stores have more levels in graph) pagebox word page
49
To Learn More cygnus.uta.edu/subdue cook@cse.uta.edu http://www-cse.uta.edu/~cook
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.