Download presentation
Presentation is loading. Please wait.
Published byIrma Pope Modified over 9 years ago
1
CSE@UTASRL Workshop1 Efficient Mining of Graph-Based Data Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook University of Texas at Arlington Department of Computer Science and Engineering http://cygnus.uta.edu/subdue
2
CSE@UTASRL Workshop2 Motivation Structural/relational data Ease of graph representation
3
CSE@UTASRL Workshop3 Graph-Based Discovery object triangle R1 C1 T1 B1 T2 B2 T3 B3 T4 B4 Input DatabaseSubstructure S1 (graph form) Compressed Database R1 C1 object square on shape
4
CSE@UTASRL Workshop4 Algorithm 1. Create substructure for each unique vertex label Substructures: triangle (4), square (4), circle (1), rectangle (1) circle rectangle triangle square on triangle square on triangle square on triangle square on
5
CSE@UTASRL Workshop5 Algorithm 2. Expand best substructure by an edge or edge+neighboring vertex Substructures: triangle square on rectangle square on rectangle triangle on circle rectangle triangle square on triangle square on triangle square on triangle square on rectangle circle on
6
CSE@UTASRL Workshop6 Algorithm 3. Keep only best beam-width substructures on queue 4. Terminate when queue is empty or #discovered substructures >= limit 5. Compress graph and repeat to generate hierarchical description Note: polynomially constrained
7
CSE@UTASRL Workshop7 Evaluation Metric Substructures evaluated based on ability to compress input graph Compression measured using minimum description length (DL) Best substructure S in graph G minimizes: DL(S) + DL(G|S)
8
CSE@UTASRL Workshop8 Examples
9
CSE@UTASRL Workshop9 Inexact Graph Match Some variations may occur between instances Want to abstract over minor differences Difference = cost of transforming one graph to isomorphism of another Match if cost/size < threshold
10
CSE@UTASRL Workshop10 Parallel/Distributed Discovery Divide graph into P partitions using Metis, distribute to P processors Each processor performs serial Subdue on local partition Broadcast best substructures, evaluate on other processors Master processor stores best global substructures Close to linear speedup
11
CSE@UTASRL Workshop11 Graph-Based Concept Learning One graph stores positive examples One graph stores negative examples Find substructure that compresses positive graph but not negative graph (PosEgsNotCovered) + (NegEgsCovered) Multiple iterations implements set- covering approach
12
CSE@UTASRL Workshop12 Concept-Learning Example object on triangle square shape
13
CSE@UTASRL Workshop13 Concept-Learning Results Chess endgames (19,257 examples) Black King is (+) or is not (-) in check 99.8% FOIL, 99.21% Subdue
14
CSE@UTASRL Workshop14 More Concept-Learning Results Tic-Tac-Toe endgames + is win for X (958 examples) 100% Subdue, 92.35% FOIL Bach chorales Musical sequences (20 sequences) 100% Subdue, 85.71% FOIL
15
CSE@UTASRL Workshop15 Graph-Based Clustering Iterate Subdue until single vertex Each cluster (substructure) inserted into a classification lattice Root
16
CSE@UTASRL Workshop16 Clustering Example: Animals NameBody Cover Heart ChamberBody Temp.Fertilization mammalhairfourregulatedinternal birdfeathersfourregulatedinternal reptilecornified-skinimperfect-fourunregulatedinternal amphibianmoist-skinthreeunregulatedexternal fishscalestwounregulatedexternal animal hair mammal BodyCover Fertilization HeartChamber BodyTemp internalregulated Name four
17
CSE@UTASRL Workshop17 Graph-Based Clustering Results Animals BodyTemp: unregulated HeartChamber: four BodyTemp: regulated Fertilization: internal Fertilization: external Name: mammal BodyCover: hair Name: bird BodyCover: feathers Name: reptile BodyCover: cornified-skin HeartChamber: imperfect-four Fertilization: internal Name: fish BodyCover: scales HeartChamber: two Name: amphibian BodyCover: moist-skin HeartChamber: three
18
CSE@UTASRL Workshop18 Cobweb Results Comparison of Subdue and Cobweb results Subdue lattice produced better generalization, resulting in less clusters at higher levels Subdue lattice identifies overlap between (reptile) and (amphibian/fish) animals amphibian/fish mammal/bird reptile mammalbird fishamphibian
19
CSE@UTASRL Workshop19 Clustering Example: DNA
20
CSE@UTASRL Workshop20 Graph-Based Clustering Results Coverage 61% 68% 71% DNA O | O == P — OH C — NC — C \ O | O == P — OH | O | CH 2 C \ N — C \ C O \ C / \ C — C N — C / \ O C
21
CSE@UTASRL Workshop21 Evaluation of Clusterings Traditional evaluation: Not applicable to hierarchical domains Does not make sense to compare clusters in different subtrees Not applicable to relational clusterings
22
CSE@UTASRL Workshop22 Properties of Good Clusterings Small number of clusters Large coverage good generality Big cluster descriptions More features more inferential power Minimal or no overlap between clusters More distinct clusters better defined concepts
23
CSE@UTASRL Workshop23 New Evaluation Heuristic for Hierarchical Clusterings Clustering rooted at C with c children H i having |H i | instances H i,k distance() measured by inexact graph match Animals: SubdueCQ=2.6, CobwebCQ=1.7
24
CSE@UTASRL Workshop24 Graph-Based Data Mining: Application Domains Biochemical domains Protein data DNA data Toxicology (cancer) data Spatial-temporal domains Earthquake data Aircraft Safety and Reporting System Telecommunications data Program source code Web topology web_page hyperlink home … …
25
CSE@UTASRL Workshop25 Theoretical Analysis Galois lattice [Lequiere et al.] Conceptual graphs [Sowa et al.] PAC analysis [Jappy et al.]
26
CSE@UTASRL Workshop26 Graph-based Data Mining Pattern (substructure) discovery Hierarchical discovery Distributed discovery Concept learning Clustering Compression heuristic based on minimum description length
27
CSE@UTASRL Workshop27 Future Work Concept learning Theoretical analysis Comparison to ILP systems Clustering Classification lattice Hierarchical relational conceptual clustering evaluation metric Probabilistic substructures Domains: WWW, source code
28
CSE@UTASRL Workshop28 Subdue Source Code and Data http://cygnus.uta.edu/subdue
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.