Graph-Based Data Mining and Applications István Jónyer Department of Computer Science Oklahoma State University.

Graph-Based Data Mining and Applications István Jónyer Department of Computer Science Oklahoma State University

Graph-Based Data Mining and Applications2 Outline Graph-Based Data Mining Applications Graph Grammar Induction Applications Conclusions

Graph-Based Data Mining and Applications3 Graph-Based Data Mining Graphs are the most expressive data structures in computer science Intuitively represent complex domains Without repetition of data Simple building blocks: Labeled vertices Labeled directed/undirected edges

Graph-Based Data Mining and Applications4 Subdue finds frequently occurring subgraphs Returns the best ones according to the minimum description length heuristic (MDL) Features: Discovery, Clustering and Concept Learning Inexact graph matching Parallel/distributed discovery Background knowledge Subdue

Graph-Based Data Mining and Applications5 Graph Representation Input is a labeled (vertices and edges) directed graph A substructure is a connected subgraph An instance of a substructure is an isomorphic subgraph of the input graph Input graph compressed by replacing instances with vertex representing substructure object triangle R1 C1 T1 S1 T2 S2 T3 S3 T4 S4 Input DatabaseSubstructure S1 (graph form) Compressed Database R1 C1 object square on shape

Graph-Based Data Mining and Applications6 Subdue Discovery Algorithm 1. Create substructure for each unique vertex label circle rectangle triangle square on triangle square on triangle square on triangle square on Substructures: triangle (4), square (4), circle (1), rectangle (1) on

Graph-Based Data Mining and Applications7 Subdue Discovery Algorithm 2. Expand best substructures by an edge or edge+neighboring vertex Substructures: triangle square on circle rectangle square on rectangle triangle on circle rectangle triangle square on triangle square on triangle square on triangle square on rectangle on

Graph-Based Data Mining and Applications8 Subdue Discovery Algorithm 3. Evaluate substructures using MDL 4. Keep only best beam-width substructures on queue 5. Terminate when queue is empty or #discovered substructures >= limit 6. Compress graph and repeat to generate hierarchical description

Graph-Based Data Mining and Applications9 Graph Compression and MDL Minimum Description Length (MDL) principle Best theory minimizes description length of theory and data given theory Best substructure is the one with shortest description length of substructure definition DL(S) + compressed graph DL(G|S) DL(G,S) = DL(S) + DL(G|S)

Graph-Based Data Mining and Applications10 Biochemistry Application: Clustering of a DNA sequence

Graph-Based Data Mining and Applications11 Biochemistry Application: Clustering of a DNA sequence Coverage 61% 68% 71% DNA O | O == P — OH C — NC — C \ O | O == P — OH | O | CH 2 C \ N — C \ C O \ C / \ C — C N — C / \ O C

Graph-Based Data Mining and Applications12 DNA Sequence Four bases constitute a four-letter alphabet that cells use to store genetic information. Molecular biologists can break up a DNA molecule and determine its base sequence, which can be stored as a character string in a computer: TTCAGCCGATATCCTGGTCAGATTCTCT AAGTCGGCTATAGGACCAGTCTAAGAGA

Graph-Based Data Mining and Applications13 Backbone Representation “Base” vertices allow “don’t-care” positions. Accounting for overlapping substructures is also possible. basebasebasebasebase nextnextnextnext AACTG namenamenamenamename

Graph-Based Data Mining and Applications14 page Represent Web as Graph Breadth-first search of domain to generate graph Nodes represent pages / documents Edges represent hyperlinks Additional nodes represent document keywords page university texas learning group projects subdue robotics parallel hyperlink work word planning

Graph-Based Data Mining and Applications15 Query: Find all pages which link to a page containing term ‘Subdue’ Subgraph vertices: 1 page URL: http://cygnus.uta.eduhttp://cygnus.uta.edu 7 page URL: http://cygnus.uta.edu/projects.htmlhttp://cygnus.uta.edu/projects.html 8Subdue [1->7] hyperlink [7->8] word Subdue page hyperlink word page

Graph-Based Data Mining and Applications17 Direction of Research One direction of graph-based data mining research is towards efficient algorithms AGM FSG gSpan We address another need Increasing the expressive power of graph-based theories We develop a Graph Grammar Induction algorithm

Graph-Based Data Mining and Applications18 Related Work: Graph-Based Systems AGM A. Inokuchi, T. Washio and H. Motoda, An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data, Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, 2000.An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data FSG M. Kuramochi and G. Karypis, An Efficient Algorithm for Discovering Frequent Subgraphs, Technical Report 02-026, Department of Computer Science, University of Minnesota, 2002.An Efficient Algorithm for Discovering Frequent Subgraphs gSpan Yan, X. and J. Han. 2002. gSpan: Graph-Based Substructure Pattern Mining. Proceedings of the International Conference on Data Mining (ICDM). Apriori-based, association rule discovery Find all frequent subgraphs with minimum support Emphasis is on efficiency

Graph-Based Data Mining and Applications19 Graph Grammars Graph grammar production: S  P S is a non-terminal, single vertex Hence grammar is context-free P is any graph containing terminals and/or non-terminals

Graph-Based Data Mining and Applications20 Graph Grammars: Recursion Recursive production: S  P S | P P linked to S via a single edge Algorithm exponential in linking edges Sa b c Sa b c

Graph-Based Data Mining and Applications21 Graph Grammars: Variables Variable production: S  P 1 | P 2 | … | P n (discrete) S  [P min … P max ](continuous) P restricted to single terminal vertex S1S1 a bS2S2 S2S2 cdf

Graph-Based Data Mining and Applications22 Graph Grammars: Relationships Relationships Between continuous values (=, <=) Between discrete values (=) air speed visibility lighting on landing gear out Air Crash D 1 D 2 C 1 220 = <= S

Graph-Based Data Mining and Applications23 Example Graph Grammar S1S1 a bS2S2 S1S1 S2S2 cdf a bS2S2

Graph-Based Data Mining and Applications24 Discovering Recursion For each discovered substructure Check for subsets of instances P connected by the same single edge e If found, form production S  P S | P, where P connected to S by edge e Abstraction: Each matching chain compressed to single vertex labeled S Algorithm is exponential in number of edges considered between instances Note: two edges needed for S  a S b

Graph-Based Data Mining and Applications25 Recursion Example a cb a db a fb a fb x qz yx qz yx qz yx qz yr k Input :

Graph-Based Data Mining and Applications26 Recursion Example Recursive production: Input graph parsed by production: a cb a db a fb a fb r k S1S1 S1S1 x qz y S1S1 S1S1 x qz y

Graph-Based Data Mining and Applications27 Discovering Variables Variables After extending a substructure’s instances by a single edge, Collect instances extended by the same edge (same direction and label), but possibly to vertices with differing labels L i Form production of the form Discrete/Categorical Variable: S  L 1 | L 2 | … | L n Continuous Variable: S  [P min … P max ]

Graph-Based Data Mining and Applications28 Variable Example a cb a db a fb a fb x qz yx qz yx qz yx qz yr k Input :

Graph-Based Data Mining and Applications29 Variable Example a cb a db a fb a fb x qz yx qz yx qz yx qz yr k Already identified sub a-b

Graph-Based Data Mining and Applications30 Variable Example Resulting production rules Input graph parsed by productions S2S2 a bS3S3 S2S2 S3S3 cdf a bS3S3 r k S2S2 S1S1 S1S1

Graph-Based Data Mining and Applications31 Variable Example Extending a-b results a-b-c(1 instance) a-b-d(1 instance) a-b-f(2 instances) Collecting all values from the same edge results a-b-{c,d,f}(4 instances) Eliminating least frequent value results a-b-{d,f}(3 instances)

Graph-Based Data Mining and Applications32 Discovering Relationships Properties of Relationships Established between data points (variables) At least one end of a relationship must be a vertex (otherwise relationship is trivial) Represented by logical edges Types Equal (both discrete and continuous) Less-than-or-equal (continuous only)

Graph-Based Data Mining and Applications33 Sample Relationship air speed visibility lighting on landing gear out Air Crash D 3 D 4 C 1 C 2 = <= S

Graph-Based Data Mining and Applications34 Concept Learning Graph grammar induction is extended to concept learning Input is a positive and a negative graph Grammar is to describe positive graph while not describing negative graph I.e., infer model for positive input only Whatever does not fit the model is classified as negative Learning is impacted by swapping positive and negative inputs

Graph-Based Data Mining and Applications36 Experiments Sequitur (Nevill-Manning and Witten, 1997) Infers compositional hierarchies (i.e., memorizes input in a hierarchy) Works on strings or sequences

Graph-Based Data Mining and Applications37 SubdueGL vs. Sequitur Input 1: abcabdabcabd Input graph: Learned graph grammar: Sequitur’s output: S 1  a b S 2 S 1 | a b S 2 S 2  c | d a b c d a b a b c d a b S1S1 S2S2 aS1S1 b S2S2 c d S2S2 a b S  1 1 1  2 c 2 d 2  a b

Graph-Based Data Mining and Applications38 Biochemistry Protein primary and secondary structure Primary structure is sequence of amino acids: VAL LEU SER GLU GLY TRP GLN … Secondary structure is sequence of helices: h_1_19 h_1_8 h_1_18 … Using hemoglobin and myoglobin Converted to graph: … VAL … LEU SER GLU GLY GLU TRP GLN LEU VAL

Graph-Based Data Mining and Applications39 Protein Primary Structure S  S 2 – S 3 – S 4 – S 5 – S 6 – S 7 – S 8 – S 9 – S 10 – S 11 – S 12 – S 13 – S 14 – S 15 – S 16 – S 17 – GLU – S S 2  VAL | SER | HIS | LYS S 3  LEU | GLY | HIS | PHE | PRO S 4  GLY | GLN | ALA | ASP | THR S 5  ALA | ASP | ARG | THR | ASN S 6  LEU | LYS | ILE | PHE … S 20  S 21 – S 22 – S 23 – S 24 – S 25 – S 26 – S 27 – S 28 – S 29 – LEU S 21  VAL | GLY | ARG | S S 22  LEU | LYS | PRO | S S 23  LEU | SER | ASP S 24  LEU | GLU | ALA | ASP | ILE

Graph-Based Data Mining and Applications40 Protein Secondary Structure Hemoglobin S  S 2 – S 3 – h_1_6 – S 4 – h_1_19 – h_1_8 – h_1_18 – S 5 S 2  h_1_14 | h_1_15 S 3  h_1_14 | h_1_15 S 4  h_1_6 | h_1_1 S 5  h_1_20 | h_1_23 Myoglobin S  h_1_15 – h_1_15 – h_1_6 – h_1_6 – h_1_19 – S 2 – h_1_18 – S 3 S 2  h_1_9 | h_1_8 S 3  h_1_25 | h_1_23 Common Sequence: h_1_15 – h_1_15 – h_1_6 – h_1_6 – h_1_19 – h_1_8 – h_1_18 – h_1_23

Graph-Based Data Mining and Applications41 Common Ancestry? Common ancestry between hemoglobin and myoglobin has long been hypothesized Common sequence can be further proof Common sequence is only theoretical, not actual sequence “Myoglobin-like proteins found; candidates for common ancestry”: Hou, Larsen, Boudko, Riley, Karatan, Zimmer, Ordal And Alam. 2000. Myoglobin-like aerotaxis transducers in Archaea and Bacteria. Nature 403, 540 – 544.

Graph-Based Data Mining and Applications42 Counter-Terrorism Domains Part of EELD project Contract Killing Gang Wars Industry Takeover Using simplified CK domain Goal: Identify sequence of events leading up to murder-for-hire

Graph-Based Data Mining and Applications43 Contract Killing Part of Contract Killing domain Multiple sequence of events Event ReportOnSituation Meeting Person 1 Person 2 … ContainsInformation Receiver Sender ReportOnSituation PhoneCall Person 2 Person 3 ContainsInformation Receiver Sender ReportOnSituation Murder Killerski Victimski ContainsInformation Victim Perpetrator InformationSource nextEvent

Graph-Based Data Mining and Applications44 Grammar of CK Domain ReportOnSituation Murder Killerski Victimski ContainsInformation Victim Perpetrator S 5 S 6 Event S 5 S nextEvent InformationSource ReportOnSituation S 2 S 3 S 4 ContainsInformation Receiver Sender S ReportOnSituation S 2 S 3 S 4 ContainsInformation Receiver Sender S S 2 Meeting Person 1 PhoneCall Person 2 E-Mail S 3 Person 3 Person 2Person 3 S 4 Killerski

Graph-Based Data Mining and Applications45 Questions and Answers Can show more, comparative experiments if we have time…

Graph-Based Data Mining and Applications46 Non-Structural Experiments Comparison to many approaches reported in the literature Statistical, neural, machine learning, DTL, … Using the Wisconsin Breast Cancer domain Comparison to ILP and DTL systems Prolog, FOIL, C4.5, Subdue Using Vote, Diabetes, and Credit domains Experiments use 10-fold cross validation

Graph-Based Data Mining and Applications47 Related Work: ILP Systems Inductive logic programming (ILP) Combines inductive methods with FOPC Rules Fact(a,b)  Dec(a,c), Fact(c,d), Mult(a,d,b) Play(a,b,c,false)  b<=70. Example systems: FOIL ( Cameron-Jones, R. M., & Quinlan, J. R. 1994. Efficient Top-down Induction of Logic Programs. SIGART Bulletin. Vol. 5, 1:33-42. ) Progol ( Muggleton, S. 1995. Inverse Entailment and Progol. New Generation Computing Volume 13 245-86. )

Graph-Based Data Mining and Applications48 Wisconsin Breast Cancer Domain Properties: Continuous attributes only 9 attributes normalized between 1 and 10 Class attributes Malignant cases Benign cases Concept learning task

Graph-Based Data Mining and Applications49 Comparison Using WBC Domain Wen-Hua et al, 200292.3%Logistic regression16 Wen-Hua et al, 200292.9%Linear Discriminant15 Wolberg & Mangasarian, 199093.5%Multi-surface separation (1 plane)14 Zhang, 199293.7%1-nearest neighbor13 Taha & Ghosh, 199793.3–95.61%Neural Networks12 Authors94.37%Subdue11 Liu & Setiono, 199694.4%C4.510 Authors94.8%FOIL9 Authors95.67%SubdueGL8 Wolberg & Mangasarian, 199095.9%Multi-surface separation (3 planes)7 Wen-Hua et al, 200296.7%SVM6 Brodley & Utgoff, 199596.77%GSBE5 Brodley & Utgoff, 199596.92%Feature Minimization4 Wen-Hua et al, 200297.0%Gaussian Process3 Brodley & Utgoff, 199597.07%RLP2 Wen-Hua et al, 200297.2%Probit1 Accuracy Reported byAccuracyAlgorithmRank

Graph-Based Data Mining and Applications50 Comparison to ILP and DTL Discrete and mixed attribute types Vote 16 discrete-valued attributes (y, n, u) Diabetes (Pima Indians) 7 continuous-valued attributes Credit (German) 13 discrete-valued attributes 7 continuous-valued attributes Concept learning (all have 2 classes)

Graph-Based Data Mining and Applications51 Comparison to ILP Systems 71.30%70.94%94.23%SubdueGL 70.50%61.71%89.07%Subdue 63.20%63.68%94.19%Progol 68.60%70.66%93.02%FOIL CreditDiabetesVote

Graph-Based Data Mining and Applications52 Comparison to DTL 71.30%70.94%94.23%SubdueGL 70.50%61.71%89.07%Subdue 70.90%74.62%94.48%C4.5 CreditDiabetesVote

Graph-Based Data Mining and Applications István Jónyer Department of Computer Science Oklahoma State University.

Similar presentations

Presentation on theme: "Graph-Based Data Mining and Applications István Jónyer Department of Computer Science Oklahoma State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Graph-Based Data Mining and Applications István Jónyer Department of Computer Science Oklahoma State University.

Similar presentations

Presentation on theme: "Graph-Based Data Mining and Applications István Jónyer Department of Computer Science Oklahoma State University."— Presentation transcript:

Similar presentations

About project

Feedback