Frequent Structure Mining Robert Howe University of Vermont Spring 2014.

Slides:



Advertisements
Similar presentations
Lecture 15. Graph Algorithms
Advertisements

gSpan: Graph-based substructure pattern mining
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Frequent Closed Pattern Search By Row and Feature Enumeration
Fast Algorithms For Hierarchical Range Histogram Constructions
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Techniques for Dealing with Hard Problems Backtrack: –Systematically enumerates all potential solutions by continually trying to extend a partial solution.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Frequent Structure Mining Prajwal Shrestha Department of Computer Science The University of Vermont Spring 2015.
Mining Graphs.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Lists A list is a finite, ordered sequence of data items. Two Implementations –Arrays –Linked Lists.
Data Mining Association Analysis: Basic Concepts and Algorithms
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
1 A DATA MINING APPROACH FOR LOCATION PREDICTION IN MOBILE ENVIRONMENTS* by Gökhan Yavaş Feb 22, 2005 *: To appear in Data and Knowledge Engineering, Elsevier.
Fast Algorithms for Association Rule Mining
Module #1 - Logic 1 Based on Rosen, Discrete Mathematics & Its Applications. Prepared by (c) , Michael P. Frank and Modified By Mingwu Chen Trees.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
Important Problem Types and Fundamental Data Structures
Discrete Mathematics Lecture 9 Alexander Bukharovich New York University.
Graph Algorithms Using Depth First Search Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Analysis of Algorithms.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Sequential PAttern Mining using A Bitmap Representation
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Querying Structured Text in an XML Database By Xuemei Luo.
Representing and Using Graphs
Week 11 - Wednesday.  What did we talk about last time?  Graphs  Euler paths and tours.
Frequent Structure Mining Presented By: Ahmed R. Nabhan Computer Science Department University of Vermont Fall 2011.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Discrete Structures Trees (Ch. 11)
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
Bijective tree encoding Saverio Caminiti. 2 Talk Outline Domains Prüfer-like codes Prüfer code (1918) Neville codes (1953) Deo and Micikevičius code (2002)
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
2004/12/31 報告人 : 邱紹禎 1 Mining Frequent Query Patterns from XML Queries L.H. Yang, M.L. Lee, W. Hsu, and S. Acharya. Proc. of 8th Int. Conf. on Database.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Graph Indexing From managing and mining graph data.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Gspan: Graph-based Substructure Pattern Mining
Graphs ORD SFO LAX DFW Graphs 1 Graphs Graphs
CS 201: Design and Analysis of Algorithms
Mining Frequent Subgraphs
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Complex Data COMP Seminar Spring 2011.
Graphs Chapter 13.
Mining Frequent Subgraphs
Mining Frequent Subgraphs
Important Problem Types and Fundamental Data Structures
Presentation transcript:

Frequent Structure Mining Robert Howe University of Vermont Spring 2014

Original Authors This presentation is based on the paper Zaki MJ (2002). Efficiently mining frequent trees in a forest. Proceedings of the 8th ACM SIGKDD International Conference. The author’s original presentation was used to make this one. I further adapted this from Ahmed R. Nabhan’s modifications. 2

Outline Graph Mining Overview Mining Complex Structures - Introduction Motivation and Contributions of author Problem Definition and Case Examples Main Ingredients for Efficient Pattern Extraction Experimental Results Conclusions 3

Outline Graph Mining Overview Mining Complex Structures - Introduction Motivation and Contributions of author Problem Definition and Case Examples Main Ingredients for Efficient Pattern Extraction Experimental Results Conclusions 4

Why Graph Mining? Graphs are convenient structures that can represent many complex relationships. We are drowning in graph data: Social Networks Biological Networks World Wide Web 5

6 High School UVM BU Facebook Data (Source: Wolfram|Alpha Facebook Report)

7 Facebook Data (Source: Wolfram|Alpha Facebook Report)

8 Biological Data (Source: KEGG Pathways Database)

Some Graph Mining Problems Pattern Discovery Graph Clustering Graph Classification and Label Propagation Structure and Dynamics of Evolving Graphs 9

Graph Mining Framework 10 Mining graph patterns is a fundamental problem in data mining. Graph Data Mine Exponential Pattern Space Select Relevant Patterns Exploratory Task Clustering Classification Structure Indices

Basic Concepts Graph – A graph G is a 3-tuple G = (V, E, L) where V is the finite set of nodes. E ⊆ V × V is the set of edges L is a labeling function for edges and nodes. Subgraph – A graph G 1 = (V 1, E 1, L 1 ) is a subgraph of G 2 = (V 2, E 2, L 2 ) iff: V 1 ⊆ V 2 E 1 ⊆ E 2 L 1 (v) = L 2 (v) for all v ∈ V 1. L 1 (e) = L 2 (e) for all e ∈ E A CD BA C B

Basic Concepts Graph Isomorphism – “A bijection between the vertex sets of G 1 and G 2 such that any two vertices u and v which are adjacent in G 1 are also adjacent in G 2.” (Wikipedia) 12 A CD B E Subgraph Isomorphism is even harder (NP- Complete!)

Basic Concepts Graph Isomorphism – Let G 1 = (V 1, E 1, L 1 ) and G 2 = (V 2, E 2, L 2 ). A graph isomorphism is a bijective function f: V 1 → V 2 satisfying L 1 (u) = L 1 ( f (u)) for all u ∈ V 1. For each edge e 1 = (u,v) ∈ E 1, there exists e 2 = ( f(u), f(v)) ∈ E 2 such that L 1 (e 1 ) = L 2 (e 2 ). For each edge e 2 = (u,v) ∈ E 2, there exists e 1 = ( f –1 (u), f –1 (v)) ∈ E 1 such that L 1 (e 1 ) = L 2 (e 2 ). 13

Discovering Subgraphs TreeMiner and gSpan both employ subgraph or substructure pattern mining. Graph or subgraph isomorphism can be used as an equivalence relation between two structures. There is an exponential number of subgraph patterns inside a larger graph (as there are 2 n node subsets in each graph and then there are edges.) Finding frequent subgraphs (or subtrees) tends to be useful in data mining. 14

Outline Graph Mining Overview Mining Complex Structures - Introduction Motivation and Contributions of author Problem Definition and Case Examples Main Ingredients for Efficient Pattern Extraction Experimental Results Conclusions 15

Mining Complex Structures Frequent structure mining tasks Item sets – Transactional, unordered data. Sequences – Temporal/positional, text, biological sequences. Tree Patterns – Semi-structured data, web mining, bioinformatics, etc. Graph Patterns – Bioinformatics, Web Data “Frequent” is a broad term Maximal or closed patterns in dense data Correlation and other statistical metrics Interesting, rare, non-redundant patterns. 16

Anti-Monotonicity 17 (Source: SIGMOD ’08) A monotonic function is a consistently increasing or decreasing function*. The author refers to a monotonically decreasing function as anti-monotonic. The frequency of a super- graph cannot be greater than the frequency of a subgraph (similar to Apriori). * Very Informal Definition The black line is always decreasing

Outline Graph Mining Overview Mining Complex Structures - Introduction Motivation and Contributions of author Problem Definition and Case Examples Main Ingredients for Efficient Pattern Extraction Experimental Results Conclusions 18

Tree Mining – Motivation Capture intricate (subspace) patterns Can be used (as features) to build global models (classification, clustering, etc.) Ideally suited for categorical, high- dimensional, complex, and massive data. Interesting Applications Semi-structured Data – Mine structure and content Web usage mining – Log mining (user sessions as trees) Bioinformatics – RNA secondary structures, Phylogenetic trees 19 (Source: University of Washington)

Classification Example Subgraph patterns can be used as features for classification. 20 # of sides Amount Off-the-shelf classifiers (like neural networks or genetic algorithms) can be trained using these vectors. Feature selection is very useful too. “Hexagons are a commonly occurring subgraph in organic compounds.”

Contributions Systematic subtree enumeration. Extensions for mining unlabeled or unordered subtrees or sub-forests. Optimizations Representing trees as strings. Scope-lists for subtree occurrences. 21

Outline Graph Mining Overview Mining Complex Structures - Introduction Motivation and Contributions of author Problem Definition and Case Examples Main Ingredients for Efficient Pattern Extraction Experimental Results Conclusions 22

How does searching for patterns work? Start with graphs with small sizes. Extend k -size graphs by one node to generate k + 1 candidate patterns. Use a scoring function to evaluate each candidate. A popular scoring function is one that defines the minimum support. Only graphs with frequency greater than minisup are kept. 23

How does searching for patterns work? “The generation of size k + 1 subgraph candidates from size k frequent subgraphs is more complicated and more costly than that of itemsets” – Yan and Han (2002), on gSpan Where do we add a new edge? It is possible to add a new edge to a pattern and then find that doesn’t exist in the database. The main story of this presentation is on good candidate generation strategies. 24

TreeMiner TreeMiner uses a technique for numbering tree nodes based on DFS. This numbering is used to encode trees as vectors. Subtrees sharing a common prefix (e.g. the first k numbers in vectors) form an equivalence class. Generate new candidate (k + 1) -subtrees from equivalence classes of k -subtrees (e.g. Apriori) 25

TreeMiner This is important because candidate subtrees are generated only once! (Remember the subgraph isomorphism problem that makes it likely to generate the same pattern over and over) 26

Definitions Tree – An undirected graph where there is exactly one path between any two vertices. Rooted Tree – Tree with a special node called root. 27 This tree has no root node. It is an unrooted tree. This tree has a root node. It is a rooted tree.

Definitions Ordered Tree – The ordering of a node’s children matters. Example: XML Documents Exercise – Prove that ordered trees must be rooted. 28 ≠ v3v3 v2v2 v1v1 v3v3 v1v1 v2v2

Definitions Labeled Tree – Nodes have labels. Rooted trees also have some special terminology. Parent – The node one closer to the root. Ancestor – The node n edges closer to the root, for any n. Siblings – Two nodes with the same parent. 29 parent sibling ancestor ancestor(X,Y) :- parent(X,Y). ancestor(X,Y) :- parent(Z,Y), ancestor(X,Z). sibling(X,Y) :- parent(Z,X), parent(Z,Y). embedded sibling

Definitions Embedded Siblings – Two nodes sharing a common ancestor. Numbering – The node’s position in a traversal (normally DFS) of the tree. A node has a number n i and a label L(n i ). Scope – The scope of a node n l is [l, r], where n r is the rightmost leaf under n l (again, DFS numbering). 30

Definitions Embedded Subtrees – S = (N s, B s ) is an embedded subtree of T = (N, B) if and only if the following conditions are met: N s ⊆ N (the nodes have to be a subset). b = (n x, n y ) ∊ B s iff n x is an ancestor of n y. For each subset of nodes N s there is one embedded subtree or subforest. 31 v3v3 v2v2 v1v1 v0v0 v6v6 v7v7 v8v8 v5v5 v4v4 v1v1 v5v5 v4v4 (Colors are only on this graph to show corresponding nodes) subtree

Definitions Match Label – The node numbers (DFS numbers) in T of the nodes in S with matching labels. A match label uniquely identifies a subtree. This is useful because a labeling function must be surjective but will not necessarily be bijective. {v 1, v 4, v 5 } or {1, 4, 5} 32 (Colors are only on this graph to show corresponding nodes) v3v3 v2v2 v1v1 v0v0 v6v6 v7v7 v8v8 v5v5 v4v4 v1v1 v5v5 v4v4 subtree

Definitions Subforest – A disconnected pattern generated in the same way as an embedded subtree. 33 v3v3 v2v2 v1v1 v0v0 v6v6 v7v7 v8v8 v5v5 v4v4 v1v1 v4v4 (Colors are only on this graph to show corresponding nodes) subforest v7v7 v8v8

Problem Definition Given a database (forest) D of trees, find all frequent embedded subtrees. Frequent – Occurring a minimum number of times (use user-defined minisup). Support( S ) – The number of trees in D that contain at least one occurrence of S. Weighted-Support( S ) – The number of occurrences of S across all trees in D. 34

Exercise Generate an embedded subtree or subforest for the set of nodes N s = {v 1, v 2, v 5 }. Is this an embedded subtree or subforest, and why? Assume a labeling function L(x) = x. 35 v3v3 v2v2 v1v1 v0v0 v6v6 v7v7 v8v8 v5v5 v4v4 v1v1 v5v5 v2v2 This is an embedded subtree because all of the nodes are connected. (*Cough* Exam Question *Cough*)

Outline Graph Mining Overview Mining Complex Structures - Introduction Motivation and Contributions of author Problem Definition and Case Examples Main Ingredients for Efficient Pattern Extraction Experimental Results Conclusions 36

Main Ingredients Pattern Representation Trees as strings Candidate Generation No duplicates. Pattern Counting Scope-based List (TreeMiner) Pattern-based Matching (PatternMatcher) 37

String Representation With N nodes, M branches, and a max fanout of F : An adjacency matrix takes (N)(F + 1) space. An adjacency list requires 4N – 2 space. A tree of (node, child, sibling) requires 3N space. String representation requires 2N – 1 space. 38

String Representation String representation is labels with a backtrack operator, – –

Candidate Generation Equivalence Classes – Two subtrees are in the same equivalence class iff they share a common prefix string P up to the (k – 1) -th node. This gives us simple equivalence testing of a fixed-size array. Fast and parallel – Can be run on a GPU. Caveat – The order of the tree matters. 40

Candidate Generation Generate new candidate (k + 1) -subtrees from equivalence classes of k -subtrees. Consider each pair of elements in a class, including self- extensions. Up to two new candidates for each pair of joined elements. All possible candidate subtrees are enumerated. Each subtree is generated only once! 41

Candidate Generation Each class is represented in memory by a prefix string and a set of ordered pairs indicating nodes that exist in that class. A class is extended by applying a join operator ⊗ on all ordered pairs in the class. 42

Candidate Generation Equivalence Class Prefix String 12 This generates two elements: (3, v 1 ) and (4, v 0 ) The element notation can be confusing because the first item is a label and the second item is a DFS node number.

Candidate Generation Theorem 1. Define a join operator ⊗ on two elements as (x, i) ⊗ (y, j). Then apply one of the following cases: (1)If i = j and P is not empty, add (y, j) and (y, j + 1) to class [P x ]. If P is empty, only add (y, j + 1) to [P x ]. (2)If i > j, add (y, j) to [P x ]. (3)If i < j, no new candidate is possible. 44

Candidate Generation Consider the prefix class from the previous example: P = (1, 2) which contains two elements, (3, v 1 ) and (4, v 0 ). 1.Join (3, v 1 ) ⊗ (3, v 1 ) – Case (1) applies, producing (3, v 1 ) and (3, v 2 ) for the new class P 3 = (1, 2, 3). 2.Join (3, v 1 ) ⊗ (4, v 0 ) – Case (2) applies. (Don’t worry, there’s an illustration on the next slide.) 45

Candidate Generation ⊗ A class with prefix {1,2} contains a node with label 3. This is written as (3, v 1 ), meaning a node labeled ‘3’ is added at position 1 in DFS order of nodes. = Prefix = (1, 2, 3) New nodes = (3, v 2 ), (3, v 1 )

Candidate Generation ⊗ = Prefix = (1, 2, 3) New nodes = (3, v 2 ), (3, v 1 ), (4, v 0 )

The Algorithm T REE M INER ( D, minisup ): F 1 = { frequent 1-subtrees} F 2 = { classes [P] 1 of frequent 2-subtrees } for all [P] 1 ∈ E do Enumerate-Frequent-Subtrees( [P] 1 ) E NUMERATE -F REQUENT -S UBTREES ( [P] ): for each element (x, i) ∈ [P] do [P x ] = ∅ for each element (y, j) ∈ [P] do R = { (x, i) ⊗ (y, j) } L(R) = { L(x) ∩ ⊗ L(y) } if for any R ∈ R, R is frequent, then [P x ] = [P x ] ∪ {R} Enumerate-Frequent-Subtrees( [P x ] ) 48

ScopeList Join 49 Recall that the scope is the interval between the lowest numbered child (or self) node and the highest numbered child node, using DFS numbering. This can be used to calculate support. v3v3 v2v2 v1v1 v0v0 v6v6 v7v7 v8v8 v5v5 v4v4 [0, 8] [1, 5] [2, 2] [3, 5] [4, 4][5, 5] [7, 8] [8, 8]

ScopeList Join 50 ScopeLists are used to calculate support. Let x and y be nodes with scopes s x = [l x, u x ], s y = [l y, u y ]. s x contains s y iff l x ≤ l y and u x ≥ u y. A scope list represents the entire forest.

ScopeList Join 51 A ScopeList is a list of (t, m, s) 3-tuples. t is the tree ID. m is the match label of the (k – 1) -length prefix of x k. s is the scope of the last item, x k. The use of scope lists allows constant time computations of whether y is a descendent or embedded sibling of x.

Outline Graph Mining Overview Mining Complex Structures - Introduction Motivation and Contributions of author Problem Definition and Case Examples Main Ingredients for Efficient Pattern Extraction Experimental Results Conclusions 52

Experimental Results Machine: 500Mhz PentiumII, 512MB memory, 9GB disk, RHEL 6.0 Synthetic Data: Web browsing Parameters: N = #Labels, M = #Nodes, F = Max Fanout, D = Max Depth, T = #Trees Create master website tree W For each node in W, generate #children (0 to F) Assign probabilities of following each child or to backtrack; adding up to 1 Recursively continue until D is reached Generate a database of T subtrees of W Start at root. Recursively at each node generate a random number (0 – 1) to decide which child to follow or to backtrack. Default parameters: N=100, M=10,000, D=10, F=10, T=100,000 Three Datasets: D10 (all default values), F5 (F=5), T1M (T=106) Real Data: CSLOGS – 1 month web log files at RPI CS 53 Over pages accessed (#labels) Obtained 59,691 user browsing trees (#number of trees) Average string length of 23.3 per tree

Distribution of Frequent Trees 54 SparseDense Take-Home Point: Many large, frequent trees can be discovered. F5: Max-Fanout = 5 T1M: 106 Trees

Experiments (Sparse) 55 SparseDense Take-Home Point: Both algorithms are able to cope with relatively short patterns in sparse data.

Experiments (Dense) 56 Sparse (Artificial Dataset) Dense (Real-World Dataset) Take-Home Point: Long patterns at low-support (length=20); the level-wise approach suffers. The authors use the artificial dataset to justify TreeMiner as 20 times faster than pattern matcher.

Outline Graph Mining Overview Mining Complex Structures - Introduction Motivation and Contributions of author Problem Definition and Case Examples Main Ingredients for Efficient Pattern Extraction Experimental Results Conclusions 57

Conclusions 58 T REE M INER : A novel tree mining approach. Non-duplicate candidate generation. Scope-List joins for frequency comparison. Framework for tree-mining tasks Frequent subtrees in a forest of rooted, labeled, ordered trees. Frequent subtrees in a single tree. There are extensions for unlabeled and unordered trees.

Caveats 59 Frequent does not always mean significant! Exhaustive enumeration is a problem even though the candidate generation in TreeMiner is efficient. Low min_sup values increases true positives at the cost of increasing false positives. State-of-the-art graph miners make use of the structure of the search space (e.g. the LEAP search algorithm) to extract only significant structures. Candidate structures can be generated by tree miners and evaluated by some other mean.

Question One Generate an embedded subtree or subforest for the set of nodes N s = {v 1, v 2, v 5 }. Is this an embedded subtree or subforest, and why? Assume a labeling function L(x) = x. 60 v3v3 v2v2 v1v1 v0v0 v6v6 v7v7 v8v8 v5v5 v4v4 v1v1 v5v5 v2v2 This is an embedded subtree because all of the nodes are connected.

Question Two Why is the frequency of subgraphs a good function to evaluate candidate patterns? How could it be better? 61 Answer. The frequency of subgraphs is a monotonically decreasing function, meaning supergraphs are not more frequent than subgraphs. This is a desirable property combined with a minimum support threshold to reduce the search space as subgraph patterns get bigger. However, frequency does not always imply significance – another metric must be used to evaluate the candidates generated by a graph miner for significance.

Question Three How is a string representation of a tree useful in graph mining? What requirements does it place on the graph? 62 Answer. A string representation of a tree is useful because string comparisons are worst-case O(n) and can be easily optimized. However, it requires that a tree be rooted and ordered, because otherwise the string comparison operator would not be valid.