1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

Slides:



Advertisements
Similar presentations
Graph Algorithms Algorithm Design and Analysis Victor AdamchikCS Spring 2014 Lecture 11Feb 07, 2014Carnegie Mellon University.
Advertisements

Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
2012: J Paul GibsonT&MSP: Mathematical FoundationsMAT7003/L2-GraphsAndTrees.1 MAT 7003 : Mathematical Foundations (for Software Engineering) J Paul Gibson,
gSpan: Graph-based substructure pattern mining
Chapter 10: Trees. Definition A tree is a connected undirected acyclic (with no cycle) simple graph A collection of trees is called forest.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Frequent Structure Mining Prajwal Shrestha Department of Computer Science The University of Vermont Spring 2015.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Data Mining Association Analysis: Basic Concepts and Algorithms
296.3: Algorithms in the Real World
Discrete Mathematics Transparency No. 8-1 Chapter 8 Trees.
Implementation of Graph Decomposition and Recursive Closures Graph Decomposition and Recursive Closures was published in 2003 by Professor Chen. The project.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Lists A list is a finite, ordered sequence of data items. Two Implementations –Arrays –Linked Lists.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Graph Theory TreesAlgorithms. Graphs: Basic Definitions 4 n Let n be the number of nodes (stations) and e be the number of edges (links). n A graph is.
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
C o n f i d e n t i a l HOME NEXT Subject Name: Data Structure Using C Unit Title: Trees.
Graph Algorithms Using Depth First Search Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Analysis of Algorithms.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
COSC2007 Data Structures II
Advanced Algorithms Analysis and Design Lecture 8 (Continue Lecture 7…..) Elementry Data Structures By Engr Huma Ayub Vine.
Introduction Of Tree. Introduction A tree is a non-linear data structure in which items are arranged in sequence. It is used to represent hierarchical.
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Querying Structured Text in an XML Database By Xuemei Luo.
1 Trees A tree is a data structure used to represent different kinds of data and help solve a number of algorithmic problems Game trees (i.e., chess ),
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
Frequent Structure Mining Presented By: Ahmed R. Nabhan Computer Science Department University of Vermont Fall 2011.
Discussion #32 1/13 Discussion #32 Properties and Applications of Depth-First Search Trees.
5.5.2 M inimum spanning trees  Definition 24: A minimum spanning tree in a connected weighted graph is a spanning tree that has the smallest possible.
Prof. Amr Goneid, AUC1 CSCE 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 4. Trees.
Computer Science: A Structured Programming Approach Using C Trees Trees are used extensively in computer science to represent algebraic formulas;
5.5.3 Rooted tree and binary tree  Definition 25: A directed graph is a directed tree if the graph is a tree in the underlying undirected graph.  Definition.
Preview  Graph  Tree Binary Tree Binary Search Tree Binary Search Tree Property Binary Search Tree functions  In-order walk  Pre-order walk  Post-order.
Discrete Structures Trees (Ch. 11)
Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.
Foundation of Computing Systems
M180: Data Structures & Algorithms in Java Trees & Binary Trees Arab Open University 1.
Trees Dr. Yasir Ali. A graph is called a tree if, and only if, it is circuit-free and connected. A graph is called a forest if, and only if, it is circuit-free.
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
Properties and Applications of Depth-First Search Trees and Forests
Data Structures Lakshmish Ramaswamy. Tree Hierarchical data structure Several real-world systems have hierarchical concepts –Physical and biological systems.
Frequent Structure Mining Robert Howe University of Vermont Spring 2014.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
Trees By JJ Shepherd. Introduction Last time we discussed searching and sorting in a more efficient way Divide and Conquer – Binary Search – Merge Sort.
Best-first search is a search algorithm which explores a graph by expanding the most promising node chosen according to a specified rule.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Chapter 11. Chapter Summary Introduction to Trees Applications of Trees (not currently included in overheads) Tree Traversal Spanning Trees Minimum Spanning.
Gspan: Graph-based Substructure Pattern Mining
1 Trees. 2 Trees Trees. Binary Trees Tree Traversal.
Lecture on Data Structures(Trees). Prepared by, Jesmin Akhter, Lecturer, IIT,JU 2 Properties of Heaps ◈ Heaps are binary trees that are ordered.
Prof. Amr Goneid, AUC1 CSCE 210 Data Structures and Algorithms Prof. Amr Goneid AUC Part 4. Trees.
Trees Chapter 15.
CSCE 210 Data Structures and Algorithms
Chapter 5 : Trees.
Advanced Pattern Mining 02
Mining Frequent Subgraphs
Graph Algorithms Using Depth First Search
Mining Complex Data COMP Seminar Spring 2011.
Mining Frequent Subgraphs
Presentation transcript:

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki

2 Frequent Structure Mining (FSM) Dealing with extracting patterns (association, sequence, frequent tree, graph, and etc.) in massive databases Typical application Bioinformatics Web mining Mining semi-structured documents

3 Tree Mining Problems Goal: to efficiently enumerate all frequent subtrees in a forest (database of trees) according to a given minimum support (minsup) The support of a subtree S is the number of trees in D that contains one occurrence of S. A subtree S is frequent if its support is more than or equal to a user specified minsup value.

4 Rooted, Ordered & Labeled tree A tree is an acyclic connected graph Rooted: exist one vertices which is distinguished from others Ordered: the children of each node in a rooted tree are ordered. Labeled: each node is associated with a label. Every tree in the paper is a rooted, ordered and labeled tree.

5 Definition of Subtrees  We say that a tree S = {Ns, Bs} is an embedded subtree of T = {N, B}, if: 1.Ns is a subset of N 2.A branch appears in S iff two vertices are on the same path from the root to a leaf in T.  We denote a tree as T = {N, B}. N is a set of labeled nodes and B is a set of branches. Hence, embedded trees allow not only direct parent-child branches, but also ancestor-descendant branches.  A disconnected pattern is a sub-forest of T.

6 Examples of subtrees: Subtree S Not a subtree, a sub-forest Tree T

7 Node Numbers and Labels Each node has a well- defined number, i, according to its position in a depth-first traversal of a tree The label of each node is taken from a set of labels L = {0, 1, …, m-1}. It represents the value of each node

8 Scope of Node The scope of each node n i is given as [i, r], i.e., the lower bound is the position (i) of itself, and the upper bound is the position (r) of its right- most leaf node. Assume two node x, y has the following scope S x = [i x, r x ] and S y = [i y, r y ]. Sx is strictly less than (<) Sy iff r x < l y, i.e., Sx occurs before Sy. It means that y is an embedded sibling of x Sx contains Sy iff l x = r y. It means that y is a descendant of x [1,4] [0,7] [2,3] [3,3] [5,7] [4,4][6,7][7,7]

9 Representing trees as Strings The String Encoding: –1 –1 1 –1 –1 4 3 –1 2 –1 -1 To create String encoding, which is denoted as t, we perform a depth-first search starting (also ending) at the root, adding the current node’s label x to t. Whenever we backtrack from a child to its parent we add an special symbol –1 to the string

10 Equivalence Classes Two k-subtrees X, Y are in the same prefix equivalence class iff they share a common prefix up to the (k-1)th nodes x x x  Prefix String: –1 3  The following three subtrees are in the same prefix equivalence class: –1 3 –1 –1 x –1 // (x, 0) –1 3 –1 x –1 –1 // (x, 1) –1 3 x –1 –1 –1 // (x, 3)  Element list: (label, the position of the node which x is attached) (x, 0); (x, 1); (x, 3)  A valid element x may be attached to only those that lie on the path from the root to the right-most leaf. x Not a valid element!

11 Candidate Generation: Goal: Given an equivalence class of k-subtrees, try to obtain candidate (k+1)-subtrees. Main idea: consider each pair of elements in the class for extension, including self-extension. Theorem: Assume elements are kept sorted by node label as the primary key and position as the secondary key. Let P be a prefix class, and (x,i) and (y, j) denote any two elements in the class. Px denotes the class representing extension of element (x, i). Define (y,j) join (x,i ) as follows: Case I ( i = j ): 1) If P ≠ 0, add (y, j) and (y, j+1) to Px. 2) If P = 0, add (y, j) to Px. Case II ( i > j ): add (y,j) to Px Case III ( i < j ): no new element is possible in this case  The Theorem has a mistake.

1 2 Prefix: 1 2 Element List: (3, 1); (4, 0) Prefix = 1 2 3Prefix = 1 2 –1 4 (3,1) (3,2) (4,0) (3,1) join (3,1) (4,0) join (3,1) 4 (4,0) (4,2) 4 If we add (y, j+1), i.e., (4, 1), we get the following tree: –1 4, wrong! (4,0) join (4,0)

13 TreeMiner Algorithm TreeMiner (D(database of tree, Forest), minsup)  F1 = { frequent 1-subtrees };  F2 = { classes [P] 1 of frequent 2-subtrees };  For all [P], do Enumerate-Frequent-Subtree; Enumerate-Frequent-Subtree F k  For each element (x, i) € [P] do  For each element (y, j) € [P] do  (y,j) join (x, i) => at most two new candidate subtrees  For each subtree, do scope-list joins  If it is frequent, then we add the subtree to the list of frequent-subtree.  Repeated until all frequent subtrees have been enumerated. P: prefix class. [P] 1 means the prefix size = 1, i.e., only one node in the prefix class. P x refers to the new prefix tree formed by adding (x, i) to P. F k: the set of all frequent subtrees of size k.

14 An example of TreeMiner Algorithm Tree T0 Tree T1 Tree T2 Database D of 3 Trees D in Horizontal Format: (tree-id, string encoding): (T0, 1 2 –1 3 4 –1 –1) (T1, –1 4 –1 –1 2 –1 3 –1) (T2, –1 – –1 3 4 –1 –1 –1 -1) D in Vertical Format ( tree-id, scope) pairs: , [0,3] 0, [1,1] 0, [2,3] 0, [3,3] 2, [3,7] 1, [1,3] 1, [0,5] 1, [5,5] 1, [3,3] 2, [0,7] 1, [2,2] 2, [1,2] 2, [7,7] 2, [4,7] 1, [4,4] 2, [6,7] 2, [2,2] 2, [5,5]

Step 1: Calculate F1: Prefix = {}, Element list: (1,-1), (2,-1), (3,-1), (4,-1) ,[0,3]* 0,[1,1] 0,[2,3] 0,[3,3] 1,[1,3] 1,[0,5] 1,[5,5] 1,[3,3] 2,[0,7] 1,[2,2] 2,[1,2] 2,[7,7] 2,[4,7] 1,[4,4] 2,[6,7] 2,[2,2] 2,[5,5] Infrequent Element: (5,-1) Step 2: Calculate F2: Suppose Prefix = {1}, Element list:(2,0), (4,0) 11 0,0,[1,1]* 0,0,[3,3] 1,1,[2,2] 1,1,[3,3] 2,0,[2,2] 2,0,[7,7] 2,0,[5,5] 2,4,[7,7] 2,4,[5,5] Infrequent Element: (1,0), (3,0) 2 4 Step 3: Calculate F 3 : Suppose Prefix = {1,2}, Element list:(4,0) 1 0,01,[3,3]* 1,12,[3,3] 2,02,[7,7] 2,05,[7,7] 2,45,[7,7] Infrequent Element: (2,0), (2,1), (4,0) 2 4 Scope-List Joins Example: minsup = 100% *: 0 – tree id [0,3] – node scope *: 0 – tree id 0 – the node number (position) of the prefix {1} [1,1] – scope of the element node. *: 0 – tree id 01 – the node number (position) of the prefix {12} [3,3] – scope of the element node.

16 Conclusion  Introduce the notion of mining embedded subtrees in a (forest) database of trees  Systematic candidate subtree generation. No subtree is generated more than once. (but has a mistake)  Use a string encoding of tree to store dataset efficiently  Use a node’s scope to develop scope-lists  Introduce a new algorithm – TreeMiner