TreeFinder ： a first step towards XML data mining Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Alexandre Termier Marie-Christine Michele Sebag.

Slides:

Advertisements

Similar presentations

An Inductive Database for Mining Temporal Patterns in Event Sequences Alexandre Vautier, Marie-Odile Cordier and René Quiniou

Advertisements

Recap: Mining association rules from large datasets

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline.

gSpan: Graph-based substructure pattern mining

A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date ： 2014/04/15 Source ： KDD’13 Authors ： Chi Wang, Marina Danilevsky, Nihit.

Fast Algorithms For Hierarchical Range Histogram Constructions

Zeev Dvir – GenMax From: “ Efficiently Mining Frequent Itemsets ” By : Karam Gouda & Mohammed J. Zaki.

2001/12/181/50 Discovering Robust Knowledge from Databases that Change Author: Chun-Nan Hsu, Craig A. Knoblock Advisor: Dr. Hsu Graduate: Yu-Wei Su.

Graduate : Sheng-Hsuan Wang

5/12/2015PhD seminar CS BGU Counting subgraphs Support measures for graphs Natalia Vanetik.

FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Data Mining Association Analysis: Basic Concepts and Algorithms

Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Classifying Categorical Data Risi Thonangi M.S. Thesis Presentation Advisor: Dr. Vikram Pudi.

Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)

Data Mining Association Analysis: Basic Concepts and Algorithms

Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.

The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Anthony K.H. Tung Hongjun Lu Jiawei Han Ling Feng 國立雲林科技大學 National.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Dimitrios Skoutas Alkis Simitsis

Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

SECURED OUTSOURCING OF FREQUENT ITEMSET MINING Hana Chih-Hua Tai Dept. of CSIE, National Taipei University.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Keng-Wei Chang Author: Yehuda.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

Mathematical Programming in Data Mining Author: O. L. Mangasarian Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.

Algorithmic Detection of Semantic Similarity WWW 2005.

LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hongkun Zhao, Weiyi.

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

A New Temporal Pattern Identification Method for Characterization and Prediction of Complex Time Series Events Advisor ： Dr. Hsu Graduate ： You-Cheng Chen.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Chung-hung.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.

2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor ： Dr. Hsu Graduate ： You-Cheng Chen Author ： Jeremy.

Intelligent Database Systems Lab Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Salvatore Orlando Raffaele Perego Claudio Silvestri 國立雲林科技大學 National.

Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A hierarchical clustering algorithm for categorical sequence.

Navigation Strategies for Exploring Indoor Environments Hector H Gonzalez-Banos and Jean-Claude Latombe The International Journal of Robotics Research.

DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Mining Top-n Local Outliers in Large Databases Author: Wen Jin, Anthony K. H. Tung, Jiawei Han Advisor: Dr. Hsu Graduate: Chia- Hsien Wu.

Searching for Pattern Rules Guichong Li and Howard J. Hamilton Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor ： Jia-Ling Koh Speaker ： Tsui-Feng Yen.

Gspan: Graph-based Substructure Pattern Mining

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

Patterns extraction from process executions

Spatial Online Sampling and Aggregation

Association Rule Mining

Comparative RNA Structural Analysis

Discriminative Frequent Pattern Analysis for Effective Classification

Presentation transcript:

TreeFinder ： a first step towards XML data mining Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Alexandre Termier Marie-Christine Michele Sebag

outline Motivation Objective Introduction Example Formal background Overview of the TreeFinder system Experimental results Conclusions Personal Opinion

Motivation The construction of tree-based mediated schema for integrating multiple and heterogeneous sources of XML data. They consider the problem of searching frequent trees from a collection of tree- structured data modeling XML data.

Objective The TreeFinder algorithm aims at finding trees, such that their exact or perturbed copies are frequent in a collection of labelled trees.

1. Introduction They present a method that automatically extracts from a collection of labelled trees a set of frequent tree occurring as common(exact or approximate) trees embedded in a sufficient number of trees of the collection.

1. Introduction Two step (i) a clustering of the input trees, (ii) a characterization of each cluster by a set of frequent trees. The important point that they are not looking for an exact embedding but for trees that may be approximately embedded in several input trees.

2. Example-Illustration

-The tree (Fig. 2(a)) is (exactly or approximately) included in -, because they have in common the tree (Fig. 2(b)), to several different trees in the input collection

3. Formal background Definition 1 (Labelled trees) A labelled tree is a pair (t,label) where (i) t is a finite tree whose nodes are in N (ii) label is a labeling function that assigns a label to each in t.

3. Formal background Definition 2 (Inclusion by tree subsumption) Let t and t’ be two labellded trees. We say that t is included according to tree subsumption in t’ if there exists a mapping f from the nodes of t into the set of nodes of t’ such that f preserves the ancestor relation ： u in t, label(u) = label(f(u)) and u, v, in t, anc(u, v) anc(f(u), f(v))

3. Formal background Definition 3 (relational description of labelled trees) Let t be a labelled tree. Rel(t) is the conjunction of all atoms ab(u,v), u and v are nodes in t, label(u)=a, label(v)=b, and u is the parent node of v. Rel (t) is the conjunction of atoms a*b(u,v), such that u and v are nodes in t, with label(u)=a, label(v)=b, and u is an ancestor node of v.

3. Formal background

Definition 4 (Maximal common tree) Let be labelled trees. We say that t is a maximal common tree of iff ： [1…n] t is included in t is maximal for the previous property

3. Formal background Definition 5 (Frequent tree) Let T be a set of labelled trees, and let t be a labelled tree. Let be a real number in [0,1] we say that t is a -frequent tree of T iff ： There exists l trees { } in T such that t is maximal common tree of { }. l is greater or equal to x |T|.

4. Overview of the TreeFinder system Let T = { } be a set of labelled trees, let be a frequency trees. The TreeFinder method for discovering -frequent trees in T is a two-step algorithm.

4.1 Clustering Input ： An abstraction of the input T={ }, where each is viewed as a transaction made of all the items l*m such that l is the label of an ancestor of a node labelled by m in. Ex. = { Dealer * UsedCar, Dealer*NewCar, UsedCar*Model, UsedCar*Year, UsedCar*Price, NewCar*Model, NewCar*Price, Dealer*Model, Dealer*Year, Dealer * price

4.1 Clustering Clustering method ： Apriori algorithm output ： support sets of the largest frequent item sets Ex. = {UsedCar*Model, UsedCar*Year, UsedCar*Price} = {Book*Price, Book*Title, Book*Author} Support( ) = { }

4.2 Computation of maximal common trees Input ： This step takes as input the output of the previous clustering setp, and computes for each cluster the maximal trees that are common to all the trees of the cluster Method ： compute the least general generalization, LGG(Rel ( ),…Rel ( )) Ex. LGG(Rel ( ),…Rel ( )) = a*b( ) a*c( ) a*d( ) c*d( ).

4.2 Computation of maximal common trees

4.3 miss frequent trees

5. Experimental results Tow goals Investigate Computational cost of TreeFinder. Estimate the percentage of frequent trees missed by TreeFinder.

5.1 Experimental setting

5.2 Empirical study of computational cost

5.3 Empirical study of TreeFinder loss factor Loss factor is zero when the target frequent trees do not overlap. Loss factor id defined as ： Loss(, ) = 1-TF(, ) / AT(, ) TF(, ) denotes the (median) number of frequent trees actually produced by TreeFinder over 10 datasets independently. AT(, ) likewise is the lower bound

5.3 Empirical study of TreeFinder loss factor

6. Conclusions The TreeFinder achieves a more flexible and robust tree-mining and can detect trees that could not be discovered using a strict subtree inclusion. The main limitation of TreeFinder is to be an approximate miner.

Personal Opinion Improve the LGG computational cost Overlap broblem