Applying Pruning Techniques to Single- Class Emerging Substring Mining Speaker: Sarah Chan Supervisor: Dr. B. C. M. Kao M.Phil. Probation Talk CSIS DB.

Slides:



Advertisements
Similar presentations
Boosting Textual Compression in Optimal Linear Time.
Advertisements

Association Rule Mining
Mining Association Rules
Recap: Mining association rules from large datasets
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Frequent Closed Pattern Search By Row and Feature Enumeration
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Data Mining Association Analysis: Basic Concepts and Algorithms
296.3: Algorithms in the Real World
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Continuous Data Stream Processing  Music Virtual Channel – extensions  Data Stream Monitoring – tree pattern mining  Continuous Query Processing – sequence.
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:
Association Analysis: Basic Concepts and Algorithms.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Data Mining Association Analysis: Basic Concepts and Algorithms
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Fast Algorithms for Association Rule Mining
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Mining Association Rules
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.
Mining Frequent Patterns without Candidate Generation.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Association Analysis (3)
1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Graph Indexing From managing and mining graph data.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Efficient Mining of Emerging Patterns and Emerging Substrings
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Frequent-Pattern Tree
FP-Growth Wenlong Zhang.
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Association Analysis: Basic Concepts
Presentation transcript:

Applying Pruning Techniques to Single- Class Emerging Substring Mining Speaker: Sarah Chan Supervisor: Dr. B. C. M. Kao M.Phil. Probation Talk CSIS DB Seminar Aug 30, 2002

Presentation Outline  Introduction  The single-class ES mining problem  Data structure: merged suffix tree  Algorithms: baseline, s-pruning, g-pruning, l -pruning  Performance evaluation  Conclusions

Introduction  Emerging Substrings (ESs) A new type of KDD patterns Substrings whose supports (or frequencies) increase significantly from one class to another (measured by a growth rate) Motivation: Emerging Patterns (EPs) by Dong and Li Jumping Emerging Substrings (JESs) as a specialization of ESs  Substrings which can only be found in one class but not others

Introduction  Emerging Substrings (ESs) Usefulness  Capture sharp contrasts between datasets, or trends over time  Provide knowledge for building sequence classifiers Applications (virtually endless)  Language identification, purchase behavior analysis, financial data analysis, bioinformatics, melody track selection,web-log mining, content-based processing systems, …

Introduction  Mining ESs Brute-force approach  To enumerate all possible substrings in the database, find their support counts in each class, and check growth rate  But a huge sequence database contains millions of sequences (GenBank has 15 million sequences in 2001), and  No. of substrings in a sequence increases exponentially with sequence length (A typical human genome has 3 billion characters)  Too many candidates  Expensive in terms of time ( O(|D| 2 n 3 ) ) and memory  Other shortcomings: repeated substrings, common substrings, … (Please refer to [seminar020201])

Introduction  Mining ESs An Apriori-like approach  E.g. if both abcd & bcde are frequent in D, generate candidate abcde  Find frequent substrings and check growth rate  Still requires many database scans  A candidate may not be contained in any sequence in D  Apriori property does not hold for ESs: abcde can be an ES even if both abcd & bcde are not We need algorithms which are more efficient and which allow us to filter out ES candidates

Introduction  Mining ESs Our approach: A suffix tree-based framework  A compact way of storing all substrings, with support counters maintained  Deal with suffixes (not substrings) of sequences  Do not consider substrings not existing in the database  Time complexity: O( lg(|  |) |D| n 2 )  Techniques for pruning of ES candidates can be easily applied

Basic Definitions  Sequence An ordered set of symbols over an alphabet   Class In a sequence database, each sequence  i has a class label C i  C = the set of all class labels  does not belong to C k   belongs to C k ’  Dataset If database D is associated with m class labels, we can partition D into m datasets, such that all sequences in dataset D i have class label C i   D k    D k ’

Basic Definitions  Count and support of string s in dataset D count D (s) = no. of sequences in D that contain s supp D (s) = count D (s) / |D|  Growth rate of string s from D 1 to D 2 growthRate D1→D2 (s) = supp D2 (s) / supp D1 (s)  growth rate = 0 if supp D1 (s) = supp D1 (s) = 0  growth rate = ∞ if supp D1 (s) = 0 and supp D2 (s) > 0

ES and JES  Emerging Substring (ES) Given  s and  g, a string s is an ES from D k ’ to D k (or s is an ES of C k ) if these hold: support condition: supp Dk (s) ≥  s growth rate condition: growthRate Dk’→Dk (s) ≥  g  Jumping Emerging Substring (JES) It is an ES with ∞ growth rate JES of C k : supp Dk’ (s) = 0 and supp Dk (s) > 0

ES and JES  Example Class C 1 Class C 2 abcd bd a c abd bc cd b With  g = 1.5 ESs from D 2 to D 1 : a, abc, bcd, abcd ESs from D 1 to D 2 : b, abd

ES and JES  Example Class C 1 Class C 2 abcd bd a c abd bc cd b With  g = 1.5 ESs from D 2 to D 1 : a, abc, bcd, abcd ESs from D 1 to D 2 : b, abd growthRate D1→D2 (b) = (3/4) / (2/4) = 1.5

ES and JES  Example Class C 1 Class C 2 abcd bd a c abd bc cd b With  g = 1.5 ESs from D 2 to D 1 : a, abc, bcd, abcd ESs from D 1 to D 2 : b, abd JESs are underlined

The ES Mining Problem  The ES mining problem Given a database D, the set C of all class labels, a support threshold  s and a growth rate threshold  g, to discover the set of all ESs for each class C j  C  The single-class ES mining problem A target class C k is specified and our goal is to discover the set of all ESs of C k C k ’ : opponent class

Merged Suffix Tree  Suffix tree Represent all the substrings of a length-n sequence in O(n) space  Merged suffix tree Represent all the substrings of all sequences in a dataset D k in O(|D k | n) space Each node has a support counter for each dataset relatedEach node is associated with a substring and related to one or more substrings Each edge is denoted as an index range [i start, i end )  E.g. if  = abcd, then  [1, 3) = ab

Merged Suffix Tree  Example (c 1, c 2 ) = (count in C k, count in C k ’ ) Class C k Class C k ’ abcd bd a c abd bc cd b A

Merged Suffix Tree  Example (c 1, c 2 ) = (count in C k, count in C k ’ ) Class C k Class C k ’ abcd bd a c abd bc cd b A count Dk (a) = 2, count Dk’ (a) = 1

Merged Suffix Tree  Example Node Y is associated with abcd (concatenation) and related to abc & abcd (all share Y’s counters) An implicit node Z is associated with abc Y Class C k Class C k ’ abcd bd a c abd bc cd b Z

Algorithms  The baseline algorithm Consists of 3 phases  Three pruning techniques Support threshold pruning (s-pruning algorithm) Growth rate threshold pruning (g-pruning algorithm) Length threshold pruning ( l -pruning algorithm)

Baseline Algorithm 1. Construction Phase (C-Phase) A merged tree MT is built from all the sequences of the target class C k – each suffix s j of each sequence is matched against substrings in the tree  Update c 1 counter for substrings contained in s j (but a sequence should not contribute twice to the same counter)  Explicitize implicit nodes when necessary  When a mismatch occurs, add a new edge and a new leaf to represent the unmatched part of s j

Baseline Algorithm 1. Construction Phase (C-Phase) Example Class C k abcd ab cd (1, 0) ab (2, 0) abc 3 c d (2, 0) Update of c 1 counter Explicitization of implicit node Update of edges

Baseline Algorithm 1. Construction Phase (C-Phase) Example Class C k abcd ab abc (1, 0) ab (2, 0) abe c d Addition of new edge and leaf node (3, 0) e (1, 0) 4

Baseline Algorithm 2. Update Phase (U-Phase) MT is updated with all the sequences of the opponent class C k ’  Only update c 2 counter for substrings that are already present in the tree, but not introduce any substring that is only present in D k ’  Only internal nodes will be added (no new leaf nodes) Resultant tree: MT’

Baseline Algorithm 3. eXtraction Phase (X-Phase) All ESs of C k are extracted by a pre-order tree traversal on MT’  At each node X, we check the values of its counters,  s and  g, to determine whether its related substrings can satisfy both the support and growth rate conditions  If the related substrings of a node X cannot fulfill the support condition, we can ignore the subtree rooted at X  Baseline algorithm: C-U-X phases

s -Pruning Algorithm  Observations The c 2 counter of each substring  in MT would be updated in the U-Phase if it is contained in some sequence in D k ’ If  is infrequent with respect to D k, it is not qualified to be an ES of C k and all its descendent nodes will not even be visited in the X-Phase  Pruning idea To prune infrequent substrings in MT after the C- Phase

s -Pruning Algorithm   s -Pruning Phase (P s -Phase) With the use of  s, all substrings being infrequent in D k are pruned by a pre-order traversal on MT Resultant tree: MT s (input to the U-Phase)  s-pruning algorithm: C-P s -U-X phases

g -Pruning Algorithm  Observations As sequences in D k ’ are being added to MT, value of the c 2 counter of some nodes would become larger  Support of these nodes' related substrings in D k ’ is monotonically increasing  Ratio of the support of these substrings in D k to that in D k ’ is monotonically decreasing At some point, this ratio may become less than  g. When this happens, these substrings have actually lost their candidature for being ESs of C k

g -Pruning Algorithm  Pruning idea To prune substrings in MT as soon as they are found to be failing the growth rate requirement   g -Update Phase (U g -Phase) When the support count of a substring in D k ’ increases, check if it still satisfies the growth rate condition. If not, prune substring by path compression or node deletion Supported by [i start, i q, i end ) representation of edges  g-pruning algorithm: C-U g -X phases

l -Pruning Algorithm  Observations Longer substrings often have lower support than shorter ones  less likely to fulfill the support condition for ESs It is not desirable to append these longer substrings to the tree in the C-Phase and subsequently prune them in the P s -Phase (for the s-pruning algorithm)  Pruning idea To limit the length of substrings to be added to MT in the tree construction phase

l -Pruning Algorithm   l -Construction Phase (C l -Phase) Only match (min(|s j |,  l ) symbols of each suffix against the tree (ignore the remainder)  a smaller MT is built Unlike the previous two pruning approaches, it may result in ES loss  l -pruning algorithm: C l -U-X phases

Summary of Phases  Baseline: C-U-X s-pruning: C-P s -U-X (earlier use of  s ) g-pruning: C-U g -X (earlier use of  g ) l -pruning: C l -U-X (addition of  l )  Combination of the use of pruning techniques,,,

Performance Evaluation  Dataset: CI3 (music feature in midi tracks) ClassNo. of sequences Avg./max. sequence length No. of distinct symbols melody843 (11%)331.0 / non-melody6742 (89%)274.9 /  Goal: to extract ESs from target class: melody (opponent class: non-melody)  Assumptions: all sequences are pre-stored in memory (appended in a vector, starting & ending positions of each sequence recorded)

Number of ESs Mined ss Min. no. of occurrences No. of non-jumping ESsNo. of JESs  g = 2  g = 5  g =  0.25% % % %

Take a look at the tree size  When  s = 0.50%,  g = 2 |MT||MT s ||MT’| baseline416,151542,094 s-pruning416,15122,582 (-94.6%)22,961 (-95.8%) g-pruning416,151510,764 (-5.8%) sg-pruning416,15122,582 (-94.6%)18,413 (-96.6%)

Baseline Algorithm [C-U-X ]  Performance: same for all  s and  g  Time: about 35s

s -Pruning Algorithm [C-P s -U-X ]  Faster than baseline alg.by 25-45%  But reduction in time < reduction in tree size  Performance: improve with  in  s, same for all  g

g -Pruning Algorithm [C-U g -X ]  When  g = , faster than baseline alg. by 2-5%  When  g = 2 or 5, slower than baseline alg. by 1-4%  Performance: improve with  in  g, same for all  s

sg -Pruning Algorithm [C-P s -U g -X ]  Faster than baseline, s-pruning, g-pruning alg.(all cases)  Faster than baseline alg. for 31-54%(2 or 5), 47-81%(  )  Performance: improve with  in  s and  g

Target Class: Melody (  g = 2)  Performance of algorithms: (fastest) sg-pruning > s-pruning > baseline > g-pruning

What If Target Class = Non-Melody? (  g = 2)  Performance of algorithms: (fastest) s-pruning > sg-pruning > baseline > g-pruning

What If Target Class = Non-Melody?  sg-pruning performs worse than s-pruning Due to overhead in node creation (g-pruning requires one more index for each edge)  Not much performance gain with s -pruning (just 3-5%) or sg-pruning (1-3%) Bottleneck: formation of MT (over 93% time is spent in the C-Phase) In fact, these pruning techniques are very effective since much time is saved in the U-Phase  42-80% (for s-pruning) and 54-85% (for sg-pruning)

l -Pruning Algorithm – % Loss of ESs  Except when  s = 0.25%, there is loss of non-jumping ESs only when  l < 20 (15 for the case of JESs)  s,  g ll ll avg. seq. length = 331 max. seq. length = 1085

l -Pruning Algorithm – % Time Saved  Time saved becomes obvious when  l < 100  For  s  0.50%, can save over 30% time without ES loss  s,  g ll ll avg. seq. length = 331 max. seq. length = 1085

To be Explored...  l s-pruning  l g-pruning  l sg-pruning

Conclusions  ESs of a class are substrings which occur more frequently in that class rather than other classes.  ESs are useful features as they capture distinguishing characteristics of data classes.  We have proposed a suffix tree-based framework for mining ESs.

Conclusions  Three basic techniques for pruning ES candidates have been described, and most of them have been proven effective  Future work: to study whether pruning techniques can be efficiently applied to suffix tree merging algorithms or other ES mining models.

Applying Pruning Techniques to Single- Class Emerging Substring Mining - The End -