Download presentation
Presentation is loading. Please wait.
Published byJeffry Young Modified over 8 years ago
1
Applying Pruning Techniques to Single- Class Emerging Substring Mining Speaker: Sarah Chan Supervisor: Dr. B. C. M. Kao M.Phil. Probation Talk CSIS DB Seminar Aug 30, 2002
2
Presentation Outline Introduction The single-class ES mining problem Data structure: merged suffix tree Algorithms: baseline, s-pruning, g-pruning, l -pruning Performance evaluation Conclusions
3
Introduction Emerging Substrings (ESs) A new type of KDD patterns Substrings whose supports (or frequencies) increase significantly from one class to another (measured by a growth rate) Motivation: Emerging Patterns (EPs) by Dong and Li Jumping Emerging Substrings (JESs) as a specialization of ESs Substrings which can only be found in one class but not others
4
Introduction Emerging Substrings (ESs) Usefulness Capture sharp contrasts between datasets, or trends over time Provide knowledge for building sequence classifiers Applications (virtually endless) Language identification, purchase behavior analysis, financial data analysis, bioinformatics, melody track selection,web-log mining, content-based e-mail processing systems, …
5
Introduction Mining ESs Brute-force approach To enumerate all possible substrings in the database, find their support counts in each class, and check growth rate But a huge sequence database contains millions of sequences (GenBank has 15 million sequences in 2001), and No. of substrings in a sequence increases exponentially with sequence length (A typical human genome has 3 billion characters) Too many candidates Expensive in terms of time ( O(|D| 2 n 3 ) ) and memory Other shortcomings: repeated substrings, common substrings, … (Please refer to [seminar020201])
6
Introduction Mining ESs An Apriori-like approach E.g. if both abcd & bcde are frequent in D, generate candidate abcde Find frequent substrings and check growth rate Still requires many database scans A candidate may not be contained in any sequence in D Apriori property does not hold for ESs: abcde can be an ES even if both abcd & bcde are not We need algorithms which are more efficient and which allow us to filter out ES candidates
7
Introduction Mining ESs Our approach: A suffix tree-based framework A compact way of storing all substrings, with support counters maintained Deal with suffixes (not substrings) of sequences Do not consider substrings not existing in the database Time complexity: O( lg(| |) |D| n 2 ) Techniques for pruning of ES candidates can be easily applied
8
Basic Definitions Sequence An ordered set of symbols over an alphabet Class In a sequence database, each sequence i has a class label C i C = the set of all class labels does not belong to C k belongs to C k ’ Dataset If database D is associated with m class labels, we can partition D into m datasets, such that all sequences in dataset D i have class label C i D k D k ’
9
Basic Definitions Count and support of string s in dataset D count D (s) = no. of sequences in D that contain s supp D (s) = count D (s) / |D| Growth rate of string s from D 1 to D 2 growthRate D1→D2 (s) = supp D2 (s) / supp D1 (s) growth rate = 0 if supp D1 (s) = supp D1 (s) = 0 growth rate = ∞ if supp D1 (s) = 0 and supp D2 (s) > 0
10
ES and JES Emerging Substring (ES) Given s and g, a string s is an ES from D k ’ to D k (or s is an ES of C k ) if these hold: support condition: supp Dk (s) ≥ s growth rate condition: growthRate Dk’→Dk (s) ≥ g Jumping Emerging Substring (JES) It is an ES with ∞ growth rate JES of C k : supp Dk’ (s) = 0 and supp Dk (s) > 0
11
ES and JES Example Class C 1 Class C 2 abcd bd a c abd bc cd b With g = 1.5 ESs from D 2 to D 1 : a, abc, bcd, abcd ESs from D 1 to D 2 : b, abd
12
ES and JES Example Class C 1 Class C 2 abcd bd a c abd bc cd b With g = 1.5 ESs from D 2 to D 1 : a, abc, bcd, abcd ESs from D 1 to D 2 : b, abd growthRate D1→D2 (b) = (3/4) / (2/4) = 1.5
13
ES and JES Example Class C 1 Class C 2 abcd bd a c abd bc cd b With g = 1.5 ESs from D 2 to D 1 : a, abc, bcd, abcd ESs from D 1 to D 2 : b, abd JESs are underlined
14
The ES Mining Problem The ES mining problem Given a database D, the set C of all class labels, a support threshold s and a growth rate threshold g, to discover the set of all ESs for each class C j C The single-class ES mining problem A target class C k is specified and our goal is to discover the set of all ESs of C k C k ’ : opponent class
15
Merged Suffix Tree Suffix tree Represent all the substrings of a length-n sequence in O(n) space Merged suffix tree Represent all the substrings of all sequences in a dataset D k in O(|D k | n) space Each node has a support counter for each dataset relatedEach node is associated with a substring and related to one or more substrings Each edge is denoted as an index range [i start, i end ) E.g. if = abcd, then [1, 3) = ab
16
Merged Suffix Tree Example (c 1, c 2 ) = (count in C k, count in C k ’ ) Class C k Class C k ’ abcd bd a c abd bc cd b A
17
Merged Suffix Tree Example (c 1, c 2 ) = (count in C k, count in C k ’ ) Class C k Class C k ’ abcd bd a c abd bc cd b A count Dk (a) = 2, count Dk’ (a) = 1
18
Merged Suffix Tree Example Node Y is associated with abcd (concatenation) and related to abc & abcd (all share Y’s counters) An implicit node Z is associated with abc Y Class C k Class C k ’ abcd bd a c abd bc cd b Z
19
Algorithms The baseline algorithm Consists of 3 phases Three pruning techniques Support threshold pruning (s-pruning algorithm) Growth rate threshold pruning (g-pruning algorithm) Length threshold pruning ( l -pruning algorithm)
20
Baseline Algorithm 1. Construction Phase (C-Phase) A merged tree MT is built from all the sequences of the target class C k – each suffix s j of each sequence is matched against substrings in the tree Update c 1 counter for substrings contained in s j (but a sequence should not contribute twice to the same counter) Explicitize implicit nodes when necessary When a mismatch occurs, add a new edge and a new leaf to represent the unmatched part of s j
21
Baseline Algorithm 1. Construction Phase (C-Phase) Example Class C k abcd ab cd (1, 0) ab (2, 0) abc 3 c d (2, 0) Update of c 1 counter Explicitization of implicit node Update of edges
22
Baseline Algorithm 1. Construction Phase (C-Phase) Example Class C k abcd ab abc (1, 0) ab (2, 0) abe c d Addition of new edge and leaf node (3, 0) e (1, 0) 4
23
Baseline Algorithm 2. Update Phase (U-Phase) MT is updated with all the sequences of the opponent class C k ’ Only update c 2 counter for substrings that are already present in the tree, but not introduce any substring that is only present in D k ’ Only internal nodes will be added (no new leaf nodes) Resultant tree: MT’
24
Baseline Algorithm 3. eXtraction Phase (X-Phase) All ESs of C k are extracted by a pre-order tree traversal on MT’ At each node X, we check the values of its counters, s and g, to determine whether its related substrings can satisfy both the support and growth rate conditions If the related substrings of a node X cannot fulfill the support condition, we can ignore the subtree rooted at X Baseline algorithm: C-U-X phases
25
s -Pruning Algorithm Observations The c 2 counter of each substring in MT would be updated in the U-Phase if it is contained in some sequence in D k ’ If is infrequent with respect to D k, it is not qualified to be an ES of C k and all its descendent nodes will not even be visited in the X-Phase Pruning idea To prune infrequent substrings in MT after the C- Phase
26
s -Pruning Algorithm s -Pruning Phase (P s -Phase) With the use of s, all substrings being infrequent in D k are pruned by a pre-order traversal on MT Resultant tree: MT s (input to the U-Phase) s-pruning algorithm: C-P s -U-X phases
27
g -Pruning Algorithm Observations As sequences in D k ’ are being added to MT, value of the c 2 counter of some nodes would become larger Support of these nodes' related substrings in D k ’ is monotonically increasing Ratio of the support of these substrings in D k to that in D k ’ is monotonically decreasing At some point, this ratio may become less than g. When this happens, these substrings have actually lost their candidature for being ESs of C k
28
g -Pruning Algorithm Pruning idea To prune substrings in MT as soon as they are found to be failing the growth rate requirement g -Update Phase (U g -Phase) When the support count of a substring in D k ’ increases, check if it still satisfies the growth rate condition. If not, prune substring by path compression or node deletion Supported by [i start, i q, i end ) representation of edges g-pruning algorithm: C-U g -X phases
29
l -Pruning Algorithm Observations Longer substrings often have lower support than shorter ones less likely to fulfill the support condition for ESs It is not desirable to append these longer substrings to the tree in the C-Phase and subsequently prune them in the P s -Phase (for the s-pruning algorithm) Pruning idea To limit the length of substrings to be added to MT in the tree construction phase
30
l -Pruning Algorithm l -Construction Phase (C l -Phase) Only match (min(|s j |, l ) symbols of each suffix against the tree (ignore the remainder) a smaller MT is built Unlike the previous two pruning approaches, it may result in ES loss l -pruning algorithm: C l -U-X phases
31
Summary of Phases Baseline: C-U-X s-pruning: C-P s -U-X (earlier use of s ) g-pruning: C-U g -X (earlier use of g ) l -pruning: C l -U-X (addition of l ) Combination of the use of pruning techniques,,,
32
Performance Evaluation Dataset: CI3 (music feature in midi tracks) ClassNo. of sequences Avg./max. sequence length No. of distinct symbols melody843 (11%)331.0 / 108529 non-melody6742 (89%)274.9 / 289161 Goal: to extract ESs from target class: melody (opponent class: non-melody) Assumptions: all sequences are pre-stored in memory (appended in a vector, starting & ending positions of each sequence recorded)
33
Number of ESs Mined ss Min. no. of occurrences No. of non-jumping ESsNo. of JESs g = 2 g = 5 g = 0.25%3522033389451222 0.50%517619726489 1.00%9819525351 2.00%1737998190
34
Take a look at the tree size When s = 0.50%, g = 2 |MT||MT s ||MT’| baseline416,151542,094 s-pruning416,15122,582 (-94.6%)22,961 (-95.8%) g-pruning416,151510,764 (-5.8%) sg-pruning416,15122,582 (-94.6%)18,413 (-96.6%)
35
Baseline Algorithm [C-U-X ] Performance: same for all s and g Time: about 35s
36
s -Pruning Algorithm [C-P s -U-X ] Faster than baseline alg.by 25-45% But reduction in time < reduction in tree size Performance: improve with in s, same for all g
37
g -Pruning Algorithm [C-U g -X ] When g = , faster than baseline alg. by 2-5% When g = 2 or 5, slower than baseline alg. by 1-4% Performance: improve with in g, same for all s
38
sg -Pruning Algorithm [C-P s -U g -X ] Faster than baseline, s-pruning, g-pruning alg.(all cases) Faster than baseline alg. for 31-54%(2 or 5), 47-81%( ) Performance: improve with in s and g
39
Target Class: Melody ( g = 2) Performance of algorithms: (fastest) sg-pruning > s-pruning > baseline > g-pruning
40
What If Target Class = Non-Melody? ( g = 2) Performance of algorithms: (fastest) s-pruning > sg-pruning > baseline > g-pruning
41
What If Target Class = Non-Melody? sg-pruning performs worse than s-pruning Due to overhead in node creation (g-pruning requires one more index for each edge) Not much performance gain with s -pruning (just 3-5%) or sg-pruning (1-3%) Bottleneck: formation of MT (over 93% time is spent in the C-Phase) In fact, these pruning techniques are very effective since much time is saved in the U-Phase 42-80% (for s-pruning) and 54-85% (for sg-pruning)
42
l -Pruning Algorithm – % Loss of ESs Except when s = 0.25%, there is loss of non-jumping ESs only when l < 20 (15 for the case of JESs) s, g ll ll avg. seq. length = 331 max. seq. length = 1085
43
l -Pruning Algorithm – % Time Saved Time saved becomes obvious when l < 100 For s 0.50%, can save over 30% time without ES loss s, g ll ll avg. seq. length = 331 max. seq. length = 1085
44
To be Explored... l s-pruning l g-pruning l sg-pruning
45
Conclusions ESs of a class are substrings which occur more frequently in that class rather than other classes. ESs are useful features as they capture distinguishing characteristics of data classes. We have proposed a suffix tree-based framework for mining ESs.
46
Conclusions Three basic techniques for pruning ES candidates have been described, and most of them have been proven effective Future work: to study whether pruning techniques can be efficiently applied to suffix tree merging algorithms or other ES mining models.
47
Applying Pruning Techniques to Single- Class Emerging Substring Mining - The End -
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.