Download presentation
Presentation is loading. Please wait.
1
Advanced Pattern Mining 02
COP 6726: New Directions in Database Systems Advanced Pattern Mining 02
2
Formal Definition of a Subsequence
A sequence <a1 a2 … an> is contained in another sequence <b1 b2 … bm> (m ≥ n) if there exist integers i1 < i2 < … < in such that a1 bi1 , a2 bi1, …, an bin The support of a subsequence w is defined as the fraction of data sequences that contain w A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup)
3
Sequential Pattern Mining: Definition
Given: a database of sequences a user-specified minimum support threshold, min-sup Task: Find all subsequences with support ≥ minsup
4
Sub-sequence Mining
5
Sequence Mining Let ∑ denote an alphabet, defined as a finite set of characters or symbols. A sequence (or a string) is defined as an ordered list of symbols, i.e., s = s1s2…sk, where si ∈ ∑ is a symbol at position i. Let s = s1s2…sn and r = r1r2…rm be two sequence over ∑. r is a subsequence of s denoted r ⊆ s, if there exists a non-to-one mapping ϕ : [1, m] [1, n], such that r[i] = s[ϕ(i)] and for any two position i, j in r, i < j ϕ(i) < ϕ(j). r is a consecutive subsequence (or substring) of s, if r[1 : m] = s[j : j+m], with 1 ≤ j ≤ n −m +1.
6
Sequence Mining Given a sequence data, find frequent subsequences that satisfy the minimum support constraint. A AA AG AAG G GA GG GAA GAG GAAG T Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT Minimum Support = 3
7
Generalized Sequence Pattern (GSP) Mining
Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT A 3 C 2 G T Minimum Support = 3
8
Spade: Vertical Sequence Mining
For each symbol c ∈ ∑, we keep a set of tuples of the form < i, pos(c)>, where pos(c) is the set of positions in the sequence si. It maintains the list of positions for the occurrences of the last symbol. In this example, A occurs in s1 at positions 2,4, and 5. Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT
9
Spade: Vertical Sequence Mining
Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT 4
10
Projection-Based Sequence Mining (PrefixSpan)
The projected database with respect to ⍺, denoted D⍺ is obtained by finding the first occurrence (e.g., p) of A in si. Next, the suffix of ss starting at position p+1 is extracted from si. After that, any infrequent symbols are removed from suffix. Minimum Support = 3 ⍺ = G, i.e., DG Projection Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT A 3 C 2 G T Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT Id Sequence s1 AAGT s2 AAG s3 Given s1 = CAGAAGT, the projection of s1 with respect to G (i.e., DG) is AAGT. In this example, DG is {s1: AAGT, s2: AAG, s3: AAGT}
11
PrefixSpan
12
PrefixSpan
13
Consecutive Subsequence (or substring) Mining
14
Substring Mining via Suffix Trees
Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT (id number, frequency occurrence)
15
Substring Mining via Suffix Trees
Id Sequence s1 CAGAAGT
16
Substring Mining via Suffix Trees
Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT n N = support (i.e., frequency) If min-sup =3, Is GAA frequent? Is CAA frequent?
17
Subgraph Pattern Mining
18
Frequent Subgraph Mining
Extend association rule mining to finding frequent subgraphs Useful for Web Mining, computational chemistry, bioinformatics, spatial data sets, etc.
19
Graph Definitions
20
Computing the support of a subgraph.
21
Frequent Subgraph Mining
Given a set of graph G and a support threshold, minsup, the goal of frequent subgraph mining is to find all subgraphs g such that s(g) ≥ minsup.
22
Brute-force method
23
Apriori-like method Transform each graph into a transaction-like format so that existing algorithms such as Aprior can be applied. During candidate generation, a pair of frequent (k-1)-subgraphs are merged to form a candidate k-subgraph.
24
Take Home Message Sequence Mining Consecutive Sequence Mining
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.