Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Pattern Mining 02

Similar presentations


Presentation on theme: "Advanced Pattern Mining 02"— Presentation transcript:

1 Advanced Pattern Mining 02
COP 6726: New Directions in Database Systems Advanced Pattern Mining 02

2 Formal Definition of a Subsequence
A sequence <a1 a2 … an> is contained in another sequence <b1 b2 … bm> (m ≥ n) if there exist integers i1 < i2 < … < in such that a1  bi1 , a2  bi1, …, an  bin The support of a subsequence w is defined as the fraction of data sequences that contain w A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup)

3 Sequential Pattern Mining: Definition
Given: a database of sequences a user-specified minimum support threshold, min-sup Task: Find all subsequences with support ≥ minsup

4 Sub-sequence Mining

5 Sequence Mining Let ∑ denote an alphabet, defined as a finite set of characters or symbols. A sequence (or a string) is defined as an ordered list of symbols, i.e., s = s1s2…sk, where si ∈ ∑ is a symbol at position i. Let s = s1s2…sn and r = r1r2…rm be two sequence over ∑. r is a subsequence of s denoted r ⊆ s, if there exists a non-to-one mapping ϕ : [1, m]  [1, n], such that r[i] = s[ϕ(i)] and for any two position i, j in r, i < j  ϕ(i) < ϕ(j). r is a consecutive subsequence (or substring) of s, if r[1 : m] = s[j : j+m], with 1 ≤ j ≤ n −m +1.

6 Sequence Mining Given a sequence data, find frequent subsequences that satisfy the minimum support constraint. A AA AG AAG G GA GG GAA GAG GAAG T Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT Minimum Support = 3

7 Generalized Sequence Pattern (GSP) Mining
Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT A 3 C 2 G T Minimum Support = 3

8 Spade: Vertical Sequence Mining
For each symbol c ∈ ∑, we keep a set of tuples of the form < i, pos(c)>, where pos(c) is the set of positions in the sequence si. It maintains the list of positions for the occurrences of the last symbol. In this example, A occurs in s1 at positions 2,4, and 5. Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT

9 Spade: Vertical Sequence Mining
Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT 4

10 Projection-Based Sequence Mining (PrefixSpan)
The projected database with respect to ⍺, denoted D⍺ is obtained by finding the first occurrence (e.g., p) of A in si. Next, the suffix of ss starting at position p+1 is extracted from si. After that, any infrequent symbols are removed from suffix. Minimum Support = 3 ⍺ = G, i.e., DG Projection Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT A 3 C 2 G T Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT Id Sequence s1 AAGT s2 AAG s3 Given s1 = CAGAAGT, the projection of s1 with respect to G (i.e., DG) is AAGT. In this example, DG is {s1: AAGT, s2: AAG, s3: AAGT}

11 PrefixSpan

12 PrefixSpan

13 Consecutive Subsequence (or substring) Mining

14 Substring Mining via Suffix Trees
Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT (id number, frequency occurrence)

15 Substring Mining via Suffix Trees
Id Sequence s1 CAGAAGT

16 Substring Mining via Suffix Trees
Id Sequence s1 CAGAAGT s2 TGACAG s3 GAAGT n N = support (i.e., frequency) If min-sup =3, Is GAA frequent? Is CAA frequent?

17 Subgraph Pattern Mining

18 Frequent Subgraph Mining
Extend association rule mining to finding frequent subgraphs Useful for Web Mining, computational chemistry, bioinformatics, spatial data sets, etc.

19 Graph Definitions

20 Computing the support of a subgraph.

21 Frequent Subgraph Mining
Given a set of graph G and a support threshold, minsup, the goal of frequent subgraph mining is to find all subgraphs g such that s(g) ≥ minsup.

22 Brute-force method

23 Apriori-like method Transform each graph into a transaction-like format so that existing algorithms such as Aprior can be applied. During candidate generation, a pair of frequent (k-1)-subgraphs are merged to form a candidate k-subgraph.

24 Take Home Message Sequence Mining Consecutive Sequence Mining


Download ppt "Advanced Pattern Mining 02"

Similar presentations


Ads by Google