Sequential Pattern Mining: From Market Basket Analysis to DNA Mining

Sequential Pattern Mining: From Market Basket Analysis to DNA Mining
Jiawei Han and Jian Pei School of Computing Science Simon Fraser University British Columbia, Canada September 16, 2018 PAKDD01: Sequential Pattern Mining

PAKDD01: Sequential Pattern Mining
Tutorial Outline Time-series data mining From frequent pattern mining to sequential pattern mining Sequential pattern mining methods Frequent or sequential pattern mining applications: 1. Web usage mining Frequent or sequential pattern mining applications: 2. DNA mining September 16, 2018 PAKDD01: Sequential Pattern Mining

Part I. Time-Series Data Mining
From frequent pattern mining to sequential pattern mining Sequential pattern mining methods Web usage mining DNA mining September 16, 2018 PAKDD01: Sequential Pattern Mining

Analysis of Time-Series Data
Time-series database A time-series database consists of sequences of values or events changing with time, often data is recorded at regular intervals Typical analysis: trend, periodicity, similarity, etc. Different from (but can be treated as a special case of) sequence databases Applications Financial: stock price, inflation Medical & Biomedical: disease treatment Meteorological: precipitation September 16, 2018 PAKDD01: Sequential Pattern Mining

Time-Series Plot September 16, 2018 PAKDD01: Sequential Pattern Mining

Trend Analysis A time series can be illustrated as a time-series graph which describes a point moving with the passage of time Categories of time-series movements Long-term or trend movements (trend curve) Cyclic movements or variations, e.g., business cycles Seasonal movements or seasonal variations i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years. Irregular or random movements September 16, 2018 PAKDD01: Sequential Pattern Mining

Estimation of Trend Curve
The freehand method Fit the curve by looking at the graph Costly and barely reliable for large-scaled data mining The least-square method Find the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data points The moving-average method Eliminate cyclic, seasonal and irregular patterns Loss of end data Sensitive to outliers September 16, 2018 PAKDD01: Sequential Pattern Mining

Discovery of Trend in Time-Series (1)
Estimation of seasonal variations Seasonal index Set of numbers showing the relative values of a variable during the months of the year E.g., if the sales during October, November, and December are 80%, 120%, and 140% of the average monthly sales for the whole year, respectively, then 80, 120, and 140 are seasonal index numbers for these months Deseasonalized data Data adjusted for seasonal variations E.g., divide the original monthly data by the seasonal index numbers for the corresponding months September 16, 2018 PAKDD01: Sequential Pattern Mining

Discovery of Trend in Time-Series (2)
Estimation of cyclic variations If (approximate) periodicity of cycles occurs, cyclic index can be constructed in much the same manner as seasonal indexes Estimation of irregular variations By adjusting the data for trend, seasonal and cyclic variations With the systematic analysis of the trend, cyclic, seasonal, and irregular components, it is possible to make long- or short-term predictions with reasonable quality September 16, 2018 PAKDD01: Sequential Pattern Mining

Similarity Search in Time-Series Analysis
Similarity search finds data sequences that differ only slightly from the given query sequence Normal database query finds exact match Two categories of similarity queries Whole matching: find a sequence that is similar to the query sequence Subsequence matching: find all pairs of similar sequences Typical Applications Financial market Market basket data analysis Scientific databases Medical diagnosis September 16, 2018 PAKDD01: Sequential Pattern Mining

Similar Time Series Analysis
VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund Two similar mutual funds in the different fund group September 16, 2018 PAKDD01: Sequential Pattern Mining

Data Transformation Many techniques for signal analysis require the data to be in the frequency domain Usually data-independent transformations are used The transformation matrix is determined a priori E.g., discrete Fourier transform (DFT), discrete wavelet transform (DWT) The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain DFT does a good job of concentrating energy in the first few coefficients If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance September 16, 2018 PAKDD01: Sequential Pattern Mining

Multidimensional Indexing
Constructed for efficient accessing using the first few Fourier coefficients Use the index can to retrieve the sequences that are at most a certain small distance away from the query sequence Perform postprocessing by computing the actual distance between sequences in the time domain and discard any false matches September 16, 2018 PAKDD01: Sequential Pattern Mining

Similar Time Series Analysis: Divide-and-Conquer
Dividing trails into sub-trails and MBRs (Minimum Bounding Rectangles) September 16, 2018 PAKDD01: Sequential Pattern Mining

Subsequence Matching Break each sequence into a set of pieces of window with length w Extract the features of the subsequence inside the window Map each sequence to a “trail” in the feature space Divide the trail of each sequence into “subtrails” and represent each of them with minimum bounding rectangle Use a multipiece assembly algorithm to search for longer sequence matches September 16, 2018 PAKDD01: Sequential Pattern Mining

Steps for Similarity Search
Atomic matching Find all pairs of gap-free windows of a small length that are similar Window stitching Stitch similar windows to form pairs of large similar subsequences allowing gaps between atomic matches Subsequence Ordering Linearly order the subsequence matches to determine whether enough similar pieces exist September 16, 2018 PAKDD01: Sequential Pattern Mining

Steps for Similar Time-Series Analysis
September 16, 2018 PAKDD01: Sequential Pattern Mining

Enhanced Similarity Search Methods
Two subsequences are similar if one lies within an envelope of  width around the other, ignoring outliers Allow for gaps within a sequence or differences in offsets or amplitudes Normalize sequences with amplitude scaling and offset translation Two sequences are similar if they have enough non-overlapping time-ordered pairs of similar subsequences Parameters specified by a user or expert: sliding window size, width of an envelope for similarity, maximum gap, and matching fraction September 16, 2018 PAKDD01: Sequential Pattern Mining

Query Languages for Time Sequences
Time-sequence query language Should be able to specify sophisticated queries like Find all of the sequences that are similar to some sequence in class A, but not similar to any sequence in class B Should be able to support various kinds of queries: range queries, all-pair queries, and nearest neighbor queries Shape definition language Allows users to define and query the overall shape of time sequences Uses human readable series of sequence transitions or macros Ignores the specific details E.g., the pattern up, Up, UP can be used to describe increasing degrees of rising slopes Macros: spike, valley, etc. September 16, 2018 PAKDD01: Sequential Pattern Mining

Sequential Pattern Mining
Mining of frequently occurring patterns related to time or other sequences Collection of ordered events within an interval Sequential pattern mining usually concentrate on symbolic patterns E.g., renting “Star Wars”, then “Empire Strikes Back”, then “Return of the Jedi” in that order Applications Targeted marketing Customer retention Weather prediction September 16, 2018 PAKDD01: Sequential Pattern Mining

Mining Sequential Patterns: An Example
Customer shopping sequence Frequent Itemsets Sequential patterns with support > {(C), (H)} {(C), (DG)} September 16, 2018 PAKDD01: Sequential Pattern Mining

Sequential Pattern Mining: Cases and Parameters (1)
Duration of a time sequence T Sequential pattern mining can be confined to the data within a specified duration E.g., subsequence corresponding to the year of 1999 E.g., partitioned sequences, such as every year, or every week after stock crashes, or every two weeks before and after a volcano eruption Event folding window w If w = T, time-insensitive frequent patterns are found If w = 0 (no event sequence folding), sequential patterns are found where each event occurs at a distinct time instant If 0 < w < T, sequences occurring within the same period w are folded in the analysis (e.g., every week) September 16, 2018 PAKDD01: Sequential Pattern Mining

Sequential Pattern Mining: Cases and Parameters (2)
Interval gap, gap, between the events in the discovered pattern gap = 0: No interval gap is allowed, i.e., only strictly consecutive sequences are found E.g., frequent patterns occurring in consecutive weeks min_int  gap  max_int: Find patterns that are separated by at least min_int but at most max_int E.g., if a person rents movie A, it is likely she will rent movie B after one week but within 30 days” (7  gap  30) gap = c  0: Find patterns carrying an exact interval E.g., every time when Dow Jones drops more than 5%, what will happen two days later?” (gap = 2) September 16, 2018 PAKDD01: Sequential Pattern Mining

Sequential Pattern Mining Methods: A General Picture
Sequential pattern mining methods are strongly influenced by frequent pattern (i.e., association rule) mining methods Apriori to GSP (Generalized Sequential Patterns) FPgrowth to FreeSpan and PrefixSpan Methods will be systematically introduced in the next two parts September 16, 2018 PAKDD01: Sequential Pattern Mining

Episodes Get IT job Buy a car Lease a car Buy a house Establish a company or and Episodes: user-specified constraints in the form of pattern templates Serial episodes: A  B Parallel episodes: A & B Regular expressions: (A | B) C* (D  E) Arbitrary DAG Vertices: events; Edges: sequence of events Sequential patterns can also be specified as frequent sub-trees or sub-graphs September 16, 2018 PAKDD01: Sequential Pattern Mining

Mining Episodes Given: an input sequence and a set of episodes Move a time window across the input sequence Find all patterns that occur in some user-specified percentage of windows September 16, 2018 PAKDD01: Sequential Pattern Mining

Periodicity and Partial Periodicity
Periodicity is everywhere: tides, seasons, daily power consumption, etc. Full periodicity: Every point in time contributes (precisely or approximately) to the periodicity Partial periodicity: A more general notion Only some fragments contribute to the periodicity Jim reads NY Times 7:00-7:30 am every week day Cyclic association rules: Associations which form cycles Methods Full periodicity: FFT, other statistical analysis methods Partial and cyclic periodicity: Variations of Apriori-like mining methods September 16, 2018 PAKDD01: Sequential Pattern Mining

Periodicity Analysis Mine recurring patterns in time-related databases Full Periodicity Similarity Cyclic and Partial September 16, 2018 PAKDD01: Sequential Pattern Mining

Mining Partial Periodic Patterns
Full periodicity search algorithms do not work Due to mixture of periodic and non-periodic data Cyclic association mining method (Ozden, et al.’98) Most real life cycles are imperfect cycles A max-subpattern hit set method for mining partial periodicity (Han, Dong & Yin, ICDE99) A more efficient pattern-growth method can be developed based on FP-growth (see Part II) September 16, 2018 PAKDD01: Sequential Pattern Mining

Part II. From Frequent Pattern Mining to Sequential Pattern Mining
Time-series data mining From frequent pattern mining to sequential pattern mining Sequential pattern mining methods Web usage mining DNA mining September 16, 2018 PAKDD01: Sequential Pattern Mining

What Is Frequent Pattern Mining?
What is a frequent pattern? Pattern (set of items, sequence, etc.) that occurs together frequently in a database [AIS93] Frequent pattern mining: Finding regularities in data What products were often purchased together? — beer and diapers?! What are the subsequent purchases after buying a PC? What kind of DNAs are sensitive to this new drug? Can we automatically classify Web documents? September 16, 2018 PAKDD01: Sequential Pattern Mining

Why Is Frequent Pattern Mining an Essential Task in Data Mining?
Foundation for several essential data mining tasks Association, correlation, causality Sequential patterns, partial periodicity, temporal or spatial associations Associative classification, cluster analysis Broad applications basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, DNA sequence analysis, etc. September 16, 2018 PAKDD01: Sequential Pattern Mining

From Frequent Patterns to Association Rules: Support and Confidence
Customer buys both Find all the rules X & Y  Z with min confidence and support support, s, probability that a transaction contains {X, Y, Z} confidence, c, conditional probability that a transaction having {X, Y} also contains Z. Customer buys diaper Customer buys beer Let mini_support = 50%, min_conf = 50%: A  C (50%, 66.6%) C  A (50%, 100%) September 16, 2018 PAKDD01: Sequential Pattern Mining

Previous Work: A Candidate Generation-and-test Approach
Apriori —A level-wise, candidate set generation and test approach (Agrawal & Srikant 1994, Mannila, et al. 1994) Any subset of a frequent itemset must be frequent if {beer, diaper, nuts} is frequent, {beer, diaper} must be. Apriori pruning principle: If there is any itemset which is infrequent, its superset will not be generated! Method: generate length (k+1) candidate itemsets from length k frequent itemsets, and test the candidates against DB The performance studies show its efficiency and scalability September 16, 2018 PAKDD01: Sequential Pattern Mining

Subsequent Studies on Apriori
Aproiri: a breakthrough in scalable data mining hundreds of following-up studies! Performance improvements partitioning (’95), sampling (’96), dynamic itemset counting (’97), SQL implementation of Apriori (’98), constraint pushing (’98), incremental, parallel and distributed algorithms, etc. Mining various kinds of rules or regularities multi-level, quantitative association rules, correlation and causality, ratio rules, sequential patterns, emerging patterns, temporal associations, partial periodicity classification, clustering, iceberg cubes, etc. September 16, 2018 PAKDD01: Sequential Pattern Mining

Is Apriori Efficient Enough? —Performance Bottlenecks!!
Strength: Candidate generation-and-test Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets Use database scan and pattern matching to test (i.e., collect counts for the candidate itemsets) Bottleneck: Candidate generation-and-test, too!! Generation may lead to huge candidate sets 104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of length 100, e.g., {a1, a2, …, a100}, we need to generate 2100  1030 candidates. Test involves multiple scans of the entire database Needs (n +1 ) scans, n is the length of the longest pattern September 16, 2018 PAKDD01: Sequential Pattern Mining

Can We Do Better? — Mining frequent patterns without candidate generation “Divide-and-conquer”: A long-lasting winner Apriori: decompose the task according to pattern length Partitioning (Navathe’95): decompose the database Our approach: decompose both the task and DB according to the frequent patterns obtained so far Database projection and compression Project the database based on its frequent patterns Compress a database into a compact, Frequent-Pattern tree (FP-tree) condensed, but complete for frequent pattern mining no candidate generation: test projected database only! September 16, 2018 PAKDD01: Sequential Pattern Mining

Construction of FP-tree from a Transaction Database
TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 0.5 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Scan DB once, find frequent 1-itemset (single item pattern) Order frequent items in frequency descending order Scan DB again, construct FP-tree September 16, 2018 PAKDD01: Sequential Pattern Mining

Benefits of the FP-tree Structure
Completeness Never breaks a long pattern of any transaction Preserves complete information for frequent pattern mining Compactness Reducing irrelevant info—infrequent items are gone Items in frequency descension order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field) For Connect-4 DB, compression ratio could be over 100 September 16, 2018 PAKDD01: Sequential Pattern Mining

Mining Frequent Patterns with FP-trees
Idea: Frequent pattern growth Recursively grow frequent patterns by pattern and database partition Method For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern September 16, 2018 PAKDD01: Sequential Pattern Mining

From FP-tree to Conditional Pattern-Base
Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 September 16, 2018 PAKDD01: Sequential Pattern Mining

Transformed Prefix Paths
Derive the transformed prefix paths of item p For each item p in the tree, collect p’s prefix path with count = p’s frequency Why only prefix path? Why this count? Complete? {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 September 16, 2018 PAKDD01: Sequential Pattern Mining

Correctness and Completeness of Conditional Pattern-Bases
Completeness (node-link property) For any frequent item p, all the possible frequent patterns that contain p can be obtained by following p's node-links, starting from p’'s head in the FP-tree header Correctness (transformed prefix path property) To calculate the frequent patterns for a node p in a path P, only the prefix sub-path of p in P need to be accumulated, and its frequency count should carry the same count as node p. September 16, 2018 PAKDD01: Sequential Pattern Mining

From Conditional Pattern-Bases to Conditional FP-trees
For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam {} f:3 c:3 a:3 m-conditional FP-tree c:3 b:1 b:1   a:3 p:1 m:2 b:1 p:2 m:1 September 16, 2018 PAKDD01: Sequential Pattern Mining

Conditional Pattern-Bases and Conditional FP-trees of Single Items
{(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f September 16, 2018 PAKDD01: Sequential Pattern Mining

Recursion: Mining Each Conditional FP-tree Until …
{} f:3 c:3 am-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} f:3 c:3 a:3 m-conditional FP-tree {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree {} Cond. pattern base of “cam”: (f:3) f:3 cam-conditional FP-tree September 16, 2018 PAKDD01: Sequential Pattern Mining

A Special Case: Single FP-tree Path
Suppose a (conditional) FP-tree T has a single path P The complete set of frequent patterns of T can be generated by enumeration of all the combinations of the sub-paths of P {} All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam f:3  c:3 a:3 m-conditional FP-tree September 16, 2018 PAKDD01: Sequential Pattern Mining

A More General (Special) Case: Single Prefix Path in FP-tree
Suppose a (conditional) FP-tree T has a shared single prefix-path P Mining can be decomposed into two parts Reduction of the single prefix path into one node Concatenation of the mining results of the two parts a2:n2 a3:n3 a1:n1 {} b1:m1 C1:k1 C2:k2 C3:k3 b1:m1 C1:k1 C2:k2 C3:k3 r1 a2:n2 a3:n3 a1:n1 {} r1 = +  September 16, 2018 PAKDD01: Sequential Pattern Mining

I/O-Bound FP-Growth: Scaling FP-Growth by DB Projection
FP-tree cannot fit in memory?—DB projection first partition a database into a set of projected DBs then construct and mine FP-tree for each projected DB parallel projection vs. partition projection techniques Alternative methods Construction of a disk-resident FP-tree Materialization and incremental update of an FP-tree September 16, 2018 PAKDD01: Sequential Pattern Mining

Partition-Based Projection
Parallel projection needs a lot of disk space Partition projection saves it Tran. DB fcamp fcabm fb cbp p-proj DB fcam cb m-proj DB fcab fca b-proj DB f … a-proj DB fc c-proj DB am-proj DB cm-proj DB September 16, 2018 PAKDD01: Sequential Pattern Mining

Principles of Frequent Pattern Growth
Pattern growth theorem Let  be a frequent itemset in DB, B be 's conditional pattern base, and  be an itemset in B. Then    is a frequent itemset in DB iff  is frequent in B. Ex. “abcdef ” is a frequent pattern, if and only if “abcd ” is a frequent pattern, and “ef ” is frequent in abcd’s conditional pattern-base (i.e., the set of transactions containing “abcd ”) September 16, 2018 PAKDD01: Sequential Pattern Mining

FP-Growth vs. Apriori: Scalability With the Support Threshold
Data set T25I20D10K September 16, 2018 PAKDD01: Sequential Pattern Mining

FP-Growth vs. Tree-Projection: Scalability with the Support Threshold
Data set T25I20D100K September 16, 2018 PAKDD01: Sequential Pattern Mining

Why Is FP-Growth the Winner?
Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting and FP-tree building, not pattern search and matching September 16, 2018 PAKDD01: Sequential Pattern Mining

Implications of the Methodology
Mining closed frequent itemsets and max-patterns CLOSET (DMKD’00) Mining sequential patterns FreeSpan (KDD’00), PrefixSpan (ICDE’01) Constraint-based mining of frequent patterns Convertible constraints (KDD’00, ICDE’01) Computing iceberg data cubes with complex measures H-tree and H-cubing algorithm (SIGMOD’01) Fault-tolerant frequent patterns and sequential patterns Work in progress September 16, 2018 PAKDD01: Sequential Pattern Mining

Part III. Sequential Pattern Mining Methods
Time-series data mining From frequent pattern mining to sequential pattern mining Sequential pattern mining methods Web usage mining DNA mining September 16, 2018 PAKDD01: Sequential Pattern Mining

Why Sequential Pattern Mining?
Sequential pattern mining: Finding time-related frequent patterns (frequent subsequences) Most data and applications are time-related Customer shopping patterns, telephone calling patterns E.g., first buy computer, then CD-ROMS, software, within 3 mos. Natural disasters (e.g., earthquake, hurricane) Disease and treatment Stock market fluctuation Weblog click stream analysis DNA sequence analysis September 16, 2018 PAKDD01: Sequential Pattern Mining

Sequential Pattern Mining
Given a set of sequences, find the complete set of frequent subsequences A sequence : < (ef) (ab) (df) c b > A sequence database Elements items within an element are listed alphabetically SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern September 16, 2018 PAKDD01: Sequential Pattern Mining

Sequential Pattern: Basics
A sequence : <(bd) c b (ac)> Elements A sequence database <a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence Seq. ID <ad(ae)> is a subsequence of <a(bd)bcb(ade)> Given support threshold min_sup =2, <(bd)cb> is a sequential pattern September 16, 2018 PAKDD01: Sequential Pattern Mining

Apriori Property If a sequence S is not frequent  every super-sequence of S is not frequent E.g, <hb> is infrequent so do <hab>, <(ah)b> <a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence Seq. ID Given support threshold min_sup =2 September 16, 2018 PAKDD01: Sequential Pattern Mining

Apriori Property: Another Form
If S is a sequential pattern  every non-empty subsequence of S is also a sequential pattern <(bd)cba> is a sequential pattern  so are <acb>, <dba>, ..., etc. <a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence Seq. ID min_sup =2 September 16, 2018 PAKDD01: Sequential Pattern Mining

Apriori-like Sequential Pattern Mining Methods
Proposed by Agrawal and Srikant, ICDE’95 & EDBT’96 GSP (Generalized Sequential Pattern) algorithm Outline of the method Level-by-level do Generate candidate sequences Scan database to collect support counts Use Apriori property to prune candidates Only generate candidates satisfying Apriori property Advantages Candidate pruning, scalable September 16, 2018 PAKDD01: Sequential Pattern Mining

Find Length-1 Sequential Patterns
Cand Sup <a> 3 5 <c> 4 <d> <e> <f> 2 <g> 1 <h> Candidates: all singleton sequence <a>, , <c>, <d>, <e>, <f>, <g>, <h> Scan database once, count support for candidates <a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence Seq. ID min_sup =2 September 16, 2018 PAKDD01: Sequential Pattern Mining

Generate Length-2 Candidates
 <c> <d> <e> <f> <aa> <ab> <ac> <ad> <ae> <af> <ba> <bb> <bc> <bd> <be> <bf> <ca> <cb> <cc> <cd> <ce> <cf> <da> <db> <dc> <dd> <de> <df> <ea> <eb> <ec> <ed> <ee> <ef> <fa> <fb> <fc> <fd> <fe> <ff> 51 length-2 Candidates <a> <c> <d> <e> <f> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <(bc)> <(bd)> <(be)> <(bf)> <(cd)> <(ce)> <(cf)> <(de)> <(df)> <(ef)> Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates September 16, 2018 PAKDD01: Sequential Pattern Mining

Scan database once, collect support counts for length-2 candidates 19 length-2 candidates pass support threshold They are length-2 sequential patterns September 16, 2018 PAKDD01: Sequential Pattern Mining

Generate Length-3 Candidates
Self-join length-2 sequential patterns Based on Apriori property Examples <ab>, <aa> and <ba> are all length-2 sequential patterns  <aba> is a length-3 candidate <(bd)>, <bb> and <db> are all length-2 sequential patterns  <(bd)b> is a length-3 candidate 46 candidates are generated September 16, 2018 PAKDD01: Sequential Pattern Mining

Scan database once more, collect support counts for candidates 19 out of 46 candidates pass support threshold September 16, 2018 PAKDD01: Sequential Pattern Mining

Candidate Generation Candidate generation: Length-(k+1) candidates are generated from length-k sequential patterns based on the Apriori property Join phase: <xS> and <Sy> generate <xSy> <abcd> and <bcde> generate <abcde> <(abc)de> and <(bc)d(ef)> generate <(abc)d(ef)> Prune phase: delete candidates having infrequent length-k subsequence September 16, 2018 PAKDD01: Sequential Pattern Mining

Counting Candidates Scan database, collect support counts for candidates Use a hash-tree to store candidates Reduce the number of candidates need to be checked <{a, b}*> Hashing functions <?{ab}*> <?{c, d}*> <aca> <bde> … <aab> <aba> … September 16, 2018 PAKDD01: Sequential Pattern Mining

The GSP Mining Process Cand. cannot pass sup. threshold
<a> <c> <d> <e> <f> <g> <h> <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)> <abb> <aab> <aba> <baa> <bab> … <abba> <(bd)bc> … <(bd)cba> 1st scan: 8 cand. 6 length-1 seq. pat. 2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4th scan: 8 cand. 6 length-4 seq. pat. 5th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold Cand. not in DB at all <a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence Seq. ID min_sup =2 September 16, 2018 PAKDD01: Sequential Pattern Mining

The GSP Algorithm Take sequences in form of <x> as length-1 candidates Scan database once, find F1, the set of length-1 sequential patterns Let k=1; while Fk is not empty do Form Ck+1, the set of length-(k+1) candidates from Fk; If Ck+1 is not empty, scan database once, find Fk+1, the set of length-(k+1) sequential patterns Let k=k+1; September 16, 2018 PAKDD01: Sequential Pattern Mining

Bottlenecks of Apriori–Like Methods
A huge set of candidates could be generated 1,000 frequent length-1 sequences generate length-2 candidates! Many scans of database in mining Encounter difficulty when mining long sequential patterns Exponential number of short candidates A length-100 sequential pattern needs candidate sequences! September 16, 2018 PAKDD01: Sequential Pattern Mining

Prefix and Postfix (Projection)
<a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)> Given sequence <a(abc)(ac)d(cf)> Prefix Postfix /Projection <a> <(abc)(ac)d(cf)> <aa> <(_bc)(ac)d(cf)> <ab> <(_c)(ac)d(cf)> September 16, 2018 PAKDD01: Sequential Pattern Mining

Mine Sequential Patterns by Prefix Projections
Step 1: find length-1 sequential patterns <a>, , <c>, <d>, <e>, <f> Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: The ones having prefix <a>; The ones having prefix ; … The ones having prefix <f> SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> September 16, 2018 PAKDD01: Sequential Pattern Mining

Find Seq. Patterns with Prefix <a>
Only need to consider projections w.r.t. <a> <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> Further partition into 6 subsets Having prefix <aa>; … Having prefix <af> SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> September 16, 2018 PAKDD01: Sequential Pattern Mining

Completeness of PrefixSpan
SDB SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Length-1 sequential patterns <a>, , <c>, <d>, <e>, <f> Having prefix <c>, …, <f> Having prefix <a> Having prefix <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc> -projected database … Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> … … Having prefix <aa> Having prefix <af> <aa>-proj. db … <af>-proj. db September 16, 2018 PAKDD01: Sequential Pattern Mining

Efficiency of PrefixSpan
No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: constructing projected databases Can be improved by bi-level projections September 16, 2018 PAKDD01: Sequential Pattern Mining

Pair-wise Checking Using S-matrix
SDB SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> Length-1 sequential patterns <a>, , <c>, <d>, <e>, <f> <aa> happens twice <(ac)> happens twice a 2 b (4, 2, 2) 1 c (4, 2, 1) (3, 3, 2) 3 d (2, 1, 1) (2, 2, 0) (1, 3, 0) e (1, 2, 1) (1, 2, 0) (1, 1, 0) f (1, 1, 1) (2, 0, 1) S-matrix <ac> happens 4 times <ca> happens twice All length-2 sequential patterns are found in S-matrix September 16, 2018 PAKDD01: Sequential Pattern Mining

Scaling-up by Bi-level Projection
Partition search space based on length-2 sequential patterns Only form projected databases and pursue recursive mining over bi-level projected databases September 16, 2018 PAKDD01: Sequential Pattern Mining

Mining <ab>-projected Database
SDB Length-1 sequential patterns <a>, , <c>, <d>, <e>, <f> SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> a 2 b (4, 2, 2) 1 c (4, 2, 1) (3, 3, 2) 3 d (2, 1, 1) (2, 2, 0) (1, 3, 0) e (1, 2, 1) (1, 2, 0) (1, 1, 0) f (1, 1, 1) (2, 0, 1) S-matrix <ab>-projected database <(_c)(ac(cf)> <(_c)a> <c> No hope to form (_ac), so no need to count it. Lead to pattern <a(bc)a> S-matrix a c (1, 0, 1) 1 (_c) (, 2, ) (, 1, )  Local length-1 sequential patterns: <a>, <c>, <(_c)> September 16, 2018 PAKDD01: Sequential Pattern Mining

Benefits of Bi-level Projection
More patterns are found in each shoot Much less projections In the example, there are 53 patterns. 53 level-by-level projections 22 bi-level projections September 16, 2018 PAKDD01: Sequential Pattern Mining

3-way Apriori Checking Using Apriori heuristic to prune items in projected databases Absorb goodness of Apriori-like algorithms <acd> cannot be a pattern! Exclude d from <ac>-projected database a 2 b (4, 2, 2) 1 c (4, 2, 1) (3, 3, 2) 3 d (2, 1, 1) (2, 2, 0) (1, 3, 0) e (1, 2, 1) (1, 2, 0) (1, 1, 0) f (1, 1, 1) (2, 0, 1) September 16, 2018 PAKDD01: Sequential Pattern Mining

Speed-up by Pseudo-projection
Major cost of PrefixSpan: projection Postfixes of sequences often appear repeatedly in recursive projected databases When the (projected) database fit in memory, use pointers to form projections Pointer to the sequence Offset of the postfix s=<a(abc)(ac)d(cf)> <a> s|<a>: ( , 2) <(abc)(ac)d(cf)> <ab> s|<ab>: ( , 4) <(_c)(ac)d(cf)> September 16, 2018 PAKDD01: Sequential Pattern Mining

Pseudo-Projection vs. Physical Projection
Pseudo-projection avoids physically copying postfixes Efficient when database fits in main memory Not efficient when database cannot fit in main memory Disk-based random accessing is very costly Suggested Approach: Integration of physical and pseudo-projection Swapping to pseudo-projection when the data set fits in memory September 16, 2018 PAKDD01: Sequential Pattern Mining

Seeing is Believing: Experiments and Performance Analysis
Comparing PrefixSpan with GSP and FreeSpan in large databases GSP (IBM Almaden, Srikant & Agrawal EDBT’96) FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, M.C. Hsu, KDD’00) Prefix-Span-1 (single-level projection) Prefix-Scan-2 (bi-level projection) Comparing effects of pseudo-projection Comparing I/O cost and scalability September 16, 2018 PAKDD01: Sequential Pattern Mining

PrefixSpan Is Faster Than GSP and FreeSpan

Effect of Pseudo-Projection

I/O Cost: When It Cannot Fit in Memory

Scalability (When DB Is Large)

Major Features of PrefixSpan
Both PrefixSpan and FreeSpan are pattern-growth methods Searches are more focused and thus efficient Prefix-projected pattern growth (PrefixSpan) is more elegant than frequent pattern-guided projection (FreeSpan) Apriori heuristic is integrated into bi-level projection PrefixSpan Pseudo-projection substantially enhances the performance of the memory-based processing September 16, 2018 PAKDD01: Sequential Pattern Mining

Some Other Interesting Methods
Spade (Zaki, 98) Use vertical list format, for each sequential pattern, maintain a list of sequence-ids containing the pattern Find support via list intersections FreeSpan (Han, et al. KDD’00) Database-projection based on the currently mined frequent itemsets (corrupt sequences to itemsets in projections) September 16, 2018 PAKDD01: Sequential Pattern Mining

Constraint-Based Sequential Pattern Mining
Taxonomies, sliding windows and time gap constraints Anti-monotone and monotone constraints Regular expression constraints September 16, 2018 PAKDD01: Sequential Pattern Mining

Taxonomies September 16, 2018 PAKDD01: Sequential Pattern Mining

Sliding Windows Allowing a set of transactions to contain an element of a sequence Difference in transaction-times between transactions less than user-specified window-size September 16, 2018 PAKDD01: Sequential Pattern Mining

Time Gap Constraints Restrict the time gap between sets of transactions min_gap max_gap September 16, 2018 PAKDD01: Sequential Pattern Mining

Constraint Pushing in GSP
Push sliding windows and time constraints into support counting Push taxonomies Extend sequences, include ancestors Pre-compute ancestors Do not count patterns with an element containing an item and its ancestor September 16, 2018 PAKDD01: Sequential Pattern Mining

Anti-monotone Constraint
If sequence S violates constraint C, then so does every super-sequence of S  C is anti-monotone Examples: Sequence contain less than 10 elements The highest price in the sequence is less than $1,000 September 16, 2018 PAKDD01: Sequential Pattern Mining

Mining With Anti-monotone Constraints
Anti-monotone constraints can be enforced into PrefixSpan Only grow patterns (prefixes) satisfying the constraint September 16, 2018 PAKDD01: Sequential Pattern Mining

Monotone Constraints If sequence S satisfies constraint C, then so does every non-empty subsequence of S Examples Sequence have more than 10 elements Sequence have an item with price $1,000 or higher September 16, 2018 PAKDD01: Sequential Pattern Mining

Enforcing Monotone Constraints
Monotone constraints can be pushed into PrefixSpan Constraint-checking is waived once the prefix satisfies the monotone constraint September 16, 2018 PAKDD01: Sequential Pattern Mining

Regular Expression Constraints
Constraints in the form of an automaton Deterministic finite automaton for regular expression 1*(22|234|44) 1 2 2 3 4 a b c d 4 September 16, 2018 PAKDD01: Sequential Pattern Mining

SPIRIT SPIRIT (Rastogi, et al.’99) “Mining sequential patterns with regular expression constraints” Sequential pattern mining with regular expression constraints Similar in structure to GSP Relax constraint by inducing a weaker one Use the relaxed constraint in the candidate generation and pruning phases of each pass September 16, 2018 PAKDD01: Sequential Pattern Mining

Part IV. Applications of Frequent and Sequential Pattern Mining: Web Usage Mining Time-series data mining From frequent pattern mining to sequential pattern mining Sequential pattern mining methods Web usage mining DNA mining September 16, 2018 PAKDD01: Sequential Pattern Mining

Why Discussing Web Usage Mining and DNA Mining?
The success of data mining should be demonstrated on “typical data sets” Nice features of Web usage data and DNA data Large volumes of data, widely available Controlled and reliable data collection The ability to evaluate results, convincing returns of investments Ease of integration with further studies Suitable for frequent/sequential pattern mining September 16, 2018 PAKDD01: Sequential Pattern Mining

A Classification of Web Mining
Web content mining Web document classification Web authoritative page discovery Multi-level Web warehouse Web usage mining Weblog data analysis Web structure mining Web clustering and Web sub-community discovery Web linkage analysis September 16, 2018 PAKDD01: Sequential Pattern Mining

Web Data Web data can be collected at Server side, client-side, proxy servers, or obtained from an organization’s database Web data has the following types Web content (real data) Structure (intra-page structures and inter-page links) Usage: URL requested, IP addresses, page references, date and time User profile: registration data and user profile information September 16, 2018 PAKDD01: Sequential Pattern Mining

Web Usage Mining Using data mining techniques to discover usage patterns from Web data Applications Target potential customers for e-commerce Enhance the quality and delivery of Internet information services to the end user Improve Web server system performance Identify potential prime advertisement locations September 16, 2018 PAKDD01: Sequential Pattern Mining

Three Phases of Web Usage Mining
Preprocessing Converting the usage, contents, and structure info contained in available data sources into data abstractions for pattern discovery Pattern discovery Applies pattern discovery methods and algorithms developed from statistics, data mining, machine learning, and pattern recognition Pattern analysis Using OLAP, querying or visualization methods to analyze discovered patterns September 16, 2018 PAKDD01: Sequential Pattern Mining

Pattern Discovery Techniques in Web Usage Mining (1)
Statistical analysis Descriptive statistical analyses (frequency, mean, etc.) on page views, viewing time, length of navigation path, etc. Periodic report: most frequently accessed pages, average view time of a page, unauthorized entry points, etc. Improving system performance, enhancing security, support marketing decisions Data warehouse and OLAP techniques Multidimensional analysis, data cube techniques, drilling/rolling Association rule mining Frequently co-occurring patterns/paths, correlations Marketing, Web page pre-fetching, Web restructuring September 16, 2018 PAKDD01: Sequential Pattern Mining

Pattern Discovery Techniques in Web Usage Mining (2)
Clustering Clustering users (with similar browsing patterns) for market segmentation or personalized Web contents Clustering pages (with related contents) for Internet search engines and Web assistance providers Classification Find groups of users belonging to a category Sequential pattern mining Find intersession patterns for trend analysis, change point detection, or similarity analysis Dependency modeling Model the browsing behavior of users for prediction Use hidden Markov models, Bayesian Belief networks, etc. September 16, 2018 PAKDD01: Sequential Pattern Mining

Benefits of Understanding Web User Behavior
Improving web page organization Increasing web server performance Identifying potential prime advertising locations Targeting customers for electronic commerce Generating user profiles for customizing a site September 16, 2018 PAKDD01: Sequential Pattern Mining

Example Web Site September 16, 2018 PAKDD01: Sequential Pattern Mining

Existing Web Log Analysis Tools
Most of the existing products collect statistical information on page hits only Summary report of hits and bytes transferred List of top requested URLs List of top referrers List of most common browsers Hits per hour/day/week/month reports Hits per internet domain Error report Directory tree report Limited in comprehensiveness and depth of analysis September 16, 2018 PAKDD01: Sequential Pattern Mining

Data Mining Techniques
Frequently accessed groups of web pages Similar to association rule mining Sequential pattern mining Not necessarily consecutive pages Frequently traversed paths Consecutive pages Path similarity Clustering Plan mining Sequences of events leading to an action Multi-dimensional analysis using OLAP techniques September 16, 2018 PAKDD01: Sequential Pattern Mining

Web Sever Log File IP / user auth / time stamp / method / URL / HTTP version / return code / bytes transferred / referrer page URL / agent Peo-ill-21.ix.netcom.com - - [24/Feb/1997:00:00: ] “GET /images/nudge.gif HTTP/1.0” “ “Mozilla/2.0 (compatible; MSIE 3.01; Windows NT)” Proxy-level logs Client-level logs Server-level logs September 16, 2018 PAKDD01: Sequential Pattern Mining

W3C WCA Web Usage Abstractions
User A person using a client application to interact and retrieve resources from the server Web page Collection of resources identified by a single URI Page view The rendered web page in a specific client application September 16, 2018 PAKDD01: Sequential Pattern Mining

W3C WCA Web Usage Abstractions (cnt’d)
Click-stream A sequential series of page views by a user User session Set of user clicks across one or more servers Server session / visit Set of user clicks as recorded by a single server Episode Subset of related user clicks in user session September 16, 2018 PAKDD01: Sequential Pattern Mining

Design of Web Log Miner Web log is filtered to generate a relational database A data cube is generated form database OLAP is used to drill-down and roll-up in the cube Data mining techniques are used to extract interesting knowledge Data Cube Sliced and diced cube Database 2 Data Selection 3 OLAP Web log files 1 Data Cleaning & Transformation 4 Data Mining September 16, 2018 PAKDD01: Sequential Pattern Mining

Data Cleaning IP address, User, Timestamp, Method, File+Parameters, Status, Size Web Log Generic Cleaning and Transformation Machine, Internet domain, Internet sub-domain, User, Date-Time, Method, Dir, Sub-Dir, File, File-Type, File-Sub-Type, Parameters, Status, Size URL completion, page view, user, session and episode identification, path completion Site structure Database September 16, 2018 PAKDD01: Sequential Pattern Mining

Shortcomings of Analyzing Server Log Files
Caching: necessary for the web, but … No log entry for corresponding hit Client-level browser cache Proxy servers and fire-wall cache Could be added during path completion Difficulties with identifying a user Generally no user ID Proxy: many users / IP address Many IP addresses / user Different browsers per client site September 16, 2018 PAKDD01: Sequential Pattern Mining

Mining Frequent Sequences from Web Logs
Find out frequent n-sequences, or n-grams and their occurrence counts in all sessions. (n-sequence or n-gram is a sequence with a length of n) A sequence is considered frequent when its count exceeds a threshold  September 16, 2018 PAKDD01: Sequential Pattern Mining

Generate N-gram Prediction Rules
For each frequent N-gram, S1 S2 …Sk, we can generate a rule: S1 S2 …Sk-1  Sk ( conf ) where conf = P(Sk |S1 S2 …Sk-1 ) = Count(S1 S2 …Sk ) / Count(S1 S2 …Sk-1 ) From the EOPT, we can generate a rule for each embedded object Oi in Sk, S1 S2 …Sk-1  Oi ( confi ) where confi = P(Oi |S1 S2 …Sk-1 ) = conf * Pi September 16, 2018 PAKDD01: Sequential Pattern Mining

Prediction Algorithm Given a user’s access history, the algorithm matches the longest pattern to the LHS of the rules first. If no k-length pattern is matched, patterns of length k-1, k-2, … , 1 are applied accordingly. Our prediction contains a set of predicted objects with a corresponding probability: BCDA F (0.70) O1 (0.70) O2 (0.60) September 16, 2018 PAKDD01: Sequential Pattern Mining

Evaluation Metrics Recall: R = Np / N Np:The number of predicted requests N : The total number of requests How many requests can be predicted? Precision: P = P+/ (P+ + P-) P+:The number of correct predictions P-:The number of not correct predictions How many predictions are correct? September 16, 2018 PAKDD01: Sequential Pattern Mining

Part V. Applications of Frequent and Sequential Pattern Mining: DNA Mining Time-series data mining From frequent pattern mining to sequential pattern mining Sequential pattern mining methods Web usage mining DNA mining September 16, 2018 PAKDD01: Sequential Pattern Mining

Genome & DNA Genome: the complete set of instructions for making an organism Genome consists of threads of DNA and associated proteins Genome is organized into chromosomes September 16, 2018 PAKDD01: Sequential Pattern Mining

Gene A segment of a DNA molecule A specific sequence of nucleotide basis The human genome has ~100k genes Average size: 3k bp Largest known continuous DNA sequence: 350k base September 16, 2018 PAKDD01: Sequential Pattern Mining

Human Genome Project Create detailed map of each human chromosome Divide chromosomes into smaller fragments Characterize fragments and map them to corresponding chromosome locations Determine the base sequence of ordered DNA fragments Mapping and sequencing Extract useful information from maps and sequences September 16, 2018 PAKDD01: Sequential Pattern Mining

Genome Map Order of genes or other markers and the spacing between them Bioinformatics provides techniques for Representation: different scales and levels of abstraction of genome maps Visualization and use September 16, 2018 PAKDD01: Sequential Pattern Mining

Bioinformatics Resources
Integrated portals: Entrez Database search tools: BLAST, PDB Database integration Sequence analysis tools Sequence translation Format conversion Pattern identification Multiple alignment Gene finding September 16, 2018 PAKDD01: Sequential Pattern Mining

Interesting Topics Integration of heterogeneous databases Automation of DBs cleaning and organization DNA and protein sequence and structure Alignment, analysis, pattern & association discovery Patterns of networks of gene interacting gene products Modeling of biological processes Protein structure and function prediction Designing samll molecules to inhibit or augment biological function, or modified macromolecules for medical use Understanding the influence of genetic factors on diseases September 16, 2018 PAKDD01: Sequential Pattern Mining

Approaches Statistical methods Neural networks Hidden Markov models Support vectors Bayesian belief networks Rule induction Genetic algorithms Fuzzy and rough sets Case-based reasoning September 16, 2018 PAKDD01: Sequential Pattern Mining

Data Mining in DNA Hypothesis generation and testing Find underlying principles Domain organization and evolution Identify relevant and effective representation Clustering, pattern mining, similarity-based retrieval Sequence analysis Gene expression analysis September 16, 2018 PAKDD01: Sequential Pattern Mining

Patterns in DNA Simple strings Rigid gaps Flexible gaps Unrestricted gaps Mismatches September 16, 2018 PAKDD01: Sequential Pattern Mining

Pattern Discovery in DNA
Find all patterns appear in k input sequences Find all patterns repeat k times in a sequence September 16, 2018 PAKDD01: Sequential Pattern Mining

Methods Bottom up/enumeration/combinatorial Apriori-like Start with a trivial pattern (single character) Extend it in all possible ways Check if the extension has sufficient support If so, repeat to extend more patterns If no, backtrack September 16, 2018 PAKDD01: Sequential Pattern Mining

Fault-tolerant Mining
Insertion Deletion Mutation Use alignment to determine pattern similarity September 16, 2018 PAKDD01: Sequential Pattern Mining

Finding Tandem Repeats: An Example
Tandem repeats: a stretch of DNA (approximately) repeat contiguous in DNA …TCGGCGGCGGA… Tandem repeats occur frequently in genomic sequences Hard to detect Useful in many aspects: diseases, gene regulation, … September 16, 2018 PAKDD01: Sequential Pattern Mining

Methods Computing alignment matrices Need excessive running time Data compression-like methods Provide a measure of statistical significance Hard to find long patterns September 16, 2018 PAKDD01: Sequential Pattern Mining

A Sequence Analysis Method
Benson, 1999 Two stages Detection Analysis September 16, 2018 PAKDD01: Sequential Pattern Mining

Detection Probes: very short (length 1 or 2) sequences Length 1 probes: A, C, G, T Length 2 probes: AA, AC, …, TT. 16 probes Sliding window Find repeat of probes in sliding window CG, GG, GC has enough repeat count …TCGGCGGCGGA… September 16, 2018 PAKDD01: Sequential Pattern Mining

Analysis Align potential candidates Tandem repeats: patterns passing alignment score threshold Statistical models are used to handle faults …TCGGCGGCGGA… September 16, 2018 PAKDD01: Sequential Pattern Mining

Advantages Fast Can handle reasonably long patterns Sequential pattern mining techniques may work September 16, 2018 PAKDD01: Sequential Pattern Mining

Challenges Large databases Very long sequences Missing and noisy data Complex relationships ... September 16, 2018 PAKDD01: Sequential Pattern Mining

Conclusions Time-series data mining From frequent pattern mining to sequential pattern mining Sequential pattern mining methods Web usage mining DNA mining September 16, 2018 PAKDD01: Sequential Pattern Mining

Conclusions Time-series or sequential pattern mining have been playing an important role in data mining A lot of interesting methods have been developed in this field A lot of more active research work need to be done to make mining highly effective and efficient September 16, 2018 PAKDD01: Sequential Pattern Mining

References (1) R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. In Proc. 4th Int. Conf. Foundations of Data Organization and Algorithms, Chicago, Oct R. Agrawal, K.-I. Lin, H.S. Sawhney, and K. Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. VLDB'95, Zurich, Switzerland, Sept R. Agrawal, G. Psaila, E. L. Wimmers, and M. Zait. Querying shapes of histories. VLDB'95, Zurich, Switzerland, Sept R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, Taipei, Taiwan, Mar P. Baldi and S. Brunak. Bioinformatics: The Machine Learning Approach. MIT Press, 1998. A. Baxevanis and B. F. F. Ouellette. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. John Wiley & Sons, 1998. S. Benninga and B. Czaczkes. Financial Modeling. MIT Press, 1997. G. Benson. Tandem repeats finder: a program to analyze DNA sequences. In Nucleic Acides Research, 1999. M. J. A. Berry and G. Linoff. Mastering Data Mining: The Art and Science of Customer Relationship Management. John Wiley & Sons, 1999. September 16, 2018 PAKDD01: Sequential Pattern Mining

References (2) C. Bettini, X. Sean Wang, and S. Jajodia. Mining temporal relationships with multiple granularities in time sequences. Data Engineering Bulletin, 21:32-38, 1998. S. Chakrabarti, S. Sarawagi, and B. Dom. Mining surprising patterns using temporal description length. VLDB'98, New York, NY, Aug C. Chatfield. The Analysis of Time Series: An Introduction, 3rd ed. Chapman and Hall, 1984. M. Goebel and L. Gruenwald. A survey of data mining and knowledge discovery software tools. SIGKDD Explorations, 1:20-33, 1999. D. Gusfield. Algorithms on Strings, Trees and Sequences, Computer Science and Computation Biology. Cambridge University Press, New York, 1997. J. Han et al. DNA-Miner: A system prototype for mining DNA sequences, SIGMOD’01, Santa Barbara, 2001. J. M. Kleinberg, C. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. Data Mining and Knowledge Discovery, 2: , 1998. September 16, 2018 PAKDD01: Sequential Pattern Mining

References (3) C. Faloutsos and K.-I. Lin. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. SIGMOD'95, San Jose, CA, May 1995. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (eds.). Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. A. Floratos, I. Jurisica and I. Rigoutsos. Knowledge discovery in biological domains. Tutorial in KDD’00. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. SIGMOD'94, Minneapolis, Minnesota, May 1994. D. Gusfield. Algorithms on strings, trees and sequences. Cambridge, 1997. J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, Sydney, Australia, Apr September 16, 2018 PAKDD01: Sequential Pattern Mining

References (4) J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. Freespan: Frequent pattern-projected sequential pattern mining. KDD'00, Boston, MA, Aug H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction association rules. DMKD'98, Seattle, WA, June 1998. H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1: , 1997. B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, Orlando, FL, Feb D. Rafiei and A. Mendelzon. Similarity-based queries for time series data. SIGMOD'97, Tucson, Arizona, May 1997. R. H. Shumway. Applied Statistical Time Series Analysis. Prentice Hall, 1988. J. Srivastava, R. Cooley, M. Deshpande, and P. N. Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1:12-23, 2000. September 16, 2018 PAKDD01: Sequential Pattern Mining

References (5) M. S. Waterman. Introduction to Computational Biology: Maps, Sequences, and Genomes (Interdisciplinary Statistics). CRC Press, 1995. J. Wang et al. (ed.) Pattern discovery in biomolecular data. Oxford University Press, 1999. B.-K. Yi, H. V. Jagadish, and C. Faloutsos. Efficient retrieval of similar time sequences under time warping. ICDE'98, Orlando, FL, Feb C. T. Yu and W. Meng. Principles of Database Query Processing for Advanced Applications. Morgan Kaufmann, 1997. B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for co-evolving time sequences. ICDE'00, San Diego, CA, Feb C. Zaniolo, S. Ceri, C. Faloutsos, R. T. Snodgrass, C. S. Subrahmanian, and R. Zicari. Advanced Database Systems. Morgan Kaufmann, 1997. M. J. Zaki, N. Lesh, and M. Ogihara. PLANMINE: Sequence mining for plan failures. KDD'98, New York, NY, Aug O. R. Za"iane, M. Xin, and J. Han. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. ADL'98, Santa Barbara, CA, Apr September 16, 2018 PAKDD01: Sequential Pattern Mining

Thank you !!! September 16, 2018 PAKDD01: Sequential Pattern Mining

Sequential Pattern Mining: From Market Basket Analysis to DNA Mining

Similar presentations

Presentation on theme: "Sequential Pattern Mining: From Market Basket Analysis to DNA Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequential Pattern Mining: From Market Basket Analysis to DNA Mining

Similar presentations

Presentation on theme: "Sequential Pattern Mining: From Market Basket Analysis to DNA Mining"— Presentation transcript:

Similar presentations

About project

Feedback