Mining Sequential Patterns Authors: Rakesh Agrawal and Ramakrishnan Srikant. Presenter: Jeremy Dalmer.

Introduction What is a “sequential pattern”? Answers to final exam questions.

What is a “sequential pattern”? Requires set of attributes deciding each tuple’s class. Call this the class set. Requires set of attributes deciding each tuple’s class. Call this the class set. Example: class set = {customer-id} Example: class set = {customer-id} Tuples are sorted into classes. Tuples are sorted into classes.

Requires set of attributes used for ordering tuples. Call this the order set. Requires set of attributes used for ordering tuples. Call this the order set. Example: order set = {transaction-time} Example: order set = {transaction-time} Tuples within classes are sorted according to an order defined over order set codomain. Tuples within classes are sorted according to an order defined over order set codomain.

Specifying a value for each attribute in (class set U order set) must be specifying at most one tuple. (class set U order set forms primary key.) Specifying a value for each attribute in (class set U order set) must be specifying at most one tuple. (class set U order set forms primary key.) Support and confidence measure classes now, not tuples. Support and confidence measure classes now, not tuples.

Example: order set = {transaction-time} class set = {customer-id} Example: order set = {transaction-time} class set = {customer-id}

Classes: {Joe, Sarah}

Ordering within classes according to order set:

A large sequence (support = 100%) is (intuitive) (intuitive) I,,,,, and,, and are also large sequences.

Example: order set = {year, month} class set = {} Example: order set = {year, month} class set = {}

Classes: {}

Ordering within classes (class!) according to order set:

Intuition suggests large sequence but this is not considered any “larger” than and and because there is only one class.

One more point about the previous example. Having recorded <{goldfish}, {lobster}, {monkey}> as a large sequence, why record subsequences? and, though and, though large sequences, are not informative. “Maximal sequence”.

final exam questions Root of each algorithm: (1) Group into classes and order. (2) Find all large itemsets. (3) For each tuple, drop everything except a record of the large itemsets contained in that tuple. (4) Find all large sequences (of large itemsets). (5) Discard large sequences not maximal.

Consider a previous example.

Large itemsets (min-sup = 100%): {knife}, {beer}, {knife, beer}, {Band-Aids}. Set {knife} to 1, {beer} to 2, {knife, beer} to 3, and {Band-Aids} to 4. Transform tuples to ((1 2 3) (1 4)) ((1 2 3) (4))

Large sequences (actually with 100% support) are ((1)), ((2)), ((3)), ((4)), ((1) (4)), ((2) (4)), and ((3) (4)) But, since ((3) (4)) implies all the others, only ((3) (4)) is a maximal large sequence.

Potentially large vs. Definitely large (candidate sequences vs. large sequences). Potentially large – no counting, but many. Definitely large – counting, but few. Algorithms similar to Apriori, but with sequences of large itemsets instead of large sets of items.

AprioriAll – Counts every large sequence, including those not maximal. AprioriSome – Generates every candidate sequence, but skips counting some large sequences (Forward Phase). Then, discards candidates not maximal and counts remaining large sequences (Backward Phase).

AprioriAll scans the database more, taking more time. AprioriSome keeps more potentially large sequences in memory, degenerating to AprioriAll when requests for memory fail.

“There were two types of algorithms presented to find sequential patterns, CountSome and CountAll. What was the main difference between the two algorithms?”

CountAll (AprioriAll) is careful with respect to minimum support, careless with respect to maximality. CountSome (AprioriSome) is careful with respect to maximality, careless with respect to minimum support.

“What was the greatest hardware concern regarding the algorithms contained in the paper?”

Main memory capacity. When there is little main memory, or many potentially large sequences, the benefits of AprioriSome vanish.

“How did the two best sequence mining algorithms (AprioriAll and AprioriSome) perform compared with each other? Take into consideration memory, speed, and usefulness of the data.”

Memory: In terms of main memory usage, AprioriAll is better. In terms of secondary storage access, AprioriSome is better.

Speed: With sufficient memory, as minimum support decreases the difference between AprioriAll and AprioriSome increases. (AprioriSome is better.) More large sequences not maximal are generated.

Usefulness of the data: For the problem of finding maximal large sequences, the answer is “Precisely the same.”. However, AprioriAll finds all large sequences, while AprioriSome discards some large sequences that aren’t maximal. AprioriAll, then, generates more “useful” data. “The user may want to know the ratio of the number of people who bought the first k + 1 items in a sequence to the number of people who bought the first k items.”

Mining Sequential Patterns Authors: Rakesh Agrawal and Ramakrishnan Srikant. Presenter: Jeremy Dalmer.

Similar presentations

Presentation on theme: "Mining Sequential Patterns Authors: Rakesh Agrawal and Ramakrishnan Srikant. Presenter: Jeremy Dalmer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Sequential Patterns Authors: Rakesh Agrawal and Ramakrishnan Srikant. Presenter: Jeremy Dalmer.

Similar presentations

Presentation on theme: "Mining Sequential Patterns Authors: Rakesh Agrawal and Ramakrishnan Srikant. Presenter: Jeremy Dalmer."— Presentation transcript:

Similar presentations

About project

Feedback