Institut für Scientific Computing - Universität WienP.Brezany 1 Datamining Methods Mining Association Rules and Sequential Patterns
Institut für Scientific Computing - Universität WienP.Brezany 2 KDD (Knowledge Discovery in Databases) Process Clean, Collect, Summarize Clean, Collect, Summarize Data Preparation Data Preparation Verification & Evaluation Verification & Evaluation Data Mining Data Mining Operational Databases Training Data Model, Patterns Data Warehouse
Institut für Scientific Computing - Universität WienP.Brezany 3 Mining Association Rules Association rule mining finds interesting association or correlation relationships among a large set of data items. This can help in many business decision making processes: store layout, catalog design, and customer segmentation based on buying paterns. Another important field: medical applications. Market basket analysis - a typical example of association rule mining. How can we find association rules from large amounts of data? Which association rules are the most interesting. How can we help or guide the mining procedures?
Institut für Scientific Computing - Universität WienP.Brezany 4 Informal Introduction Given a set of database transactions, where each transaction is a set of items, an association rule is an expression X Y where X and Y are sets of items (literals). The intuitive meaning of the rule: transactions in the database which contain the items in X tend to also contain the items in Y. Example: 98% of customers who purchase tires and auto accessories also buy some automotive services; here 98% is called the confidence of the rule. The support of the rule is the percentage of transactions that contain both X and Y. The problem of mining association rules is to find all rules that satisfy a user-specified minimum support and minimum confidence.
Institut für Scientific Computing - Universität WienP.Brezany 5 Basic Concepts Let J = (i 1, i 2,..., i m ) be a set of items. Typically, the items are identifiers of individuals articles (pro- ducts (e.g., bar codes). Let D, the task relevant data, be a set of database transactions where each transaction T is a set of items such that T J. Let A be a set of items: a transaction T is said to contain A if and only if A T, An association rule is an implication of the form A B, where A J, B J, and A B = . The rule A B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A B (i.e. both A and B). This is the probability, P(A B).
Institut für Scientific Computing - Universität WienP.Brezany 6 Basic Concepts (Cont.) The rule A B has confidence c in the transaction set D if c is the percentage of transactions in D containing A that also contain B - the conditional probability P(B|A). Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong. A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. The occurence frequency of an itemset is the number of transactions that contain the itemset.
Institut für Scientific Computing - Universität WienP.Brezany 7 Basic Concepts - Example transaction purchased items 1 bread, coffee, milk, cake 2 coffee, milk, cake 3 bread, butter, coffee, milk 4 milk, cake 5 bread, cake 6 bread X = {coffee, milk} R = {coffee, cake, milk} support of X = 3 from 6 = 50% support of R = 2 from 6 = 33% Support of “milk, coffee” “cake” equals to support of R = 33% Confidence of “milk, coffee” “cake” = 2 from 3 = 67% [=support(R)/support(X)]
Institut für Scientific Computing - Universität WienP.Brezany 8 Basic Concepts (Cont.) An itemset satisfies minimum support if the occurrence fre- quency of the itemset is greater than or equal to the product of min_sup and the total number of transactions in D. The number of transactions required for the itemset to satis- fy minimum support is therefore referred to as the minimum support count. If an itemset satisfy minimum support, then it is a frequent itemset. The set of frequent k-itemsets is commonly denoted by L k. Association rule mining is a two-step process: 1. Find all frequent itemsets. 2. Generate strong association rules from the frequent itemsets.
Institut für Scientific Computing - Universität WienP.Brezany 9 Association Rule Classification Based on the types of values handled in the rule: If a rule concerns associations between the presence or absence of items, it is a Boolean association rule. For example: computer financial_management_software [support = 2%, confidence = 60%] If a rule describes associations between quantitative items or attributes, then it is a quantitative associa- tion rule. For example: age(X, “30..39”) and income(X,”42K..48K”) buys(X, high resolution TV) Note that the quantitative attributes, age and income, have been discretized.
Institut für Scientific Computing - Universität WienP.Brezany 10 Association Rule Classification (Cont.) Based on the dimensions of data involved in the rule: If the items or attributes in an association rule refe- rence only one dimension, then it is a single dimensional association rule. For example: buys(X,”computer”) buys (X, “financial manage- ment software”) The above rule refers to only one dimension, buys. If a rule references two or more dimensions, such as buys, time_of_transaction, and customer_category, then it is a multidimensional association rule. The second rule on the previous slide is a 3-dimensional ass. rule since it involves 3 dimensions: age, income, and buys.
Institut für Scientific Computing - Universität WienP.Brezany 11 Association Rule Classification (Cont.) Based on the levels of abstractions involved in the rule set: Suppose that a set of association rules minded includes: age(X,”30..39”) buys(X, “laptop computer”) age(X,”30..39”) buys(X, “computer”) In the above rules, the items bought are referenced at different levels of abstraction. (E.g., “computer” is a higher-level abstraction of “laptop computer”.) Such ru- les are called multilevel association rules. Single-level association rules refer one abstraction level only.
Institut für Scientific Computing - Universität WienP.Brezany 12 Mining Single-Dimensional Boolean Association Rules from Transactional Databases This is the simplest form of association rules (used in market basket analysis. We present Apriori, a basic algorithm for finding frequent itemsets. Its name – it uses prior knowledge of frequent itemset properties (explained later). Apriori employs a iterative approach known as a level-wise search, where k-itemsets are used to explore (k + 1)-itemsets. First, the set of frequent 1-items, L 1, is found. L 1 is used to find L 2, the set of frequent 2-itemsets, which is used to find L 3, and so on, until no more frequent k-itemsets can be found. The finding of each L k requires one full scan of the database. The Apriori property is used to reduce the search space.
Institut für Scientific Computing - Universität WienP.Brezany 13 The Apriori Property All nonempty subsets of a frequent itemset must also be frequent. If an itemset I does not satisfy the minimum support threshold, min_sup, then I is not frequent, that is, P(I) < min_sup. If an item A is added to the itemset I, then the resulting itemset (i.e., I A ) cannot occur more frequently than I. Therefore, I A is not frequent either, that is, P (I A ) < min_sup. How is the Apriori property used in the algorithm? To understand this, let us look at how L k-1 is used to find L k. A two-step process is followed, consisting of join and prune actions. These steps are explained on the next slides,
Institut für Scientific Computing - Universität WienP.Brezany 14 The Apriori Algorithm – the Join Step To find L k, a set of candidate k-itemsets is generated by joining L k-1 with itself. This set of candidates is denoted by C k. Let l 1 and l 2 be itemsets in L k-1. The notation l i [j] refers to the jth item in l i (e.g., l i [k-2] refers to the second to the last item in l 1 ). Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. The join L k-1 join L k-1, is performed, where members of L k-1 are joinable if their first (k-2) items are in common. That is, members l 1 and l 2 of L k-1 are joined if (l 1 [1] = l 2 [1] ) (l 1 [2] = l 2 [2] ) ... (l 1 [k-2] = l 2 [k-2] ) (l 1 [k-1] < l 2 [k-1] ). The condition (l 1 [k-1] < l 2 [k-1] ) simply ensures that no duplicates are generated. The resulting itemset: l 1 [1] l 1 [2] )... l 1 [k-1] l 2 [k-1].
Institut für Scientific Computing - Universität WienP.Brezany 15 The Apriori Algorithm – the Join Step (2) Illustration by an example p L k-1 = ( 1 2 3) || || Join: Result C k = ( ) || || q L k-1 = ( 1 2 4) Each frequent k-itemset p is always extended by the last item of all frequent itemsets q which have the same first k-1 items as p.
Institut für Scientific Computing - Universität WienP.Brezany 16 The Apriori Algorithm – the Prune Step C k is a superset of L k, that is, its members may or may not be frequent, but all of the frequent k-items are included in C k. A scan of the database to determine the count of each candidate in C k would result in the determination of L k. C k can be huge, and so this could involve heavy computation. To reduce the size of C k, the Apriori property is used as follows. Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence, if any (k-1)-subset of a candidate k-itemset is not in L k-1, then the candidate cannot be frequent either and so can be removed from C k. The above subset testing can be done quickly by maintaining a hash tree of all frequent itemsets.
Institut für Scientific Computing - Universität WienP.Brezany 17 The Apriori Algorithm - Example Let’s look at a concrete example of Apriori, based on the AllElectronics transaction database D, shown below. There are nine transactions in this database, e.i., |D| = 9. We use the next figure to illus- trate the fin- ding of fre- quent itemsets in D. TID List of item_Ids T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3
Institut für Scientific Computing - Universität WienP.Brezany 18 Generation of C K and L K (min.supp. count=2) Scan D for count of each candidate- scan Itemset Sup. count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 Itemset Sup. count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 Itemset Sup. count {I1,I2} 4 {I1,I3} 4 {I1,I5} 2 {I2,I3} 4 {I2,I4} 2 {I2, I5} 2 Itemset Sup. count {I1,I2} 4 {I1,I3} 4 {I1,I4} 1 {I1,I5} 2 {I2,I3} 4 {I2,I4} 2 {I2,I5} 2 {I3,I4} 0 {I3,I5} 1 {I4,I5} 0 Itemset {I1,I2} {I1,I3} {I1,I4} {I1,I5} {I2,I3} {I2,I4} {I2,I5} {I3,I4} {I3,I5} {I4,I5} Compare candidate support count with minimum support count - compare Generate C 2 candidates from L 1 Scan Compare C1C1 L1L1 C2C2 C2C2 L2L2
Institut für Scientific Computing - Universität WienP.Brezany 19 Generation of C K and L K (min.supp. count=2) Generate C 3 candidates from L 2 Itemset {I1,I2,I3} {I1,I2,I5} Itemset Sup. Count {I1,I2,I3} 2 {I1,I2,I5} 2 Itemset Sup. Count {I1,I2,I3} 2 {I1,I2,I5} 2 ScanCompare C3C3 C3C3 L3L3
Institut für Scientific Computing - Universität WienP.Brezany 20 Algorithm Application Description 1In the 1st iteration, each item is a member of C 1. The algorithm simply scan all the transactions in order to count the number of occurrences of each item. 2Suppose that the minimum transaction support count (min_sup = 2/9 = 22%). L 1 can then be determined. 3C 2 = L 1 join L 1. 4The transactions in D are scanned and the support count of each candidate itemset in C 2, as shown in the middle table of the second row in the last figure. 5The set of frequent 2-itemsets, L 2, is then determined, consisting of those candidate 2-itemsets in C 2 having minimum support.
Institut für Scientific Computing - Universität WienP.Brezany 21 Algorithm Application Description (2) 6The generation of C 3 = L 2 join L 2 is detailed in the next figure. Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that the four latter candidates cannot possibly be frequent. We therefore remove them from C 3. 7The transactions in D are scanned in order to determine L 3, consisting of those candidate 3-itemsets in C 3 having minimum support. 8C 4 = L 3 join L 3, after the pruning C 4 = Ø.
Institut für Scientific Computing - Universität WienP.Brezany 22 Example: Generation C 3 from L 2 1. Join: C 3 = L 2 L 2 = {{I1,I2},{I1,I3},{I1,I5}, {I2,I3},{I2,I4}, {I2,I5}} {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3},{I2,I4}, {I2,I5}} = {{I1,I2,I3}, {I1,I2,I5}, {I1,I3,I5}, {I2,I3,I4}, {I2,I3,I5}, {I2,I4,I5}}. 2. Prune using the Apriori property: All nonempty subsets of a frequent itemset must also be frequent. The 2-item subsets of {I1,I2,I3} are {I1,I2}, {I1,I3}, {I2,I3}, and they all are members of L 2. Therefore, keep {I1,I2,I3} in C 3. The 2-item subsets of {I1,I2,I5} are {I1,I2}, {I1,I5}, {I2,I5}, and they all are members of L 2. Therefore, keep {I1,I2,I5} in C 3. Using the same analysis remove other 3-items from C Therefore, C 3 = {{I1,I2,I3}, {I1,I2,I5}} after pruning.
Institut für Scientific Computing - Universität WienP.Brezany 23 Generating Association Rules from Frequent Items We generate strong association rules - they satisfy both minimum support and minimum confidence. support_count(A B) confidence ( A B ) = P(B|A) = support_count(A) where support_count(A B) is the number of transactions containing the itemsets A B, and support_count(A) is the number of transactions containing the itemset A.
Institut für Scientific Computing - Universität WienP.Brezany 24 Generating Association Rules from Frequent Items (Cont.) Based on the equations on the previous slide, association rules can be generated as follows: - For each frequent itemset l, generate all nonempty subsets of l. - For every nonempty subset s of l, output the rule “s (l - s)” support_count(l) if min_conf, where min_conf is minimum support_count(s) confidence threshold.
Institut für Scientific Computing - Universität WienP.Brezany 25 Generating Association Rules - Example Suppose that the transactional data for AllElectronics contain the frequent itemset l = {I1,I2,I5}. The resulting rules are: I1 I2 I5, confidence = 2/4 = 50% I1 I5 I2, confidence = 2/2 = 100% I2 I5 I1, confidence = 2/2 = 100% I1 I2 I5, confidence = 2/6 = 33% I2 I1 I5, confidence = 2/7 = 29% I5 I1 I2, confidence = 2/2 = 100% If the minimum confidence threshold is, say, 70%, then only the second, third, and the last rules above are output, since these are the only ones generated that are strong.
Institut für Scientific Computing - Universität WienP.Brezany 26 Multilevel (Generalized) Association Rules For many applications, it is difficult to find strong associations among data items at low or primitive levels of abstraction due to sparsity of data in multidimensional space. Strong associations discovered at high concept levels may represent common sense knowledge. However, what may represent common sense to one user may seem novel to another. Therefore, data mining systems should provide capabilities to mine association rules at multiple levels of abstraction and traverse easily among different abstraction spaces.
Institut für Scientific Computing - Universität WienP.Brezany 27 Multilevel (Generalized) Association Rules - Example Suppose we are given the following task-relevant set of transactional data for sales at the computer department of an AllElectronics branch, showing the items purchased for each transaction TID. TID Items purchased T1IBM desktop computer, Sony b/w printer T2Microsoft educational software, Microsoft financial software T3Logitech mouse computer accessory, Ergoway wrist pad accessory T4IBM desktop computer, Microsoft financial software T5IBM desktop computer.. Table Transactions
Institut für Scientific Computing - Universität WienP.Brezany 28 A Concept Hierarchy for our Example all computer softwareprinter Computer accessory desktoplaptopeducationalfinancialcolorb/w wrist pad mouse IBMMicrosoft HP ErgowayLogitech Sony Level 0 Level 3
Institut für Scientific Computing - Universität WienP.Brezany 29 Example (Cont.) The items in Table Transactions are at the lowest level of the concept hierarchy. It is difficult to find interesting purchase patterns at such raw or primitive level data. If, e.g., “IBM desktop computer” or “Sony b/w printer” each occurs in a very small fraction of the transactions, then it may be difficult to find strong associations involving such items. In other words, it is unlikely that the itemset “{IBM desktop computer, Sony b/w printer}” will satisfy minimum support. Itemsets containing generalized items, such as “{IBM desktop computer, b/w printer}” and “{computer, printer}” are more likely to have minimum support. Rules generated from association rule mining with concept hie- rarchies are called multiple-level or multilevel or generalized association rules.
Institut für Scientific Computing - Universität WienP.Brezany 30 Parallel Formulation of Association Rules Need: –Huge Transaction Datasets (10s of TB) –Large Number of Candidates. Data Distribution: –Partition the Transaction Database, or –Partition the Candidates, or –Both
Institut für Scientific Computing - Universität WienP.Brezany 31 Parallel Association Rules: Count Distribution (CD) Each Processor has complete candidate hash tree. Each Processor updates its hash tree with local data. Each Processor participates in global reduction to get global counts of candidates in the hash tree. Multiple database scans per iteration are required if hash tree too big for memory.
Institut für Scientific Computing - Universität WienP.Brezany 32 CD: Illustration {5,8} 2 {3,4} {2,3} {1,3} {1,2} {5,8} 7 {3,4} {2,3} {1,3} {1,2} {5,8} 0 {3,4} {2,3} {1,3} {1,2} P0P1P2 Global Reduction of Counts N/p
Institut für Scientific Computing - Universität WienP.Brezany 33 Parallel Association Rules: Data Distribution (DD) Candidate set is partitioned among the processors. Once local data has been partitioned, it is broadcast to all other processors. High Communication Cost due to data movement. Redundant work due to multiple traversals of the hash trees.
Institut für Scientific Computing - Universität WienP.Brezany 34 DD: Illustration All-to-All Broadcast of Candidates 9 {1,3} {1,2} 10 {3,4} {2,3} {5,8}17 P0P1P2 N/p Remote Data Remote Data Remote Data Broadcast Count
Institut für Scientific Computing - Universität WienP.Brezany 35 Predictive Model Markup Language – PMML and Visualization
Institut für Scientific Computing - Universität WienP.Brezany 36 Predictive Model Markup Language - PMML Markup language (XML) to describe data mining models PMML describes: –the inputs to data mining models –the transformations used prior to prepare data for data mining –The parameters which define the models themselves
Institut für Scientific Computing - Universität WienP.Brezany 37 PMML 2.1 – Association Rules (1) 1.Model attributes (1) …
Institut für Scientific Computing - Universität WienP.Brezany 38 PMML 2.1 – Association Rules (2) 1. Model attributes (2)
Institut für Scientific Computing - Universität WienP.Brezany 39 PMML 2.1 – Association Rules (3) 2. Items
Institut für Scientific Computing - Universität WienP.Brezany 40 PMML 2.1 – Association Rules (4) 3. ItemSets
Institut für Scientific Computing - Universität WienP.Brezany 41 PMML 2.1 – Association Rules (5) 4. AssociationRules
Institut für Scientific Computing - Universität WienP.Brezany 42 PMML example model for AssociationRules (1)
Institut für Scientific Computing - Universität WienP.Brezany 43 PMML example model for AssociationRules (2)
Institut für Scientific Computing - Universität WienP.Brezany 44 PMML example model for AssociationRules (3)
Institut für Scientific Computing - Universität WienP.Brezany 45 PMML example model for AssociationRules (4) <AssociationRule support="0.9“ confidence="0.85“ antecedent=“4" consequent=“3" /> <AssociationRule support="0.9" confidence="0.75" antecedent=“1" consequent=“6" /> <AssociationRule support="0.9" confidence="0.70" antecedent=“6" consequent="1" />
Institut für Scientific Computing - Universität WienP.Brezany 46 Visualization of Association Rules (1) 1. Table Format AntecedentConsequentSupportConfidence PC, MonitorPrinter90%85% PCPrinter, Monitor90%75% Printer, MonitorPC80%70%
Institut für Scientific Computing - Universität WienP.Brezany 47 Visualization of Association Rules (2) 2. Directed Graph PC Printer PC Monitor PC Monitor Printer Monitor
Institut für Scientific Computing - Universität WienP.Brezany 48 Visualization of Association Rules (3) 3. 3-D Visualisation
Institut für Scientific Computing - Universität WienP.Brezany 49 Mining Sequential Patterns (Mining Sequential Associations)
Institut für Scientific Computing - Universität WienP.Brezany 50 Mining Sequential Patterns Discovering sequential patterns is a relatively new data mining problem. The input data is a set of sequences, called data- sequences. Each data-sequence is a list of transactions where each transaction is a set of items. Typically, there is a transaction time associated with each transaction. A sequential pattern also consists of a list of sets of items. The problem is to find all sequential patterns with a user-specified minimum support, where the support of a sequential pattern is a percentage of data-sequences that contain the pattern.
Institut für Scientific Computing - Universität WienP.Brezany 51 Application Examples Book club Each data sequence may correspond to all book selections of a customer, and each transaction corresponds to the books selected by the customer in one order. A sequential pattern may be “5% of customers bough `Foundation´, then `Foundation and Empire´ and then `Second Foundation´”. The data sequences corresponding to a customer who bought some other books in between these books still contains this sequential pattern. Medical domain A data sequence may correspond to the symptoms or diseases of a patient, with a transaction corresponding to the symptoms exhibited or diseases diagnosed during a visit to the doctor. The patterns discovered could be used in disease research to help identify symptoms diseases that precede certain diseases.
Institut für Scientific Computing - Universität WienP.Brezany 52 Discovering Sequential Associations Object 2 Object timeline events Given: A set of objects with associated event occurrences.
Institut für Scientific Computing - Universität WienP.Brezany 53 Problem Statement We are given a database D of customer transactions. Each transaction consists of the following fields: customer-id, transaction-time, and the items purchased in the transaction. No customer has more than one transaction with the same transaction time. We do not consider quantities of items bought in a transaction: each item is a binary variable representing whether an item was bought or not. A sequence is an ordered list of itemsets. We denote an itemset i by (i 1 i 2... i m ), where i j is an item. We denote a sequence s by, where s j is an itemset. A sequence is contained in another sequence if there exist integers i 1 < i 2 < i n such that a 1 b i1, a 2 b i2,..., a n b in.
Institut für Scientific Computing - Universität WienP.Brezany 54 Problem Statement (2) For example, is contained in, since (3) (3 8), (4 5) (4 5 6) and (8) (8). However, the sequence is not contained in (an vice versa). The former represents items 3 and 5 being bought one after the other, while the latter represents items 3 and 5 being bought together. In a set of sequences, a sequence s is maximal if s is not contained in any other sequence. Customer sequence - an itemset list of customer transactions ordered by increasing transaction time:
Institut für Scientific Computing - Universität WienP.Brezany 55 Problem Statement (3) A customer supports a sequence s if s is contained in the customer sequence for this customer. The support for a sequence is defined as the fraction of total customers who support this sequence. Given a database D customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain user-specified minimum support. Each such sequence represents a sequential pattern. We call a sequence satisfying the minimum support constraint a large sequence. See the next example.
Institut für Scientific Computing - Universität WienP.Brezany 56 Example Customer Id Transaction Time Items Bought 1June 25 ‘0030 1June 30 ‘0090 2June 10 ‘00 10, 20 2June 15 ‘0030 2June 20 ‘00 40, 60, 70 3June 25 ‘00 30, 50, 70 4June 25 ‘0030 4June 30 ‘00 40, 70 4July 25 ‘0090 5June 12 ‘0090 Customer Id Custom Sequence Database sorted by customer Id and transaction time Customer-sequence version of the database
Institut für Scientific Computing - Universität WienP.Brezany 57 Example (2) With minimum support set to 25%, i.e., a minimum support of 2 customers, two sequences: and are maximal among those satisfying the support constraint, and are the desired sequential patterns. is supported by customers 1 and 4. Customer 4 buys items (40 70) in between items 30 and 90, but supports the patterns since we are looking for patterns that are not necessarily contiguous. is supported by customers 2 and 4. Customer 2 buys 60 along with 40 and 70, but suports this pattern since (40 70) is a subset of ( ). E.g. the sequence does not have minimal support; it is only supported by customer 2. The sequences,,,, and have minimum support, they are not maximal - therefore, they are not in the answer.