Institut für Scientific Computing - Universität WienP.Brezany 1 Datamining Methods Mining Association Rules and Sequential Patterns.

Slides:



Advertisements
Similar presentations
Association Rule and Sequential Pattern Mining for Episode Extraction Jonathan Yip.
Advertisements

Brian Chase.  Retailers now have massive databases full of transactional history ◦ Simply transaction date and list of items  Is it possible to gain.
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
A distributed method for mining association rules
CSE 634 Data Mining Techniques
Data Mining Techniques Association Rule
Association rules and frequent itemsets mining
Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
Mining Multiple-level Association Rules in Large Databases
Association Rules Spring Data Mining: What is it?  Two definitions:  The first one, classic and well-known, says that data mining is the nontrivial.
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
1 Association Graphs Selim Mimaroglu University of Massachusetts Boston.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Chapter 16 Parallel Data Mining 16.1From DB to DW to DM 16.2Data Mining: A Brief Overview 16.3Parallel Association Rules 16.4Parallel Sequential Patterns.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Association Rule Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Fast Algorithms for Association Rule Mining
Mining Association Rules
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
Association Rules. 2 Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
1 Mining Association Rules Mohamed G. Elfeky. 2 Introduction Data mining is the discovery of knowledge and useful information from the large amounts of.
Mining various kinds of Association Rules
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Data Mining Find information from data data ? information.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Elsayed Hemayed Data Mining Course
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Mining Association Rules in Large Database This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed.
Data Mining Find information from data data ? information.
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules
Frequent Pattern Mining
Market Basket Analysis and Association Rules
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining II: Association Rule mining & Classification
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS
Market Basket Analysis and Association Rules
©Jiawei Han and Micheline Kamber
Presentation transcript:

Institut für Scientific Computing - Universität WienP.Brezany 1 Datamining Methods Mining Association Rules and Sequential Patterns

Institut für Scientific Computing - Universität WienP.Brezany 2 KDD (Knowledge Discovery in Databases) Process Clean, Collect, Summarize Clean, Collect, Summarize Data Preparation Data Preparation Verification & Evaluation Verification & Evaluation Data Mining Data Mining Operational Databases Training Data Model, Patterns Data Warehouse

Institut für Scientific Computing - Universität WienP.Brezany 3 Mining Association Rules Association rule mining finds interesting association or correlation relationships among a large set of data items. This can help in many business decision making processes: store layout, catalog design, and customer segmentation based on buying paterns. Another important field: medical applications. Market basket analysis - a typical example of association rule mining. How can we find association rules from large amounts of data? Which association rules are the most interesting. How can we help or guide the mining procedures?

Institut für Scientific Computing - Universität WienP.Brezany 4 Informal Introduction Given a set of database transactions, where each transaction is a set of items, an association rule is an expression X  Y where X and Y are sets of items (literals). The intuitive meaning of the rule: transactions in the database which contain the items in X tend to also contain the items in Y. Example: 98% of customers who purchase tires and auto accessories also buy some automotive services; here 98% is called the confidence of the rule. The support of the rule is the percentage of transactions that contain both X and Y. The problem of mining association rules is to find all rules that satisfy a user-specified minimum support and minimum confidence.

Institut für Scientific Computing - Universität WienP.Brezany 5 Basic Concepts Let J = (i 1, i 2,..., i m ) be a set of items. Typically, the items are identifiers of individuals articles (pro- ducts (e.g., bar codes). Let D, the task relevant data, be a set of database transactions where each transaction T is a set of items such that T  J. Let A be a set of items: a transaction T is said to contain A if and only if A  T, An association rule is an implication of the form A  B, where A  J, B  J, and A  B = . The rule A  B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A  B (i.e. both A and B). This is the probability, P(A  B).

Institut für Scientific Computing - Universität WienP.Brezany 6 Basic Concepts (Cont.) The rule A  B has confidence c in the transaction set D if c is the percentage of transactions in D containing A that also contain B - the conditional probability P(B|A). Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong. A set of items is referred to as an itemset. An itemset that contains k items is a k-itemset. The occurence frequency of an itemset is the number of transactions that contain the itemset.

Institut für Scientific Computing - Universität WienP.Brezany 7 Basic Concepts - Example transaction purchased items 1 bread, coffee, milk, cake 2 coffee, milk, cake 3 bread, butter, coffee, milk 4 milk, cake 5 bread, cake 6 bread X = {coffee, milk} R = {coffee, cake, milk} support of X = 3 from 6 = 50% support of R = 2 from 6 = 33% Support of “milk, coffee”  “cake” equals to support of R = 33% Confidence of “milk, coffee”  “cake” = 2 from 3 = 67% [=support(R)/support(X)]

Institut für Scientific Computing - Universität WienP.Brezany 8 Basic Concepts (Cont.) An itemset satisfies minimum support if the occurrence fre- quency of the itemset is greater than or equal to the product of min_sup and the total number of transactions in D. The number of transactions required for the itemset to satis- fy minimum support is therefore referred to as the minimum support count. If an itemset satisfy minimum support, then it is a frequent itemset. The set of frequent k-itemsets is commonly denoted by L k. Association rule mining is a two-step process: 1. Find all frequent itemsets. 2. Generate strong association rules from the frequent itemsets.

Institut für Scientific Computing - Universität WienP.Brezany 9 Association Rule Classification Based on the types of values handled in the rule: If a rule concerns associations between the presence or absence of items, it is a Boolean association rule. For example: computer  financial_management_software [support = 2%, confidence = 60%] If a rule describes associations between quantitative items or attributes, then it is a quantitative associa- tion rule. For example: age(X, “30..39”) and income(X,”42K..48K”)  buys(X, high resolution TV) Note that the quantitative attributes, age and income, have been discretized.

Institut für Scientific Computing - Universität WienP.Brezany 10 Association Rule Classification (Cont.) Based on the dimensions of data involved in the rule: If the items or attributes in an association rule refe- rence only one dimension, then it is a single dimensional association rule. For example: buys(X,”computer”)  buys (X, “financial manage- ment software”) The above rule refers to only one dimension, buys. If a rule references two or more dimensions, such as buys, time_of_transaction, and customer_category, then it is a multidimensional association rule. The second rule on the previous slide is a 3-dimensional ass. rule since it involves 3 dimensions: age, income, and buys.

Institut für Scientific Computing - Universität WienP.Brezany 11 Association Rule Classification (Cont.) Based on the levels of abstractions involved in the rule set: Suppose that a set of association rules minded includes: age(X,”30..39”)  buys(X, “laptop computer”) age(X,”30..39”)  buys(X, “computer”) In the above rules, the items bought are referenced at different levels of abstraction. (E.g., “computer” is a higher-level abstraction of “laptop computer”.) Such ru- les are called multilevel association rules. Single-level association rules refer one abstraction level only.

Institut für Scientific Computing - Universität WienP.Brezany 12 Mining Single-Dimensional Boolean Association Rules from Transactional Databases This is the simplest form of association rules (used in market basket analysis. We present Apriori, a basic algorithm for finding frequent itemsets. Its name – it uses prior knowledge of frequent itemset properties (explained later). Apriori employs a iterative approach known as a level-wise search, where k-itemsets are used to explore (k + 1)-itemsets. First, the set of frequent 1-items, L 1, is found. L 1 is used to find L 2, the set of frequent 2-itemsets, which is used to find L 3, and so on, until no more frequent k-itemsets can be found. The finding of each L k requires one full scan of the database. The Apriori property is used to reduce the search space.

Institut für Scientific Computing - Universität WienP.Brezany 13 The Apriori Property All nonempty subsets of a frequent itemset must also be frequent. If an itemset I does not satisfy the minimum support threshold, min_sup, then I is not frequent, that is, P(I) < min_sup. If an item A is added to the itemset I, then the resulting itemset (i.e., I  A ) cannot occur more frequently than I. Therefore, I  A is not frequent either, that is, P (I  A ) < min_sup. How is the Apriori property used in the algorithm? To understand this, let us look at how L k-1 is used to find L k. A two-step process is followed, consisting of join and prune actions. These steps are explained on the next slides,

Institut für Scientific Computing - Universität WienP.Brezany 14 The Apriori Algorithm – the Join Step To find L k, a set of candidate k-itemsets is generated by joining L k-1 with itself. This set of candidates is denoted by C k. Let l 1 and l 2 be itemsets in L k-1. The notation l i [j] refers to the jth item in l i (e.g., l i [k-2] refers to the second to the last item in l 1 ). Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. The join L k-1 join L k-1, is performed, where members of L k-1 are joinable if their first (k-2) items are in common. That is, members l 1 and l 2 of L k-1 are joined if (l 1 [1] = l 2 [1] )  (l 1 [2] = l 2 [2] ) ... (l 1 [k-2] = l 2 [k-2] )  (l 1 [k-1] < l 2 [k-1] ). The condition (l 1 [k-1] < l 2 [k-1] ) simply ensures that no duplicates are generated. The resulting itemset: l 1 [1] l 1 [2] )... l 1 [k-1] l 2 [k-1].

Institut für Scientific Computing - Universität WienP.Brezany 15 The Apriori Algorithm – the Join Step (2) Illustration by an example p  L k-1 = ( 1 2 3) || || Join: Result  C k = ( ) || || q  L k-1 = ( 1 2 4) Each frequent k-itemset p is always extended by the last item of all frequent itemsets q which have the same first k-1 items as p.

Institut für Scientific Computing - Universität WienP.Brezany 16 The Apriori Algorithm – the Prune Step C k is a superset of L k, that is, its members may or may not be frequent, but all of the frequent k-items are included in C k. A scan of the database to determine the count of each candidate in C k would result in the determination of L k. C k can be huge, and so this could involve heavy computation. To reduce the size of C k, the Apriori property is used as follows. Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence, if any (k-1)-subset of a candidate k-itemset is not in L k-1, then the candidate cannot be frequent either and so can be removed from C k. The above subset testing can be done quickly by maintaining a hash tree of all frequent itemsets.

Institut für Scientific Computing - Universität WienP.Brezany 17 The Apriori Algorithm - Example Let’s look at a concrete example of Apriori, based on the AllElectronics transaction database D, shown below. There are nine transactions in this database, e.i., |D| = 9. We use the next figure to illus- trate the fin- ding of fre- quent itemsets in D. TID List of item_Ids T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3

Institut für Scientific Computing - Universität WienP.Brezany 18 Generation of C K and L K (min.supp. count=2) Scan D for count of each candidate- scan Itemset Sup. count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 Itemset Sup. count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 Itemset Sup. count {I1,I2} 4 {I1,I3} 4 {I1,I5} 2 {I2,I3} 4 {I2,I4} 2 {I2, I5} 2 Itemset Sup. count {I1,I2} 4 {I1,I3} 4 {I1,I4} 1 {I1,I5} 2 {I2,I3} 4 {I2,I4} 2 {I2,I5} 2 {I3,I4} 0 {I3,I5} 1 {I4,I5} 0 Itemset {I1,I2} {I1,I3} {I1,I4} {I1,I5} {I2,I3} {I2,I4} {I2,I5} {I3,I4} {I3,I5} {I4,I5} Compare candidate support count with minimum support count - compare Generate C 2 candidates from L 1 Scan Compare C1C1 L1L1 C2C2 C2C2 L2L2

Institut für Scientific Computing - Universität WienP.Brezany 19 Generation of C K and L K (min.supp. count=2) Generate C 3 candidates from L 2 Itemset {I1,I2,I3} {I1,I2,I5} Itemset Sup. Count {I1,I2,I3} 2 {I1,I2,I5} 2 Itemset Sup. Count {I1,I2,I3} 2 {I1,I2,I5} 2 ScanCompare C3C3 C3C3 L3L3

Institut für Scientific Computing - Universität WienP.Brezany 20 Algorithm Application Description 1In the 1st iteration, each item is a member of C 1. The algorithm simply scan all the transactions in order to count the number of occurrences of each item. 2Suppose that the minimum transaction support count (min_sup = 2/9 = 22%). L 1 can then be determined. 3C 2 = L 1 join L 1. 4The transactions in D are scanned and the support count of each candidate itemset in C 2, as shown in the middle table of the second row in the last figure. 5The set of frequent 2-itemsets, L 2, is then determined, consisting of those candidate 2-itemsets in C 2 having minimum support.

Institut für Scientific Computing - Universität WienP.Brezany 21 Algorithm Application Description (2) 6The generation of C 3 = L 2 join L 2 is detailed in the next figure. Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that the four latter candidates cannot possibly be frequent. We therefore remove them from C 3. 7The transactions in D are scanned in order to determine L 3, consisting of those candidate 3-itemsets in C 3 having minimum support. 8C 4 = L 3 join L 3, after the pruning C 4 = Ø.

Institut für Scientific Computing - Universität WienP.Brezany 22 Example: Generation C 3 from L 2 1. Join: C 3 = L 2  L 2 = {{I1,I2},{I1,I3},{I1,I5}, {I2,I3},{I2,I4}, {I2,I5}}  {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3},{I2,I4}, {I2,I5}} = {{I1,I2,I3}, {I1,I2,I5}, {I1,I3,I5}, {I2,I3,I4}, {I2,I3,I5}, {I2,I4,I5}}. 2. Prune using the Apriori property: All nonempty subsets of a frequent itemset must also be frequent.  The 2-item subsets of {I1,I2,I3} are {I1,I2}, {I1,I3}, {I2,I3}, and they all are members of L 2. Therefore, keep {I1,I2,I3} in C 3.  The 2-item subsets of {I1,I2,I5} are {I1,I2}, {I1,I5}, {I2,I5}, and they all are members of L 2. Therefore, keep {I1,I2,I5} in C 3.  Using the same analysis remove other 3-items from C Therefore, C 3 = {{I1,I2,I3}, {I1,I2,I5}} after pruning.

Institut für Scientific Computing - Universität WienP.Brezany 23 Generating Association Rules from Frequent Items We generate strong association rules - they satisfy both minimum support and minimum confidence. support_count(A  B) confidence ( A  B ) = P(B|A) = support_count(A) where support_count(A  B) is the number of transactions containing the itemsets A  B, and support_count(A) is the number of transactions containing the itemset A.

Institut für Scientific Computing - Universität WienP.Brezany 24 Generating Association Rules from Frequent Items (Cont.) Based on the equations on the previous slide, association rules can be generated as follows: - For each frequent itemset l, generate all nonempty subsets of l. - For every nonempty subset s of l, output the rule “s  (l - s)” support_count(l) if  min_conf, where min_conf is minimum support_count(s) confidence threshold.

Institut für Scientific Computing - Universität WienP.Brezany 25 Generating Association Rules - Example Suppose that the transactional data for AllElectronics contain the frequent itemset l = {I1,I2,I5}. The resulting rules are: I1  I2  I5, confidence = 2/4 = 50% I1  I5  I2, confidence = 2/2 = 100% I2  I5  I1, confidence = 2/2 = 100% I1  I2  I5, confidence = 2/6 = 33% I2  I1  I5, confidence = 2/7 = 29% I5  I1  I2, confidence = 2/2 = 100% If the minimum confidence threshold is, say, 70%, then only the second, third, and the last rules above are output, since these are the only ones generated that are strong.

Institut für Scientific Computing - Universität WienP.Brezany 26 Multilevel (Generalized) Association Rules For many applications, it is difficult to find strong associations among data items at low or primitive levels of abstraction due to sparsity of data in multidimensional space. Strong associations discovered at high concept levels may represent common sense knowledge. However, what may represent common sense to one user may seem novel to another. Therefore, data mining systems should provide capabilities to mine association rules at multiple levels of abstraction and traverse easily among different abstraction spaces.

Institut für Scientific Computing - Universität WienP.Brezany 27 Multilevel (Generalized) Association Rules - Example Suppose we are given the following task-relevant set of transactional data for sales at the computer department of an AllElectronics branch, showing the items purchased for each transaction TID. TID Items purchased T1IBM desktop computer, Sony b/w printer T2Microsoft educational software, Microsoft financial software T3Logitech mouse computer accessory, Ergoway wrist pad accessory T4IBM desktop computer, Microsoft financial software T5IBM desktop computer.. Table Transactions

Institut für Scientific Computing - Universität WienP.Brezany 28 A Concept Hierarchy for our Example all computer softwareprinter Computer accessory desktoplaptopeducationalfinancialcolorb/w wrist pad mouse IBMMicrosoft HP ErgowayLogitech Sony Level 0 Level 3

Institut für Scientific Computing - Universität WienP.Brezany 29 Example (Cont.) The items in Table Transactions are at the lowest level of the concept hierarchy. It is difficult to find interesting purchase patterns at such raw or primitive level data. If, e.g., “IBM desktop computer” or “Sony b/w printer” each occurs in a very small fraction of the transactions, then it may be difficult to find strong associations involving such items. In other words, it is unlikely that the itemset “{IBM desktop computer, Sony b/w printer}” will satisfy minimum support. Itemsets containing generalized items, such as “{IBM desktop computer, b/w printer}” and “{computer, printer}” are more likely to have minimum support. Rules generated from association rule mining with concept hie- rarchies are called multiple-level or multilevel or generalized association rules.

Institut für Scientific Computing - Universität WienP.Brezany 30 Parallel Formulation of Association Rules Need: –Huge Transaction Datasets (10s of TB) –Large Number of Candidates. Data Distribution: –Partition the Transaction Database, or –Partition the Candidates, or –Both

Institut für Scientific Computing - Universität WienP.Brezany 31 Parallel Association Rules: Count Distribution (CD) Each Processor has complete candidate hash tree. Each Processor updates its hash tree with local data. Each Processor participates in global reduction to get global counts of candidates in the hash tree. Multiple database scans per iteration are required if hash tree too big for memory.

Institut für Scientific Computing - Universität WienP.Brezany 32 CD: Illustration {5,8} 2 {3,4} {2,3} {1,3} {1,2} {5,8} 7 {3,4} {2,3} {1,3} {1,2} {5,8} 0 {3,4} {2,3} {1,3} {1,2} P0P1P2 Global Reduction of Counts N/p

Institut für Scientific Computing - Universität WienP.Brezany 33 Parallel Association Rules: Data Distribution (DD) Candidate set is partitioned among the processors. Once local data has been partitioned, it is broadcast to all other processors. High Communication Cost due to data movement. Redundant work due to multiple traversals of the hash trees.

Institut für Scientific Computing - Universität WienP.Brezany 34 DD: Illustration All-to-All Broadcast of Candidates 9 {1,3} {1,2} 10 {3,4} {2,3} {5,8}17 P0P1P2 N/p Remote Data Remote Data Remote Data Broadcast Count

Institut für Scientific Computing - Universität WienP.Brezany 35 Predictive Model Markup Language – PMML and Visualization

Institut für Scientific Computing - Universität WienP.Brezany 36 Predictive Model Markup Language - PMML Markup language (XML) to describe data mining models PMML describes: –the inputs to data mining models –the transformations used prior to prepare data for data mining –The parameters which define the models themselves

Institut für Scientific Computing - Universität WienP.Brezany 37 PMML 2.1 – Association Rules (1) 1.Model attributes (1) …

Institut für Scientific Computing - Universität WienP.Brezany 38 PMML 2.1 – Association Rules (2) 1. Model attributes (2)

Institut für Scientific Computing - Universität WienP.Brezany 39 PMML 2.1 – Association Rules (3) 2. Items

Institut für Scientific Computing - Universität WienP.Brezany 40 PMML 2.1 – Association Rules (4) 3. ItemSets

Institut für Scientific Computing - Universität WienP.Brezany 41 PMML 2.1 – Association Rules (5) 4. AssociationRules

Institut für Scientific Computing - Universität WienP.Brezany 42 PMML example model for AssociationRules (1)

Institut für Scientific Computing - Universität WienP.Brezany 43 PMML example model for AssociationRules (2)

Institut für Scientific Computing - Universität WienP.Brezany 44 PMML example model for AssociationRules (3)

Institut für Scientific Computing - Universität WienP.Brezany 45 PMML example model for AssociationRules (4) <AssociationRule support="0.9“ confidence="0.85“ antecedent=“4" consequent=“3" /> <AssociationRule support="0.9" confidence="0.75" antecedent=“1" consequent=“6" /> <AssociationRule support="0.9" confidence="0.70" antecedent=“6" consequent="1" />

Institut für Scientific Computing - Universität WienP.Brezany 46 Visualization of Association Rules (1) 1. Table Format AntecedentConsequentSupportConfidence PC, MonitorPrinter90%85% PCPrinter, Monitor90%75% Printer, MonitorPC80%70%

Institut für Scientific Computing - Universität WienP.Brezany 47 Visualization of Association Rules (2) 2. Directed Graph PC Printer PC Monitor PC Monitor Printer Monitor

Institut für Scientific Computing - Universität WienP.Brezany 48 Visualization of Association Rules (3) 3. 3-D Visualisation

Institut für Scientific Computing - Universität WienP.Brezany 49 Mining Sequential Patterns (Mining Sequential Associations)

Institut für Scientific Computing - Universität WienP.Brezany 50 Mining Sequential Patterns Discovering sequential patterns is a relatively new data mining problem. The input data is a set of sequences, called data- sequences. Each data-sequence is a list of transactions where each transaction is a set of items. Typically, there is a transaction time associated with each transaction. A sequential pattern also consists of a list of sets of items. The problem is to find all sequential patterns with a user-specified minimum support, where the support of a sequential pattern is a percentage of data-sequences that contain the pattern.

Institut für Scientific Computing - Universität WienP.Brezany 51 Application Examples Book club Each data sequence may correspond to all book selections of a customer, and each transaction corresponds to the books selected by the customer in one order. A sequential pattern may be “5% of customers bough `Foundation´, then `Foundation and Empire´ and then `Second Foundation´”. The data sequences corresponding to a customer who bought some other books in between these books still contains this sequential pattern. Medical domain A data sequence may correspond to the symptoms or diseases of a patient, with a transaction corresponding to the symptoms exhibited or diseases diagnosed during a visit to the doctor. The patterns discovered could be used in disease research to help identify symptoms diseases that precede certain diseases.

Institut für Scientific Computing - Universität WienP.Brezany 52 Discovering Sequential Associations Object 2 Object timeline events Given: A set of objects with associated event occurrences.

Institut für Scientific Computing - Universität WienP.Brezany 53 Problem Statement We are given a database D of customer transactions. Each transaction consists of the following fields: customer-id, transaction-time, and the items purchased in the transaction. No customer has more than one transaction with the same transaction time. We do not consider quantities of items bought in a transaction: each item is a binary variable representing whether an item was bought or not. A sequence is an ordered list of itemsets. We denote an itemset i by (i 1 i 2... i m ), where i j is an item. We denote a sequence s by, where s j is an itemset. A sequence is contained in another sequence if there exist integers i 1 < i 2 < i n such that a 1  b i1, a 2  b i2,..., a n  b in.

Institut für Scientific Computing - Universität WienP.Brezany 54 Problem Statement (2) For example, is contained in, since (3)  (3 8), (4 5)  (4 5 6) and (8)  (8). However, the sequence is not contained in (an vice versa). The former represents items 3 and 5 being bought one after the other, while the latter represents items 3 and 5 being bought together. In a set of sequences, a sequence s is maximal if s is not contained in any other sequence. Customer sequence - an itemset list of customer transactions ordered by increasing transaction time:

Institut für Scientific Computing - Universität WienP.Brezany 55 Problem Statement (3) A customer supports a sequence s if s is contained in the customer sequence for this customer. The support for a sequence is defined as the fraction of total customers who support this sequence. Given a database D customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain user-specified minimum support. Each such sequence represents a sequential pattern. We call a sequence satisfying the minimum support constraint a large sequence. See the next example.

Institut für Scientific Computing - Universität WienP.Brezany 56 Example Customer Id Transaction Time Items Bought 1June 25 ‘0030 1June 30 ‘0090 2June 10 ‘00 10, 20 2June 15 ‘0030 2June 20 ‘00 40, 60, 70 3June 25 ‘00 30, 50, 70 4June 25 ‘0030 4June 30 ‘00 40, 70 4July 25 ‘0090 5June 12 ‘0090 Customer Id Custom Sequence Database sorted by customer Id and transaction time Customer-sequence version of the database

Institut für Scientific Computing - Universität WienP.Brezany 57 Example (2) With minimum support set to 25%, i.e., a minimum support of 2 customers, two sequences: and are maximal among those satisfying the support constraint, and are the desired sequential patterns. is supported by customers 1 and 4. Customer 4 buys items (40 70) in between items 30 and 90, but supports the patterns since we are looking for patterns that are not necessarily contiguous. is supported by customers 2 and 4. Customer 2 buys 60 along with 40 and 70, but suports this pattern since (40 70) is a subset of ( ). E.g. the sequence does not have minimal support; it is only supported by customer 2. The sequences,,,, and have minimum support, they are not maximal - therefore, they are not in the answer.