연관규칙탐사, 박종수 1 연관 규칙 탐사와 그 응용 성신여자대학교 전산학과 박 종수

Slides:

Advertisements

Similar presentations

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.

Advertisements

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

LOGO Association Rule Lecturer: Dr. Bo Yuan

Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.

10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.

ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.

Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.

Rakesh Agrawal Ramakrishnan Srikant

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Chapter 5: Mining Frequent Patterns, Association and Correlations

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Organization “Association Analysis”

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.

Association Analysis: Basic Concepts and Algorithms.

Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int’l Conference on Data Engineering (ICDE) March 1995 Presenter: Phil Schlosser.

Data Mining Association Analysis: Basic Concepts and Algorithms

Fast Algorithms for Mining Association Rules * CS401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam.

6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.

Fast Algorithms for Association Rule Mining

Mining Association Rules

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.

Mining Association Rules

Performance and Scalability: Apriori Implementation.

Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,

Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.

Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.

Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.

Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.

1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.

DATA MINING LECTURE 3 Frequent Itemsets Association Rules.

Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: Sam Brown.

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Fast Algorithms for Mining Association Rules Rakesh Agrawal and Ramakrishnan Srikant VLDB '94 presented by kurt partridge cse 590db oct 4, 1999.

Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.

Data Mining Find information from data data ? information.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.

Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

1 Knowledge discovery & data mining Association rules and market basket analysis --introduction UCLA CS240A Course Notes* __________________________ *

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.

COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.

Introduction to Machine Learning Lecture 13 Introduction to Association Rules Albert Orriols i Puig Artificial.

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

Data Mining Find information from data data ? information.

Data Mining Association Analysis: Basic Concepts and Algorithms

Predictive Analytics in SQL and Datalog

Association rule mining

Association Rules Repoussis Panagiotis.

Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*

Frequent Pattern Mining

Association Rule Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Mining Sequential Patterns

Presentation transcript:

연관규칙탐사, 박종수 1 연관 규칙 탐사와 그 응용 성신여자대학교 전산학과 박 종수

연관규칙탐사, 박종수 2 차 례차 례 n Data Mining in the KDD Process n Association Rule 의 정의 n Mining Association Rules in Transaction Databases n Algorithm Apriori & DHP n Generalized Association Rules n Cyclic Association Rules and Negative Associations. n Interestingness Measurement n Sequential Patterns and Path Traversal Patterns n 연구 방향 및 참고 Homepages

연관규칙탐사, 박종수 3 Data Target Data Preprocessed Data Preprocessed Data Transformed Data Transformed Data Patterns Knowledge SelectionPreprocessing Transformation Data Mining Interpretation/ Evaluation Overview of the steps constituting the KDD process

연관규칙탐사, 박종수 4 Types of Data-Mining Problems n Prediction – Classification – Regression – Time Series n Knowledge Discovery – Deviation Detection – Database Segmentation – Clustering – Association Rules – Summarization – Visualization – Text mining

연관규칙탐사, 박종수 5 Association Rule Ex: the statement that 90% of transactions that purchase bread and butter also purchase milk. [Bread], [Butter][Milk] (12.5%, 90%) 90% : confidence factor of the rule (not 100%) 12.5%: support for the rule, the fraction of transactions in database antecedentconsequent Find all rules that have “Diet Coke” as consequent. Find all rules that have “bagels” in the antecedent. Find the “best” k rules that have “bagels” in the consequent.

연관규칙탐사, 박종수 6 연관규칙의 정의 n I : a set of literals called items. T: a set of items such that T  I, transaction. l An association rule is an implication of the form X  Y, where X  I, Y  I and X  Y = ø. l X  Y [support, confidence]

연관규칙탐사, 박종수 7 Transaction Databases 에서 연관 규칙 탐사 n Applications: pattern association, market analysis, etc n Given  data of transactions  each transaction has a list of items purchased association rules n Find all association rules: the presence of one set of items implies the presence of another set of items. people who purchased hammers also purchased nails - e.g., people who purchased hammers also purchased nails. n Measurement of rule strength  Confidence  Confidence: X & Y  Z has 90% confidence if 90% of customers who bought X and Y also bought Z.  Support  Support: useful rules(for business decision) should have some minimum transaction support.

연관규칙탐사, 박종수 8 Two Steps Two Steps for Association Rules n Determining “large itemsets” above minimum support  Find all combinations of items that have transaction support above minimum support  Researches  Researches have been focussed on this phase. n Generating rules large itemset L for each large itemset L do subset c of L for each subset c of L do support(L) / support(L - c)  minimum confidence if (support(L) / support(L - c)  minimum confidence) then output the rule (L - c)  c, with confidence = support(L)/support(L - c) and support = support(L);

연관규칙탐사, 박종수 9 Candidate ItemsetsLarge Itemsets Scan Database How to generate candidate itemsets Focus on data structures to speed up scanning the database Association Rules Apriori method: join step + prune step minimum support minimum confidence Hash tree, Trie, Hash table, etc.

연관규칙탐사, 박종수 10 Database D TID Items 100 A C D 200 B C E 300 A B C E 400 B E C 1 Itemset Sup. {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Scan D L1L1 Itemset Sup. {A} 2 {B} 3 {C} 3 {E} 3 C2C2 Itemset {A B} {A C} {A E} {B C} {B E} {C E} C2C2 Itemset Sup. {A B} 1 {A C} 2 {A E} 1 {B C} 2 {B E} 3 {C E} 2 L2L2 Itemset Sup. {A C} 2 {B C} 2 {B E} 3 {C E} 2 Scan D Scan D C3C3 Itemset {B C E} C3C3 ItemsetSup. {B C E} 2 L3L3 ItemsetSup. {B C E} 2 minimum support = 2

연관규칙탐사, 박종수 11 Algorithms Algorithms for Mining Association Rules May ‘93 n AIS(Agrawal et al., ACM SIGMOD, May ‘93) n SETM(Swami et al., IBM Tech. Rep., Oct ‘93) n Apriori n Apriori(Agrawal et al., VLDB, Sept ‘94) n OCD(Mannila et al., AAAI workshop on KDD, July, ‘94) n DHP n DHP(Park et al., ACM SIGMOD, May ‘95) n PARTITION(Savasere et al., VLDB, Sept ‘95) n Mining Generalized Association Rules(Srikant et al., VLDB, Sept ‘95) n Sampling Approach(Toivonen, VLDB, Sept ‘96) n DICMay ‘97 n DIC(dynamic itemset counting, Brin et al., ACM SIGMOD, May ‘97) n Cyclic Association Rules(  zden et al., IEEE ICDE, Feb ‘98) n Negative Associations(Savasere et al., IEEE ICDE, Feb ‘98)

연관규칙탐사, 박종수 12 Algorithm Apriori n L k : Set of Large k-itemsets n C k :Set of Candidate k-itemsets n Step; C 1  L 1  C 2  L 2,..., C k  L k n Input File: Transaction File, Output: Large itemsets L 1 = {large 1-itemset} for ( k=2; L k-1  Ø; k++) do begin C k = apriori-gen(L k-1 ); forall transactions t  D do begin C t = subset(C k, t); forall candidates c  C t do c.count++; end L k = {c  C k | c.count  minsup} end Answer = U k L k ;

연관규칙탐사, 박종수 13 insert into C k select p.item 1, p.item 2,..., p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 = q.item 1,..., p.item k-2 = q.item k-2, p.item k-1 < q.item k-1 Apriori-gen(L k-1 ) n Join step n Prune step forall itemsets c  C k do forall (k-1)-subsets s of c do if ( s  L k-1 ) then delete c from C k ;

연관규칙탐사, 박종수 14 Ex: Generation of Candidate Itemsets n 예 : L 3 로부터 C 4 를 생성하는 과정. Ê Join step L 3 = {{1, 2,3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}} 일때, 후보 4- 항목집합 = { { }, { }} · Prune step: - {1, 2, 3, 4} 의 3-subset = {{1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}} - {1, 3, 4, 5} 의 3-subset = {{1,3,4}, {1,3,5}, {1,4,5}, {3,4,5}} + 각 {1,4,5},{3,4,5}  L 3 이므로 {1, 3, 4, 5} 는 pruning!! C 4 = {{1, 2, 3, 4}}

연관규칙탐사, 박종수 15 Data Structure for C k n 각 레벨의 후보집합에 대해 Hash Tree 형성. n 예 : C 2 = {{A,B},{A,C},{A,T} {B,C}, {B,D},{C,D}} 의 Hash Tree ABC BCCD A,B A,C B,C B,D C,D C2C2 Level 1 Level 2 중간노드 잎노드 A,T

연관규칙탐사, 박종수 16 C 2 Hash Table H 2 와를 생성하는 예 DHP (DHP) 후보 2- 항목집합

연관규칙탐사, 박종수 17 C 2 count L2 {A C} 2 {A C} {B C} 2 {B C} {B E} 3 {B E} {C E} 2 {C E} s = 2 TID Items 100 A C D 200 B C E 300 A B C E 400 B E Counting support in a hash tree D 3 = {, } L 2 와 D 3 의 예 (DHP) {A C} Discard {B C} {B E} {C E} Keep {B C E} {A C} {B C} {B E} {C E} Keep {B C E} {B E} Discard

연관규칙탐사, 박종수 18 Generalized Association Rules n Finding associations between items at any level of the taxonomy. n Rules:  People who buy clothes tend to buy shoes. (  )  People who buy outerwear tend to buy shoes. ( o )  People who buy jacket tend to buy shoes. (  ) Clothes OuterwearShirts JacketsSki Pants ShoesHiking Boots Footwear

연관규칙탐사, 박종수 19 Problem Statement n I = { i 1, i 2, …, i m }: set of literals, D: set of transactions, T: a set of taxonomy, DAG(Directed Acyclic Graph) 일때, X  Y [confidence, support], where X  I, Y  I, X  Y = , and no item in Y is an ancestor of any item in X. (X, Y: any level of taxonomy T ) n Step 1. Find all sets of items whose support is greater than minimum support. 2. Generate association rules, whose confidence is greater than minimum confidence. 3. Prune all uninteresting rules from this set with respect to the R- interesting.

연관규칙탐사, 박종수 20 Interestingness of Generalized Rules n Using new interest measure, R-interesting: Prune out 40% to 60% of the rules as “redundant “ rules. n Example: * 가정 : Taxonomy: Skim milk is-a Milk, Milk  Cereal ( 8% support, 70% confidence), Skim milk 의 판매량 = milk 판매량의 1/4 일 때, * Skim milk  Cereal 에 대해, n Expectation: 2% support, 70% confidence n Actual support & confidence: 약 2% support, 70% confidence ==> redundant & uninteresting!!

연관규칙탐사, 박종수 21 Cyclic Association Rules n Beer and chips are sold together primarily between 6PM and 9PM. n Association rules could also display regular hourly, daily, weekly, etc., variation that has the appearance of cycles. n An association rule X  Y holds in time unit t i, – if the support of X  Y in D[i] exceeds MinSup and – the confidence of X  Y in D[i] exceeds MinConf. – It has a cycle c = (l, o), a length l and an offset o. n “coffee  doughnuts” has a cycle (24, 7), – if the unit of time is an hour and “coffee  doughnuts” holds during the interval 7AM-8AM everyday (I.e., every 24 hours).

연관규칙탐사, 박종수 22 Negative Association Rules n A rule : “60% of the customers who buy potato chips do not buy bottled water.” n Negative rule: X Y such that – (a) support(X) and support(Y) are greater than minimum support MinSup; and – (b) the rule interest measure is greater than MinRI. n The interest measure RI of a negative association rule, X Y, – E[support(X)] is the expected support of an itemset X.

연관규칙탐사, 박종수 23 Incremental Updating Incremental Updating, Parallel and Distributed Algorithms n 데이타베이스 연관규칙 탐사를 위한 점진적 평가기법. ( 김의경등, 한국정보과학회 ‘95 가을 학술 발표 논문지 ) n Fast updating algorithms, FUP (Cheung et al., IEEE ICDE, ‘96).  Partitioned derivation and incremental updating. n PDM (Park et al., ACM CIKM, ‘95):  Use a hashing technique(DHP-like) to identify candidate k-itemsets from the local databases. n Count Distribution n Count Distribution (Agrawal & Shafer, IEEE TKDE, Vol 8, No 6, ‘96):  An extension of the Apriori algorithm.  May require a lot of messages in count exchange. n FDM n FDM (Cheung et al., IEEE TKDE, Vol 8, No 6, ‘96).  Observation:If an itemset X is globally large, there exists a partition D i such that X and all its subsets are locally large at D i.  Candidate set are those which are also local candidates in some component database, plus some message passing optimizations.

연관규칙탐사, 박종수 24 When is Market Basket Analysis useful? n The following three rules are examples of real rules generated from real data: – On Thursdays, grocery store consumers often purchase diapers and beer together.  Useful rule: high quality, actionable information. – Customers who purchases maintenance agreements are very likely to purchase large appliances.  Trivial rule – When a new hardware store opens, one of the most commonly sold items is toilet rings.  Inexplicable rule

연관규칙탐사, 박종수 25 Interestingness Interestingness Measurement for Association Rules (I) support and confidence n Two popular measurements: support and confidence  The longer (itemset), the fewer (support). taxonomy information n Use taxonomy information for pruning redundant rules redundant  A rule is “redundant” if its support and confidence are close to their expected values based on an ancestor of the rule.  Example: ”milk  cereal” vs. “skim milk  cereal”.  More effective than that based on statistical significance. n Interestingness of Patterns  If a pattern contradicts the set of hard beliefs of the user  If a pattern contradicts the set of hard beliefs of the user, then this pattern is always interesting to the user.  The more a pattern “affects” the belief system, the more interesting it is.

연관규칙탐사, 박종수 26 InterestingnessMeasurement (II) Interestingness Measurement (II) n Improvement (Interest ) – How much better a rule is at predicting the result than just assuming the result in the first place. – Co-occurrence than implication. – Symmetric. n Conviction – How far ”condition and  result” deviates from independence

연관규칙탐사, 박종수 27 Range of measurement n Improvement – Improvement = 1: ç condition 과 result 의 item 이 completely independent!  Improvement < 1: ç worse rule! – Improvement > 1: ç better rule! n Conviction – Conviction = 1: ç condition 과 result 의 item 이 completely unrelated. – Conviction > 1: ç better rule!! – Conviction =  : ç completely related rule

연관규칙탐사, 박종수 28 Sequential Patterns n Examples of such a pattern: – Customers typically rent “Star Wars”, then “Empire Strikes Back”, and then “Return of the jedi”. – Note that these rentals need not to be consecutive. – 수강신청 : 관광과 여가 (1 학기 )  수도권과 주택문제 (2 학기 )  증권시장 (3 학기 ) – 주가 변동 패턴 : 삼성전자 주가 상승  LG 전자 주가 상승  보해양조 주가 상승 – 구매패턴 : 양복  와이셔츠  검정색 구두  ? – 의료진단에서 질병 발생 순서 패턴 – 환자 치료에서 진료 및 투약 패턴

연관규칙탐사, 박종수 29 Mining Sequential Patterns n An itemset is a non-empty set of items. n A sequence is an ordered list of itemsets. Customer IdCustomer Sequence Sequential Patterns with support > 25%

연관규칙탐사, 박종수 30 The Algorithm for Sequential Patterns by Agrawal and Srikant, 1995 ICDE n Sort Phase – major key: customer-id, minor key: transaction-time n Litemset Phase – litemset = an itemset with minimum support n Transformation Phase – A customer sequence is represented by a list of sets of litemsets n Sequence Phase ( Apriori 알고리즘의 응용 ) – Candidate sequences ==> Large sequences n Maximal Phase – a sequence s is maximal if s is not contained in any other sequence

연관규칙탐사, 박종수 31 Mining Path Traversal Patterns n Understanding user access patterns in a distributed information providing environment such as WWW, Hitel, etc. – help improving the system design – lead to better marketing decisions n Capturing user access patterns – mining path traversal patterns – capturing user traveling behavior – improving the quality of such services

연관규칙탐사, 박종수 32 B C D E G HW O UV A Maximal forward references {ABCD, ABEGH, ABEGW, AOU, AOV} Traversal patterns 2. Find maximal reference sequences. 1. Find large reference sequences.

연관규칙탐사, 박종수 33 연구 방향 n 연관 규칙 탐사 – Sampling approach, parallel method, distributed algorithm 등의 연구 – Candidate itemsets 을 효율적으로 관리하고 scanning 에 효과적인 자료구조 연구 – 규칙의 흥미도 또는 중요도 측정 – 연관 규칙의 응용으로 구체적인 적용 방법. n Other patterns – pattern 의 정의와 적용에 관한 문제 연구 – Similarity search – WWW 에서 path traversal patterns 등의 연구

연관규칙탐사, 박종수 34 Some Data Mining Systems and Homepages QuestAgrawalQuest (IBM Almaden: Agrawal, et al.): –large DB-oriented association, classification, sequential patterns, similar sequences, etc. –“ DBMinerHanDBMiner: (SFC: Han, et al.): –Interactive, multi-level characterization, classification, association & prediction. –“ KDDPiatetsky-ShapiroKDD (GTE: Piatetsky-Shapiro, et al.): –multi-strategy, strong rules, statistical approaches, etc. “ –KD Mine: “ Other Homepages for Data Mining –Rakesh Agrawal: “ –Usama Fayyad: “ –Heikki Mannila: “ –Jiawei Han: “ –Data Mining and Knowledge Discovery Journal Editorial Board –Data Mining and Knowledge Discovery Journal: “ 의 Editorial Board