Download presentation
Presentation is loading. Please wait.
Published byReynard Copeland Modified over 8 years ago
1
연관규칙탐사, 박종수 1 연관 규칙 탐사와 그 응용 성신여자대학교 전산학과 박 종수 jpark@cs.sungshin.ac.kr
2
연관규칙탐사, 박종수 2 차 례차 례 n Data Mining in the KDD Process n Association Rule 의 정의 n Mining Association Rules in Transaction Databases n Algorithm Apriori & DHP n Generalized Association Rules n Cyclic Association Rules and Negative Associations. n Interestingness Measurement n Sequential Patterns and Path Traversal Patterns n 연구 방향 및 참고 Homepages
3
연관규칙탐사, 박종수 3 Data Target Data Preprocessed Data Preprocessed Data Transformed Data Transformed Data Patterns Knowledge SelectionPreprocessing Transformation Data Mining Interpretation/ Evaluation Overview of the steps constituting the KDD process
4
연관규칙탐사, 박종수 4 Types of Data-Mining Problems n Prediction – Classification – Regression – Time Series n Knowledge Discovery – Deviation Detection – Database Segmentation – Clustering – Association Rules – Summarization – Visualization – Text mining
5
연관규칙탐사, 박종수 5 Association Rule Ex: the statement that 90% of transactions that purchase bread and butter also purchase milk. [Bread], [Butter][Milk] (12.5%, 90%) 90% : confidence factor of the rule (not 100%) 12.5%: support for the rule, the fraction of transactions in database antecedentconsequent Find all rules that have “Diet Coke” as consequent. Find all rules that have “bagels” in the antecedent. Find the “best” k rules that have “bagels” in the consequent.
6
연관규칙탐사, 박종수 6 연관규칙의 정의 n I : a set of literals called items. T: a set of items such that T I, transaction. l An association rule is an implication of the form X Y, where X I, Y I and X Y = ø. l X Y [support, confidence]
7
연관규칙탐사, 박종수 7 Transaction Databases 에서 연관 규칙 탐사 n Applications: pattern association, market analysis, etc n Given data of transactions each transaction has a list of items purchased association rules n Find all association rules: the presence of one set of items implies the presence of another set of items. people who purchased hammers also purchased nails - e.g., people who purchased hammers also purchased nails. n Measurement of rule strength Confidence Confidence: X & Y Z has 90% confidence if 90% of customers who bought X and Y also bought Z. Support Support: useful rules(for business decision) should have some minimum transaction support.
8
연관규칙탐사, 박종수 8 Two Steps Two Steps for Association Rules n Determining “large itemsets” above minimum support Find all combinations of items that have transaction support above minimum support Researches Researches have been focussed on this phase. n Generating rules large itemset L for each large itemset L do subset c of L for each subset c of L do support(L) / support(L - c) minimum confidence if (support(L) / support(L - c) minimum confidence) then output the rule (L - c) c, with confidence = support(L)/support(L - c) and support = support(L);
9
연관규칙탐사, 박종수 9 Candidate ItemsetsLarge Itemsets Scan Database How to generate candidate itemsets Focus on data structures to speed up scanning the database Association Rules Apriori method: join step + prune step minimum support minimum confidence Hash tree, Trie, Hash table, etc.
10
연관규칙탐사, 박종수 10 Database D TID Items 100 A C D 200 B C E 300 A B C E 400 B E C 1 Itemset Sup. {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Scan D L1L1 Itemset Sup. {A} 2 {B} 3 {C} 3 {E} 3 C2C2 Itemset {A B} {A C} {A E} {B C} {B E} {C E} C2C2 Itemset Sup. {A B} 1 {A C} 2 {A E} 1 {B C} 2 {B E} 3 {C E} 2 L2L2 Itemset Sup. {A C} 2 {B C} 2 {B E} 3 {C E} 2 Scan D Scan D C3C3 Itemset {B C E} C3C3 ItemsetSup. {B C E} 2 L3L3 ItemsetSup. {B C E} 2 minimum support = 2
11
연관규칙탐사, 박종수 11 Algorithms Algorithms for Mining Association Rules May ‘93 n AIS(Agrawal et al., ACM SIGMOD, May ‘93) n SETM(Swami et al., IBM Tech. Rep., Oct ‘93) n Apriori n Apriori(Agrawal et al., VLDB, Sept ‘94) n OCD(Mannila et al., AAAI workshop on KDD, July, ‘94) n DHP n DHP(Park et al., ACM SIGMOD, May ‘95) n PARTITION(Savasere et al., VLDB, Sept ‘95) n Mining Generalized Association Rules(Srikant et al., VLDB, Sept ‘95) n Sampling Approach(Toivonen, VLDB, Sept ‘96) n DICMay ‘97 n DIC(dynamic itemset counting, Brin et al., ACM SIGMOD, May ‘97) n Cyclic Association Rules( zden et al., IEEE ICDE, Feb ‘98) n Negative Associations(Savasere et al., IEEE ICDE, Feb ‘98)
12
연관규칙탐사, 박종수 12 Algorithm Apriori n L k : Set of Large k-itemsets n C k :Set of Candidate k-itemsets n Step; C 1 L 1 C 2 L 2,..., C k L k n Input File: Transaction File, Output: Large itemsets L 1 = {large 1-itemset} for ( k=2; L k-1 Ø; k++) do begin C k = apriori-gen(L k-1 ); forall transactions t D do begin C t = subset(C k, t); forall candidates c C t do c.count++; end L k = {c C k | c.count minsup} end Answer = U k L k ;
13
연관규칙탐사, 박종수 13 insert into C k select p.item 1, p.item 2,..., p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 = q.item 1,..., p.item k-2 = q.item k-2, p.item k-1 < q.item k-1 Apriori-gen(L k-1 ) n Join step n Prune step forall itemsets c C k do forall (k-1)-subsets s of c do if ( s L k-1 ) then delete c from C k ;
14
연관규칙탐사, 박종수 14 Ex: Generation of Candidate Itemsets n 예 : L 3 로부터 C 4 를 생성하는 과정. Ê Join step L 3 = {{1, 2,3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}} 일때, 후보 4- 항목집합 = { {1 2 3 4}, {1 3 4 5}} · Prune step: - {1, 2, 3, 4} 의 3-subset = {{1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}} - {1, 3, 4, 5} 의 3-subset = {{1,3,4}, {1,3,5}, {1,4,5}, {3,4,5}} + 각 {1,4,5},{3,4,5} L 3 이므로 {1, 3, 4, 5} 는 pruning!! C 4 = {{1, 2, 3, 4}}
15
연관규칙탐사, 박종수 15 Data Structure for C k n 각 레벨의 후보집합에 대해 Hash Tree 형성. n 예 : C 2 = {{A,B},{A,C},{A,T} {B,C}, {B,D},{C,D}} 의 Hash Tree ABC BCCD A,B A,C B,C B,D C,D C2C2 Level 1 Level 2 중간노드 잎노드 A,T
16
연관규칙탐사, 박종수 16 C 2 Hash Table H 2 와를 생성하는 예 DHP (DHP) 후보 2- 항목집합
17
연관규칙탐사, 박종수 17 C 2 count L2 {A C} 2 {A C} {B C} 2 {B C} {B E} 3 {B E} {C E} 2 {C E} s = 2 TID Items 100 A C D 200 B C E 300 A B C E 400 B E Counting support in a hash tree D 3 = {, } L 2 와 D 3 의 예 (DHP) {A C} Discard {B C} {B E} {C E} Keep {B C E} {A C} {B C} {B E} {C E} Keep {B C E} {B E} Discard
18
연관규칙탐사, 박종수 18 Generalized Association Rules n Finding associations between items at any level of the taxonomy. n Rules: People who buy clothes tend to buy shoes. ( ) People who buy outerwear tend to buy shoes. ( o ) People who buy jacket tend to buy shoes. ( ) Clothes OuterwearShirts JacketsSki Pants ShoesHiking Boots Footwear
19
연관규칙탐사, 박종수 19 Problem Statement n I = { i 1, i 2, …, i m }: set of literals, D: set of transactions, T: a set of taxonomy, DAG(Directed Acyclic Graph) 일때, X Y [confidence, support], where X I, Y I, X Y = , and no item in Y is an ancestor of any item in X. (X, Y: any level of taxonomy T ) n Step 1. Find all sets of items whose support is greater than minimum support. 2. Generate association rules, whose confidence is greater than minimum confidence. 3. Prune all uninteresting rules from this set with respect to the R- interesting.
20
연관규칙탐사, 박종수 20 Interestingness of Generalized Rules n Using new interest measure, R-interesting: Prune out 40% to 60% of the rules as “redundant “ rules. n Example: * 가정 : Taxonomy: Skim milk is-a Milk, Milk Cereal ( 8% support, 70% confidence), Skim milk 의 판매량 = milk 판매량의 1/4 일 때, * Skim milk Cereal 에 대해, n Expectation: 2% support, 70% confidence n Actual support & confidence: 약 2% support, 70% confidence ==> redundant & uninteresting!!
21
연관규칙탐사, 박종수 21 Cyclic Association Rules n Beer and chips are sold together primarily between 6PM and 9PM. n Association rules could also display regular hourly, daily, weekly, etc., variation that has the appearance of cycles. n An association rule X Y holds in time unit t i, – if the support of X Y in D[i] exceeds MinSup and – the confidence of X Y in D[i] exceeds MinConf. – It has a cycle c = (l, o), a length l and an offset o. n “coffee doughnuts” has a cycle (24, 7), – if the unit of time is an hour and “coffee doughnuts” holds during the interval 7AM-8AM everyday (I.e., every 24 hours).
22
연관규칙탐사, 박종수 22 Negative Association Rules n A rule : “60% of the customers who buy potato chips do not buy bottled water.” n Negative rule: X Y such that – (a) support(X) and support(Y) are greater than minimum support MinSup; and – (b) the rule interest measure is greater than MinRI. n The interest measure RI of a negative association rule, X Y, – E[support(X)] is the expected support of an itemset X.
23
연관규칙탐사, 박종수 23 Incremental Updating Incremental Updating, Parallel and Distributed Algorithms n 데이타베이스 연관규칙 탐사를 위한 점진적 평가기법. ( 김의경등, 한국정보과학회 ‘95 가을 학술 발표 논문지 ) n Fast updating algorithms, FUP (Cheung et al., IEEE ICDE, ‘96). Partitioned derivation and incremental updating. n PDM (Park et al., ACM CIKM, ‘95): Use a hashing technique(DHP-like) to identify candidate k-itemsets from the local databases. n Count Distribution n Count Distribution (Agrawal & Shafer, IEEE TKDE, Vol 8, No 6, ‘96): An extension of the Apriori algorithm. May require a lot of messages in count exchange. n FDM n FDM (Cheung et al., IEEE TKDE, Vol 8, No 6, ‘96). Observation:If an itemset X is globally large, there exists a partition D i such that X and all its subsets are locally large at D i. Candidate set are those which are also local candidates in some component database, plus some message passing optimizations.
24
연관규칙탐사, 박종수 24 When is Market Basket Analysis useful? n The following three rules are examples of real rules generated from real data: – On Thursdays, grocery store consumers often purchase diapers and beer together. Useful rule: high quality, actionable information. – Customers who purchases maintenance agreements are very likely to purchase large appliances. Trivial rule – When a new hardware store opens, one of the most commonly sold items is toilet rings. Inexplicable rule
25
연관규칙탐사, 박종수 25 Interestingness Interestingness Measurement for Association Rules (I) support and confidence n Two popular measurements: support and confidence The longer (itemset), the fewer (support). taxonomy information n Use taxonomy information for pruning redundant rules redundant A rule is “redundant” if its support and confidence are close to their expected values based on an ancestor of the rule. Example: ”milk cereal” vs. “skim milk cereal”. More effective than that based on statistical significance. n Interestingness of Patterns If a pattern contradicts the set of hard beliefs of the user If a pattern contradicts the set of hard beliefs of the user, then this pattern is always interesting to the user. The more a pattern “affects” the belief system, the more interesting it is.
26
연관규칙탐사, 박종수 26 InterestingnessMeasurement (II) Interestingness Measurement (II) n Improvement (Interest ) – How much better a rule is at predicting the result than just assuming the result in the first place. – Co-occurrence than implication. – Symmetric. n Conviction – How far ”condition and result” deviates from independence
27
연관규칙탐사, 박종수 27 Range of measurement n Improvement – Improvement = 1: ç condition 과 result 의 item 이 completely independent! Improvement < 1: ç worse rule! – Improvement > 1: ç better rule! n Conviction – Conviction = 1: ç condition 과 result 의 item 이 completely unrelated. – Conviction > 1: ç better rule!! – Conviction = : ç completely related rule
28
연관규칙탐사, 박종수 28 Sequential Patterns n Examples of such a pattern: – Customers typically rent “Star Wars”, then “Empire Strikes Back”, and then “Return of the jedi”. – Note that these rentals need not to be consecutive. – 수강신청 : 관광과 여가 (1 학기 ) 수도권과 주택문제 (2 학기 ) 증권시장 (3 학기 ) – 주가 변동 패턴 : 삼성전자 주가 상승 LG 전자 주가 상승 보해양조 주가 상승 – 구매패턴 : 양복 와이셔츠 검정색 구두 ? – 의료진단에서 질병 발생 순서 패턴 – 환자 치료에서 진료 및 투약 패턴
29
연관규칙탐사, 박종수 29 Mining Sequential Patterns n An itemset is a non-empty set of items. n A sequence is an ordered list of itemsets. Customer IdCustomer Sequence 1 2 3 4 5 Sequential Patterns with support > 25%
30
연관규칙탐사, 박종수 30 The Algorithm for Sequential Patterns by Agrawal and Srikant, 1995 ICDE n Sort Phase – major key: customer-id, minor key: transaction-time n Litemset Phase – litemset = an itemset with minimum support n Transformation Phase – A customer sequence is represented by a list of sets of litemsets n Sequence Phase ( Apriori 알고리즘의 응용 ) – Candidate sequences ==> Large sequences n Maximal Phase – a sequence s is maximal if s is not contained in any other sequence
31
연관규칙탐사, 박종수 31 Mining Path Traversal Patterns n Understanding user access patterns in a distributed information providing environment such as WWW, Hitel, etc. – help improving the system design – lead to better marketing decisions n Capturing user access patterns – mining path traversal patterns – capturing user traveling behavior – improving the quality of such services
32
연관규칙탐사, 박종수 32 B C D E G HW O UV A1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Maximal forward references {ABCD, ABEGH, ABEGW, AOU, AOV} Traversal patterns 2. Find maximal reference sequences. 1. Find large reference sequences.
33
연관규칙탐사, 박종수 33 연구 방향 n 연관 규칙 탐사 – Sampling approach, parallel method, distributed algorithm 등의 연구 – Candidate itemsets 을 효율적으로 관리하고 scanning 에 효과적인 자료구조 연구 – 규칙의 흥미도 또는 중요도 측정 – 연관 규칙의 응용으로 구체적인 적용 방법. n Other patterns – pattern 의 정의와 적용에 관한 문제 연구 – Similarity search – WWW 에서 path traversal patterns 등의 연구
34
연관규칙탐사, 박종수 34 Some Data Mining Systems and Homepages QuestAgrawalQuest (IBM Almaden: Agrawal, et al.): –large DB-oriented association, classification, sequential patterns, similar sequences, etc. –“http://www.almaden.ibm.com/cs/quest/” DBMinerHanDBMiner: (SFC: Han, et al.): –Interactive, multi-level characterization, classification, association & prediction. –“http://db.cs.sfu.ca/DBMiner/” KDDPiatetsky-ShapiroKDD (GTE: Piatetsky-Shapiro, et al.): –multi-strategy, strong rules, statistical approaches, etc. “http://info.gte.com/~kdd/index.html” –KD Mine: “http://info.gte.com/~kdd/index.html” Other Homepages for Data Mining –Rakesh Agrawal: “http://www.almaden.ibm.com/cs/people/ragrawal/” –Usama Fayyad: “http://www.research.microsoft.com/~fayyad/” –Heikki Mannila: “http://www.cs.Helsinki.Fl/~mannila/” –Jiawei Han: “http://fas.sfu.ca/cs/people/Faculty/Han/” –Data Mining and Knowledge Discovery Journal Editorial Board –Data Mining and Knowledge Discovery Journal: “http://www.research.microsoft.com/research/datamine/” 의 Editorial Board
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.