Date: 2012/07/02 Source: Marina Drosou, Evaggelia Pitoura (CIKM’11) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Recap: Mining association rules from large datasets
Salvatore Ruggieri SIGKDD2010 Frequent Regular Itemset Mining 2010/9/2 1.
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
Frequent Closed Pattern Search By Row and Feature Enumeration
LOGO Association Rule Lecturer: Dr. Bo Yuan
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
1 Department of Information & Computer Education, NTNU SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
Intelligent Information Directory System for Clinical Documents Qinghua Zou 6/3/2005 Dr. Wesley W. Chu (Advisor)
Association Rule Mining - MaxMiner. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and.
ReDrive: Result-Driven Database Exploration through Recommendations Marina Drosou, Evaggelia Pitoura Computer Science Department University of Ioannina.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM /8/261.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
A Personalized Recommender System Based on Users’ Information In Folksonomies Date: 2013/12/18 Author: Mohamed Nader Jelassi, Sadok Ben Yahia, Engelbert.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung.
FINDING RELEVANT INFORMATION OF CERTAIN TYPES FROM ENTERPRISE DATA Date: 2012/04/30 Source: Xitong Liu (CIKM’11) Speaker: Er-gang Liu Advisor: Dr. Jia-ling.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
1/24 Novel algorithm for mining high utility itemsets Shankar, S. Purusothaman, T. Jayanthi, S. International Conference on Computing, Communication and.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
A DDING S TRUCTURE TO T OP -K: F ORM I TEMS TO E XPANSIONS Date : Source : CIKM’ 11 Speaker : I-Chih Chiu Advisor : Dr. Jia-Ling Koh 1.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 The Strategies for Mining Fault-Tolerant Patterns Jia-Ling Koh Department of Information and Computer Education National Taiwan Normal University.
Δ-Tolerance Closed Frequent Itemsets James Cheng,Yiping Ke,and Wilfred Ng ICDM ’ 06 報告者:林靜怡 2007/03/15.
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
1 Mining the Smallest Association Rule Set for Predictions Jiuyong Li, Hong Shen, and Rodney Topor Proceedings of the 2001 IEEE International Conference.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Mining General Temporal Association Rules for Items with Different Exhibition Cheng-Yue Chang, Ming-Syan Chen, Chang-Hung Lee, Proc. of the 2002 IEEE international.
CFI-Stream: Mining Closed Frequent Itemsets in Data Streams
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Frequent Pattern Mining
CARPENTER Find Closed Patterns in Long Biological Datasets
Dynamic Itemset Counting
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
A Parameterised Algorithm for Mining Association Rules
Data Mining Association Analysis: Basic Concepts and Algorithms
Fractional Factorial Design
Frequent-Pattern Tree
DENSE ITEMSETS JOUNI K. SEPPANEN, HEIKKI MANNILA SIGKDD2004
Design matrix Run A B C D E
An integer programming approach for frequent itemset hiding
Association Analysis: Basic Concepts
Presentation transcript:

Date: 2012/07/02 Source: Marina Drosou, Evaggelia Pitoura (CIKM’11) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1

Outline Introduction The ReDRIVE framework FaSets Interesting faSets Top-k faSets computation Recommendations Statistics maintenance Two-Phase algorithm Experiment Conclusion 2

Outline Introduction The ReDRIVE framework FaSets Interesting faSets Top-k faSets computation Recommendation Statistics maintenance Two-Phase algorithm Experiment Conclusion 3

4 Introduction - Motivation User Database (EX : IMDB) Not knowing the exact content of the database Query search

5 Show me movies directed by F.F. Coppola DirectorTitleYearGenre F.F. CoppolaTetro2009Drama F.F. CoppolaYouth Without Youth2007Fantasy F.F. CoppolaThe Godfather1972Drama F.F. CoppolaRumble Fish1983Drama F.F. CoppolaThe Conversation1974Thriller F.F. CoppolaThe Outsiders1983Drama F.F. CoppolaSupernova2000Thriller F.F. CoppolaApocalypse Now1979Drama Query Result Introduction - Motivation No clear understanding of information needs Users interact with databases by formulating queries

6 SELECT title, year, genre FROM movies, directors, genres WHERE director = ‘F.F. Coppola’ AND join(Q) SELECT director FROM movies, directors, genres WHERE year = 1983 AND genre = ‘Drama’ AND join(Q) Query 1 1 Query Result 2 2 Recommendation 3 3 Explorator Query 4 4 Introduction - Goal DirectorTitleYearGenre F.F. CoppolaTetro2009Drama F.F. CoppolaYouth Without Youth2007Fantasy F.F. CoppolaThe Godfather1972Drama F.F. CoppolaRumble Fish1983Drama F.F. CoppolaThe Conversation1974Thriller F.F. CoppolaThe Outsiders1983Drama F.F. CoppolaSupernova2000Thriller F.F. CoppolaApocalypse Now1979Drama Recommendation Drama Drama, 2009 Drama, 1983 Thriller Thriller, 1974 Fantasy Fantasy, 2007 Fantasy, 2007, Youth Without Youth Interesting faSet

Outline Introduction The ReDRIVE framework FaSets Interesting faSets Top-k faSets computation Recommendation Statistics maintenance Two-Phase algorithm Experiment Conclusion 7

FaSets Facet condition: A condition A i = a i on some attribute of Res(Q) m-FaSet: A set of m facet conditions on m different attributes of Res(Q) 8 DirectorTitleYearGenre F.F. CoppolaTetro2009Drama F.F. CoppolaYouth Without Youth2007Fantasy F.F. CoppolaThe Godfather1972Drama F.F. CoppolaRumble Fish1983Drama F.F. CoppolaThe Conversation1974Thriller F.F. CoppolaThe Outsiders1983Drama F.F. CoppolaSupernova2000Thriller F.F. CoppolaApocalypse Now1979Drama 1-faSet 2-faSet

Interestingness score of a FaSet 9 Support of f in Res(Q) Support of f in the database DirectorTitleYearGenre F.F. CoppolaTetro2009Drama F.F. CoppolaYouth Without Youth2007Fantasy F.F. CoppolaThe Godfather1972Drama F.F. CoppolaRumble Fish1983Drama F.F. CoppolaThe Conversation1974Thriller F.F. CoppolaThe Outsiders1983Drama F.F. CoppolaSupernova2000Thriller F.F. CoppolaApocalypse Now1979Drama = 125 = 500 Query Result Score ( f, Q = “F.F. Coppola” ) DB “Drama” : 50 “Thriller” : 5 All tuple: 10000

Outline Introduction The ReDRIVE framework FaSets Interesting faSets Top-k faSets computation Recommendation Statistics maintenance Two-Phase algorithm Experiment Conclusion 10

Top-k faSets computation To compute the interestingness score of a faSet : p(f |Res(Q)) p(f |D) p(f |Res(Q)) is computed on-line p(f |D) is too expensive ⇒ must be estimated Compute off-line and store statistics that will allow us to estimate p(f |D) for any faSet f. FaSets that appear frequently in the database D are not expected to be interesting. 11

It is useful to maintain information about the support of “rare faSets” in D. In correspondence to Data Mining, paper define: Rare faSet (RF) : A faSet with frequency under a threshold Closed Rare faSet (CRF) : A rare faSet with no proper subset with the same frequency Minimal Rare faSet (MRF) : A rare faSet with no rare subset |MRFs| ≤ |CRFs| ≤ |RFs| MRFs can tell us if f is rare but not its frequency CRFs can tell us its frequency but are still too many 12 Estimating p(f |D)

13

14 Rare faSet (RF) : A faSet with frequency under a threshold Minimal Rare faSet (MRF) : A rare faSet with no rare subset ab : a,b acd: ac,ad,cd ade: ad,de,ae

15 abd(1) : ab(2), ad(2), bd(2) bde(0): bd(1),be(1),de(2) bcde(0): bcd(1),bce(1), bde(0),cde(1) Closed Rare faSet (CRF) : A rare faSet with no proper subset with the same frequency Not Closed Rare faSet

Statistics Maintaining statistics in the form of -Tolerance Closed Rare FaSets (-CRFs): A faSet f is an -CRF for a set of tuples S if and only if: it is rare for S it has no proper rare subset f’, |f’ |=|f |-1, such that: count(f’,S) < (1+ )count(f,S), ≥ 0 16

Outline Introduction The ReDRIVE framework FaSets Interesting faSets Top-k faSets computation Recommendation Statistics maintenance Two-Phase algorithm Experiment Conclusion 17

The Two-Phase Algorithm (1/3) Maintain all -CRFs, where rare is defined by minsupp r First Phase: X = {all 1-faSets in Res(Q)} Y = {-CRFs that consist only of 1-faSets in X} 18 DirectorTitleYearGenre F.F. CoppolaTetro2009Drama F.F. CoppolaYouth Without Youth2007Fantasy F.F. CoppolaThe Godfather1972Drama F.F. CoppolaRumble Fish1983Drama F.F. CoppolaThe Conversation1974Thriller F.F. CoppolaThe Outsiders1983Drama F.F. CoppolaSupernova2000Thriller F.F. CoppolaApocalypse Now1979Drama 1-faSet Drama Fantasy Thriller Query Result X -CRFs Drama : 50 Thriller : Collection of maintained Statistics Drama Thiller Y

The Two-Phase Algorithm (2/3) Maintain all -CRFs, where rare is defined by minsupp r First Phase: Y = {-CRFs that consist only of 1-faSets in X} Z = {faSets in Res(Q) that are supersets of some faSet in Y} Compute scores for faSets in Z 19 DirectorTitleYearGenre F.F. CoppolaTetro2009Drama F.F. CoppolaSupernova2000Thriller Query Result Drama Thiller Y Z { 2009, Drama } { Tetro, 2009, Drama } { 2000, Thriller} {Supernova, 2000, Thriller } { 2009, Drama } { Tetro, 2009, Drama } { 2000, Thriller} {Supernova, 2000, Thriller }

The Two-Phase Algorithm (3/3) Let f be a faSet examined in the second phase. This means that p(f |D) > minsupp r Second Phase: Reset the threshold minsupp f by minsupp r Executing a frequent itemset mining algorithm (A-priori) with threshold minsupp f = s * minsupp r (s = k th highest score in Z ) 20 DirectorTitleYearGenre F.F. CoppolaTetro2009Drama F.F. CoppolaYouth Without Youth2007Fantasy F.F. CoppolaThe Godfather1972Drama F.F. CoppolaRumble Fish1983Drama F.F. CoppolaThe Conversation1974Thriller F.F. CoppolaThe Outsiders1983Drama F.F. CoppolaSupernova2000Thriller F.F. CoppolaApocalypse Now1979Drama Query Result “frequent itemset” and “p(f |Res(Q)) > minsupp f ”.... { 2009, Drama } { Tetro, 2009, Drama } { 2000, Thriller} {Supernova, 2000, Thriller } Top K

Outline Introduction The ReDRIVE framework FaSets Interesting faSets Top-k faSets computation Recommendation Statistics maintenance Two-Phase algorithm Experiment Conclusion 21

Experiment - Datasets Experimenting using real datasets: AUTOS: single-relation, tuples, 41 attributes MOVIES: 13 relations, 10,000 ~ 1,000,000 tuples, 2 ~ 5 attributes And synthetic ones: ZIPF: single relation, 1000 tuples, 5 attributes 22

Experiment Generation 23

Top-k faSets discovery Baseline: Consider only frequent faSets in Res(Q) TPA: Two-Phase Algorithm 24

Conclusion Introducing ReDRIVE, a novel database exploration framework for recommending to users items which may be of interest to them although not part of the results of their original query Proposing a frequency estimation method based on - CRFs Proposing a Two-Phase Algorithm for locating the top-k most interesting faSets 25

26 δ= 0.04 “abcd” is the closest δ-TCFI superset of all its subsets that contain the item “a” “bcd” is the closest δ-TCFI superset of “bc”, “cd” and “c” let Y = abcd, then X1 = {abc, abd, acd}, X2 = {ab, ac, ad} and X3 = {a}.

27 the frequency of “abc”, “abd”, “acd” are estimated : (freq(abcd) ・ ext(abcd, 1)) = 100 * 1.03 = 103, the frequency of “ab”, “ac”, “ad” are estimated : : (freq(abcd) ・ ext (abcd, 2)) = 107 frequency of “a” is estimated : (freq(abcd) ・ ext(abcd, 3)) = 111