ICDE 2012 Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data Yongxin Tong 1, Lei Chen 1, Bolin Ding 2 1 Department of Computer.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

A distributed method for mining association rules
Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Frequent Closed Pattern Search By Row and Feature Enumeration
LOGO Association Rule Lecturer: Dr. Bo Yuan
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Zeev Dvir – GenMax From: “ Efficiently Mining Frequent Itemsets ” By : Karam Gouda & Mohammed J. Zaki.
1 Department of Information & Computer Education, NTNU SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Frequent Subgraph Pattern Mining on Uncertain Graph Data
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Association Mining Data Mining Spring Transactional Database Transaction – A row in the database i.e.: {Eggs, Cheese, Milk} Transactional Database.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Fast Algorithms for Association Rule Mining
ICMLC2007, Aug. 19~22, 2007, Hong Kong 1 Incremental Maintenance of Ontology- Exploiting Association Rules Ming-Cheng Tseng 1, Wen-Yang Lin 2 and Rong.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Ch5 Mining Frequent Patterns, Associations, and Correlations
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Sequential PAttern Mining using A Bitmap Representation
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Mining High Utility Itemset in Big Data
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
1 FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES By H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial.
Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.
Frequent Pattern Mining
Probabilistic Data Management
CARPENTER Find Closed Patterns in Long Biological Datasets
Market Basket Many-to-many relationship between different objects
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Frequent Itemsets over Uncertain Databases
Probabilistic Data Management
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
Association Rule Mining
A Parameterised Algorithm for Mining Association Rules
Data Mining Association Analysis: Basic Concepts and Algorithms
Approximate Frequency Counts over Data Streams
Association Analysis: Basic Concepts
Presentation transcript:

ICDE 2012 Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data Yongxin Tong 1, Lei Chen 1, Bolin Ding 2 1 Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2 Department of Computer Science University of Illinois at Urbana-Champaign

Outline  Why Data Uncertainty Is Ubiquitous  A Motivation Example  Problem Definitions  MPFCI Algorithm  Experiments   Related Work  Conclusion 2

Outline Why Data Uncertainty Is Ubiquitous  Why Data Uncertainty Is Ubiquitous   A Motivation Example  Problem Definitions  MPFCI Algorithm  Experiments   Related Work  Conclusion 3

 Unreliability of multiple data sources 4 Scenarios 1: Data Integration near duplicate documents … a document entity Doc Doc Doc l … 0.3 … the confidence that a document is true … data sources

5 Scenarios 2: Crowdsourcing MTurk workers (Photo By Andrian Chen) AMT Requesters  Requesters outsourced many tasks to the online crowd workers during the web crowdsourcing platforms (i.e. AMT, oDesk, CrowdFlower, etc.).  Different workers might provide different answers!  How to aggregating different answers from crowds?  How to distinguish which users are spams?

6 Scenarios 2: Crowdsourcing  The correct answer ratio measures the uncertainty of different workers!  Majority voting rule is widely used in crowdsouring, in fact, it is to determine which answer is frequent when min_sup is greater than half of total answers! Where to find the best sea food in Hong Kong? Sai Kung or Hang Kau ?

Outline  Why Data Uncertainty Is Ubiquitous  A Motivation Example  Problem Definitions  MPFCI Algorithm  Experiments   Related Work  Conclusion 7

Motivation Example  In an intelligent traffic systems application, many sensors are deployed to collect real-time monitoring data in order to analyze the traffic jams. 8 TIDLocationWeatherTimeSpeedProbability T1 HKUSTFoggy8:30-9:00 AM T2 HKUSTRainy5:30-6:00 PM T3 HKUSTSunny3:30-4:00 PM T4 HKUSTRainy5:30-6:00 PM

 According to above data, we analyze the reasons that cause the traffic jams through the viewpoint of uncertain frequent pattern mining.  For example, we find that {Time = 5:30-6:00 PM; Weather = Rainy} is a frequent itemset with a high probability.  Therefore, under the condition of {Time = 5:30-6:00 PM; Weather = Rainy}, it is very likely to cause the traffic jams. 9 TIDLocationWeatherTimeSpeedProbability T1 HKUSTFoggy8:30-9:00 AM T2 HKUSTRainy5:30-6:00 PM T3 HKUSTSunny3:30-4:00 PM T4 HKUSTRainy5:30-6:00 PM Motivation Example ( cont’d )

10 TIDTransactionProb. T1a b c d0.9 T2a b c0.6 T3a b c0.7 T4a b c d0.9 PWTransactionsProb. PW1T PW2T1, T PW3T1, T PW4T1, T PW5T1, T2, T PW6T1, T2, T PW7T1, T3, T PW8T1, T2, T3, T PW9T PW10T2, T PW11T2, T PW12T2, T3, T PW13T PW14T3, T PW15T PW16{ } How to find probabilistic frequent itemsets? Possible World Semantics  If min_sup=2, threshold= 0.8  Frequent Probability: Pr{sup(abcd) ≥ min_sup} =∑Pr(PWi)=Pr(PW4)+Pr(PW6)+Pr(PW7) +Pr(PW8)=0.81> 0.8 {a}, {b}, {c} {a, b}, {a, c}, {b, c} {a, b, c} Freq. Prob. = {d} {a, d}, {b, d}, {c, d} {a, b, d}, {a, c, d}, {b, c, d} {a, b, c, d} Freq. Prob. = 0.81

Motivation Example ( cont’d ) 11  How to distinguish the 15 itemsets in two groups in the previous page?  Extend the method of mining frequent closed itemset in certain data to the uncertain environment.  Mining Probabilistic Frequent Closed Itemsets. PWTransactionsProb.FCI PW1T { } PW2T1, T {abc} PW3T1, T {abc} PW4T1, T {abcd} PW5T1, T2, T {abc} PW6T1, T2, T {abc} {abcd} PW7T1, T3, T {abc} {abcd} PW8T1, T2, T3, T {abc} {abcd} PW9T { } PW10T2, T {abc} PW11T2, T {abc} PW12T2, T3, T {abc} PW13T { } PW14T3, T {abc} PW15T { } PW16{ }0.0012{ } TIDTransaction T1a b c d e T2a b c d T3a b c T4a b c d  In the deterministic data, an itemset is a frequent closed itemset iff:  It is frequent;  Its support must be larger than supports of any of its supersets.  For example, if min_sup=2,  {abc}.support = 4 > 2 (Yes)  {abcd}.support = 2 (Yes)  {abcde}.support=1 (No)

Outline  Why Data Uncertainty Is Ubiquitous   A Motivation Example Problem Definitions  Problem Definitions  MPFCI Algorithm  Experiments   Related Work  Conclusion 12

Problem Definitions  Frequent Closed Probability – Given a minimum support min_sup, and an itemset X, X’s frequent closed probability, denoted as Pr FC (X) is the sum of the probabilities of possible worlds where X is a frequent closed itemset.  Probabilistic Frequent Closed Itemset – Given a minimum support min_sup, a probabilistic frequent closed threshold pfct, an itemset X, X is a probabilistic frequent closed itemset if Pr{X is frequent closed itemset}= Pr FC (X) > pfct 13

14 PWTransactionsProb.FCI PW1T { } PW2T1, T {abc} PW3T1, T {abc} PW4T1, T {abcd} PW5T1, T2, T {abc} PW6T1, T2, T {abc} {abcd} PW7T1, T3, T {abc} {abcd} PW8T1, T2, T3, T {abc} {abcd} PW9T { } PW10T2, T {abc} PW11T2, T {abc} PW12T2, T3, T {abc} PW13T { } PW14T3, T {abc} PW15T { } PW16{ }0.0012{ }  If min_sup=2, pfct = 0.8  Frequent Closed Probability: Pr FC (abc)=Pr(PW2)+Pr(PW3)+ Pr(PW5)+Pr(PW6)+Pr(PW7)+ Pr(PW8)+Pr(PW10)+Pr(PW11) +Pr(PW12)+Pr(PW14)= >0.8  So, {abc} is a probabilistic frequent closed itemset. Example of Problem Definitions

Complexity Analysis  However, it is #P-hard to calculate the frequent closed probability of an itemset when the minimum support is given in an uncertain transaction database.  How to compute this hard problem? 15

Computing Strategy Stragtegy1: Pr FC (X) = Pr F (X) - Pr FNC (X) Stragtegy2: Pr FC (X) = Pr C (X) - Pr CNF (X) 16 The relationship of Pr F (X), Pr C (X) and Pr FC (X) O(NlogN) #P Hard

Computing Strategy (cont’d) Stragtegy: Pr FC (X) = Pr F (X) - Pr FNC (X)  How to compute the Pr FNC (X) ?  Assume that there are m other items, e 1,e 2,…,e m, besides items of X in UTD.  Pr FNC (X) = where denotes an event that the superset of X, X+e i, always appears together with X at least min_sup times. 17 Inclusion-Exclusion principle, a #P-hard problem.

Outline  Why Data Uncertainty Is Ubiquitous   A Motivation Example  Problem Definitions MPFCI Algorithm  MPFCI Algorithm  Experiments   Related Work  Conclusion 18

Outline  Motivation  Problem Definitions MPFCI Algorithm  MPFCI Algorithm  Experiments   Related Work  Conclusion 19

Algorithm Framework 20 Procedure MPFCI_Framework { Discover all initial probabilistic frequent single items. For each item/itemset { 1: Perform pruning and bounding strategies. 2: Calculate the frequent closed probability of itemsets which cannot be pruned and return as the result set. } }

Pruning Techniques  Chernoff-Hoeffding Bound-based Pruning  Superset Pruning  Subset Pruning  Upper Bound and Lower Bound of Frequent Closed Probability-based Pruning 21

Chernoff-Hoeffding Bound-based Pruning  Given an itemset X, an uncertain transaction database UTD, X’s expected support, a minimum support threshold min_sup, probabilistic frequent closed threshold pfct, an itemset X can be safely filtered out if, where and n is the number of transactions in UTD. 22 Stragtegy: Pr FC (X) = Pr F (X) - Pr FNC (X) Fast Bounding Pr F (X)

Superset Pruning  Given an itemset and X’s superset, X+e, where e is a item, if e is smaller than at least one item in X with respect to a specified order (such as the alphabetic order), and X.count= X+e.count, X and all supersets with X as prefix based on the specified order can be safely pruned. 23 TIDTransactionProb. T1a b c d0.9 T2a b c0.6 T3a b c0.7 T4a b c d0.9 {b, c} and all supersets with {b, c} as prefix can be safely pruned.  {b, c} {a,b,c} & a < order b  {b, c}.count = {a,b,c}.count

Subset Pruning  Given an itemset and X’s subset, X-e, where e is the last item in X according to a specified order (such as the alphabetic order), if X.count= X-e.count, we can get the following two results: – X-e can be safely pruned. – Beside itemsets of X and X’s superset, itemsets which have the same prefix X-e, and their supersets can be safely pruned. 24 TIDTransactionProb. T1a b c d0.9 T2a b c0.6 T3a b c0.7 T4a b c d0.9  {a, b, c} & {a, b, d} have the same prefix{a, b}  {a, b}.count = {a, b, c}.count {a,b}, {a, b, d} and its all supersets can be safely pruned.

Upper / Lower Bounding Frequent Closed Probability  Given an itemset X, an uncertain transaction database UTD, and min_sup, if there are m other items besides items in X, e 1,e 2,…,e m, the frequent closed probability of X, Pr FC (X), satisfies: where represents the event that the superset of X, X+e i, always appear together with X at least min_sup times. 25 Stragtegy: Pr FC (X) = Pr F (X) - Pr FNC (X) Fast Bounding Pr FNC (X)

Monte-Carlo Sampling Algorithm  Review the computing strategy of frequent closed probability: Pr FC (X) = Pr F (X) - Pr FNC (X)  The key problem is how to compute Pr FNC (X) efficiently.  Monte-Carlo sampling algorithm to calculate the frequent closed probability approximately. – This kind of sampling is unbiased – Time Complexity: 26

A Running Example 27 Input: min_sup=2; pfct = 0.8 Subset Pruning Superset Pruning TIDTransactionProb. T1a b c d0.9 T2a b c e0.6 T3a b c0.7 T4a b c d0.9

Outline  Why Data Uncertainty Is Ubiquitous   A Motivation Example  Problem Definitions  MPFCI Algorithm Experiments  Experiments   Related Work  Conclusion 28

Experimental Study  Features of Testing Algorithms in Experiments 29 AlgorithmCHSuperSubPBFramework MPFCI √ √√√DFS MPFCI-NoCH√√√DFS MPFCI-NoBound√√√DFS MPFCI-NoSuper√√√DFS MPFCI-NoSub√√√DFS MPFCI-BFS√√BFS  Characteristics of Datasets Dataset Number of Transactions Number of Items Average Length Maximal Length Mushroom T20I10D30KP

Efficiency Evaluation 30 Running Time w.r.t min_sup

Efficiency Evaluation(cont’d) 31 Running Time w.r.t pfct

Approximation Quality Evaluation 32 Approximation Quality in Mushroom Dataset Varying EpsilonVarying Delta

Outline  Why Data Uncertainty Is Ubiquitous   A Motivation Example  Problem Definitions  MPFCI Algorithm  Experiments  Related Work  Conclusion 33

Related Work  Expected Support-based Frequent Itemset Mining  Apriori-based Algorithm: UApriori (KDD’09, PAKDD’07, 08)  Pattern Growth-based Algorithms: UH-Mine (KDD’09) UFP-growth (PAKDD’08)  Probabilistic Frequent Itemset Mining  Dynamic-Programming-based Algorithms: DP (KDD’09, 10)  Divide-and-Conquer-based Algorithms: DC (KDD’10)  Approximation Probabilistic Frequent Algorithms:  Poisson Distribution-based Approximation (CIKM’10)  Normal Distribution-based Approximation (ICDM’10) 34

Outline  Why Data Uncertainty Is Ubiquitous   A Motivation Example  Problem Definitions  MPFCI Algorithm  Experiments   Related Work Conclusion  Conclusion 35

Conclusion  Propose a new problem of mining probabilistic threshold-based frequent closed itemsets in an uncertain transaction database.  Prove the problem of computing the frequent closed probability of an itemset is #P-Hard.  Design an efficient mining algorithm, including several effective probabilistic pruning techniques, to find all probabilistic frequent closed itemsets.  Show the effectiveness and efficiency of the mining algorithm in extensive experimental results. 36

37 Thank you