Download presentation
Presentation is loading. Please wait.
1
AlgoDEEP 16/04/101 An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Fabio Vandin DEI - Università di Padova CS Dept. - Brown University Join work with: A. Kirsch, M. Mitzenmacher, A. Pietracaprina, G. Pucci, E. Upfal
2
AlgoDEEP 16/04/102 Data Mining Discovery of hidden patterns (e.g., correlations, association rules, clusters, anomalies, etc.) from large data sets When is a pattern significant ? Open problem: development of rigorous (mathematical/statistical) approaches to assess significance and to discover significant patterns efficiently
3
AlgoDEEP 16/04/103 Frequent Itemsets (1) D Dataset D of transactions over set of items I (D ⊆ 2 I ) Support of an itemset X ∈ 2 I in D = number of transactions that contain X support({Beer,Diaper}) = 3 Significant?
4
AlgoDEEP 16/04/104 Original formulation of the problem [Agrawal et al. 93] input: dataset D over I, support threshold s output: all itemsets of support ≥ s in D (frequent itemsets ) Rationale: significance = high support (≥ s) Drawbacks: Threshold s hard to fix too low possible output explosion and spurious discoveries (false positives) too high loss of interesting itemsets (false negatives) No guarantee of significance of output itemsets Alternative formulations proposed to mitigate the above drawbacks Closed itemsets, maximal itemsets, top-K itemsets Frequent Itemsets (2)
5
AlgoDEEP 16/04/105 Significance Focus on statistical significance significance w.r.t. random model We address the following questions: What support level makes an itemset significantly frequent? How to narrow the search down to significant itemsets? Goal: minimize false discoveries and improve quality of subsequent analysis
6
AlgoDEEP 16/04/106 Many works consider significance of itemsets in isolation. E.g., [Silverstein, Brin, Motwani, 98]: rigorous statistical framework (with flaws!) 2 test to assess degree of dependence of items in an itemset Global characteristics of dataset taken into account in [Gionis, Mannila, et al., 06]: deviation from random dataset w.r.t. number of frequent itemsets no rigouros statistical grounding Related Work
7
AlgoDEEP 16/04/107 Statistical Tests Standard statistical test null hypothesis H 0 (≈not significant) alternative hypothesis H 1 H 0 is tested against H 1 by observing a certain statistic s p-value = Prob( obs ≥ s | H 0 is true ) Significance level α = probability of rejecting H 0 when it is true (false positive). Also called probability of Type I error
8
AlgoDEEP 16/04/108 Random Model I = set of n items D = input dataset of t transactions over I: i ∊ I: n(i) = support of {i} in D f i = n(i)/t = frequency of i in D D = random dataset of t transactions over I: Item i is included in transaction j with probability f i independently of all other events
9
AlgoDEEP 16/04/109 For each itemset X = i 1, i 2,.., i k ⊆ I: f X = f i1 f i2 … f ik expected frequency of X in D null hypothesis H 0 (X): the support of X in D conforms with D, (i.e., it is as drawn from Binomial(t, f X ) ) alternative hypothesis H 1 (X): the support of X in D does not conforms with D Naïve Approach (1)
10
AlgoDEEP 16/04/1010 Naïve Approach (2) Statistic of interest: s x = support of X in D Reject H 0 (X) if: p-value = Prob(B(t, f X ) ≥ s X ) ≤ α Significant itemsets = X ⊆ I : H 0 (X) is rejected
11
AlgoDEEP 16/04/1011 What’s wrong? D with t=1,000,000 transactions, over n=1000 items, each item with frequency 1/1000. Pair {i,j} that occurs 7 times: is it statistically significant? In D (random dataset) E[support({i,j})] = 1 p-value = Prob({i,j} has support ≥ 7 ) ≃ 0.0001 {i,j} must be significant! Naïve Approach (3)
12
AlgoDEEP 16/04/1012 Expected number of pairs with support ≥ 7 in random dataset is ≃ 50 existence of {i,j} with support ≥ 7 is not such a rare event! returning {i,j} as significant itemset could be a false discovery However, 300 (disjoint) pairs with support ≥ 7 in D is an extremely rare event (prob ≤ 2 -300 ) Naïve Approach (4)
13
AlgoDEEP 16/04/1013 Multi-Hypothesis test (1) Looking for significant itemsets of size k (k- itemsets) involves testing simultaneously for m= null hypotheses: {H 0 (X)} |X|=k How to combine m tests while minimizing false positives?
14
AlgoDEEP 16/04/1014 Multi-Hypothesis test (2) V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[V/R] (FDR=0 when R=0) GOAL: maximize R while ensuring FDR ≤ β [Benjamini-Yekutieli ’01] Reject hypothesis with i–th smallest p-value if ≤ i·β/m: m = does not yield a support threshold for mining
15
AlgoDEEP 16/04/1015 Our Approach Q(k, s) = obs. number of k-itemsets of support ≥ s null hypothesis H 0 (s): the number of k-itemsets of support s in D conforms with D alternative hypothesis H 1 (s): the number of k-itemsets of support s in D does not conforms with D Problem: how to compute the p-value of Q(k, s)?
16
AlgoDEEP 16/04/1016 Main Results (PODS 2009) Result 1 (Poisson approx) Q (k,s)= number of k-itemsets of support ≥ s in D Theorem Exists s min : for s≥s min, Q (k,s) is well approximated by a Poisson distribution. Result 2 Methodology to establish a support threshold for discovering significant itemsets with small FDR
17
AlgoDEEP 16/04/1017 Approximation Result (1) Based on Chen-Stein method (1975) Q (k,s) = number of k-itemsets of support ≥ s in random dataset D U~Poisson(λ), λ = E[ Q (k,s)] Theorem: for k=O(1), t=poly(n), for a large range of item distributions and supports s: distance ( Q (k,s), U) =O(1/n)
18
AlgoDEEP 16/04/1018 Approximation Result (2) Corollary: there exists s min s.t. Q (k,s) is well approximated by a Poisson distribution for s≥s min In practice: Monte-Carlo method to determine s min s.t., with probability at least 1- δ, distance ( Q (k,s), U) ≤ ε for all s≥s min
19
AlgoDEEP 16/04/1019 Support threshold for mining significant itemsets (1) Determine s min and let h be such that s min +2 h is the maximum support for an itemset Fix α 1, α 2,.., α h such that ∑ α i ≤ α Fix β 1, β 2,.., β h such that ∑ β i ≤ β For i=1 to h: s i = s min +2 i Q(k, s i ) = obs. number of k-itemsets of support ≥ s i H 0 (k,s i ): Q(k,s i ) conforms with Poisson( λ i = E[ Q (k, s i )]) reject H 0 (k,s i ) if: p-value of Q(k,s i ) < α i and Q(k,s i ) ≥ λ i / β i
20
AlgoDEEP 16/04/1020 Support threshold for mining significant itemsets (2) Theorem. Let s* be the minimum s such that H 0 (k,s) was rejected. We have: 1.With significance level α, the number of k- itemsets of support ≥ s* is significant 2.The k-itemsets with support ≥ s* are significant with FDR ≤ β
21
AlgoDEEP 16/04/1021 FIMI repository http://fimi.cs.helsinki.fi/data/ Experiments: benchmark datasets avg. trans. length items frequencies range
22
AlgoDEEP 16/04/1022 Test: α = 0.05, β = 0.05 Q k,s* = number of k-itemsets of support ≥ s* in D λ(s*) = expected number of k-itemsets with support ≥ s* Itemset of size 154 with support ≥ 7 Experiments: results (1)
23
AlgoDEEP 16/04/1023 Experiments: results (2) Comparison with standard application of Benjamini Yekutieli: FDR ≤ 0.05 R = output (standard approach) Q k,s* = output (our approach) r = |Q k,s* |/| R |
24
AlgoDEEP 16/04/1024 Poisson approximation for number of k- itemsets of support s ≥ s min in a random dataset An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR Conclusions
25
AlgoDEEP 16/04/1025 Deal with false negatives Software package Application of the method to other frequent pattern problems Future Work
26
AlgoDEEP 16/04/1026 Questions? Thank you!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.