An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.

Slides:



Advertisements
Similar presentations
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 10 The Analysis of Variance.
Advertisements

3.3 Hypothesis Testing in Multiple Linear Regression
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Hypothesis testing Another judgment method of sampling data.
Understanding the Variability of Your Data: Dependent Variable.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Hypothesis testing Some general concepts: Null hypothesisH 0 A statement we “wish” to refute Alternative hypotesisH 1 The whole or part of the complement.
Elementary hypothesis testing
Error Propagation. Uncertainty Uncertainty reflects the knowledge that a measured value is related to the mean. Probable error is the range from the mean.
Hypothesis Testing Steps of a Statistical Significance Test. 1. Assumptions Type of data, form of population, method of sampling, sample size.
Additional Topics in Regression Analysis
Differentially expressed genes
9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.
Simulation Modeling and Analysis Session 12 Comparing Alternative System Designs.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Topic 2: Statistical Concepts and Market Returns
Lecture 9: One Way ANOVA Between Subjects
Research Project Mining Negative Rules in Large Databases using GRD.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
Chapter 2 Simple Comparative Experiments
Inferences About Process Quality
BCOR 1020 Business Statistics Lecture 18 – March 20, 2008.
Testing models against data Bas Kooijman Dept theoretical biology Vrije Universiteit Amsterdam master course WTC.
AlgoDEEP 16/04/101 An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Fabio Vandin DEI - Università di Padova CS.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.
One Sample  M ean μ, Variance σ 2, Proportion π Two Samples  M eans, Variances, Proportions μ1 vs. μ2 σ12 vs. σ22 π1 vs. π Multiple.
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Multiple testing correction
Multiple testing in high- throughput biology Petter Mostad.
Let’s flip a coin. Making Data-Based Decisions We’re going to flip a coin 10 times. What results do you think we will get?
Statistical problems in network data analysis: burst searches by narrowband detectors L.Baggio and G.A.Prodi ICRR TokyoUniv.Trento and INFN IGEC time coincidence.
Understanding the Variability of Your Data: Dependent Variable Two "Sources" of Variability in DV (Response Variable) –Independent (Predictor/Explanatory)
9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Learning Objectives In this chapter you will learn about the t-test and its distribution t-test for related samples t-test for independent samples hypothesis.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
1 ConceptsDescriptionHypothesis TheoryLawsModel organizesurprise validate formalize The Scientific Method.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Large sample CI for μ Small sample CI for μ Large sample CI for p
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
Chapter 22: Comparing Two Proportions. Yet Another Standard Deviation (YASD) Standard deviation of the sampling distribution The variance of the sum or.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Statistical Testing of Differences CHAPTER fifteen.
Chapter 20 Testing Hypothesis about proportions
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
1 9 Tests of Hypotheses for a Single Sample. © John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger. 9-1.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Ark nr.: 1 | Forfatter: Øyvind Langsrud - a member of the Food Science Alliance | NLH - Matforsk - Akvaforsk Rotation Tests - Computing exact adjusted.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
The Broad Institute of MIT and Harvard Differential Analysis.
BIOSTATISTICS Hypotheses testing and parameter estimation.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
ENGR 610 Applied Statistics Fall Week 7 Marshall University CITE Jack Smith.
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea.
Hypothesis Tests u Structure of hypothesis tests 1. choose the appropriate test »based on: data characteristics, study objectives »parametric or nonparametric.
Lec. 19 – Hypothesis Testing: The Null and Types of Error.
Inferential Statistics Psych 231: Research Methods in Psychology.
Critical Appraisal Course for Emergency Medicine Trainees Module 2 Statistics.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Chapter 4. Inference about Process Quality
Discrete Event Simulation - 4
I. Statistical Tests: Why do we use them? What do they involve?
Presentation transcript:

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

Data Mining Discover hidden patterns, correlations, association rules, etc., in large data sets When is the discovery interesting, important, significant? We develop rigorous mathematical/statistical approach

Frequent Itemsets D Dataset D of transactions t j (subsets) of a base set of items I, (t j ⊆ 2 I ). Support of an itemsets X = number of transactions that contain X. support({Beer,Diaper}) = 3

Frequent Itemsets Discover all itemsets with significant support. Fundamental primitive in data mining applications support({Beer,Diaper}) = 3

Significance What support level makes an itemset significantly frequent? Minimize false positive and false negative discoveries Improve “quality” of subsequent analyses How to narrow the search to focus only on significant itemsets? Reduce the possibly exponential time search

Statistical Model Input: D = a dataset of t transactions over |I|=n For i ∊ I, let n(i) be the support of {i} in D. f i = n(i)/t = frequency of i in D H 0 Model: D = a dataset of t transactions, |I|=n Item i is included in transaction j with probability f i independent of all other events.

Statistical Tests H 0 : null hypothesis – the support of no itemset is significant with respect to D H 1 : alternative hypothesis, the support of itemsets X 1, X 2, X 3,… is significant. It is unlikely that their support comes from the distribution of D Significance level: α = Prob( rejecting H 0 when it’s true )

Naïve Approach Let X={x 1,x 2,…x r }, f x =∏ j f j, probability that a given itemset is in a given transaction s x = support of X, distributed s x ∼ B(t, f x ) Reject H 0 if: Prob(B(t, f x ) ≥ s x ) = p-value ≤ α

Naïve Approach Variations: R=support /E[support in D ] R=support - E[support in D ] Z-value = (s-E[s])/ ϭ [s] many more…

D has 1,000,000 transactions, over 1000 items, each item has frequency 1/1000. We observed that a pair { i,j } appears 7 times, is this pair statistically significant? In D (random dataset): E[ support( { i,j } ) ] = 1 Prob( { i,j } has support ≥ 7 ) ≃ p-value must be significant! What’s wrong? – example

There are 499,500 pairs, each has probability to appear in 7 transactions in D The expected number of pairs with support ≥ 7 in D is ≃ 50, not such a rare event! Many false positive discoveries (flagging itemsets that are not significant) Need to correct for multiplicity of hypothesis.

Multi-Hypothesis test Testing for significant itemsets of size k involves testing simultaneously for m= null hypotheses. H 0 (X) = support of X conforms with D s x = support of X, distributed: s x ∼ B(t, f x ) How to combine m tests while minimizing false positive and negative discoveries?

Family Wise Error Rate (FWER) Family Wise Error Rate (FWER) = probability of at least one false positive (flagging a non-significant itemset as significant) Bonferroni method (union bound) – test each null hypothesis with significance level α / m Too conservative – many false negative – does not flag many significant itemsets.

False Discovery Rate (FDR) Less conservative approach V= number of false positive discoveries R= total number of rejected null hypothesis = number itemsets flagged as significant Test with level of significance α : reject maximum number of null hypothesis such that FDR ≤ α FDR = E[V/R] (FDR=0 when R=0)

Standard Multi-Hypothesis test

Less conservative than Bonferroni method: i α/m VS α/m For m=, still needs very small individual p-value to reject an hypothesis

Alternative Approach Q (k, s i ) = observed number of itemsets of size k and support ≥ s i p-value = the probability of Q (k, s i ) in D Fewer hypothesis How to compute the p-value? What is the distribution of the number of itemsets of size k and support ≥ s i in D ?

Simulations to estimate the probabilities Choose a data set at random and count Main problem: m = small probabilities to reject hypothesis a lot of simulations to estimate probabilities Permutation Test

Main Contributions Poisson approximation: let Q k,s = number of itemsets of size k and support s in D (random dataset), for s≥s min : Q k,s is well approximate by a Poisson distribution. Based on the Poisson approximation – a powerful FDR multi-hypothesis test for significant frequent itemsets.

Chen-Stein Method A powerful technique for approximating the sum of dependent Bernoulli variables. For an itemset X of k items let Z X =1 if X has support at least s, else Z X =0 Q k,s = ∑ X Z X (X of k items) U~Poisson( λ ) I(x)= {Y | |y|=k, Y ˄ X ≠ empty},

Chen-Stein Method (2)

Approximation Result Q k,s is well approximate by a Poisson distribution for s≥s min

Monte-Carlo Estimate To determine s min for a given set of parameters (n,t,f i ): Choose m random datasets with the required parameters. For each dataset extract all itemsets with support at least s’ (≤ s min ) Find the minimum s such that Prob(b 1 (s)+b 2 (s) ≤ ε) ≥ 1- δ

New Statistical Test Instead of testing the significance of the support of individual itemsets we test the significance of the number of itemsets with a given support The null hypothesis distribution is specified by the Poisson approximation result Reduces the number of simultaneous tests More powerful test – less false negatives

Test I Define α 1, α 2, α 3, … such that ∑ α i ≤ α For i=0,…,log (s max – s min ) +1 s i = s min +2 i Q (k, s i ) = observed number of itemsets of size k and support ≥ s i H 0 (k,s i ) = “ Q (k,s i ) conforms with Poisson( λ i )” Reject H 0 (k,s i ) if p-value < α i

Test I Let s* be the smallest s such that H 0 (k,s) rejected by Test I With confidence level α the number of itemsets with support ≥ s* is significant Some itemsets with support ≥ s* could still be false positive

Test II Define β 1, β 2, β 3,… such that ∑ β i ≤ β Reject H 0 (k,s i ) if: p-value < α i and Q(k,s i )≥ λ i / β i Let s* be the minimum s such that H 0 (k,s) was rejected If we flag all itemsets with support ≥ s* as significant, FDR ≤ β

Proof V i = false discoveries if H 0 (k,s i ) first rejected E i = “H 0 (k,s i ) rejected”

Real Datasets FIMI repository standard benchmarks m = avg. transaction length

Experimental Results Poisson approximation Poisson “regime” ≠ no itemsets expected

Experimental Results Poisson approximation not approximating the p-values of itemsets as hypothesis (small!) finding the minimum s such that: Prob(b 1 (s)+b 2 (s) ≤ ε) ≥ 1- δ fewer simulations less time per simulation (“few” itemsets)

Experimental Results Test II: α = 0.05, β = 0.05 R k,s* = num. itemsets of size k with support ≥ s* Itemset of size 154 with support ≥ 7

Experimental Results Standard Multi-Hypothesis test: β = 0.05 R = size output Standard Multi-Hypothesis test R k,s* = size output Test II

Collaborators Adam Kirsch (Harvard) Michael Mitzenmacher ( Harvard ) Andrea Pietracaprina ( U. of Padova ) Geppino Pucci ( U. of Padova ) Eli Upfal ( Brown U. ) Fabio Vandin ( U. of Padova )