Download presentation
Presentation is loading. Please wait.
Published byMary Waters Modified over 9 years ago
1
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin
2
Outline Motivation Objective Research Review Search for Contrast Sets Filtering for Summarizing Contrast Set Evaluation Conclusion
3
Motivation Learning group differences a central problem in many domains Contrasting groups especially important in social science research
4
Objective Automatically detect differences between contrasting groups from observational multivariate data
5
Research Review time series research multiple observations traditional statistical methods rule learner and decision tree miss group differences association rule mining multiple group and different search criteria
6
Problem Definition itemset concept extends to contrast set Definition 1: Let A 1,A 2,...,A k be a set of k variables called attributes. Each A i can take on values from the set {V i1,V i2,...V im }. Contrast set a conjunction of attribute – value pairs defined on groups G 1,G 2,...,G n with no A i occurring more than once.
7
Define support of contrast set Definition 2: The support of a contrast set with respect to a group G is the percentage of examples in G where the contrast set is true. minimum support difference δ user defined threshold
8
Search for Contrast Sets find contrast sets meet our criteria though search explore all possible contrast sets return only sets meet our criteria STUCCO (Search and Testing for Understandable Consistent Contrasts): breadth-first search incorporates several efficiently mining techniques
9
Framework use set-enumeration trees use breadth-first search counting phase organize nodes into candidate groups
10
Finding Significant Contrast Sets testing the null hypothesis across all groups support counts from contingency tables
11
Controlling Search Error data mining test many hypotheses family of tests control Type I error Bonferroni inequality:given any set of events e 1,e 2,...,e n, the probability of their union is less than or equal to the sum of the individual probabilities
12
Pruning prune when contrast sets fail to meet effect size or statistical significance criteria prune when lead to uninteresting contrast sets Effect Size Pruning prune nodes when bound maximum support difference groups below δ Statistical Significance Pruning pruned when too few data or maximum value X 2 too small
13
Interest Based Pruning contrast sets are not interesting when have identical support or relation between groups is fixed Specializations with Identical Support marital-status=husband marital-status=husband ^ Sex = male
14
Fixed Relations Fixed Relations prune node as contrast set specializations do not add new information
15
Relation to Itemset Mining minimum support difference criterion implies constraints support levels in individual groups eliminate large portions of the search space based on: subset infrequency pruning effect size pruning superset frequency pruning interest based pruning ababc
16
Filtering for Summarizing Contrast Set past approaches limit the rules shown by constraint the variables or items compare discovered rules, show only unexpected results new methods expectation based statistical approach identify and select linear trend contrast sets
17
Statistical Surprise show most general contrast sets first, more complicated conjunctions if surprising based on previously shown sets IPF(Iterative Proportional Fitting) find maximum likelihood estimates
18
Detecting Linear Trends identical to finding change over time detect significant contrast set by using the chi- square test use regression techniques to find the portion of the x 2
19
Evaluation three research points: low support difference few high support attribute-value pairs, lower bounds can ’ t take advantage pruning rules δ -> 0 statistical significance pruning is more important filtering rules
20
Conclusion STUCCO algorithm combined statistical hypothesis testing with search for mining contrast sets STUCOO has pruning rules efficient mining at low support differences guaranteed control over false positives linear trend detection compact summarization of result
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.