Presentation is loading. Please wait.

Presentation is loading. Please wait.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Similar presentations


Presentation on theme: "Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin."— Presentation transcript:

1 Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

2 Outline  Motivation  Objective  Research Review  Search for Contrast Sets  Filtering for Summarizing Contrast Set  Evaluation  Conclusion

3 Motivation  Learning group differences a central problem in many domains  Contrasting groups especially important in social science research

4 Objective  Automatically detect differences between contrasting groups from observational multivariate data

5 Research Review  time series research multiple observations  traditional statistical methods  rule learner and decision tree miss group differences  association rule mining multiple group and different search criteria

6 Problem Definition  itemset concept extends to contrast set Definition 1: Let A 1,A 2,...,A k be a set of k variables called attributes. Each A i can take on values from the set {V i1,V i2,...V im }. Contrast set a conjunction of attribute – value pairs defined on groups G 1,G 2,...,G n with no A i occurring more than once.

7 Define support of contrast set  Definition 2: The support of a contrast set with respect to a group G is the percentage of examples in G where the contrast set is true. minimum support difference δ user defined threshold

8 Search for Contrast Sets  find contrast sets meet our criteria though search  explore all possible contrast sets return only sets meet our criteria  STUCCO (Search and Testing for Understandable Consistent Contrasts): breadth-first search incorporates several efficiently mining techniques

9 Framework  use set-enumeration trees  use breadth-first search  counting phase organize nodes into candidate groups

10 Finding Significant Contrast Sets  testing the null hypothesis across all groups  support counts from contingency tables

11 Controlling Search Error  data mining test many hypotheses  family of tests control Type I error  Bonferroni inequality:given any set of events e 1,e 2,...,e n, the probability of their union is less than or equal to the sum of the individual probabilities

12 Pruning  prune when contrast sets fail to meet effect size or statistical significance criteria  prune when lead to uninteresting contrast sets  Effect Size Pruning prune nodes when bound maximum support difference groups below δ  Statistical Significance Pruning pruned when too few data or maximum value X 2 too small

13 Interest Based Pruning  contrast sets are not interesting when have identical support or relation between groups is fixed  Specializations with Identical Support marital-status=husband marital-status=husband ^ Sex = male

14 Fixed Relations  Fixed Relations prune node as contrast set specializations do not add new information

15 Relation to Itemset Mining  minimum support difference criterion implies constraints support levels in individual groups  eliminate large portions of the search space based on:  subset infrequency pruning effect size pruning  superset frequency pruning interest based pruning ababc

16 Filtering for Summarizing Contrast Set  past approaches limit the rules shown by constraint the variables or items compare discovered rules, show only unexpected results  new methods expectation based statistical approach identify and select linear trend contrast sets

17 Statistical Surprise  show most general contrast sets first, more complicated conjunctions if surprising based on previously shown sets  IPF(Iterative Proportional Fitting) find maximum likelihood estimates

18 Detecting Linear Trends  identical to finding change over time  detect significant contrast set by using the chi- square test  use regression techniques to find the portion of the x 2

19 Evaluation  three research points: low support difference  few high support attribute-value pairs, lower bounds can ’ t take advantage pruning rules  δ -> 0 statistical significance pruning is more important filtering rules

20 Conclusion  STUCCO algorithm combined statistical hypothesis testing with search for mining contrast sets  STUCOO has pruning rules efficient mining at low support differences guaranteed control over false positives linear trend detection compact summarization of result


Download ppt "Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin."

Similar presentations


Ads by Google