Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Slides:



Advertisements
Similar presentations
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Advertisements

Mining Multiple-level Association Rules in Large Databases
Frequent Closed Pattern Search By Row and Feature Enumeration
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Hypothesis Testing Steps of a Statistical Significance Test. 1. Assumptions Type of data, form of population, method of sampling, sample size.
DATA ANALYSIS I MKT525. Plan of analysis What decision must be made? What are research objectives? What do you have to know to reach those objectives?
Data Mining Association Analysis: Basic Concepts and Algorithms
Evaluating Hypotheses
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Chapter 5 Outline Formal definition of CSP CSP Examples
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Today Concepts underlying inferential statistics
Outline Single-factor ANOVA Two-factor ANOVA Three-factor ANOVA
Monté Carlo Simulation MGS 3100 – Chapter 9. Simulation Defined A computer-based model used to run experiments on a real system.  Typically done on a.
Decision Tree Models in Data Mining
Introduction to Directed Data Mining: Decision Trees
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
Basic Data Mining Techniques
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Chapter Twelve Data Processing, Fundamental Data Analysis, and the Statistical Testing of Differences Chapter Twelve.
Investment Analysis and Portfolio management Lecture: 24 Course Code: MBF702.
Inductive learning Simplest form: learn a function from examples
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Anthony K.H. Tung Hongjun Lu Jiawei Han Ling Feng 國立雲林科技大學 National.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.
Investigation of sub-patterns discovery and its applications By: Xun Lu Supervisor: Jiuyong Li.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Leftover Slides from Week Five. Steps in Hypothesis Testing Specify the research hypothesis and corresponding null hypothesis Compute the value of a test.
Data Mining and Decision Support
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Comparing Association Rules and Decision Trees for Disease.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 12 Chi-Square Tests and Nonparametric Tests.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea.
Data Processing, Fundamental Data Analysis, and the Statistical Testing of Differences Chapter Twelve.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
4-1 MGMG 522 : Session #4 Choosing the Independent Variables and a Functional Form (Ch. 6 & 7)
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Mining Statistically Significant Co-location and Segregation Patterns.
Hypothesis Testing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Dr. Amjad El-Shanti MD, PMH,Dr PH University of Palestine 2016
Lecture 6 Comparing Proportions (II)
Data Mining Lecture 11.
Market Basket Analysis and Association Rules
Association Rule Mining
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
Farzaneh Mirzazadeh Fall 2007
Chapter 10 Analyzing the Association Between Categorical Variables
Investigation of sub-patterns discovery and its applications
Analyzing the Association Between Categorical Variables
Statistics II: An Overview of Statistics
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Presentation transcript:

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Outline  Motivation  Objective  Research Review  Search for Contrast Sets  Filtering for Summarizing Contrast Set  Evaluation  Conclusion

Motivation  Learning group differences a central problem in many domains  Contrasting groups especially important in social science research

Objective  Automatically detect differences between contrasting groups from observational multivariate data

Research Review  time series research multiple observations  traditional statistical methods  rule learner and decision tree miss group differences  association rule mining multiple group and different search criteria

Problem Definition  itemset concept extends to contrast set Definition 1: Let A 1,A 2,...,A k be a set of k variables called attributes. Each A i can take on values from the set {V i1,V i2,...V im }. Contrast set a conjunction of attribute – value pairs defined on groups G 1,G 2,...,G n with no A i occurring more than once.

Define support of contrast set  Definition 2: The support of a contrast set with respect to a group G is the percentage of examples in G where the contrast set is true. minimum support difference δ user defined threshold

Search for Contrast Sets  find contrast sets meet our criteria though search  explore all possible contrast sets return only sets meet our criteria  STUCCO (Search and Testing for Understandable Consistent Contrasts): breadth-first search incorporates several efficiently mining techniques

Framework  use set-enumeration trees  use breadth-first search  counting phase organize nodes into candidate groups

Finding Significant Contrast Sets  testing the null hypothesis across all groups  support counts from contingency tables

Controlling Search Error  data mining test many hypotheses  family of tests control Type I error  Bonferroni inequality:given any set of events e 1,e 2,...,e n, the probability of their union is less than or equal to the sum of the individual probabilities

Pruning  prune when contrast sets fail to meet effect size or statistical significance criteria  prune when lead to uninteresting contrast sets  Effect Size Pruning prune nodes when bound maximum support difference groups below δ  Statistical Significance Pruning pruned when too few data or maximum value X 2 too small

Interest Based Pruning  contrast sets are not interesting when have identical support or relation between groups is fixed  Specializations with Identical Support marital-status=husband marital-status=husband ^ Sex = male

Fixed Relations  Fixed Relations prune node as contrast set specializations do not add new information

Relation to Itemset Mining  minimum support difference criterion implies constraints support levels in individual groups  eliminate large portions of the search space based on:  subset infrequency pruning effect size pruning  superset frequency pruning interest based pruning ababc

Filtering for Summarizing Contrast Set  past approaches limit the rules shown by constraint the variables or items compare discovered rules, show only unexpected results  new methods expectation based statistical approach identify and select linear trend contrast sets

Statistical Surprise  show most general contrast sets first, more complicated conjunctions if surprising based on previously shown sets  IPF(Iterative Proportional Fitting) find maximum likelihood estimates

Detecting Linear Trends  identical to finding change over time  detect significant contrast set by using the chi- square test  use regression techniques to find the portion of the x 2

Evaluation  three research points: low support difference  few high support attribute-value pairs, lower bounds can ’ t take advantage pruning rules  δ -> 0 statistical significance pruning is more important filtering rules

Conclusion  STUCCO algorithm combined statistical hypothesis testing with search for mining contrast sets  STUCOO has pruning rules efficient mining at low support differences guaranteed control over false positives linear trend detection compact summarization of result