Subgroup Discovery Finding Local Patterns in Data.

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Data Mining Classification: Alternative Techniques
Richard M. Jacobs, OSA, Ph.D.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
IB Math Studies – Topic 6 Statistics.
Subgroup Discovery Finding Local Patterns in Data.
Building Global Models from Local Patterns A.J. Knobbe.
What is Statistical Modeling
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Decision Tree Algorithm
Lecture 5 (Classification with Decision Trees)
ROC Curves.
Social Research Methods
Data Mining – Intro.
Understanding Research Results
Evaluating Classifiers
Repository Method to suit different investment strategies Alma Lilia Garcia & Edward Tsang.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Overview of U.S. Results: Digital Problem Solving PIAAC results tell a story about the systemic nature of the skills deficit among U.S. adults.
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
PIAAC results tell a story about the systemic nature of the skills deficit among U.S. adults. Overview of U.S. Results: Focus on Numeracy.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
Evaluation – next steps
Chapter 15 Correlation and Regression
1 Associative Classification of Imbalanced Datasets Sanjay Chawla School of IT University of Sydney.
Slides for “Data Mining” by I. H. Witten and E. Frank.
ROC 1.Medical decision making 2.Machine learning 3.Data mining research communities A technique for visualizing, organizing, selecting classifiers based.
Overview of U.S. Results: Focus on Literacy PIAAC results tell a story about the systemic nature of the skills deficit among U.S. adults.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Chapter 2 Describing Data.
ICDM 2003 Review Data Analysis - with comparison between 02 and 03 - Xindong Wu and Alex Tuzhilin Analyzed by Shusaku Tsumoto.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Bab 5 Classification: Alternative Techniques Part 1 Rule-Based Classifer.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
Designing multiple biometric systems: Measure of ensemble effectiveness Allen Tang NTUIM.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Chapter 2: Getting to Know Your Data
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Multivariate Data. Descriptive techniques for Multivariate data In most research situations data is collected on more than one variable (usually many.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries and Evaluations of Clusterings Focus: Primary Focus Summarization (what kind of objects.
< Return to Largest Religious CommunitiesLargest Religious Communities The Largest Atheist / Agnostic Populations Top 50 Countries With Highest Proportion.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Data Mining and Decision Support
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Educational Research: Data analysis and interpretation – 1 Descriptive statistics EDU 8603 Educational Research Richard M. Jacobs, OSA, Ph.D.
CIS 335 CIS 335 Data Mining Classification Part I.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao.
Supervise Learning. 2 What is learning? “Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time.”
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
Finding Local Patterns in Data
Data Mining – Intro.
Displaying and Describing Categorical Data
Text Mining CSC 600: Data Mining Class 20.
Rule Induction for Classification Using
Social Research Methods
Data Mining Concept Description
SEG 4630 E-Commerce Data Mining — Final Review —
Text Mining CSC 576: Data Mining.
Roc curves By Vittoria Cozza, matr
Presentation transcript:

Subgroup Discovery Finding Local Patterns in Data

Exploratory Data Analysis  Scan the data without much prior focus  Find unusual parts of the data  Analyse attribute dependencies  interpret this as ‘ rule ’ : if X=x and Y=y then Z is unusual  Complex data: nominal, numeric, relational? the Subgroup

Exploratory Data Analysis  Classification: model the dependency of the target on the remaining attributes.  problem: sometimes classifier is a black-box, or uses only some of the available dependencies.  for example: in decision trees, some attributes may not appear because of overshadowing.  Exploratory Data Analysis: understanding the effects of all attributes on the target.

Interactions between Attributes  Single-attribute effects are not enough  XOR problem is extreme example: 2 attributes with no info gain form a good subgroup  Apart from A=a, B=b, C=c, …  consider also A=aB=b, A=aC=c, …, B=bC=c, … A=aB=bC=c, … …

Subgroup Discovery Task “ Find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attribute s ”  Inductive constraints:  Minimum support  (Maximum support)  Minimum quality (Information gain, X 2, WRAcc)  Maximum complexity  …

Subgroup Discovery: the Binary Target Case

Confusion Matrix  A confusion matrix (or contingency table) describes the frequency of the four combinations of subgroup and target:  within subgroup, positive  within subgroup, negative  outside subgroup, positive  outside subgroup, negative TF T F subgroup target

Confusion Matrix  High numbers along the TT-FF diagonal means a positive correlation between subgroup and target  High numbers along the TF-FT diagonal means a negative correlation between subgroup and target  Target distribution on DB is fixed  Only two degrees of freedom TF T F subgroup target

Quality Measures A quality measure for subgroups summarizes the interestingness of its confusion matrix into a single number WRAcc, weighted relative accuracy  Balance between coverage and unexpectedness  WRAcc(S,T) = p(ST) – p(S)p(T)  between −.25 and.25, 0 means uninteresting TF T F WRAcc(S,T) = p(ST)−p(S)p(T) =.42 −.297 =.123 subgroup target

Quality Measures  WRAcc: Weighted Relative Accuracy  Information gain  X 2  Correlation Coefficient  Laplace  Jaccard  Specificity  …

Subgroup Discovery as Search true A=a 1 A=a 2 B=b 1 B=b 2 C=c 1 … TF T F A=a 2 B=b 1 A=a 1 B=b 1 A=a 1 B=b 2 … … … A=a 1 B=b 1 C=c 1 minimum support level reached

Refinements are (anti-)monotonic target concept subgroup S 1 S 2 refinement of S 1 S 3 refinement of S 2 Refinements are (anti-) monotonic in their support… …but not in interestingness. This may go up or down. entire database

SD vs. Separate & Conquer  Produces collection of subgroups  Local Patterns  Subgroups may overlap and conflict  Subgroup, unusual distribution of classes  Produces decision-list  Classifier  Exclusive coverage of instance space  Rules, clear conclusion Subgroup DiscoverySeparate & Conquer

Subgroup Discovery and ROC space

ROC Space Each subgroup forms a point in ROC space, in terms of its False Positive Rate, and True Positive Rate. TPR = TP/Pos = TP/TP+FN ( fraction of positive cases in the subgroup ) FPR = FP/Neg = FP/FP+TN ( fraction of negative cases in the subgroup ) ROC = Receiver Operating Characteristics

ROC Space Properties ‘ ROC heaven ’ perfect subgroup ‘ ROC hell ’ random subgroup entire database empty subgroup minimum support threshold perfect negative subgroup

Measures in ROC Space 0 WRAcc Information Gain positive negative source: Flach & Fürnkranz isometric

Other Measures Precision Gini index Correlation coefficient Foil gain

Refinements in ROC Space Refinements of S will reduce the FPR and TPR, so will appear to the left and below S. Blue polygon represents possible refinements of S. With a convex measure, f is bounded by measure of corners If corners are not above minimum quality or current best (top k?), prune search space below S.

Multi-class problems  Generalising to problems with more than 2 classes is fairly staightforward: X2X2 Information gain C1C1 C2C2 C3C3 T F combine values to quality measure subgroup target source: Nijssen & Kok

Subgroup Discovery for Numeric targets

Numeric Subgroup Discovery  Target is numeric: find subgroups with significantly higher or lower average value  Trade-off between size of subgroup and average target value h = 2200 h = 3100 h = 3600

Types of SD for Numeric Targets  Regression subgroup discovery  numeric target has order and scale  Ordinal subgroup discovery  numeric target has order  Ranked subgroup discovery  numeric target has order or scale

Vancouver 2010 Winter Olympics Partial ranking objects share a rank Partial ranking objects share a rank ordinal target regression target

Offical IOC ranking of countries (med > 0) Rank Country Medals Athletes Continent Popul. Language Family Repub.Polar 1 USA N. America 309 Germanic y y 2 Germany Europe 82 Germanic y n 3 Canada N. America 34 Germanic n y 4 Norway Europe 4.8 Germanic n y 5 Austria Europe 8.3 Germanic y n 6 Russ. Fed Asia 142 Slavic y y 7 Korea Asia 73 Altaic y n 9 China Asia 1338 Sino-Tibetan y n 9 Sweden Europe 9.3 Germanic n y 9 France Europe 65 Italic y n 11 Switzerland Europe 7.8 Germanic y n 12 Netherlands8 34 Europe 16.5 Germanic n n 13.5 Czech Rep.6 92 Europe 10.5 Slavic y n 13.5 Poland 6 50 Europe 38 Slavic y n 16 Italy Europe 60 Italic y n 16 Japan 5 94 Asia 127 Japonic n n 16 Finland 5 95 Europe 5.3 Finno-Ugric y y 20 Australia 3 40 Australia 22 Germanic y n 20 Belarus 3 49 Europe 9.6 Slavic y n 20 Slovakia 3 73 Europe 5.4 Slavic y n 20 Croatia 3 18 Europe 4.5 Slavic y n 20 Slovenia 3 49 Europe 2 Slavic y n 23 Latvia 2 58 Europe 2.2 Slavic y n 25 Great Britain 1 52 Europe 61 Germanic n n 25 Estonia 1 30 Europe 1.3 Finno-Ugric y n 25 Kazakhstan1 38 Asia 16 Turkic y n Fractional ranks shared ranks are averaged Fractional ranks shared ranks are averaged

Interesting Subgroups ‘ polar = yes ’ 1. United States 3. Canada 4. Norway 6. Russian Federation 9. Sweden 16 Finland ‘ language_family = Germanic & athletes  60 ’ 1. United States 2. Germany 3. Canada 4. Norway 5. Austria 9. Sweden 11. Switzerland

Intuitions  Size: larger subgroups are more reliable  Rank: majority of objects appear at the top  Position: ‘ middle ’ of subgroup should differ from middle of ranking  Deviation: objects should have similar rank * * * * * * * * language_family = Slavic

Intuitions  Size: larger subgroups are more reliable  Rank: majority of objects appear at the top  Position: ‘ middle ’ of subgroup should differ from middle of ranking  Deviation: objects should have similar rank polar = yes * * * * * *

Intuitions  Size: larger subgroups are more reliable  Rank: majority of objects appear at the top  Position: ‘ middle ’ of subgroup should differ from middle of ranking  Deviation: objects should have similar rank population  10M * * * * * * * * * *

Intuitions  Size: larger subgroups are more reliable  Rank: majority of objects appear at the top  Position: ‘ middle ’ of subgroup should differ from middle of ranking  Deviation: objects should have similar rank language_family = Slavic & population  10M * * * * *

Quality Measures  Average  Mean test  z-Score  t-Statistic  Median X 2 statistic  AUC of ROC  Wilcoxon-Mann-Whitney Ranks statistic  Median MAD Metric

Meet Cortana the open source Subgroup Discovery tool

Cortana Features  Generic Subgroup Discovery algorithm  quality measure  search strategy  inductive constraints  Flat file,.txt,.arff, (DB connection to come)  Support for complex targets  41 quality measures  ROC plots  Statistical validation

Target Concepts  ‘ Classical ’ Subgroup Discovery  nominal targets (classification)  numeric targets (regression)  Exceptional Model Mining (to be discussed in a few slides)  multiple targets  regression, correlation  multi-label classification

Mixed Data  Data types  binary  nominal  numeric  Numeric data is treated dynamically (no discretisation as preprocessing)  all: consider all available thresholds  bins: discretise the current candidate subgroup  best: find best threshold, and search from there

Statistical Validation  Determine distribution of random results  random subsets  random conditions  swap-randomization  Determine minimum quality  Significance of individual results  Validate quality measures  how exceptional?

Open Source  You can  Use Cortana binary datamining.liacs.nl/cortana.html  Use and modify Cortana sources (Java)

Exceptional Model Mining Subgroup Discovery with multiple target attributes

Mixture of Distributions

 For each datapoint it is unclear whether it belongs to G or G  Intensional description of exceptional subgroup G?  Model class unknown  Model parameters unknown

Solution: extend Subgroup Discovery  Use other information than X and Y: object desciptions D  Use Subgroup Discovery to scan sub-populations in terms of D Subgroup Discovery: find subgroups of the database where the target attribute shows an unusual distribution.

Solution: extend Subgroup Discovery  Use other information than X and Y: object desciptions D  Use Subgroup Discovery to scan sub-populations in terms of D  Model over subgroup becomes target of SD Subgroup Discovery: find subgroups of the database where the target attributes show an unusual distribution, by means of modeling over the target attributes. Exceptional Model Mining

 Define a target concept (X and y) X y object description target concept

Exceptional Model Mining  Define a target concept (X and y)  Choose a model class C  Define a quality measure φ over C X y modeling object description target concept

Exceptional Model Mining  Define a target concept (X and y)  Choose a model class C  Define a quality measure φ over C  Use Subgroup Discovery to find exceptional subgroups G and associated model M X y modeling Subgroup Discovery object description target concept

Quality Measure  Specify what defines an exceptional subgroup G based on properties of model M  Absolute measure (absolute quality of M)  Correlation coefficient  Predictive accuracy  Difference measure (difference between M and M)  Difference in slope  qualitative properties of classifier  Reliable results  Minimum support level  Statistical significance of G

Correlation Model  Correlation coefficient φ ρ = ρ(G)  Absolute difference in correlation φ abs = |ρ(G) - ρ(G)|  Entropy weighted absolute difference φ ent = H(p)·|ρ(G) - ρ(G)|  Statistical significance of correlation difference φ scd  compute z-score from ρ through Fisher transform  compute p-value from z-score

Regression Model  Compare slope b of y i = a + b·x i + e, and y i = a + b·x i + e  Compute significance of slope difference φ ssd y = ·x y = ·x drive = 1  basement = 0  #baths ≤ 1

Gene Expression Data 11_band = ‘ no deletion ’  survival time ≤ 1919  XP_ ≤ 57 y = ·xy = ·x

Classification Model  Decision Table Majority classifier  BDeu measure (predictiveness)  Hellinger (unusual distribution) whole database RIF1  prognosis = ‘ unknown ’

General Framework X y modeling Subgroup Discovery object description target concept

General Framework X y Regression ● Classification ● Clustering Association Graphical modeling ● … Subgroup Discovery object description target concept

General Framework X y Regression Classification Clustering Association Graphical modeling … Subgroup Discovery ● Decision Trees SVM … object description target concept

General Framework X y Regression Classification Clustering Association Graphical modeling … Subgroup Discovery Decision Trees SVM … propositional ● multi-relational ● … target concept