Download presentation
Presentation is loading. Please wait.
Published byDorothy Johnston Modified over 9 years ago
1
Subgroup Discovery Finding Local Patterns in Data
2
Exploratory Data Analysis Scan the data without much prior focus Find unusual parts of the data Analyse attribute dependencies interpret this as ‘ rule ’ : if X=x and Y=y then Z is unusual Complex data: nominal, numeric, relational? the Subgroup
3
Exploratory Data Analysis Classification: model the dependency of the target on the remaining attributes. problem: sometimes classifier is a black-box, or uses only some of the available dependencies. for example: in decision trees, some attributes may not appear because of overshadowing. Exploratory Data Analysis: understanding the effects of all attributes on the target.
4
Interactions between Attributes Single-attribute effects are not enough XOR problem is extreme example: 2 attributes with no info gain form a good subgroup Apart from A=a, B=b, C=c, … consider also A=aB=b, A=aC=c, …, B=bC=c, … A=aB=bC=c, … …
5
Subgroup Discovery Task “ Find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attribute s ” Inductive constraints: Minimum support (Maximum support) Minimum quality (Information gain, X 2, WRAcc) Maximum complexity …
6
Subgroup Discovery: the Binary Target Case
7
Confusion Matrix A confusion matrix (or contingency table) describes the frequency of the four combinations of subgroup and target: within subgroup, positive within subgroup, negative outside subgroup, positive outside subgroup, negative TF T.42.13.55 F.12.33.541.0 subgroup target
8
Confusion Matrix High numbers along the TT-FF diagonal means a positive correlation between subgroup and target High numbers along the TF-FT diagonal means a negative correlation between subgroup and target Target distribution on DB is fixed Only two degrees of freedom TF T.42.13.55 F.12.33.45.54.461.0 subgroup target
9
Quality Measures A quality measure for subgroups summarizes the interestingness of its confusion matrix into a single number WRAcc, weighted relative accuracy Balance between coverage and unexpectedness WRAcc(S,T) = p(ST) – p(S)p(T) between −.25 and.25, 0 means uninteresting TF T.42.13.55 F.12.33.541.0 WRAcc(S,T) = p(ST)−p(S)p(T) =.42 −.297 =.123 subgroup target
10
Quality Measures WRAcc: Weighted Relative Accuracy Information gain X 2 Correlation Coefficient Laplace Jaccard Specificity …
11
Subgroup Discovery as Search true A=a 1 A=a 2 B=b 1 B=b 2 C=c 1 … TF T.42.13.55 F.12.33.541.0 A=a 2 B=b 1 A=a 1 B=b 1 A=a 1 B=b 2 … … … A=a 1 B=b 1 C=c 1 minimum support level reached
12
Refinements are (anti-)monotonic target concept subgroup S 1 S 2 refinement of S 1 S 3 refinement of S 2 Refinements are (anti-) monotonic in their support… …but not in interestingness. This may go up or down. entire database
13
SD vs. Separate & Conquer Produces collection of subgroups Local Patterns Subgroups may overlap and conflict Subgroup, unusual distribution of classes Produces decision-list Classifier Exclusive coverage of instance space Rules, clear conclusion Subgroup DiscoverySeparate & Conquer
14
Subgroup Discovery and ROC space
15
ROC Space Each subgroup forms a point in ROC space, in terms of its False Positive Rate, and True Positive Rate. TPR = TP/Pos = TP/TP+FN ( fraction of positive cases in the subgroup ) FPR = FP/Neg = FP/FP+TN ( fraction of negative cases in the subgroup ) ROC = Receiver Operating Characteristics
16
ROC Space Properties ‘ ROC heaven ’ perfect subgroup ‘ ROC hell ’ random subgroup entire database empty subgroup minimum support threshold perfect negative subgroup
17
Measures in ROC Space 0 WRAcc Information Gain positive negative source: Flach & Fürnkranz isometric
18
Other Measures Precision Gini index Correlation coefficient Foil gain
19
Refinements in ROC Space Refinements of S will reduce the FPR and TPR, so will appear to the left and below S. Blue polygon represents possible refinements of S. With a convex measure, f is bounded by measure of corners...... If corners are not above minimum quality or current best (top k?), prune search space below S.
20
Multi-class problems Generalising to problems with more than 2 classes is fairly staightforward: X2X2 Information gain C1C1 C2C2 C3C3 T.27.06.22.55 F.03.19.23.45.3.25.451.0 combine values to quality measure subgroup target source: Nijssen & Kok
21
Subgroup Discovery for Numeric targets
22
Numeric Subgroup Discovery Target is numeric: find subgroups with significantly higher or lower average value Trade-off between size of subgroup and average target value h = 2200 h = 3100 h = 3600
23
Types of SD for Numeric Targets Regression subgroup discovery numeric target has order and scale Ordinal subgroup discovery numeric target has order Ranked subgroup discovery numeric target has order or scale
24
Vancouver 2010 Winter Olympics Partial ranking objects share a rank Partial ranking objects share a rank ordinal target regression target
25
Offical IOC ranking of countries (med > 0) Rank Country Medals Athletes Continent Popul. Language Family Repub.Polar 1 USA 37 214 N. America 309 Germanic y y 2 Germany 30 152 Europe 82 Germanic y n 3 Canada 26 205 N. America 34 Germanic n y 4 Norway 23 100 Europe 4.8 Germanic n y 5 Austria 16 79 Europe 8.3 Germanic y n 6 Russ. Fed.15 179 Asia 142 Slavic y y 7 Korea 14 46 Asia 73 Altaic y n 9 China 11 90 Asia 1338 Sino-Tibetan y n 9 Sweden 11 107 Europe 9.3 Germanic n y 9 France 11 107 Europe 65 Italic y n 11 Switzerland 9 144 Europe 7.8 Germanic y n 12 Netherlands8 34 Europe 16.5 Germanic n n 13.5 Czech Rep.6 92 Europe 10.5 Slavic y n 13.5 Poland 6 50 Europe 38 Slavic y n 16 Italy 5 110 Europe 60 Italic y n 16 Japan 5 94 Asia 127 Japonic n n 16 Finland 5 95 Europe 5.3 Finno-Ugric y y 20 Australia 3 40 Australia 22 Germanic y n 20 Belarus 3 49 Europe 9.6 Slavic y n 20 Slovakia 3 73 Europe 5.4 Slavic y n 20 Croatia 3 18 Europe 4.5 Slavic y n 20 Slovenia 3 49 Europe 2 Slavic y n 23 Latvia 2 58 Europe 2.2 Slavic y n 25 Great Britain 1 52 Europe 61 Germanic n n 25 Estonia 1 30 Europe 1.3 Finno-Ugric y n 25 Kazakhstan1 38 Asia 16 Turkic y n Fractional ranks shared ranks are averaged Fractional ranks shared ranks are averaged
26
Interesting Subgroups ‘ polar = yes ’ 1. United States 3. Canada 4. Norway 6. Russian Federation 9. Sweden 16 Finland ‘ language_family = Germanic & athletes 60 ’ 1. United States 2. Germany 3. Canada 4. Norway 5. Austria 9. Sweden 11. Switzerland
27
Intuitions Size: larger subgroups are more reliable Rank: majority of objects appear at the top Position: ‘ middle ’ of subgroup should differ from middle of ranking Deviation: objects should have similar rank * * * * * * * * language_family = Slavic
28
Intuitions Size: larger subgroups are more reliable Rank: majority of objects appear at the top Position: ‘ middle ’ of subgroup should differ from middle of ranking Deviation: objects should have similar rank polar = yes * * * * * *
29
Intuitions Size: larger subgroups are more reliable Rank: majority of objects appear at the top Position: ‘ middle ’ of subgroup should differ from middle of ranking Deviation: objects should have similar rank population 10M * * * * * * * * * *
30
Intuitions Size: larger subgroups are more reliable Rank: majority of objects appear at the top Position: ‘ middle ’ of subgroup should differ from middle of ranking Deviation: objects should have similar rank language_family = Slavic & population 10M * * * * *
31
Quality Measures Average Mean test z-Score t-Statistic Median X 2 statistic AUC of ROC Wilcoxon-Mann-Whitney Ranks statistic Median MAD Metric
32
Meet Cortana the open source Subgroup Discovery tool
33
Cortana Features Generic Subgroup Discovery algorithm quality measure search strategy inductive constraints Flat file,.txt,.arff, (DB connection to come) Support for complex targets 41 quality measures ROC plots Statistical validation
34
Target Concepts ‘ Classical ’ Subgroup Discovery nominal targets (classification) numeric targets (regression) Exceptional Model Mining (to be discussed in a few slides) multiple targets regression, correlation multi-label classification
35
Mixed Data Data types binary nominal numeric Numeric data is treated dynamically (no discretisation as preprocessing) all: consider all available thresholds bins: discretise the current candidate subgroup best: find best threshold, and search from there
36
Statistical Validation Determine distribution of random results random subsets random conditions swap-randomization Determine minimum quality Significance of individual results Validate quality measures how exceptional?
37
Open Source You can Use Cortana binary datamining.liacs.nl/cortana.html Use and modify Cortana sources (Java)
38
Exceptional Model Mining Subgroup Discovery with multiple target attributes
39
Mixture of Distributions
41
For each datapoint it is unclear whether it belongs to G or G Intensional description of exceptional subgroup G? Model class unknown Model parameters unknown
42
Solution: extend Subgroup Discovery Use other information than X and Y: object desciptions D Use Subgroup Discovery to scan sub-populations in terms of D Subgroup Discovery: find subgroups of the database where the target attribute shows an unusual distribution.
43
Solution: extend Subgroup Discovery Use other information than X and Y: object desciptions D Use Subgroup Discovery to scan sub-populations in terms of D Model over subgroup becomes target of SD Subgroup Discovery: find subgroups of the database where the target attributes show an unusual distribution, by means of modeling over the target attributes. Exceptional Model Mining
44
Define a target concept (X and y) X y object description target concept
45
Exceptional Model Mining Define a target concept (X and y) Choose a model class C Define a quality measure φ over C X y modeling object description target concept
46
Exceptional Model Mining Define a target concept (X and y) Choose a model class C Define a quality measure φ over C Use Subgroup Discovery to find exceptional subgroups G and associated model M X y modeling Subgroup Discovery object description target concept
47
Quality Measure Specify what defines an exceptional subgroup G based on properties of model M Absolute measure (absolute quality of M) Correlation coefficient Predictive accuracy Difference measure (difference between M and M) Difference in slope qualitative properties of classifier Reliable results Minimum support level Statistical significance of G
48
Correlation Model Correlation coefficient φ ρ = ρ(G) Absolute difference in correlation φ abs = |ρ(G) - ρ(G)| Entropy weighted absolute difference φ ent = H(p)·|ρ(G) - ρ(G)| Statistical significance of correlation difference φ scd compute z-score from ρ through Fisher transform compute p-value from z-score
49
Regression Model Compare slope b of y i = a + b·x i + e, and y i = a + b·x i + e Compute significance of slope difference φ ssd y = 41 568 + 3.31·x y = 30 723 + 8.45·x drive = 1 basement = 0 #baths ≤ 1
50
Gene Expression Data 11_band = ‘ no deletion ’ survival time ≤ 1919 XP_498569.1 ≤ 57 y = 3313 - 1.77·xy = 360 + 0.40·x
51
Classification Model Decision Table Majority classifier BDeu measure (predictiveness) Hellinger (unusual distribution) whole database RIF1 160.45 prognosis = ‘ unknown ’
52
General Framework X y modeling Subgroup Discovery object description target concept
53
General Framework X y Regression ● Classification ● Clustering Association Graphical modeling ● … Subgroup Discovery object description target concept
54
General Framework X y Regression Classification Clustering Association Graphical modeling … Subgroup Discovery ● Decision Trees SVM … object description target concept
55
General Framework X y Regression Classification Clustering Association Graphical modeling … Subgroup Discovery Decision Trees SVM … propositional ● multi-relational ● … target concept
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.