Presentation is loading. Please wait.

Presentation is loading. Please wait.

Subgroup Discovery Finding Local Patterns in Data.

Similar presentations


Presentation on theme: "Subgroup Discovery Finding Local Patterns in Data."— Presentation transcript:

1 Subgroup Discovery Finding Local Patterns in Data

2 Exploratory Data Analysis  Scan the data without much prior focus  Find unusual parts of the data  Analyse attribute dependencies  interpret this as ‘ rule ’ : if X=x and Y=y then Z is unusual  Complex data: nominal, numeric, relational? the Subgroup

3 Exploratory Data Analysis  Classification: model the dependency of the target on the remaining attributes.  problem: sometimes classifier is a black-box, or uses only some of the available dependencies.  for example: in decision trees, some attributes may not appear because of overshadowing.  Exploratory Data Analysis: understanding the effects of all attributes on the target.

4 Interactions between Attributes  Single-attribute effects are not enough  XOR problem is extreme example: 2 attributes with no info gain form a good subgroup  Apart from A=a, B=b, C=c, …  consider also A=aB=b, A=aC=c, …, B=bC=c, … A=aB=bC=c, … …

5 Subgroup Discovery Task “ Find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attribute s ”  Inductive constraints:  Minimum support  (Maximum support)  Minimum quality (Information gain, X 2, WRAcc)  Maximum complexity  …

6 Subgroup Discovery: the Binary Target Case

7 Confusion Matrix  A confusion matrix (or contingency table) describes the frequency of the four combinations of subgroup and target:  within subgroup, positive  within subgroup, negative  outside subgroup, positive  outside subgroup, negative TF T.42.13.55 F.12.33.541.0 subgroup target

8 Confusion Matrix  High numbers along the TT-FF diagonal means a positive correlation between subgroup and target  High numbers along the TF-FT diagonal means a negative correlation between subgroup and target  Target distribution on DB is fixed  Only two degrees of freedom TF T.42.13.55 F.12.33.45.54.461.0 subgroup target

9 Quality Measures A quality measure for subgroups summarizes the interestingness of its confusion matrix into a single number WRAcc, weighted relative accuracy  Balance between coverage and unexpectedness  WRAcc(S,T) = p(ST) – p(S)p(T)  between −.25 and.25, 0 means uninteresting TF T.42.13.55 F.12.33.541.0 WRAcc(S,T) = p(ST)−p(S)p(T) =.42 −.297 =.123 subgroup target

10 Quality Measures  WRAcc: Weighted Relative Accuracy  Information gain  X 2  Correlation Coefficient  Laplace  Jaccard  Specificity  …

11 Subgroup Discovery as Search true A=a 1 A=a 2 B=b 1 B=b 2 C=c 1 … TF T.42.13.55 F.12.33.541.0 A=a 2 B=b 1 A=a 1 B=b 1 A=a 1 B=b 2 … … … A=a 1 B=b 1 C=c 1 minimum support level reached

12 Refinements are (anti-)monotonic target concept subgroup S 1 S 2 refinement of S 1 S 3 refinement of S 2 Refinements are (anti-) monotonic in their support… …but not in interestingness. This may go up or down. entire database

13 SD vs. Separate & Conquer  Produces collection of subgroups  Local Patterns  Subgroups may overlap and conflict  Subgroup, unusual distribution of classes  Produces decision-list  Classifier  Exclusive coverage of instance space  Rules, clear conclusion Subgroup DiscoverySeparate & Conquer

14 Subgroup Discovery and ROC space

15 ROC Space Each subgroup forms a point in ROC space, in terms of its False Positive Rate, and True Positive Rate. TPR = TP/Pos = TP/TP+FN ( fraction of positive cases in the subgroup ) FPR = FP/Neg = FP/FP+TN ( fraction of negative cases in the subgroup ) ROC = Receiver Operating Characteristics

16 ROC Space Properties ‘ ROC heaven ’ perfect subgroup ‘ ROC hell ’ random subgroup entire database empty subgroup minimum support threshold perfect negative subgroup

17 Measures in ROC Space 0 WRAcc Information Gain positive negative source: Flach & Fürnkranz isometric

18 Other Measures Precision Gini index Correlation coefficient Foil gain

19 Refinements in ROC Space Refinements of S will reduce the FPR and TPR, so will appear to the left and below S. Blue polygon represents possible refinements of S. With a convex measure, f is bounded by measure of corners...... If corners are not above minimum quality or current best (top k?), prune search space below S.

20 Multi-class problems  Generalising to problems with more than 2 classes is fairly staightforward: X2X2 Information gain C1C1 C2C2 C3C3 T.27.06.22.55 F.03.19.23.45.3.25.451.0 combine values to quality measure subgroup target source: Nijssen & Kok

21 Subgroup Discovery for Numeric targets

22 Numeric Subgroup Discovery  Target is numeric: find subgroups with significantly higher or lower average value  Trade-off between size of subgroup and average target value h = 2200 h = 3100 h = 3600

23 Types of SD for Numeric Targets  Regression subgroup discovery  numeric target has order and scale  Ordinal subgroup discovery  numeric target has order  Ranked subgroup discovery  numeric target has order or scale

24 Vancouver 2010 Winter Olympics Partial ranking objects share a rank Partial ranking objects share a rank ordinal target regression target

25 Offical IOC ranking of countries (med > 0) Rank Country Medals Athletes Continent Popul. Language Family Repub.Polar 1 USA 37 214 N. America 309 Germanic y y 2 Germany 30 152 Europe 82 Germanic y n 3 Canada 26 205 N. America 34 Germanic n y 4 Norway 23 100 Europe 4.8 Germanic n y 5 Austria 16 79 Europe 8.3 Germanic y n 6 Russ. Fed.15 179 Asia 142 Slavic y y 7 Korea 14 46 Asia 73 Altaic y n 9 China 11 90 Asia 1338 Sino-Tibetan y n 9 Sweden 11 107 Europe 9.3 Germanic n y 9 France 11 107 Europe 65 Italic y n 11 Switzerland 9 144 Europe 7.8 Germanic y n 12 Netherlands8 34 Europe 16.5 Germanic n n 13.5 Czech Rep.6 92 Europe 10.5 Slavic y n 13.5 Poland 6 50 Europe 38 Slavic y n 16 Italy 5 110 Europe 60 Italic y n 16 Japan 5 94 Asia 127 Japonic n n 16 Finland 5 95 Europe 5.3 Finno-Ugric y y 20 Australia 3 40 Australia 22 Germanic y n 20 Belarus 3 49 Europe 9.6 Slavic y n 20 Slovakia 3 73 Europe 5.4 Slavic y n 20 Croatia 3 18 Europe 4.5 Slavic y n 20 Slovenia 3 49 Europe 2 Slavic y n 23 Latvia 2 58 Europe 2.2 Slavic y n 25 Great Britain 1 52 Europe 61 Germanic n n 25 Estonia 1 30 Europe 1.3 Finno-Ugric y n 25 Kazakhstan1 38 Asia 16 Turkic y n Fractional ranks shared ranks are averaged Fractional ranks shared ranks are averaged

26 Interesting Subgroups ‘ polar = yes ’ 1. United States 3. Canada 4. Norway 6. Russian Federation 9. Sweden 16 Finland ‘ language_family = Germanic & athletes  60 ’ 1. United States 2. Germany 3. Canada 4. Norway 5. Austria 9. Sweden 11. Switzerland

27 Intuitions  Size: larger subgroups are more reliable  Rank: majority of objects appear at the top  Position: ‘ middle ’ of subgroup should differ from middle of ranking  Deviation: objects should have similar rank * * * * * * * * language_family = Slavic

28 Intuitions  Size: larger subgroups are more reliable  Rank: majority of objects appear at the top  Position: ‘ middle ’ of subgroup should differ from middle of ranking  Deviation: objects should have similar rank polar = yes * * * * * *

29 Intuitions  Size: larger subgroups are more reliable  Rank: majority of objects appear at the top  Position: ‘ middle ’ of subgroup should differ from middle of ranking  Deviation: objects should have similar rank population  10M * * * * * * * * * *

30 Intuitions  Size: larger subgroups are more reliable  Rank: majority of objects appear at the top  Position: ‘ middle ’ of subgroup should differ from middle of ranking  Deviation: objects should have similar rank language_family = Slavic & population  10M * * * * *

31 Quality Measures  Average  Mean test  z-Score  t-Statistic  Median X 2 statistic  AUC of ROC  Wilcoxon-Mann-Whitney Ranks statistic  Median MAD Metric

32 Meet Cortana the open source Subgroup Discovery tool

33 Cortana Features  Generic Subgroup Discovery algorithm  quality measure  search strategy  inductive constraints  Flat file,.txt,.arff, (DB connection to come)  Support for complex targets  41 quality measures  ROC plots  Statistical validation

34 Target Concepts  ‘ Classical ’ Subgroup Discovery  nominal targets (classification)  numeric targets (regression)  Exceptional Model Mining (to be discussed in a few slides)  multiple targets  regression, correlation  multi-label classification

35 Mixed Data  Data types  binary  nominal  numeric  Numeric data is treated dynamically (no discretisation as preprocessing)  all: consider all available thresholds  bins: discretise the current candidate subgroup  best: find best threshold, and search from there

36 Statistical Validation  Determine distribution of random results  random subsets  random conditions  swap-randomization  Determine minimum quality  Significance of individual results  Validate quality measures  how exceptional?

37 Open Source  You can  Use Cortana binary datamining.liacs.nl/cortana.html  Use and modify Cortana sources (Java)

38 Exceptional Model Mining Subgroup Discovery with multiple target attributes

39 Mixture of Distributions

40

41  For each datapoint it is unclear whether it belongs to G or G  Intensional description of exceptional subgroup G?  Model class unknown  Model parameters unknown

42 Solution: extend Subgroup Discovery  Use other information than X and Y: object desciptions D  Use Subgroup Discovery to scan sub-populations in terms of D Subgroup Discovery: find subgroups of the database where the target attribute shows an unusual distribution.

43 Solution: extend Subgroup Discovery  Use other information than X and Y: object desciptions D  Use Subgroup Discovery to scan sub-populations in terms of D  Model over subgroup becomes target of SD Subgroup Discovery: find subgroups of the database where the target attributes show an unusual distribution, by means of modeling over the target attributes. Exceptional Model Mining

44  Define a target concept (X and y) X y object description target concept

45 Exceptional Model Mining  Define a target concept (X and y)  Choose a model class C  Define a quality measure φ over C X y modeling object description target concept

46 Exceptional Model Mining  Define a target concept (X and y)  Choose a model class C  Define a quality measure φ over C  Use Subgroup Discovery to find exceptional subgroups G and associated model M X y modeling Subgroup Discovery object description target concept

47 Quality Measure  Specify what defines an exceptional subgroup G based on properties of model M  Absolute measure (absolute quality of M)  Correlation coefficient  Predictive accuracy  Difference measure (difference between M and M)  Difference in slope  qualitative properties of classifier  Reliable results  Minimum support level  Statistical significance of G

48 Correlation Model  Correlation coefficient φ ρ = ρ(G)  Absolute difference in correlation φ abs = |ρ(G) - ρ(G)|  Entropy weighted absolute difference φ ent = H(p)·|ρ(G) - ρ(G)|  Statistical significance of correlation difference φ scd  compute z-score from ρ through Fisher transform  compute p-value from z-score

49 Regression Model  Compare slope b of y i = a + b·x i + e, and y i = a + b·x i + e  Compute significance of slope difference φ ssd y = 41 568 + 3.31·x y = 30 723 + 8.45·x drive = 1  basement = 0  #baths ≤ 1

50 Gene Expression Data 11_band = ‘ no deletion ’  survival time ≤ 1919  XP_498569.1 ≤ 57 y = 3313 - 1.77·xy = 360 + 0.40·x

51 Classification Model  Decision Table Majority classifier  BDeu measure (predictiveness)  Hellinger (unusual distribution) whole database RIF1  160.45 prognosis = ‘ unknown ’

52 General Framework X y modeling Subgroup Discovery object description target concept

53 General Framework X y Regression ● Classification ● Clustering Association Graphical modeling ● … Subgroup Discovery object description target concept

54 General Framework X y Regression Classification Clustering Association Graphical modeling … Subgroup Discovery ● Decision Trees SVM … object description target concept

55 General Framework X y Regression Classification Clustering Association Graphical modeling … Subgroup Discovery Decision Trees SVM … propositional ● multi-relational ● … target concept


Download ppt "Subgroup Discovery Finding Local Patterns in Data."

Similar presentations


Ads by Google