Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stefan Rüping Fraunhofer IAIS Ranking Interesting Subgroups.

Similar presentations


Presentation on theme: "Stefan Rüping Fraunhofer IAIS Ranking Interesting Subgroups."— Presentation transcript:

1 Stefan Rüping Fraunhofer IAIS stefan.rueping@iais.fraunhofer.de Ranking Interesting Subgroups

2 2 Fraunhofer Web-Projekt, Kick-off am 17.7.08 1. name_score >= 1 & geoscore >= 1 & housing >= 5  p = 41.6% 2. Income_score >= 5 & name_score >= 5 & housing >= 5  p = 36.0% 3. Active_housholds >= 3 & queries_per_household >= 1 & housing >= 5  p = 43.8% 4. Families == 0 & name_score >= 1 & housing == 0  p = 28.9% 5. Financial_status == 0 & name_score >= 3 & housing <= 5  p = 66.1% Motivation

3 3 Fraunhofer Web-Projekt, Kick-off am 17.7.08 1. name_score >= 1 & geoscore >= 1 & housing >= 5  p = 41.6% 2. Income_score >= 5 & name_score >= 5 & housing >= 5  p = 36.0% 3. Active_housholds >= 3 & queries_per_household >= 1 & housing >= 5  p = 43.8% 4. Families == 0 & name_score >= 1 & housing == 0  p = 28.9% 5. Financial_status == 0 & name_score >= 3 & housing <= 5  p = 66.1% Motivation

4 4 Fraunhofer Web-Projekt, Kick-off am 17.7.08 1. name_score >= 1 & geoscore >= 1 & housing >= 5  p = 41.6% 2. Income_score >= 5 & name_score >= 5 & housing >= 5  p = 36.0% 3. Active_housholds >= 3 & queries_per_household >= 1 & housing >= 5  p = 43.8% 4. Families == 0 & name_score >= 1 & housing == 0  p = 28.9% 5. Financial_status == 0 & name_score >= 3 & housing <= 5  p = 66.1% Motivation  Applying ranking to complex data: subgroup models  Optimization of data mining models for non-expert users  Applying ranking to complex data: subgroup models  Optimization of data mining models for non-expert users

5 5 Fraunhofer IAIS Overview  Introduction to Subgroup Discovery  Interesting Patterns  Ranking Subgroups Representation Ranking SVMs Iterative algorithm  Experiments  Conclusions

6 6 Fraunhofer IAIS Subgroup Discovery  Input X defined by nominal attributes A 1,…,A d Data  Subgroup language Propositional formula A i1 = v j1  A i2 = v j2  …  For a subgroup S let g(S) = #{ x i  S }/n,p(S) = #{ x i  S | y i = 1 }/g(S), p 0 = |y i = 1|/n q(S) = g(S) a (p(S)-p 0 )  Task Find k subgroups with highest significance (maximal quality q) a = 0.5  t-test Subgroup quality = significance of pattern Subgroup size and class probability

7 7 Fraunhofer IAIS Subgroup Discovery: Example WeatherAdvertisedIce Cream Sales goodyeshigh goodnohigh goodnohigh goodnohigh badnolow badyeshigh badnolow badnolow

8 8 Fraunhofer IAIS Subgroup Discovery: Example WeatherAdvertisedIce Cream Sales goodyeshigh goodnohigh goodnohigh goodnohigh badnolow badyeshigh badnolow badnolow S1: Weather = good  sales = high g(S) = 4/8 p(S) = 4/4 q(S) = (4/8) 0.5 (4/4 - 5/8) = 0.265

9 9 Fraunhofer IAIS Subgroup Discovery: Example WeatherAdvertisedIce Cream Sales goodyeshigh goodnohigh goodnohigh goodnohigh badnolow badyeshigh badnolow badnolow S1: Weather = good  sales = high g(S) = 4/8 p(S) = 4/4 q(S) = (4/8) 0.5 (4/4 - 5/8) = 0.265 S2: Advertised = yes  sales = high g(s) = 2/8 p(S) = 2/2 q(S) = (2/8) 0.5 (2/2 – 5/8) = 0.187

10 10 Fraunhofer IAIS Subgroup Discovery: Example WeatherAdvertisedIce Cream Sales goodyeshigh goodnohigh goodnohigh goodnohigh badnolow badyeshigh badnolow badnolow S1: Weather = good  sales = high g(S) = 4/8 p(S) = 4/4 q(S) = (4/8) 0.5 (4/4 - 5/8) = 0.265 S2: Advertised = yes  sales = high g(s) = 2/8 p(S) = 2/2 q(S) = (2/8) 0.5 (2/2 – 5/8) = 0.187 Significance ≠ Interestingness

11 11 Fraunhofer IAIS Interesting Patterns What makes a pattern interesting to the user? Depends on prior knowledge, but heuristics exist  Attributes Actionability Acquaintedness  Sub-space Novelty  Complexity Not too complex Not too simple ?

12 12 Fraunhofer IAIS Overview: Ranking Interesting Subgroups Data Subgroup Discovery Ranking SVM Task Modification Subgroup Representation „S1 > S2“

13 13 Fraunhofer IAIS Subgroup Representation (1/3)  Subgroups become examples of ranking learner!  Notation A i = original attribute r(S) = representation of subgroup S  Remember: important properties of subgroups Attributes Examples Complexity  Representing complexity r(S) includes g(S) and p(S)-p 0

14 14 Fraunhofer IAIS Subgroup Representation (2/3) Representing attributes  For each attribute A i of the original examples include into subgroup representation attribute  Observation: TF/IDF-like representation performs even better

15 15 Fraunhofer IAIS Subgroup Representation (3/3) Representing examples  User may be more interested in subset of examples  Construct list of known relevant and irrelevant subgroups from user feedback  For each subgroup S and each known relevant/irrelevant subgroup T define  relatedness of S to known subgroup T

16 16 Fraunhofer IAIS Ranking Optimization Problem  Rationale Subgroup discovery gives quality q(S) = g(S) a (p(S)-p 0 ) User defines ranking by pairs „S1 > S2“ (S1 is better than S2) Find true ranking q * such that S1 > S2 q * (S1) > q * (S2)  Assumption (justfied by assuming hidden labels of interestingness of examples)  Define linear ranking function log q * (S) = (a,1,w) r(S)

17 17 Fraunhofer IAIS Ranking Optimization Problem (2/2)  Solution similar to ranking SVM  Optimization problem:  Equivalent problem: where z = r(S i,1 )-r(S i,2 ). Remember log q * (S) = (a,1,w) r(S)

18 18 Fraunhofer IAIS Ranking Optimization Problem (2/2)  Solution similar to ranking SVM  Optimization problem:  Equivalent problem: where z = r(S i,1 )-r(S i,2 ). Remember log q * (S) = (a,1,w) r(S) Deviation from parameter a 0 in subgroup discovery

19 19 Fraunhofer IAIS Ranking Optimization Problem (2/2)  Solution similar to ranking SVM  Optimization problem:  Equivalent problem: where z = r(S i,1 )-r(S i,2 ). Remember log q * (S) = (a,1,w) r(S) Deviation from parameter a 0 in subgroup discovery Constant weight for g(S) defines margin

20 20 Fraunhofer IAIS Iterative Procedure  Why? Google: ~10 12 web pages Same number of possible subgroups on 12-dimensional data set with 9 distinct values per attribute  cannot compute all subgroups for single-step ranking  Approach Optimization problem gives new estimate of a Transform weight of subgroups–features into weights for original examples Idea: replace binary y with numeric value. Appropriate offset guarantees that subgroup-q is approximates optimized q* subgroup ranking search

21 21 Fraunhofer IAIS Experiments  Simulation on UCI data Replace true label with most correlated attribute Use true label to simulate user Measure correspondence of algorithm‘s ranking with subgroups found on true label Tests ability of approach to flexibly adapt to correlated patterns  Performance measure Area under the curve – retrieval of true top 100 subgroups Kendall‘s  - internal consistency of returned ranking

22 22 Fraunhofer IAIS Results  Wilcoxon signed rank test confirms significance  3 Data sets with minimal  AUC are exactly the ones with minimal correlation between true and proxy label! Data set  AUC  Diabetes0.2560.008 Breast-w0.7590.120 Vote0.6640.051 Segment0.5960.601 Vehicle0.0530.500 Heart-c0.1800.036 Primary-tumor0.7390.532 Hypothyroid0.7290.307 Ionosphere0.2270.708 Credit-a0.0500.241 Credit-g0.0190.285 Colic1.9E-40.213 Anneal0.0300.329 Soybean1.9E-40.040 Mushroom0.5420.320 mean0.3230.286

23 23 Fraunhofer IAIS Conclusions  Example of ranking on complex, knowledge-rich data  Interestingness of subgroups patterns can be significantly increased with interactive ranking-based method  Step toward automating machine learning for end-users  Future work: Validation with true users Active learning approach


Download ppt "Stefan Rüping Fraunhofer IAIS Ranking Interesting Subgroups."

Similar presentations


Ads by Google