Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Henrik Grosskreutz and Stefan Rüping Fraunhofer IAIS On Subgroup Discovery.

Similar presentations


Presentation on theme: "© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Henrik Grosskreutz and Stefan Rüping Fraunhofer IAIS On Subgroup Discovery."— Presentation transcript:

1 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Henrik Grosskreutz and Stefan Rüping Fraunhofer IAIS On Subgroup Discovery in Numerical Domains TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A

2 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Overview Introduction: Subgroup discovery in numerical domains Motivation and definition Problems New algorithm: Based on a new pruning scheme Empirical evaluation

3 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Subgroup Discovery – A Local Pattern Discovery Task Task: Find descriptions of subgroups of the overall population that are both Large Unusual label Subgroup description: Conjunction of attribute-value pairs Example: Profession= Teacher & Sex=M  Cost=High StressProfessionSex HighTeacherM HighBakerF HighScientistF LowBakerM LowTeacherF HighTeacherM Example: Employee dataset Class attribute

4 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS The Quality of a Subgroup Quality Functions: n a (p - p 0 ) n: Size of the subgroup extension p: target share in subgroup p 0 : overall target share 0  a  1: Constant a=1: “Piatetsky- Shapiro”/”WRAC” a=0.5: Binomialtest Ex.: Profession= Teacher & Sex=M P-S Quality = 2 (1 – 4/6) StressProfessionSex HighTeacherM HighBakerF HighScientistF LowBakerM LowTeacherF HighTeacherM

5 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Subgroup Discovery in Numerical (or Ordinal) Domains In many domains, the attributes are not nominal but numeric (or ordinal) Subgroup descriptions: Conjunctions of attribute-interval pairs Examples BMI  [ 26,30] and BP  [ 160,180] BMI  [ 26,30] Vascular disease Blood Pressure (systolic) Body Mass Index Yes16030 No16022 No 12022 Yes18020 No12030 Yes17026 Example: Medical dataset

6 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS How to deal with numerical attributes? Standard approach: ( Entropy) Discretization (A) Replace every numeric attribute by one nominal attribute ranging over non-overlapping intervals Result D(X’) = {[1-2],[3-4],[5,6]} Problems Expected result should include  X  [1-4] and Y  [1-2] But we only obtain:  X = [1-2] and Y = [1-2],  X = [3-4]

7 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS How to deal with numerical attributes (ii)? Standard approach: Entropy Discretization (B) Replace every numeric attribute by a set of binominal attributes, i.e. use overlapping intervals D(Y’) = {[1-2],[3-4],[5,6],[1-4],[1-6],[3-6]} D(X’) = Ø Problem: Expected solution should include:  X  [1-4] and Y  [1-2] But the obtained result will not contain any constraint on X

8 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Discretization Strategies on Benchmark Datasets: Quality of the Best Subgroup ‘diabetes‘ dataset‘yeast ‘ dataset

9 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS To find optimal subgroup descriptions, we have to consider attribute-interval constraints over arbitrary intervals How can we improve the performance if overlapping intervals are considered?

10 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Subgroup Discovery as Search in the Space of Subgroup Descriptions Ø y  [y 1,y m ] … x  [x 1,x 3 ]x  [x 2,x 3 ]x  [x 1,x n ] Subgroup descriptions of length 1 Empty description y  [y 1,y 2 ] x  [x 1,x 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] Constraints based on XConstraints based on Y Subgroup descriptions of length 2 Constraints based on Y, in conjunction with constrains on X Can we make use of properties of the search space to speedup the computation?

11 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Standard Approach: DFS with Optimistic Estimate Pruning (Horizontal Pruning) Ø y  [y 1,y m ] … x  [x 1,x 3 ]x  [x 2,x 3 ]x  [x 1,x n ] Subgroup descriptions of length 1 Empty description y  [y 1,y 2 ] x  [x 1,x 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] Subgroup descriptions of length 2 Calculate OE(x  [x 1,x 3 ]) OE <   Prune branch

12 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS New Approach: “Horizontal Pruning” Ø y  [y 1,y m ] … x  [x 1,x 3 ]x  [x 2,x 3 ]x  [x 1,x n ] Empty description y  [y 1,y 2 ] x  [x 1,x 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] Calculate m 1 = max. quality of refinements of x  [x 1,x 2 ] Calculate m 2 = max. q. of ref. of x  [x 2,x 3 ] Subgroup descriptions of length 1 Subgroup descriptions of length 2 Use m 1 and m 2 to calculate tighter estimate OE(x  [x1,x3]) OE <   Prune branch

13 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Lemma Let t l and t r be two split points. The quality n a (p - p 0 ), (a  1) of every refinement of sd  X  [t l, t r ] is bound by the sum of the maximum of the qualities of all refinements of sd  X  [t l, t] and sd  X  [t l, t] for every t in [t l, t r ] x  [t,t r ]x  [t l,t]X  [t l,t r ] This property  Also holds if a depth limit is considered  Also holds if an arbitrary set of candidate split points is considered Ø ≤+

14 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Combining Exact Bounds and Classic Optimistic Estimates Ø y  [y 1,y m ] … x  [x 1,x 3 ]x  [x 2,x 3 ]x  [x 1,x n ] Empty description y  [y 1,y 2 ] x  [x 1,x 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] Use maxQ below x  [x 1,x 2 ] and OE(x  [x 2,x 3 ]) to prune x  [x 1,x 3 ] Subgroup descriptions of length 1 Subgroup descriptions of length 2

15 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS A New Algorithm for SD in Numerical Domains Main Idea: DFS in space of subgroup descriptions Uses optimistic estimates to initialize the Bound tables When new bound maxQ for subgroups involving X  [t l,t r ] becomes available Set Bound[l,r] to maxQ  l’, r ’. l’ ≤ l, r ≤ r ’ Bound [l’, r’] = Bound [l’, r]+maxQ+Bound [l, r ’ ]. Heuristik: Never consider an interval if its subinterval have not yet been considered

16 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Experimental Results transfusiondiabetes Pruning affects subgroup descriptions of length 1! Number of nodes considered when using frequency discretization with 10 bins

17 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Experimental Results (ii): Speedup mamography transfusion Speedup @ maximum length 2

18 © Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Summary Classical subgroup discovery approaches… … have problems in numeric domains. New “horizontal” pruning scheme speeds up computation Thank you for your attention! Open Issues Good Sets of Subgroups (Iterated Weighted Covering?) Mixed Domains Further Speedups …


Download ppt "© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Henrik Grosskreutz and Stefan Rüping Fraunhofer IAIS On Subgroup Discovery."

Similar presentations


Ads by Google