Download presentation
Presentation is loading. Please wait.
1
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Henrik Grosskreutz and Stefan Rüping Fraunhofer IAIS On Subgroup Discovery in Numerical Domains TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A
2
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Overview Introduction: Subgroup discovery in numerical domains Motivation and definition Problems New algorithm: Based on a new pruning scheme Empirical evaluation
3
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Subgroup Discovery – A Local Pattern Discovery Task Task: Find descriptions of subgroups of the overall population that are both Large Unusual label Subgroup description: Conjunction of attribute-value pairs Example: Profession= Teacher & Sex=M Cost=High StressProfessionSex HighTeacherM HighBakerF HighScientistF LowBakerM LowTeacherF HighTeacherM Example: Employee dataset Class attribute
4
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS The Quality of a Subgroup Quality Functions: n a (p - p 0 ) n: Size of the subgroup extension p: target share in subgroup p 0 : overall target share 0 a 1: Constant a=1: “Piatetsky- Shapiro”/”WRAC” a=0.5: Binomialtest Ex.: Profession= Teacher & Sex=M P-S Quality = 2 (1 – 4/6) StressProfessionSex HighTeacherM HighBakerF HighScientistF LowBakerM LowTeacherF HighTeacherM
5
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Subgroup Discovery in Numerical (or Ordinal) Domains In many domains, the attributes are not nominal but numeric (or ordinal) Subgroup descriptions: Conjunctions of attribute-interval pairs Examples BMI [ 26,30] and BP [ 160,180] BMI [ 26,30] Vascular disease Blood Pressure (systolic) Body Mass Index Yes16030 No16022 No 12022 Yes18020 No12030 Yes17026 Example: Medical dataset
6
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS How to deal with numerical attributes? Standard approach: ( Entropy) Discretization (A) Replace every numeric attribute by one nominal attribute ranging over non-overlapping intervals Result D(X’) = {[1-2],[3-4],[5,6]} Problems Expected result should include X [1-4] and Y [1-2] But we only obtain: X = [1-2] and Y = [1-2], X = [3-4]
7
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS How to deal with numerical attributes (ii)? Standard approach: Entropy Discretization (B) Replace every numeric attribute by a set of binominal attributes, i.e. use overlapping intervals D(Y’) = {[1-2],[3-4],[5,6],[1-4],[1-6],[3-6]} D(X’) = Ø Problem: Expected solution should include: X [1-4] and Y [1-2] But the obtained result will not contain any constraint on X
8
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Discretization Strategies on Benchmark Datasets: Quality of the Best Subgroup ‘diabetes‘ dataset‘yeast ‘ dataset
9
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS To find optimal subgroup descriptions, we have to consider attribute-interval constraints over arbitrary intervals How can we improve the performance if overlapping intervals are considered?
10
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Subgroup Discovery as Search in the Space of Subgroup Descriptions Ø y [y 1,y m ] … x [x 1,x 3 ]x [x 2,x 3 ]x [x 1,x n ] Subgroup descriptions of length 1 Empty description y [y 1,y 2 ] x [x 1,x 2 ] y [y 1,y m ]y [y 1,y 2 ] y [y 1,y m ]y [y 1,y 2 ] y [y 1,y m ]y [y 1,y 2 ] y [y 1,y m ]y [y 1,y 2 ] Constraints based on XConstraints based on Y Subgroup descriptions of length 2 Constraints based on Y, in conjunction with constrains on X Can we make use of properties of the search space to speedup the computation?
11
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Standard Approach: DFS with Optimistic Estimate Pruning (Horizontal Pruning) Ø y [y 1,y m ] … x [x 1,x 3 ]x [x 2,x 3 ]x [x 1,x n ] Subgroup descriptions of length 1 Empty description y [y 1,y 2 ] x [x 1,x 2 ] y [y 1,y m ]y [y 1,y 2 ] y [y 1,y m ]y [y 1,y 2 ] y [y 1,y m ]y [y 1,y 2 ] y [y 1,y m ]y [y 1,y 2 ] Subgroup descriptions of length 2 Calculate OE(x [x 1,x 3 ]) OE < Prune branch
12
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS New Approach: “Horizontal Pruning” Ø y [y 1,y m ] … x [x 1,x 3 ]x [x 2,x 3 ]x [x 1,x n ] Empty description y [y 1,y 2 ] x [x 1,x 2 ] y [y 1,y m ]y [y 1,y 2 ] y [y 1,y m ]y [y 1,y 2 ] y [y 1,y m ]y [y 1,y 2 ] y [y 1,y m ]y [y 1,y 2 ] Calculate m 1 = max. quality of refinements of x [x 1,x 2 ] Calculate m 2 = max. q. of ref. of x [x 2,x 3 ] Subgroup descriptions of length 1 Subgroup descriptions of length 2 Use m 1 and m 2 to calculate tighter estimate OE(x [x1,x3]) OE < Prune branch
13
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Lemma Let t l and t r be two split points. The quality n a (p - p 0 ), (a 1) of every refinement of sd X [t l, t r ] is bound by the sum of the maximum of the qualities of all refinements of sd X [t l, t] and sd X [t l, t] for every t in [t l, t r ] x [t,t r ]x [t l,t]X [t l,t r ] This property Also holds if a depth limit is considered Also holds if an arbitrary set of candidate split points is considered Ø ≤+
14
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Combining Exact Bounds and Classic Optimistic Estimates Ø y [y 1,y m ] … x [x 1,x 3 ]x [x 2,x 3 ]x [x 1,x n ] Empty description y [y 1,y 2 ] x [x 1,x 2 ] y [y 1,y m ]y [y 1,y 2 ] y [y 1,y m ]y [y 1,y 2 ] y [y 1,y m ]y [y 1,y 2 ] y [y 1,y m ]y [y 1,y 2 ] Use maxQ below x [x 1,x 2 ] and OE(x [x 2,x 3 ]) to prune x [x 1,x 3 ] Subgroup descriptions of length 1 Subgroup descriptions of length 2
15
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS A New Algorithm for SD in Numerical Domains Main Idea: DFS in space of subgroup descriptions Uses optimistic estimates to initialize the Bound tables When new bound maxQ for subgroups involving X [t l,t r ] becomes available Set Bound[l,r] to maxQ l’, r ’. l’ ≤ l, r ≤ r ’ Bound [l’, r’] = Bound [l’, r]+maxQ+Bound [l, r ’ ]. Heuristik: Never consider an interval if its subinterval have not yet been considered
16
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Experimental Results transfusiondiabetes Pruning affects subgroup descriptions of length 1! Number of nodes considered when using frequency discretization with 10 bins
17
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Experimental Results (ii): Speedup mamography transfusion Speedup @ maximum length 2
18
© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Summary Classical subgroup discovery approaches… … have problems in numeric domains. New “horizontal” pruning scheme speeds up computation Thank you for your attention! Open Issues Good Sets of Subgroups (Iterated Weighted Covering?) Mixed Domains Further Speedups …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.