© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Henrik Grosskreutz and Stefan Rüping Fraunhofer IAIS On Subgroup Discovery.

Slides:



Advertisements
Similar presentations
BEST FIRST SEARCH - BeFS
Advertisements

Review: Search problem formulation
An Introduction to Artificial Intelligence
SPEED: Precise & Efficient Static Estimation of Symbolic Computational Complexity Sumit Gulwani MSR Redmond TexPoint fonts used in EMF. Read the TexPoint.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Traveling Salesperson Problem
Fast Algorithms For Hierarchical Range Histogram Constructions
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Lena Gorelick Joint work with Frank Schmidt and Yuri Boykov Rochester Institute of Technology, Center of Imaging Science January 2013 TexPoint fonts used.
Constraint Optimization Presentation by Nathan Stender Chapter 13 of Constraint Processing by Rina Dechter 3/25/20131Constraint Optimization.
Optimal Rectangle Packing: A Meta-CSP Approach Chris Reeson Advanced Constraint Processing Fall 2009 By Michael D. Moffitt and Martha E. Pollack, AAAI.
Stefan Rüping Fraunhofer IAIS Ranking Interesting Subgroups.
Subgroup Discovery Finding Local Patterns in Data.
Active subgroup mining for descriptive induction tasks Dragan Gamberger Rudjer Bošković Instute, Zagreb Zdenko Sonicki University of Zagreb.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL FastANOVA: an Efficient Algorithm for Genome-Wide Association Study Xiang Zhang Fei Zou Wei Wang University.
Nonholonomic Multibody Mobile Robots: Controllability and Motion Planning in the Presence of Obstacles (1991) Jerome Barraquand Jean-Claude Latombe.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Review: Search problem formulation
Research Topics of Potential Interest to Geography COMPUTER SCIENCE Research Away Day, 29 th April 2010 Thomas Erlebach.
ETH Zurich – Distributed Computing Group Roger Wattenhofer 1ETH Zurich – Distributed Computing – Christoph Lenzen Roger Wattenhofer Exponential.
Uninformed Search Reading: Chapter 3 by today, Chapter by Wednesday, 9/12 Homework #2 will be given out on Wednesday DID YOU TURN IN YOUR SURVEY?
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Expectation Maximization Algorithm
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Games & Adversarial Search Chapter 6 Section 1 – 4.
Defining Polynomials p 1 (n) is the bound on the length of an input pair p 2 (n) is the bound on the running time of f p 3 (n) is a bound on the number.
Mathematics for Business (Finance)
5.2 Definite Integrals Quick Review Quick Review Solutions.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Discovering Interesting Subsets Using Statistical Analysis Maitreya Natu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune,
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
Mohammad Ali Keyvanrad
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Toward Efficient Flow-Sensitive Induction Variable Analysis and Dependence Testing for Loop Optimization Yixin Shou, Robert A. van Engelen, Johnnie Birch,
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Additive Data Perturbation: the Basic Problem and Techniques.
1 Intelligente Analyse- und Informationssysteme Frank Reichartz, Hannes Korte & Gerhard Paass Fraunhofer IAIS, Sankt Augustin, Germany Dependency Tree.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
The Square Variation of Rearranged Fourier Series Allison Lewko Mark Lewko Columbia University Institute for Advanced Study TexPoint fonts used in EMF.
VLDB 2006, Seoul1 Indexing For Function Approximation Biswanath Panda Mirek Riedewald, Stephen B. Pope, Johannes Gehrke, L. Paul Chew Cornell University.
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Rounding scheme if r * j  1 then r j := 1  When the number of processors assigned in the continuous solution is between 0 and 1 for each task, the speed.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries and Evaluations of Clusterings Focus: Primary Focus Summarization (what kind of objects.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Heuristic Functions. A Heuristic is a function that, when applied to a state, returns a number that is an estimate of the merit of the state, with respect.
Real-time simulation of a family of fractional-order low-pass filters TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Compression for Fixed-Width Memories Ori Rottenstriech, Amit Berman, Yuval Cassuto and Isaac Keslassy Technion, Israel.
Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.
Efficient Similarity Search : Arbitrary Similarity Measures, Arbitrary Composition Date: 2011/10/31 Source: Dustin Lange et. al (CIKM’11) Speaker:Chiang,guang-ting.
Presented by: Dr Beatriz de la Iglesia
Heuristic Functions.
Distance Computation “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presentation by Julie Letchner.
Decision Tree Saed Sayad 9/21/2018.
Games & Adversarial Search
P.S.V. Nataraj and Sachin Tharewal Systems & Control Engineering,
Games & Adversarial Search
Games & Adversarial Search
Stability Analysis of MNCM Class of Algorithms and two more problems !
Data Transformations targeted at minimizing experimental variance
Combinations of Functions
Games & Adversarial Search
Presentation transcript:

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Henrik Grosskreutz and Stefan Rüping Fraunhofer IAIS On Subgroup Discovery in Numerical Domains TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Overview Introduction: Subgroup discovery in numerical domains Motivation and definition Problems New algorithm: Based on a new pruning scheme Empirical evaluation

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Subgroup Discovery – A Local Pattern Discovery Task Task: Find descriptions of subgroups of the overall population that are both Large Unusual label Subgroup description: Conjunction of attribute-value pairs Example: Profession= Teacher & Sex=M  Cost=High StressProfessionSex HighTeacherM HighBakerF HighScientistF LowBakerM LowTeacherF HighTeacherM Example: Employee dataset Class attribute

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS The Quality of a Subgroup Quality Functions: n a (p - p 0 ) n: Size of the subgroup extension p: target share in subgroup p 0 : overall target share 0  a  1: Constant a=1: “Piatetsky- Shapiro”/”WRAC” a=0.5: Binomialtest Ex.: Profession= Teacher & Sex=M P-S Quality = 2 (1 – 4/6) StressProfessionSex HighTeacherM HighBakerF HighScientistF LowBakerM LowTeacherF HighTeacherM

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Subgroup Discovery in Numerical (or Ordinal) Domains In many domains, the attributes are not nominal but numeric (or ordinal) Subgroup descriptions: Conjunctions of attribute-interval pairs Examples BMI  [ 26,30] and BP  [ 160,180] BMI  [ 26,30] Vascular disease Blood Pressure (systolic) Body Mass Index Yes16030 No16022 No Yes18020 No12030 Yes17026 Example: Medical dataset

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS How to deal with numerical attributes? Standard approach: ( Entropy) Discretization (A) Replace every numeric attribute by one nominal attribute ranging over non-overlapping intervals Result D(X’) = {[1-2],[3-4],[5,6]} Problems Expected result should include  X  [1-4] and Y  [1-2] But we only obtain:  X = [1-2] and Y = [1-2],  X = [3-4]

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS How to deal with numerical attributes (ii)? Standard approach: Entropy Discretization (B) Replace every numeric attribute by a set of binominal attributes, i.e. use overlapping intervals D(Y’) = {[1-2],[3-4],[5,6],[1-4],[1-6],[3-6]} D(X’) = Ø Problem: Expected solution should include:  X  [1-4] and Y  [1-2] But the obtained result will not contain any constraint on X

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Discretization Strategies on Benchmark Datasets: Quality of the Best Subgroup ‘diabetes‘ dataset‘yeast ‘ dataset

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS To find optimal subgroup descriptions, we have to consider attribute-interval constraints over arbitrary intervals How can we improve the performance if overlapping intervals are considered?

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Subgroup Discovery as Search in the Space of Subgroup Descriptions Ø y  [y 1,y m ] … x  [x 1,x 3 ]x  [x 2,x 3 ]x  [x 1,x n ] Subgroup descriptions of length 1 Empty description y  [y 1,y 2 ] x  [x 1,x 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] Constraints based on XConstraints based on Y Subgroup descriptions of length 2 Constraints based on Y, in conjunction with constrains on X Can we make use of properties of the search space to speedup the computation?

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Standard Approach: DFS with Optimistic Estimate Pruning (Horizontal Pruning) Ø y  [y 1,y m ] … x  [x 1,x 3 ]x  [x 2,x 3 ]x  [x 1,x n ] Subgroup descriptions of length 1 Empty description y  [y 1,y 2 ] x  [x 1,x 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] Subgroup descriptions of length 2 Calculate OE(x  [x 1,x 3 ]) OE <   Prune branch

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS New Approach: “Horizontal Pruning” Ø y  [y 1,y m ] … x  [x 1,x 3 ]x  [x 2,x 3 ]x  [x 1,x n ] Empty description y  [y 1,y 2 ] x  [x 1,x 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] Calculate m 1 = max. quality of refinements of x  [x 1,x 2 ] Calculate m 2 = max. q. of ref. of x  [x 2,x 3 ] Subgroup descriptions of length 1 Subgroup descriptions of length 2 Use m 1 and m 2 to calculate tighter estimate OE(x  [x1,x3]) OE <   Prune branch

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Lemma Let t l and t r be two split points. The quality n a (p - p 0 ), (a  1) of every refinement of sd  X  [t l, t r ] is bound by the sum of the maximum of the qualities of all refinements of sd  X  [t l, t] and sd  X  [t l, t] for every t in [t l, t r ] x  [t,t r ]x  [t l,t]X  [t l,t r ] This property  Also holds if a depth limit is considered  Also holds if an arbitrary set of candidate split points is considered Ø ≤+

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Combining Exact Bounds and Classic Optimistic Estimates Ø y  [y 1,y m ] … x  [x 1,x 3 ]x  [x 2,x 3 ]x  [x 1,x n ] Empty description y  [y 1,y 2 ] x  [x 1,x 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] y  [y 1,y m ]y  [y 1,y 2 ] Use maxQ below x  [x 1,x 2 ] and OE(x  [x 2,x 3 ]) to prune x  [x 1,x 3 ] Subgroup descriptions of length 1 Subgroup descriptions of length 2

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS A New Algorithm for SD in Numerical Domains Main Idea: DFS in space of subgroup descriptions Uses optimistic estimates to initialize the Bound tables When new bound maxQ for subgroups involving X  [t l,t r ] becomes available Set Bound[l,r] to maxQ  l’, r ’. l’ ≤ l, r ≤ r ’ Bound [l’, r’] = Bound [l’, r]+maxQ+Bound [l, r ’ ]. Heuristik: Never consider an interval if its subinterval have not yet been considered

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Experimental Results transfusiondiabetes Pruning affects subgroup descriptions of length 1! Number of nodes considered when using frequency discretization with 10 bins

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Experimental Results (ii): Speedup mamography transfusion maximum length 2

© Fraunhofer Institut für intelligente Analyse- und Informationssysteme IAIS Summary Classical subgroup discovery approaches… … have problems in numeric domains. New “horizontal” pruning scheme speeds up computation Thank you for your attention! Open Issues Good Sets of Subgroups (Iterated Weighted Covering?) Mixed Domains Further Speedups …