Subgroup Discovery Finding Local Patterns in Data.

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

From Decision Trees To Rules
Data Analysis of Tennis Matches Fatih Çalışır. 1.ATP World Tour 250  ATP 250 Brisbane  ATP 250 Sydney... 2.ATP World Tour 500  ATP 500 Memphis  ATP.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Building Global Models from Local Patterns A.J. Knobbe.
What is Statistical Modeling
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Decision Tree Algorithm
ROC Curves.
Chapter 6 An Introduction to Portfolio Management.
ICS 273A Intro Machine Learning
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
ROC Curve and Classification Matrix for Binary Choice Professor Thomas B. Fomby Department of Economics SMU Dallas, TX February, 2015.
Chapter 9 Hypothesis Testing.
Bulut, Singh # Selecting the Right Interestingness Measure for Association Patterns Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava Department of Computer.
ROC Curves.
Decision Tree Models in Data Mining
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
Subgroup Discovery Finding Local Patterns in Data.
Discovering Interesting Subsets Using Statistical Analysis Maitreya Natu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune,
A Significance Test-Based Feature Selection Method for the Detection of Prostate Cancer from Proteomic Patterns M.A.Sc. Candidate: Qianren (Tim) Xu The.
Statistics. Question Tell whether the following statement is true or false: Nominal measurement is the ranking of objects based on their relative standing.
Evaluation – next steps
Portfolio Management-Learning Objective
Lecture Presentation Software to accompany Investment Analysis and Portfolio Management Seventh Edition by Frank K. Reilly & Keith C. Brown Chapter 7.
1 Associative Classification of Imbalanced Datasets Sanjay Chawla School of IT University of Sydney.
Link Reconstruction from Partial Information Gong Xiaofeng, Li Kun & C. H. Lai
ROC 1.Medical decision making 2.Machine learning 3.Data mining research communities A technique for visualizing, organizing, selecting classifiers based.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Identifying Feature Relevance Using a Random Forest Jeremy Rogers & Steve Gunn.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
On information theory and association rule interestingness Loo Kin Kong 5 th July, 2002.
Bab 5 Classification: Alternative Techniques Part 1 Rule-Based Classifer.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
1 Appendix D: Application of Genetic Algorithm in Classification Duong Tuan Anh 5/2014.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Class1 Class2 The methods discussed so far are Linear Discriminants.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries and Evaluations of Clusterings Focus: Primary Focus Summarization (what kind of objects.
1 Performance Measures for Machine Learning. 2 Performance Measures Accuracy Weighted (Cost-Sensitive) Accuracy Lift Precision/Recall –F –Break Even Point.
CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.
Data Mining and Decision Support
Professor William H. Press, Department of Computer Science, the University of Texas at Austin1 Opinionated in Statistics by Bill Press Lessons #50 Binary.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Supervise Learning. 2 What is learning? “Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time.”
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
Lecture 1.31 Criteria for optimal reception of radio signals.
Finding Local Patterns in Data
Chapter 12 Chi-Square Tests and Nonparametric Tests
Evaluation – next steps
Data Mining Classification: Alternative Techniques
Features & Decision regions
Data Mining Concept Description
SEG 4630 E-Commerce Data Mining — Final Review —
Roc curves By Vittoria Cozza, matr
Data Mining CSCI 307, Spring 2019 Lecture 21
Data Mining CSCI 307, Spring 2019 Lecture 23
Presentation transcript:

Subgroup Discovery Finding Local Patterns in Data

Exploratory Data Analysis  Classification: model the dependence of the target on the remaining attributes.  problem: sometimes classifier is a black-box, or uses only some of the available dependencies.  for example: in decision trees, some attributes may not appear because of overshadowing.  Exploratory Data Analysis: understanding the effects of all attributes on the target. Q: How can we use ideas from C4.5 to approach this task? A: Why not list the info gain of all attributes, and rank according to this?

Interactions between Attributes  Single-attribute effects are not enough  XOR problem is extreme example: 2 attributes with no info gain form a good subgroup  Apart from A=a, B=b, C=c, …  consider also A=aB=b, A=aC=c, …, B=bC=c, … A=aB=bC=c, … …

Subgroup Discovery Task “Find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attribute”  Inductive constraints:  Minimum support  (Maximum support)  Minimum quality (Information gain, X 2, WRAcc)  Maximum complexity  …

Confusion Matrix  A confusion matrix (or contingency table) describes the frequency of the four combinations of subgroup and target:  within subgroup, positive  within subgroup, negative  outside subgroup, positive  outside subgroup, negative TF T F subgroup target

Confusion Matrix  High numbers along the TT-FF diagonal means a positive correlation between subgroup and target  High numbers along the TF-FT diagonal means a negative correlation between subgroup and target  Target distribution on DB is fixed  Only two degrees of freedom TF T F subgroup target

Quality Measures A quality measure for subgroups summarizes the interestingness of its confusion matrix into a single number WRAcc, weighted relative accuracy  Also known as Novelty  Balance between coverage and unexpectedness  nov(S,T) = p(ST) – p(S)p(T)  between −.25 and.25, 0 means uninteresting TF T F nov(S,T) = p(ST)−p(S)p(T) =.42 −.297 =.123 subgroup target

Quality Measures  WRAcc: Weighted Relative Accuracy  Information gain  X 2  Correlation Coefficient  Laplace  Jaccard  Specificity

Subgroup Discovery as Search T A=a 1 A=a 2 B=b 1 B=b 2 C=c 1 … TF T F A=a 2 B=b 1 A=a 1 B=b 1 A=a 1 B=b 2 … … … A=a 1 B=b 1 C=c 1 minimum support level reached

Refinements are (anti-)monotonic target concept subgroup S 1 S 2 refinement of S 1 S 3 refinement of S 2 Refinements are (anti-) monotonic in their support… …but not in interestingness. This may go up or down.

SD vs. Separate & Conquer  Produces collection of subgroups  Local Patterns  Subgroups may overlap and conflict  Subgroup, unusual distribution of classes  Produces decision-list  Classifier  Exclusive coverage of instance space  Rules, clear conclusion Subgroup DiscoverySeparate & Conquer

Subgroup Discovery and ROC space

ROC Space Each subgroup forms a point in ROC space, in terms of its False Positive Rate, and True Positive Rate. TPR = TP/Pos = TP/TP+FN ( fraction of positive cases in the subgroup ) FPR = FP/Neg = FP/FP+TN ( fraction of negative cases in the subgroup ) ROC = Receiver Operating Characteristics

ROC Space Properties ‘ROC heaven’ perfect subgroup ‘ROC hell’ random subgroup entire database empty subgroup minimum support threshold perfect negative subgroup

Measures in ROC Space 0 WRAcc Information Gain positive negative source: Flach & Fürnkranz

Other Measures PrecisionGini index Correlation coefficient Foil gain

Refinements in ROC Space Refinements of S will reduce the FPR and TPR, so will appear to the left and below S. Blue polygon represents possible refinements of S. With a convex measure, f is bounded by measure of corners If corners are not above minimum quality or current best (top k?), prune search space below S.

Combining Two Subgroups

Multi-class problems  Generalising to problems with more than 2 classes is fairly staightforward: X2X2 Information gain C1C1 C2C2 C3C3 T F combine values to quality measure subgroup target

Numeric Subgroup Discovery  Target is numeric: find subgroups with significantly higher or lower average value  Trade-off between size of subgroup and average target value h = 2200 h = 3100 h = 3600

Quiz 1 Q: Assume you have found a subgroup with a positive WRAcc (or infoGain). Can any refinement of this subgroup be negative? A: Yes.

Quiz 2 Q: Assume both A and B are a random subgroup. Can the subgroup A  B be an interesting subgroup? A: Yes. Think of the XOR problem. A  B is either completely positive or negative....

Quiz 3 Q: Can the combination of two positive subgroups ever produce a negative subgroup? A: Yes..