Exploiting the Power of Group Differences to Solve Data Analysis Problems Classification Guozhu Dong, PhD, Professor CSE guozhu.dong@wright.edu.

Exploiting the Power of Group Differences to Solve Data Analysis Problems Classification
Guozhu Dong, PhD, Professor CSE

Where Are We Now Introduction and overview Preliminaries
Emerging patterns: definitions and mining Using emerging patterns as features and regression terms Classification using emerging patterns Clustering and clustering evaluation using emerging patterns Outlier and intrusion detection using emerging patterns Ranking attributes for problems with complex multi-attribute interactions using emerging patterns Pattern aided regression and classification Interesting applications of emerging patterns

Emerging Pattern Based Classification
Preliminaries on classification CAEP: Classification by Aggregating Power of EPs Also, handling imbalanced classification with score normalization DeEPs: Instance based Classification using EPs ECP: Using CAEP on tiny training dataset for lead compound optimization Why and how EPs are useful for classification Note: This part is using EPs as basic conditions or as basic classifiers; later we use EPs as subpopulation handles for pattern aided regression and classification. Guozhu Dong 2019

Classification: The Problem
Want to build a classification model to accurately predict categorical class labels on data instances The classification model is built using a training dataset having class labels Mathematically, a classifier (model) is a function mapping tuples to classes. There are many ways to specify classifier models, many methods to build classification models, and many measures to evaluate classifier model performance. Typical applications Credit/loan approval/denial Medical diagnosis: is a tissue cancerous or benign? Fraud detection: is a transaction fraudulent? Web page categorization: which category a page belongs to Guozhu Dong 2019

Classifier Performance Evaluation Measures
1: Positive, 0: Negative A classifier’s predictions PredP PredN TrueP TP:2 FN:2 P:4 TrueN FP:1 TN:1 N:2 A1 A2 A3 C PredC 1 2 Also AUC of ROC Guozhu Dong 2019

Challenges High dimensional data Imbalanced classification
Hard datasets Complex interactions Small disjuncts Complicated class boundaries Many classes * Small disjuncts are patterns covering a very small number of examples Guozhu Dong 2019

Why and How EPs Are Useful for Classification
Discriminative patterns are useful They can capture complex interactions They can capture small disjuncts They can be mined even for tiny training dataset They are highly predictive (on data they match) Guozhu Dong 2019

CAEP: Classification by Aggregating Power of EPs
We want a classification method that uses a fairly complete set of discriminative patterns combines the discriminative power of multiple matching discriminative patterns using all matching patterns in the pattern set of the model [Dong+Zhang et al 99] Guozhu Dong 2019

CAEP Description If we only have JEPs, then score(Ci,T) = sum of the supports of the matching JEPs.

CAEP Illustration We have two classes: P and N
We have 200 EPs for P, and 150 EPs for N Want to classify an instance t Suppose: 3 EPs for P match t, and 2 EPs for N match t Patterns matching t has the following characteristics Score(P,t) = 0.3*5/(5+1)+0.1*10/ =0.39 Score(P,t) = 0.2*7/ * 20/21= 0.22 If normalization is not performed, t is predicted as belonging to P Class P sup GrowRate Class N P1 0.3 5 Q1 0.2 7 P2 0.1 10 Q2 0.05 20 P3 infinite Guozhu Dong 2019

Strength of a Matching Pattern in Score
sup(P)*GrowthRate(P) / (GrowthRate(P)+1) sup(P): patterns matching more instances have larger impact GrowthRate(P) / (GrowthRate(P)+1): patterns with larger growth rate have larger impact GrowthRate(P) / (GrowthRate(P)+1) ~=1 if GrowthRate(P) is very large. It is 1 if P is a JEP Guozhu Dong 2019

Normalizing Scores for Imbalanced Data
Data sizes of different classes are imbalanced. Numbers of patterns of different classes also imbalanced. CAEP handles this problem by normalizing (dividing) score(Ci,·) for each class Ci using a fixed percentile (e.g. 85 percentile) of the bag of scores {score(Ci,x) | x in Ci}. Let score’(Ci,t) denote normalized score CAEP assigns an instance t to the class Cj where the normalized score is the largest: score’(Cj,t) = max {score’(C1,t), score’(C2,t)} If numbers of EPs for the classes are small, we can divide the scores for a class by the sum of sup*GrowthRate for all patterns in the class . [Auer et al 2016] Guozhu Dong 2019

Factors Influencing Selection of Emerging Patterns
The number of candidate emerging patterns for selection can be large. The following factors are important in pattern selection Support of patterns (individually) GrowthRate of patterns (individually) mds overlap among patterns GrowthRate similarity and support similarity, among patterns with similar mds Guozhu Dong 2019

Most Expressive Jumping Emerging Patterns
Observation: If P and Q are two JEPs for a class Ci such that P  Q, then P is more expressive (it matches more instances: it matches all instances that match Q and may match more) P and Q give the same confidence (signal) for assigning instances t matching them to Ci So we call the minimal JEPs (in the set containment sense) the most expressive JEPs Minimal JEPs were used to build powerful classifiers in several methods. [Li+Dong+Kotagiri 2001] Guozhu Dong 2019

Performance of CAEP Experiments show that CAEP Has good performance
Is noise tolerant We will point out additional advantages later. Note: No free lunch theorem says that no method is best all the time. Guozhu Dong 2019

Comparison with Other Methods
Set of rules used: The Decision Trees algorithms are very greedy in selecting attributes for nodes. So they are very greedy in selecting the set of rules they use. They use a very small subset of all possible rules. The CBA method also uses a very small set of rules [Liu et al 98]. CAEP uses many more discriminate patterns How rules are used: Both CBA and Decision Trees only use one rule to classify an instance. They do not let (multiple) rules vote. CAEP combines the discriminative power of all matching patterns. Guozhu Dong 2019

Many Interesting Applications for CAEP
Sequence classification for gene start site prediction Activity recognition from image data with applications to senior home monitoring/care Lead compound optimization in drug candidate selection – chemoinformatics …… Guozhu Dong 2019

Many New Classification Methods are CAEP-Like
CMAR [Li+Han+Pei 2001], combining multiple rules using rule with maximum normalized Ch2 CPAR [Yin+Han 2003], combining multiple rules using average accuracy of best k matching rules for each class Causal Associative Classification [Yu+Wu et al 2009], using Markov blanket to select patterns & using CAEP to score. Uses fewer patterns. They differ on how to mine patterns, how to select patterns, how to aggregate the discriminative power of matching patterns Many related methods exist. They belong to the families of “associative classifiers” and “rule/pattern based classifiers” [Yu+Wu et al 2009] Guozhu Dong 2019

DeEPs: Instance based Classification using EPs [Li+Dong et al 2004]
Training data: C1, C2 [can generalize to >2 classes] For each case t to be classified, let Data(Ci,t) be the projection of Ci onto t – remove all attribute values not occurring in t mine JEPs for each class C1, C2, resulting in EP sets LazyEPs(C1,t) and LazyEPs(C2,t) let LazyData(Ci,t) be the subset Data(Ci,t) matching some JEP in LazyEPs(Ci,t) use the size of LazyData(C1,t) and LazyData(C2,t) as percentage of C1 and C2 to decide t’s class The matching-size based scoring also handles the imbalanced class problem. Guozhu Dong 2019

ECP: Using CAEP on Tiny Training Dataset for Lead Compound Optimization
Drug design often requires molecule design & molecule selection Need to search for highly potent molecule structures, for use as drug candidates Researchers use some intuition to guide the search EG: Pick a set of likely +ve and likely –ve examples; use this set as basis to find the most potent (not necessarily in the picked set) One approach is to use the training set to build a classifier to predict potency of other molecules. Picking training set is labor intensive. Want to pick small training sets. EG 3+ves vs 3-ves, 5+ves vs 5-ves, 10+ves vs 10-ves [Auer+Bajarath 2006, 2008] ECP: Emerging Chemical Pattern Guozhu Dong 2019

avg avg BIN: Binary QSAR; DT: Decision Trees; ECP: CAEP using Emerg Chem Patts

Simulated Lead Optimization using CAEP
[Auer and Bajorath (2006)] used an iterative procedure for Simulated Lead Optimization, exploiting strength of CAEP with small training data During each iteration, they randomly selected small sets of compounds from the current set of test compounds, got their potency, and divided them into a high potency class and a low potency class (using their mean potency value as the threshold). K=3 or 5 examples per class. This compound set was then used to train the ECP (CAEP) classifier to distinguish higher from lower potency compounds. The class label of remaining test compounds was predicted, assigning each test compound to the high or low potency class. All compounds predicted to have low potency were then removed from the test set; only compounds classified as highly potent were retained for the next iteration. The final (enriched) set after 100s iterations should be highly potent. See figures on next slide. Guozhu Dong 2019

Simulated Lead Optimization using Iterative CAEP
Boxpolts for 500 runs of the iterative process. In pharmacology, potency is a measure of drug activity expressed in terms of amount needed to produce effect of given intensity. Want molecules with small IC50

Exploiting the Power of Group Differences to Solve Data Analysis Problems Classification Guozhu Dong, PhD, Professor CSE guozhu.dong@wright.edu.

Similar presentations

Presentation on theme: "Exploiting the Power of Group Differences to Solve Data Analysis Problems Classification Guozhu Dong, PhD, Professor CSE guozhu.dong@wright.edu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exploiting the Power of Group Differences to Solve Data Analysis Problems Classification Guozhu Dong, PhD, Professor CSE guozhu.dong@wright.edu.

Similar presentations

Presentation on theme: "Exploiting the Power of Group Differences to Solve Data Analysis Problems Classification Guozhu Dong, PhD, Professor CSE guozhu.dong@wright.edu."— Presentation transcript:

Similar presentations

About project

Feedback