Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prototype-based learning and adaptive distances for classification

Similar presentations


Presentation on theme: "Prototype-based learning and adaptive distances for classification"— Presentation transcript:

1 Prototype-based learning and adaptive distances for classification
Michael Biehl Johann Bernoulli Institute for Mathematics and Computer Science University of Groningen

2 overview Basic concepts of similarity / distance based classification
example system: Learning Vector Quantization (LVQ) application: Classification of Adrenal Tumors Distance measures and Relevance Learning predefined distances, e.g. divergence based LVQ application: Detection of Cassava Mosaic Disease adaptive distances, e.g. Matrix Relevance LVQ application: Classification of Adrenal Tumors (cont’d) extensions: combined distances, relational data (excursion: uniqueness and regularization of relevance matrices)

3 Part I: Basic concepts of distance/similarity based classification

4 classification problems
here only: supervised learning , classification: - character/digit/speech recognition - medical diagnoses - pixel-wise segmentation in image processing - object recognition/scene analysis - fault detection in technical systems - remote sensing ... machine learning approach: extract information from example data parameterized in a learning system (neural network, LVQ, SVM...) working phase: application to novel data

5 distance based classification
assignment of data (objects, observations,...) to one or several classes (crisp/soft) (categories, labels) based on comparison with reference data (samples, prototypes) in terms of a distance measure (dis-similarity, metric) representation of data (a key step!) - collection of qualitative/quantitative descriptors - vectors of numerical features - sequences, graphs, functional data - relational data, e.g. in terms of pairwise (dis-) similarities

6 K-NN classifier a simple distance-based classifier
store a set of labeled examples classify a query according to the label of the Nearest Neighbor (or the majority of K NN) ? local decision boundary acc. to (e.g.) Euclidean distances piece-wise linear class borders parameterized by all examples feature space + conceptually simple, no training required, one parameter (K) - expensive storage and computation, sensitivity to “outliers” can result in overly complex decision boundaries

7 prototype based classification
a prototype based classifier [Kohonen 1990, 1997] represent the data by one or several prototypes per class classify a query according to the label of the nearest prototype (or alternative schemes) ? local decision boundaries according to (e.g.) Euclidean distances piece-wise linear class borders parameterized by prototypes feature space + less sensitive to outliers, lower storage needs, little computational effort in the working phase - training phase required in order to place prototypes, model selection problem: number of prototypes per class, etc.

8 Nearest Prototype Classifier
set of prototypes carrying class-labels nearest prototype classifier (NPC): based on dissimilarity/distance measure given determine the winner - assign to the class reasonable requirements: most prominent example: (squared) Euclidean distance

9 Learning Vector Quantization
N-dimensional data, feature vectors ∙ identification of prototype vectors from labeled example data ∙ distance based classification (e.g. Euclidean) heuristic scheme: LVQ1 [Kohonen, 1990, 1997] • initialize prototype vectors for different classes • present a single example • identify the winner (closest prototype) • move the winner - closer towards the data (same class) - away from the data (different class)

10   Learning Vector Quantization ∙ aim: discrimination of classes
4/25/2017 Learning Vector Quantization N-dimensional data, feature vectors ∙ identification of prototype vectors from labeled example data ∙ distance based classification (e.g. Euclidean) ∙ distance-based classification [here: Euclidean distances] ∙ tesselation of feature space [piece-wise linear] ∙ aim: discrimination of classes ( ≠ vector quantization or density estimation ) ∙ generalization ability correct classification of new data

11 LVQ1 iterative training procedure:
randomized initial , e.g. close to the class-conditional means sequential presentation of labelled examples … the winner takes it all: LVQ1 update step: learning rate many heuristic variants/modifications: [Kohonen, 1990,1997] learning rate schedules ηw (t) [Darken & Moody, 1992] update more than one prototype per step

12 LVQ1 LVQ1 update step: LVQ1-like update for generalized distance:
requirement: update decreases (increases) distance if classes coincide (are different)

13 4/25/2017 Generalized LVQ one example of cost function based training: GLVQ [Sato & Yamada, 1995] minimize two winning prototypes: linear E favors large margin separation of classes, e.g. sigmoidal (linear for small arguments), e.g. E approximates number of misclassifications

14 GLVQ training = optimization with respect to prototype position,
e.g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on non-negative, differentiable distance

15 GLVQ training = optimization with respect to prototype position,
e.g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on non-negative, differentiable distance

16 GLVQ training = optimization with respect to prototype position,
e.g. single example presentation, stochastic sequence of examples, update of two prototypes per step based on Euclidean distance moves prototypes towards / away from sample with prefactors

17 prototype/distance based classifiers
+ intuitive interpretation prototypes defined in feature space + natural for multi-class problems + flexible, easy to implement + frequently applied in a variety of practical problems often based on purely heuristic arguments … or … cost functions with unclear relation to classification error model/parameter selection (# of prototypes, learning rate, …) Important issue: which is the ‘right’ distance measure ? features may - scale differently be of completely different nature - be highly correlated / dependent simple Euclidean distance ?

18 4/25/2017 related schemes Many variants of LVQ intuitive schemes: LVQ2.1, LVQ3, OLVQ, ... cost function based: RSLVQ (likelihood ratios) Supervised Neural Gas (NG) many prototypes, rank based update Supervised Self-Organizing Maps (SOM) neighborhood relations, topology preserving mapping Radial Basis Function Networks (RBF) hidden units = centers (prototypes) with Gaussian activation

19 remark: the curse of dimension ?
4/25/2017 remark: the curse of dimension ? concentration of distances for large N „distance based methods are bound to fail in high dimensions“ ??? LVQ: - prototypes are not just random data points - carefully selected low-noise representatives of the data - distances of a given data point to prototypes are compared projection to non-trivial low-dimensional subspace! see also: [Ghosh et al., 2007, Witoelar et al., 2010] models of LVQ training, analytical treatment in the limit successful training needs training examples

20 Questions ? ?

21 classification of adrenal tumors
tumor classification An example problem: classification of adrenal tumors Petra Schneider Han Stiekema Michael Biehl Johann Bernoulli Institute for Mathematics and Computer Science University of Groningen Wiebke Arlt , Angela Taylor Dave J. Smith, Peter Nightingale P.M. Stewart, C.H.L. Shackleton et al. School of Medicine Queen Elizabeth Hospital University of Birmingham/UK (+ several centers in Europe) [Arlt et al., J. Clin. Endocrinology & Metabolism, 2011]

22 tumor classification ∙ adrenal tumors are common (1-2%) www.ensat.org
gland ∙ adrenal tumors are common (1-2%) and mostly found incidentally ∙ adrenocortical carcinomas (ACC) account for 2-11% of adrenal incidentalomas ( ACA: adrenocortical adenomas ) ∙ conventional diagnostic tools lack sensitivity and are labor and cost intensive (CT, MRI) ∙ idea: tumor classification based on steroid excretion profile

23 tumor classification - urinary steroid excretion (24 hours)
- 32 potential biomarkers biochemistry imposes correlations, grouping of steroids

24 ACA patient # ACC patient #
tumor classification data set: 102 patients with benign ACA 45 patients with malignant ACC color coded excretion values (log. scale, relative to healthy controls) ACA patient # ACC patient # # steroid marker

25 ∙ data divided in 90% training and 10% test set
tumor classification Generalized LVQ , training and performance evaluation ∙ data divided in 90% training and 10% test set ∙ determine prototypes by stochastic gradient descent typical profiles (1 per class) ∙ employ Euclidean distance measure in the 32-dim. feature space ∙ apply classifier to test data evaluate performance (error rates) ∙ repeat and average over many random splits

26 tumor classification ACA prototypes: steroid excretion in ACA/ACC ACC

27 ∙ Receiver Operator Characteristics (ROC) [Fawcett, 2000]
tumor classification ∙ Receiver Operator Characteristics (ROC) [Fawcett, 2000] obtained by introducing a biased NPC: all tumors classified as ACA - no false alarms - no true positives detected true positive rate (sensitivity) random guessing Area under Curve all tumors classified as ACC - all true positives detected - max. number of false alarms false positive rate (1-specificity)

28 tumor classification GLVQ performance:
ROC characteristics (averaged over splits of the data set) AUC=0.87

29 Questions ? ?

30 Part II: distance measures and relevance learning

31 distance measures fixed distance measures:
- select distance measures according to prior knowledge - data driven choice in a preprocessing step - determine prototypes for a given distance - compare performance of various measures example: divergence based LVQ

32 Relevance Matrix LVQ [Schneider et al., 2009]
generalized quadratic distance in LVQ: normalization: variants: one global, several local, class-wise relevance matrices → piecewise quadratic decision boundaries diagonal matrices: single feature weights [Bojer et al., 2001] [Hammer et al., 2002] rectangular discriminative low-dim. representation e.g. for visualization [Bunte et al., 2012] possible constraints: rank-control, sparsity, …

33 Relevance Matrix LVQ optimization of prototypes and distance measure
WTA Matrix-LVQ1

34 Relevance Matrix LVQ optimization of prototypes and distance measure
Generalized Matrix-LVQ (gradients of )

35 heuristic interpretation
standard Euclidean distance for linearly transformed features summarizes - the contribution of the original dimension - the relevance of original features for the classification interpretation assumes implicitly: features have equal order of magnitude e.g. after z-score-transformation → (averages over data set)

36 Relevance Matrix LVQ simplified classification schemes
optimization of prototype positions distance measure(s) in one training process (≠ pre-processing) motivation: improved performance - weighting of features and pairs of features simplified classification schemes - elimination of non-informative, noisy features - discriminative low-dimensional representation insight into the data / classification problem - identification of most discriminative features - incorporation of prior knowledge (e.g. structure of Ω)

37 tumor classification (cont’d)
[Arlt et al., 2011] [Biehl et al., 2012] Generalized Matrix LVQ , ACC vs. ACA classification ∙ data divided in 90% training, 10% test set, (z-score transformed) ∙ determine prototypes typical profiles (1 per class) ∙ adaptive generalized quadratic distance measure parameterized by ∙ apply classifier to test data evaluate performance (error rates, ROC) ∙ repeat and average over many random splits

38 tumor classification Relevance matrix diagonal elements off-diagonal
fraction of runs (random splits) in which a steroid is rated among 9 most relevant markers subset of 9 selected steroids ↔ technical realization (patented, University of Birmingham/UK)

39 tumor classification diagonal elements off-diagonal 19 discriminative
ACA ACC discriminative e.g. steroid 19

40 tumor classification diagonal elements off-diagonal 8
ACA ACC non-trivial role: steroid 8 among the most relevant!

41 tumor classification highly discriminative combination of markers!
12 weakly discriminative markers 8

42 tumor classification ROC characteristics clear improvement due to
adaptive distances GRLVQ (sensitivity) AUC 0.87 0.93 0.97 Euclidean 8 diagonal rel. full matrix GMLVQ (1-specificity)

43 tumor classification observation / theory :
eigenvalues in ACA/ACC classification low rank of resulting relevance matrix often: single relevant eigendirection Stationarity of Matrix Relevance LVQ [M. Biehl, B. Hammer, F.-M. Schleif, T. Villmann, IJCNN 2015, in press] intrinsic regularization nominally ~ NxN adaptive parameters in Matrix LVQ reduce to ~ N effective degrees of freedom low-dimensional representation facilitates, e.g., visualization of labeled data sets

44 tumor classification visualization of the data set ACA ACC

45 a multi-class example classification of coffee samples
projection on second eigenvector classification of coffee samples based on hyperspectral data (256-dim. feature vectors) [U. Seiffert et al., IFF Magdeburg] projection on first eigenvector prototypes

46 related schemes Linear Discriminant Analysis (LDA)
4/25/2017 related schemes Linear Discriminant Analysis (LDA) one prototype per class + global matrix, different objective function! Relevance Learning related schemes in supervised learning ... RBF Networks [Backhaus et al., 2012] Neighborhood Component Analysis [Goldberger et al., 2005] Large Margin Nearest Neighbor [Weinberger et al., 2006, 2010] and many more! Relevance LVQ variants local, rectangular, structured, restricted... relevance matrices for visualization, functional data, texture recognition, etc. relevance learning in Robust Soft LVQ, Supervised NG, etc. combination of distances for mixed data ...

47 links Matlab collection:
Relevance and Matrix adaptation in Learning Vector Quantization (GRLVQ, GMLVQ and LiRaM LVQ) Pre/re-prints etc.: Challenging data sets ?

48 Questions ? ?

49 uniqueness / regularization
quadratic distance measure (positive semi-definite pseudo-metric) intrinsic representation by linear transformation uniqueness (i) matrix square root is not unique* canonical representation, e.g. * irrelevant rotations, reflections, symmetries

50 uniqueness of relevance matrix
uniqueness (ii) given mapping: is possible if exists with i.e. the rows of lie in the null-space of → identical mapping of all examples and prototypes, same distances and classification scheme w.r.t. training data is singular if features are highly correlated, interdependent

51 uniqueness of relevance matrix
a simple example consider two identical, entirely irrelevant features, e.g. contributions cancel exactly if (disregarded in the classification) but naïve interpretation of diagonal suggests high relevance!

52 posterior null-space projection
training process yields determine with eigenvectors and eigenvalues column space projection: with removes null-space contributions Note: minimizes under the condition formal solution: (Moore-Penrose pseudo-inverse)

53 posterior regularization
training process yields determine with eigenvectors and eigenvalues regularization: with retains the eigenspace corresponding to largest eigenvalues only - removes also eigenspace of (small) non-zero eigenvalues - smoothens the mapping, less data set specific - potentially improved generalization performance

54 posterior regularization
retains original features flexible K can include prototypes regularized mapping after/during training pre-processing of data (PCA-like) mapped feature space fixed K prototypes yet unknown (*) (*) remark: prototypes are (close to) linear combinations of feature vectors when converged here: posterior regularization in classification schemes dependence of generalization performance on parameter K improved interpretability of the mapping / distance measure

55 illustrative example alcohol content (binned) GMLVQ classification
infra-red spectral data: 124 wine spamples 256 wavelengths training data 94 test spectra GMLVQ classification (here) high correlation of features (neighbor channels) and P=30 → effective dimension ≪ 256 can be expected

56 illustrative example over-fitting effect
null-space correction P=30 dimensions best performance 7 dimensions remaining regularization (beyond column space projection) - potentially enhances generalization, controls over-fitting

57 before and after … regularization - enhances generalization
- smoothens relevance profile/matrix - removes ‘false relevances’ - improves interpretability of Λ

58 links Matlab collection:
Relevance and Matrix adaptation in Learning Vector Quantization (GRLVQ, GMLVQ and LiRaM LVQ) Pre/re-prints etc.:

59 Questions ? ?


Download ppt "Prototype-based learning and adaptive distances for classification"

Similar presentations


Ads by Google