Biomedical applications of prototype-based

Biomedical applications of prototype-based
classifiers and relevance learning Michael Biehl Intelligent Systems Johann Bernoulli Institute for Mathematics and Computing Science University of Groningen / NL Introduction: prototype-based classification, relevance learning Generalized Matrix Relevance LVQ Illustration: three bio-medical applications

supervised learning classification / regression / prediction
based on labeled example data generic workflow: example data model apply to novel data training working validation estimate working performance set parameters of model / training compare different models obvious performance measures: overall / class-wise accuracy ROC, Precision Recall ... accuracy is not enough - interpretable “white-box” systems example: prototype-based models, distance-based classifiers

distance-based classifiers
a simple distance-based system: (K) NN classifier store a set of labeled examples classify a query according to the label of the Nearest Neighbor (or the majority of K NN) piece-wise linear decision boundaries according to (e.g.) Euclidean distance from all examples ? N-dim. feature space + conceptually simple, + no training phase + only one parameter (K) expensive (storage, computation) sensitive to mislabeled data overly complex decision boundaries

prototype-based classification
Learning Vector Quantization [Kohonen, 1990] N-dim. feature space ? represent the data by one or several prototypes per class classify a query according to the label of the nearest prototype (or alternative schemes) local decision boundaries acc. to (e.g.) Euclidean distances + robust, low storage needs, little computational effort + parameterization in feature space, interpretability - model selection: number of prototypes per class, etc. requires training: placement of prototypes in feature space

Learning Vector Quantization
N-dimensional data, feature vectors ∙ identification of prototype vectors from labeled example data ∙ distance based classification (e.g. Euclidean) competitive learning: LVQ1 [Kohonen, 1990] • initialize prototype vectors for different classes • present a single example • identify the winner (closest prototype) • move the winner - closer towards the data (same class) - away from the data (different class)

Learning Vector Quantization
6/3/2018 N-dimensional data, feature vectors ∙ identification of prototype vectors from labeled example data ∙ distance based classification (e.g. Euclidean) ∙ distance-based classification [here: Euclidean distances] ∙ tesselation of feature space [piece-wise linear] ∙ aim: discrimination of classes ( ≠ vector quantization or density estimation ) ∙ generalization ability correct classification of new data

cost function based LVQ
6/3/2018 cost function based LVQ one example: Generalized LVQ (GLVQ) cost function [Sato&Yamada, 1995] minimize two winning prototypes: E favors - small number of misclassifications, e.g. with - large margins between classes - small , large - class-typical prototypes

6/3/2018 LVQ distance measures ? key question: appropriate distance / (dis-) similarity measure fixed, pre-defined distance measures: (G)LVQ can formulated for general (differentiable) distances examples: Minkowski distances (p≠2), correlation based, statistical divergences, not necessarily metrics! standard work-flow - consider several distance measures according to prior knowledge - compare performances in, e.g., cross-validation elegant approach: Relevance Learning / adaptive distances - employ parameterized distance measure - optimize in the data-driven training process (cost function!)

Generalized Matrix Relevance LVQ: GMLVQ
[Schneider, Biehl, Hammer, 2009] generalized quadratic distance in LVQ:

Generalized Matrix Relevance LVQ: GMLVQ
[Schneider, Biehl, Hammer, 2009] generalized quadratic distance in LVQ: training: adaptation of prototypes and distance measure guided by GLVQ cost function variants: one global, several local, class-wise relevance matrices rectangular low-dim. representation / visualization [Bunte et al., 2012] diagonal matrices: single feature weights [Hammer et al., 2002]

interpretation after training:
prototypes represent typical class properties or subtypes Relevance Matrix quantifies the contribution of the pair of features (i,j) to the distance summarizes the contribution of a single dimension the relevance of original features in the classifier Note: interpretation assumes implicitly that features have equal order of magnitude e.g. after z-score-transformation → (averages over data set)

three application examples
I) steroid metabolomics: - detection of malignancy in adrenocortical tumors based on urinary steroid metabolite excretion GMLVQ: ~ 150 samples, 32-dim. feature vectors II) cytokine expression data: - diagnosis of (early) rheumatoid arthritis based on synovial tissue samples ~ 50 samples represented by 117 cytokine expressions in synovial tissue, PCA+GMLVQ combined III) gene expression data: - recurrence risk prediction from tumor samples ~ 400 samples, ~20000 dim. feature space outlier analysis + GMLVQ on (80) pre-selected genes

Steroid metabolomics: detecting malignancy in adrenocortical tumors
W. Arlt, M. Biehl, A. Taylor, S. Hahner, R. Libé, B. Hughes, P. Schneider, D. Smith, H. Stiekema, N. Krone, E. Porfiri, G. Opocher, J. Bertherat, F. Mantero, B. Allolio, M. Terzolo, P. Nightingale, C. Shackleton, X. Bertagna, M.Fassnacht, P. Stewart Urine Steroid Metabolomics as a Biomarker Tool for Detecting Malignancy in Patients with Adrenal Tumors J Clinical Endocrinology & Metabolism 96: (2011)

steroid metabolomics classification of adrenocortical tumors (adenoma vs. carcinoma) based on steroid hormone excretion profiles benign ACA malignant ACC features: 32 steroid metabolite excretion values non-invasive measurement (24 hrs. urine samples) aim: develop a novel biomarker tool for differential diagnosis idea: identify characteristic steroid profiles (prototypes)

∙ data divided in 90% training, 10% test set, (z-score transformed)
steroid metabolomics [Arlt et al., 2011] [Biehl et al., 2012] Generalized Matrix LVQ , ACC vs. ACA classification ∙ data divided in 90% training, 10% test set, (z-score transformed) ∙ determine prototypes typical profiles (1 per class) ∙ adaptive generalized quadratic distance measure parameterized by ∙ apply classifier to test data evaluate performance (error rates, ROC) ∙ repeat and average over many random splits

steroid metabolomics prototypes: steroid excretion in ACA/ACC ACA ACC

steroid metabolomics Relevance matrix relevance of single markers
… of pairs of markers frequency of markers to be among top 9 subset of selected steroids ↔ technical realization (patented, UoB) using 9 markers only, similar ROC

steroid metabolomics ROC characteristics clear improvement due to
adaptive distances 90% / 10% randomized splits of the data in training and test set averages over 1000 runs (sensitivity) AUC 0.87 0.93 0.97 Euclidean diagonal rel. full matrix (1-specificity)

steroid metabolomics 19 discriminative e.g. steroid 19 (THS)
Relevance matrix diagonal elements off-diagonal 19 ACA ACC discriminative e.g. steroid 19 (THS)

steroid metabolomics TH-Doc (12) highly discriminative
combination of markers! weakly discriminative markers 5a-THA (8)

adrenocortical tumors
GRLVQ 8 (sensitivity) GMLVQ AUC 0.87 0.93 0.97 Euclidean diagonal rel. full matrix (1-specificity)

visualization of the data set
generic property: relevance matrix becomes highly singular ACA ACC

work in progress high-throughput LC/MS assay to replace GC/MS
on-going prospective study w.r.t. ~ 2000 patients monitoring of patients after surgery and/or under medication aim: recurrence detection / prediction identification of tumor subtypes ? other disorders affecting / related to steroid metabolism

Early diagnosis of Rheumatoid Arthritis
L. Yeo, N. Adlard, M. Biehl, M. Juarez, M. Snow C.D. Buckley, A. Filer, K. Raza, D. Scheel-Toellner Expression of chemokines CXCL4 and CXCL7 by synovial macrophages defines an early stage of rheumatoid arthritis Annals of the Rheumatic Diseases 75: (2016)

? rheumatoid arthritis (RA) uninflamed control established RA
early inflammation resolving early RA ultimate goals: understand pathogenesis and mechanism of progression ? cytokine based diagnosis of RA at earliest possible stage ?

synovial tissue cytokine expression
synovium tissue section mRNA extraction real-time PCR panel of 117 cytokines cell signaling proteins regulate immune response produced by, e.g. T-cells, macrophages, lymphocytes, fibroblasts, etc.

GMLVQ analysis pre-processing: log-transformed expression values
21 leading principal components explain 95% of the variation Two two-class problems: (A) established RA vs. uninflamed controls (B) early RA vs. resolving inflammation 1 prototype per class, global relevance matrix, distance measure: leave-two-out validation (one from each class) evaluation in terms of Receiver Operating Characteristics 27

Matrix Relevance LVQ (A) established RA vs. uninflamed control
leave-one-out diagonal relevances (A) established RA vs. uninflamed control true positive rate false positive rate diagonal Λii vs. cytokine index i (B) early RA vs. resolving inflammation true positive rate 28

protein level studies CXCL4 chemokine (C-X-C motif) ligand 4
direct study on protein level, staining / imaging of sinovial tissue: macrophages : predominant source of CXCL4/7 expression high levels of CXCL4 and CXLC7 in early RA expression on macrophages outside of blood vessels discriminates early RA / resolving cases 29

relevant cytokines (A) established RA vs. uninflamed control
leave-one-out diagonal relevances (A) established RA vs. uninflamed control true positive rate false positive rate diagonal Λii vs. cytokine index i (B) early RA vs. resolving inflammation macrophage stimulating 1 true positive rate 30

work in progress more samples (difficult...) needed in order
to obtain a reliable early diagnosis integrated analysis of gene expression and other data from the same / an analogous patient cohort

Predicting Recurrence in Clear Cell Renal Cell Carcinoma
6/3/2018 Predicting Recurrence in Clear Cell Renal Cell Carcinoma Analysis of TCGA data using Outlier Analysis and GMLVQ Gargi Mukherjee … Rutgers University, New Jersey Kevin Raines … Stanford University, California Srikanth Sastry … JNC, Bengaluru, India Sebastian Doniach … Stanford University, California Gyan Bhanot … Rutgers University, New Jersey Michael Biehl … University of Groningen, The Netherlands In: Proc. IEEE Congress on Evolutionary Computation CEC 2016

data clear cell Renal Cell Carcinoma (ccRCC)
publicly available datasets: The Cancer Genome Atlas (TCGA) cancergenome.nih.gov also hosted at Broad Institute gdac.broadinstitute.org

469 tumor samples 65 normal samples
data clear cell renal cell carcinoma TCGA data @ Broad Institute mRNA-Seq expression data X normalized, log-transformed: Y=log(1+X) 65 normal samples 65 matched tumor samples 469 tumor samples in total 469 tumor samples 65 normal samples 20532 genes number of recurrences recurrence data: days after diagnosis matched

380 training samples 89 test samples
outlier analysis 380 training samples 89 test samples fast forward to machine learning analysis randomized split

outlier analysis 380 training samples per gene: determine
mean μ, standard deviation σ of Y 380 training samples for each gene: identify outlier samples Y > μ + σ “high outlier“ Y < μ - σ “low outlier“ restrict the following analysis to genes with ≥ 20 high outlier samples or ≥ 20 low outlier samples

outlier analysis Kaplan-Meier (KM) analysis per gene:
test for significant association of outlier status of samples with recurrence 1546 „high-outlier genes“ with KM log rank p < 0.001 1628 „low-outlier genes“ with KM log rank p < 1546 genes construct two binary outlier matrices „1“ for high-outlier samples „0“ else „1“ for low-outlier samples 380 samples  PCA 1628 genes 380 samples

A B C D outlier analysis high outlier genes PCA reveals
four clusters of genes A 1475 B 71 genes in small clusters (B,D): outlier status associated with late recurrence low outlier genes genes in large clusters (A,C): outlier status associated with early recurrence C 1402 D 226

recurrence risk score top 20 genes (by KM p-value) from each cluster A,B,C,D reference set of 80 genes for each sample: - determine outlier status w.r.t. the 80 genes (Y>?<μ ± σ ) - add up contributions per gene if sample is outlier w.r.t. to a gene in A or C (early rec.) 0 if sample is not an outlier w.r.t. the gene if sample is outlier w.r.t. to a gene in B or D (late rec.) recurrence risk score ≤ R ≤ + 40 observe: median = 2 over the 380 training samples crisp classification w.r.t. recurrence risk: high risk (early recurrence) if R < 2 low risk (late recurrence) if R ≥ 2

recurrence risk prediction
KM plots with respect to high / low risk groups: training set (380 samples) test set (89 samples) log rank p < 1.e-16 log rank p < 1.e-4 risk score R is predictive of the actual recurrence risk the 80 selected genes can serve as a prognostic panel

extreme case analysis outlier analysis yields 4 groups (A,B,C,D) of 20
pre-selected genes associated with late/early recurrence 80-dim. feature vectors number of recurrences: ≤ 2 years (early) > 5 years (undefined) (late or no recurrence) 2 classes: 109 samples class 2, high risk 107 samples class 1, low risk

A B C D A B C D GMLVQ classifier one prototype vector per class:
adaptive distance for comparison of samples and prototypes: components of A B C D low expression | high expression diagonal elements of Λ A B C D

GMLVQ classifier ROC of GMLVQ classifier (Leave-One-Out of the 216 extreme samples) log rank p < 1.e-7 KM plot w.r.t. all 469 samples ( L-1-O for 216 samples, plus 253 undefined )

diagnostics? the set of 80 genes is also diagnostic:
GMLVQ separates normal from tumor cells (close to) perfectly PCA of corresponding gene expressions: gradient from normal to high risk: 65 normal samples 105 low risk samples (late rec.) 109 high risk samples (early rec.)

most relevant genes (GMLVQ)
from GMLVQ classifier

remarks and open questions
prospective studies 80 genes do not necessarily reflect biological mechanisms compare, e.g., with known pathways / modules of genes GMLVQ suggests an even smaller panel of genes (12?) identify a minimum panel for diagnostics and prognostics more direct, multivariate identification of relevant genes by dimension reduction + GMLVQ with back-transform

conclusion prototype- and distance based systems:
- intuitive, transparent, interpretable - classification, regression, unsupervised learning, visualization ... - relevance learning: further insight into data and problem - suitable for a variety of bio-medical problems a recent review: M. Biehl, B. Hammer, T. Villmann Prototype-based models in Machine Learning Advanced Review in: WIRES Cognitive Science 7(2): (2016)

links Matlab code: Relevance and Matrix adaptation in Learning Vector
Quantization (GRLVQ, GMLVQ and LiRaM LVQ): A no-nonsense beginners’ tool for GMLVQ: (see also: Tutorial, Thursday 9:30) Pre- and re-prints etc.:

thanks Barbara Hammer Thomas Villmann Wiebke Arlt Dagmar
Scheel-Toellner Petra Schneider Kerstin Bunte Gyan Bhanot

Biomedical applications of prototype-based

Similar presentations

Presentation on theme: "Biomedical applications of prototype-based"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biomedical applications of prototype-based

Similar presentations

Presentation on theme: "Biomedical applications of prototype-based"— Presentation transcript:

Similar presentations

About project

Feedback