PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman

Pat1Pat2Pat3Pat4Pat5Pat6Pat7Pat8Pat9 209619_at7758470553427443874749337950503152937546 32541_at280387392238385329337163225288 206398_s_at1050835126817231377804184611802521512 219281_at391593298265491517334387285507 207857_at1425977202711849398146585936591318 211338_at37272838331636233130 213539_at1241974541161621139797160149 221497_x_at1208617599115808311966113 213958_at179225449174185203186185157215 210835_s_at203144197314250353173285325215 209199_s_at75812348331449769111098763811331326 217979_at57056397279686949467310136651568 201015_s_at533343325270691460563321261331 203332_s_at649354494554710455748392418505 204670_x_at5577321653234423577133744328351520723061 208788_at6483271057746541270361774590679 210784_x_at142151144173148145131146147119 204319_s_at298172200298196104144110150341 205049_s_at3294135120802066372613962244214212481974 202114_at83367473312988623718865017341409 213792_s_at646375370436738497546406376442 203932_at19771016243618561917822118910926232190 203963_at976377136857491616692 203978_at315279221260227222232141123319 203753_at146811053811154980141912535541045481 204891_s_at787115274127576615370108 209365_s_at472519365349756528637828720273 209604_s_at7727413021610831180235177191 211005_at495812970567761617572 219686_at694342345502960403535513258386 38521_at775604305563542543725587406906 217853_at36716810716028726427311389363 217028_at4926266735425163468332814822397827023977 201137_s_at4733284618345471507923303345146023173109 202284_s_at60018231657117797223031574173110472054 201999_s_at8979598008082971014998663491613 221737_at265200130245192246227228108394 205456_at636410060826553737181 201540_at8211296165185861311441549146218132112 219371_s_at147721078371534240711041688295612331313 205297_s_at4183942937784053084471005709201 208650_s_at1025455685872718884534863219846 210031_at288162205155194150185184141206 203675_at2683883182564132792392461098532 205255_x_at677308679540398447428333197417 202598_at176342298174174413352323459311 201022_s_at251193116106155285221242377217 218205_s_at102812662085179010962302192511487872700 207820_at634353971025475483075 202207_at77217241674413184748372130 Motivation: Multidimensional data

Outline Dimension reduction –PCA –Clustering Classification Example: study of childhood leukemia

Childhood Leukemia Cancer in the cells of the immune system Approx. 35 new cases in Denmark every year 50 years ago – all patients died Today – approx. 78% are cured Riskgroups –Standard –Intermediate –High –Very high –Extra high Treatment –Chemotherapy –Bone marrow transplantation –Radiation

Prognostic Factors Good prognosisPoor prognosis Immunophenotypeprecursor BT Age1-9≥10 Leukocyte countLow (<50*10 9 /L)High (>100*10 9 /L) Number of chromosomesHyperdiploidy (>50) Hypodiploidy (<46) Translocationst(12;21)t(9;22), t(1;19) Treatment responseGood responsePoor response

Study of Childhood Leukemia Diagnostic bone marrow samples from leukemia patients Platform: Affymetrix Focus Array –8763 human genes Immunophenotype –18 patients with precursor B immunophenotype –17 patients with T immunophenotype Outcome 5 years from diagnosis –11 patients with relapse –18 patients in complete remission

Principal Component Analysis (PCA) used for visualization of complex data developed to capture as much of the variation in data as possible

Principal components 1. principal component (PC1) –the direction along which there is greatest variation 2. principal component (PC2) –the direction with maximum variation left in data, orthogonal to the 1. PC General about principal components –linear combinations of the original variables –uncorrelated with each other

Principal components

PCA - example

PCA on all Genes Leukemia data, precursor B and T Plot of 34 patients, dimension of 8973 genes reduced to 2

Outcome: PCA on all Genes

Principal components - Variance

Hierarchical –agglomerative (buttom-up) eg. UPGMA -divisive (top-down) Partitioning –eg. K-means clustering Clustering methods

Hierarchical clustering Representation of all pairwise distances Parameters: none (distance measure) Results: –in one large cluster –hierarchical tree (dendrogram) Deterministic

Hierarchical clustering – UPGMA Algorithm Assign each item to its own cluster Join the nearest clusters Reestimate the distance between clusters Repeat for 1 to n

Hierarchical clustering

Data with clustering order and distances Dendrogram representation

Leukemia data - clustering of patients

Leukemia data - clustering of genes

K-means clustering Partition data into K clusters Parameter: Number of clusters (K) must be chosen Randomilized initialization: –different clusters each time

K-means - Algorithm Assign each item a class in 1 to K (randomly) For each class 1 to K –Calculate the centroid (one of the K-means) –Calculate distance from centroid to each item Assign each item to the nearest centroid Repeat until no items are re-assigned (convergence)

K-means clustering, K=3

K-means clustering of Leukemia data

Comparison of clustering methods Hierarchical clustering –Distances between all variables –Timeconsuming with a large number of gene –Advantage to cluster on selected genes K-mean clustering –Faster algorithm –Does not show relations between all variables

Distance measures Euclidian distance Vector angle distance Pearsons distance

Comparison of distance measures

Classification Feature selection Classification methods Cross-validation Training and testing

Reduction of input features Dimension reduction –PCA Feature selection (gene selection) –Significant genes: t-test –Selection of a limited number of genes

Microarray Data ClassprecursorBT... Patient1Patient2Patient3Patient4... Gene11789.5963.92079.53243.9... Gene246.452.022.327.1... Gene3215.6276.4245.1199.1... Gene4176.9504.6420.5380.4 Gene54023.03768.64257.84451.8 Gene612.612.137.738.7... Gene8793312.5415.91045.41308.0

Outcome: PCA on Selected Genes

Outcome Prediction: CC against the Number of Genes

Calculation of a centroid for each class Calculation of the distance between a test sample and each class cetroid Class prediction by the nearest centroid method Nearest Centroid

Based on distance measure –For example Euclidian distance Parameter k = number of nearest neighbors –k=1 –k=3 –k=... Prediction by majority vote for odd numbers K-Nearest Neighbor (KNN)

Neural Networks Hidden neurons Gene2 Gene6 Gene8793... B T Input Output

Comparison of Methods Nearest centroidNeural networks Simple methods Based on distance calculation Good for simple problems Good for few training samples Distribution of data assumed Advanced methods Involve machine learning Several adjustable parameters Many training samples required (eg. 50-100) Flexible methods KNN

Cross-validation Data: 10 samples Cross-5-validation: Training: 4/5 of data (8 samples) Testing: 1/5 of data (2 samples) -> 5 different models Leave-one-out cross-validation (LOOCV) Training: 9/10 of data (9 samples) Testing: 1/10 of data (1 sample) -> 10 different models

Validation Definition of –true and false positives –true and false negatives Actual classBBTT Predicted classBTTB TPFNTNFP

Accuracy Definition: TP + TN TP + TN + FP + FN Range: 0 – 100%

Definition:TP·TN - FN·FP √(TN+FN)(TN+FP)(TP+FN)(TP+FP) Range: (-1) – 1 Matthews correlation coefficient

Overview of Classification Expression data Subdivision of data for cross-validation into training sets and test sets Feature selection (t-test) Dimension reduction (PCA) Training of classifier: - using cross-validation - choice of method - choice of optimal parameters Testing of classifierIndependant test set

Important Points Avoid overfitting Validate performance –Test on an independant test set –by using cross-validation Include feature selection in cross-validation Why? –To avoid overestimation of performance! –To make a general classifier

Study of Childhood Leukemia: Results Classification of immunophenotype (precursorB og T) –100% accuracy During the training When testing on an independant test set –Simple classification methods applied K-nearest neighbor Nearest centroid Classification of outcome (relapse or remission) –78% accuracy (CC = 0.59) –Simple and advanced classification methods applied

Prognostic factors: Immunophenotype Age Leukocyte count Number of chromosomes Translocations Treatment response Risk classification in the future ? Risk group: Standard Intermediate High Very high Ekstra high Patient: Clincal data Immunopheno- typing Morphology Genetic measurements Microarray technology Custom designed treatment

Summary Dimension reduction important to visualize data –Principal Component Analysis –Clustering Hierarchical Partitioning (K-means) (distance measure important) Classification –Reduction of dimension often nessesary (t-test, PCA) –Several classification methods avaliable –Validation

Coffee break

PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

Similar presentations

Presentation on theme: "PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

Similar presentations

Presentation on theme: "PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman."— Presentation transcript:

Similar presentations

About project

Feedback