PCA, Clustering and Classification by Agnieszka S. Juncker

Slides:



Advertisements
Similar presentations
Indian Statistical Institute Kolkata
Advertisements

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Classification and risk prediction
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Data mining and machine learning A brief introduction.
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
CZ5225: Modeling and Simulation in Biology Lecture 8: Microarray disease predictor-gene selection by feature selection methods Prof. Chen Yu Zong Tel:
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Principal Components Analysis ( PCA)
Classification of tissues and samples 指導老師:藍清隆 演講者:張許恩、王人禾.
Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
PREDICT 422: Practical Machine Learning
Semi-Supervised Clustering
ECE 471/571 - Lecture 19 Review 02/24/17.
Machine Learning Clustering: K-means Supervised Learning
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Performance Evaluation 02/15/17
Instance Based Learning
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
Classifiers!!! BCH364C/394P Systems Biology / Bioinformatics
Gene Set Enrichment Analysis
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Basic machine learning background with Python scikit-learn
Machine Learning Basics
Molecular Classification of Cancer
Overview of Supervised Learning
Principal Component Analysis
John Nicholas Owen Sarah Smith
Revision (Part II) Ke Chen
Clustering.
Revision (Part II) Ke Chen
Multivariate Statistical Methods
ECE 471/571 – Review 1.
Parametric Estimation
Dimension reduction : PCA and Clustering
Somi Jacob and Christian Bach
Text Categorization Berlin Chen 2003 Reference:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Lecture 16. Classification (II): Practical Considerations
ECE – Pattern Recognition Lecture 10 – Nonparametric Density Estimation – k-nearest-neighbor (kNN) Hairong Qi, Gonzalez Family Professor Electrical.
Support Vector Machines 2
Hairong Qi, Gonzalez Family Professor
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
What is Artificial Intelligence?
Presentation transcript:

PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman

Motivation: Multidimensional data Pat1 Pat2 Pat3 Pat4 Pat5 Pat6 Pat7 Pat8 Pat9 209619_at 7758 4705 5342 7443 8747 4933 7950 5031 5293 7546 32541_at 280 387 392 238 385 329 337 163 225 288 206398_s_at 1050 835 1268 1723 1377 804 1846 1180 252 1512 219281_at 391 593 298 265 491 517 334 387 285 507 207857_at 1425 977 2027 1184 939 814 658 593 659 1318 211338_at 37 27 28 38 33 16 36 23 31 30 213539_at 124 197 454 116 162 113 97 97 160 149 221497_x_at 120 86 175 99 115 80 83 119 66 113 213958_at 179 225 449 174 185 203 186 185 157 215 210835_s_at 203 144 197 314 250 353 173 285 325 215 209199_s_at 758 1234 833 1449 769 1110 987 638 1133 1326 217979_at 570 563 972 796 869 494 673 1013 665 1568 201015_s_at 533 343 325 270 691 460 563 321 261 331 203332_s_at 649 354 494 554 710 455 748 392 418 505 204670_x_at 5577 3216 5323 4423 5771 3374 4328 3515 2072 3061 208788_at 648 327 1057 746 541 270 361 774 590 679 210784_x_at 142 151 144 173 148 145 131 146 147 119 204319_s_at 298 172 200 298 196 104 144 110 150 341 205049_s_at 3294 1351 2080 2066 3726 1396 2244 2142 1248 1974 202114_at 833 674 733 1298 862 371 886 501 734 1409 213792_s_at 646 375 370 436 738 497 546 406 376 442 203932_at 1977 1016 2436 1856 1917 822 1189 1092 623 2190 203963_at 97 63 77 136 85 74 91 61 66 92 203978_at 315 279 221 260 227 222 232 141 123 319 203753_at 1468 1105 381 1154 980 1419 1253 554 1045 481 204891_s_at 78 71 152 74 127 57 66 153 70 108 209365_s_at 472 519 365 349 756 528 637 828 720 273 209604_s_at 772 74 130 216 108 311 80 235 177 191 211005_at 49 58 129 70 56 77 61 61 75 72 219686_at 694 342 345 502 960 403 535 513 258 386 38521_at 775 604 305 563 542 543 725 587 406 906 217853_at 367 168 107 160 287 264 273 113 89 363 217028_at 4926 2667 3542 5163 4683 3281 4822 3978 2702 3977 201137_s_at 4733 2846 1834 5471 5079 2330 3345 1460 2317 3109 202284_s_at 600 1823 1657 1177 972 2303 1574 1731 1047 2054 201999_s_at 897 959 800 808 297 1014 998 663 491 613 221737_at 265 200 130 245 192 246 227 228 108 394 205456_at 63 64 100 60 82 65 53 73 71 81 201540_at 821 1296 1651 858 613 1144 1549 1462 1813 2112 219371_s_at 1477 2107 837 1534 2407 1104 1688 2956 1233 1313 205297_s_at 418 394 293 778 405 308 447 1005 709 201 208650_s_at 1025 455 685 872 718 884 534 863 219 846 210031_at 288 162 205 155 194 150 185 184 141 206 203675_at 268 388 318 256 413 279 239 246 1098 532 205255_x_at 677 308 679 540 398 447 428 333 197 417 202598_at 176 342 298 174 174 413 352 323 459 311 201022_s_at 251 193 116 106 155 285 221 242 377 217 218205_s_at 1028 1266 2085 1790 1096 2302 1925 1148 787 2700 207820_at 63 43 53 97 102 54 75 48 30 75 202207_at 77 217 241 67 441 318 474 83 72 130

Outline Dimension reduction Classification PCA Clustering Classification Example: study of childhood leukemia

Childhood Leukemia Cancer in the cells of the immune system Approx. 35 new cases in Denmark every year 50 years ago – all patients died Today – approx. 78% are cured Riskgroups Standard Intermediate High Very high Extra high Treatment Chemotherapy Bone marrow transplantation Radiation

Prognostic Factors Good prognosis Poor prognosis Immunophenotype precursor B T Age 1-9 ≥10 Leukocyte count Low (<50*109/L) High (>100*109/L) Number of chromosomes Hyperdiploidy (>50) Hypodiploidy (<46) Translocations t(12;21) t(9;22), t(1;19) Treatment response Good response Poor response

Study of Childhood Leukemia Diagnostic bone marrow samples from leukemia patients Platform: Affymetrix Focus Array 8763 human genes Immunophenotype 18 patients with precursor B immunophenotype 17 patients with T immunophenotype Outcome 5 years from diagnosis 11 patients with relapse 18 patients in complete remission

Principal Component Analysis (PCA) used for visualization of complex data developed to capture as much of the variation in data as possible

Principal components 1. principal component (PC1) the direction along which there is greatest variation 2. principal component (PC2) the direction with maximum variation left in data, orthogonal to the 1. PC General about principal components linear combinations of the original variables uncorrelated with each other

Principal components

PCA - example

PCA on all Genes Leukemia data, precursor B and T Plot of 34 patients, dimension of 8973 genes reduced to 2

Outcome: PCA on all Genes

Principal components - Variance

Clustering methods Hierarchical Partitioning agglomerative (buttom-up) eg. UPGMA divisive (top-down) Partitioning eg. K-means clustering

Hierarchical clustering Representation of all pairwise distances Parameters: none (distance measure) Results: in one large cluster hierarchical tree (dendrogram) Deterministic

Hierarchical clustering – UPGMA Algorithm Assign each item to its own cluster Join the nearest clusters Reestimate the distance between clusters Repeat for 1 to n

Hierarchical clustering

Hierarchical clustering Data with clustering order and distances Dendrogram representation

Leukemia data - clustering of patients

Leukemia data - clustering of genes

K-means clustering Partition data into K clusters Parameter: Number of clusters (K) must be chosen Randomilized initialization: different clusters each time

K-means - Algorithm Assign each item a class in 1 to K (randomly) For each class 1 to K Calculate the centroid (one of the K-means) Calculate distance from centroid to each item Assign each item to the nearest centroid Repeat until no items are re-assigned (convergence)

K-means clustering, K=3

K-means clustering, K=3

K-means clustering, K=3

K-means clustering of Leukemia data

K-means clustering of Cell Cycle data

Comparison of clustering methods Hierarchical clustering Distances between all variables Timeconsuming with a large number of gene Advantage to cluster on selected genes K-mean clustering Faster algorithm Does not show relations between all variables

Distance measures Euclidian distance Vector angle distance Pearsons distance

Comparison of distance measures

Classification Feature selection Classification methods Cross-validation Training and testing

Reduction of input features Dimension reduction PCA Feature selection (gene selection) Significant genes: t-test Selection of a limited number of genes

Microarray Data Class precursorB T ... Patient1 Patient2 Patient3 Gene1 1789.5 963.9 2079.5 3243.9 Gene2 46.4 52.0 22.3 27.1 Gene3 215.6 276.4 245.1 199.1 Gene4 176.9 504.6 420.5 380.4 Gene5 4023.0 3768.6 4257.8 4451.8 Gene6 12.6 12.1 37.7 38.7 Gene8793 312.5 415.9 1045.4 1308.0

Outcome Prediction: CC against the Number of Genes

Linear discriminant analysis Assumptions: Data eg. Gaussian distributed Variances and covariances the same for classes Calculation of class probability

Nearest Centroid Calculation of a centroid for each class Calculation of the distance between a test sample and each class cetroid Class prediction by the nearest centroid method

K-Nearest Neighbor (KNN) Based on distance measure For example Euclidian distance Parameter k = number of nearest neighbors k=1 k=3 k=... Prediction by majority vote for odd numbers

Support Vector Machines Machine learning Relatively new and highely theoretic Works on non-linearly seperable data Finding a hyperplane between the two classes by minimizing of the distance between the hyperplane and closest points

Neural Networks Gene1 Gene2 Gene3 ... Input Hidden neurons Output B T

Comparison of Methods Linear discriminant analysis Nearest centroid Neural networks Support vector machines Simple method Based on distance calculation Good for simple problems Good for few training samples Distribution of data assumed Advanced methods Involve machine learning Several adjustable parameters Many training samples required (eg. 50-100) Flexible methods KNN

Cross-validation Data: 10 samples Cross-5-validation: Training: 4/5 of data (8 samples) Testing: 1/5 of data (2 samples) -> 5 different models Leave-one-out cross-validation (LOOCV) Training: 9/10 of data (9 samples) Testing: 1/10 of data (1 sample) -> 10 different models

Validation Definition of Actual class B B T T Predicted class B T T B true and false positives true and false negatives Actual class B B T T Predicted class B T T B TP FN TN FP

Accuracy Definition: TP + TN TP + TN + FP + FN Range: 0 – 100%

Matthews correlation coefficient Definition: TP·TN - FN·FP √(TN+FN)(TN+FP)(TP+FN)(TP+FP) Range: (-1) – 1

Overview of Classification Expression data Testing of classifier Independant test set Subdivision of data for cross-validation into training sets and test sets Feature selection (t-test) Dimension reduction (PCA) Training of classifier: using cross-validation choice of method - choice of optimal parameters

Important Points Avoid overfitting Validate performance Test on an independant test set by using cross-validation Include feature selection in cross-validation Why? To avoid overestimation of performance! To make a general classifier

Study of Childhood Leukemia: Results Classification of immunophenotype (precursorB og T) 100% accuracy During the training When testing on an independant test set Simple classification methods applied K-nearest neighbor Nearest centroid Classification of outcome (relapse or remission) 78% accuracy (CC = 0.59) Simple and advanced classification methods applied

Risk classification in the future ? Patient: Clincal data Immunopheno-typing Morphology Genetic measurements Prognostic factors: Immunophenotype Age Leukocyte count Number of chromosomes Translocations Treatment response Risk group: Standard Intermediate High Very high Ekstra high Microarray technology Custom designed treatment

Summary Dimension reduction important to visualize data Classification Principal Component Analysis Clustering Hierarchical Partitioning (K-means) (distance measure important) Classification Reduction of dimension often nessesary (t-test, PCA) Several classification methods avaliable Validation

Coffee break