PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

Slides:



Advertisements
Similar presentations
Indian Statistical Institute Kolkata
Advertisements

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman.
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Introduction to Bioinformatics - Tutorial no. 12
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.
Methods for Micro-Array Analysis Data Mining & Machine Learning Approaches.
CS Instance Based Learning1 Instance Based Learning.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Copyright  2003 limsoon wong Diagnosis of Childhood Acute Lymphoblastic Leukemia and Optimization of Risk-Benefit Ratio of Therapy Limsoon Wong Institute.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
CSE 185 Introduction to Computer Vision Pattern Recognition.
Whole Genome Expression Analysis
Data mining and machine learning A brief introduction.
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY.
The Broad Institute of MIT and Harvard Classification / Prediction.
Microarrays.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
CZ5225: Modeling and Simulation in Biology Lecture 8: Microarray disease predictor-gene selection by feature selection methods Prof. Chen Yu Zong Tel:
Today Ensemble Methods. Recap of the course. Classifier Fusion
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Classification Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Principal Components Analysis ( PCA)
Classification of tissues and samples 指導老師:藍清隆 演講者:張許恩、王人禾.
Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Data Mining and Text Mining. The Standard Data Mining process.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
PREDICT 422: Practical Machine Learning
Machine Learning Clustering: K-means Supervised Learning
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Basic machine learning background with Python scikit-learn
Molecular Classification of Cancer
Principal Component Analysis
PCA, Clustering and Classification by Agnieszka S. Juncker
Volume 1, Issue 2, Pages (March 2002)
Dimension reduction : PCA and Clustering
Somi Jacob and Christian Bach
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
What is Artificial Intelligence?
Presentation transcript:

PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman

Pat1Pat2Pat3Pat4Pat5Pat6Pat7Pat8Pat _at _at _s_at _at _at _at _at _x_at _at _s_at _s_at _at _s_at _s_at _x_at _at _x_at _s_at _s_at _at _s_at _at _at _at _at _s_at _s_at _s_at _at _at _at _at _at _s_at _s_at _s_at _at _at _at _s_at _s_at _s_at _at _at _x_at _at _s_at _s_at _at _at Motivation: Multidimensional data

Outline Dimension reduction –PCA –Clustering Classification Example: study of childhood leukemia

Childhood Leukemia Cancer in the cells of the immune system Approx. 35 new cases in Denmark every year 50 years ago – all patients died Today – approx. 78% are cured Riskgroups –Standard –Intermediate –High –Very high –Extra high Treatment –Chemotherapy –Bone marrow transplantation –Radiation

Prognostic Factors Good prognosisPoor prognosis Immunophenotypeprecursor BT Age1-9≥10 Leukocyte countLow (<50*10 9 /L)High (>100*10 9 /L) Number of chromosomesHyperdiploidy (>50) Hypodiploidy (<46) Translocationst(12;21)t(9;22), t(1;19) Treatment responseGood responsePoor response

Study of Childhood Leukemia Diagnostic bone marrow samples from leukemia patients Platform: Affymetrix Focus Array –8763 human genes Immunophenotype –18 patients with precursor B immunophenotype –17 patients with T immunophenotype Outcome 5 years from diagnosis –11 patients with relapse –18 patients in complete remission

Principal Component Analysis (PCA) used for visualization of complex data developed to capture as much of the variation in data as possible

Principal components 1. principal component (PC1) –the direction along which there is greatest variation 2. principal component (PC2) –the direction with maximum variation left in data, orthogonal to the 1. PC General about principal components –linear combinations of the original variables –uncorrelated with each other

Principal components

PCA - example

PCA on all Genes Leukemia data, precursor B and T Plot of 34 patients, dimension of 8973 genes reduced to 2

Outcome: PCA on all Genes

Principal components - Variance

Hierarchical –agglomerative (buttom-up) eg. UPGMA -divisive (top-down) Partitioning –eg. K-means clustering Clustering methods

Hierarchical clustering Representation of all pairwise distances Parameters: none (distance measure) Results: –in one large cluster –hierarchical tree (dendrogram) Deterministic

Hierarchical clustering – UPGMA Algorithm Assign each item to its own cluster Join the nearest clusters Reestimate the distance between clusters Repeat for 1 to n

Hierarchical clustering

Data with clustering order and distances Dendrogram representation

Leukemia data - clustering of patients

Leukemia data - clustering of genes

K-means clustering Partition data into K clusters Parameter: Number of clusters (K) must be chosen Randomilized initialization: –different clusters each time

K-means - Algorithm Assign each item a class in 1 to K (randomly) For each class 1 to K –Calculate the centroid (one of the K-means) –Calculate distance from centroid to each item Assign each item to the nearest centroid Repeat until no items are re-assigned (convergence)

K-means clustering, K=3

K-means clustering of Leukemia data

Comparison of clustering methods Hierarchical clustering –Distances between all variables –Timeconsuming with a large number of gene –Advantage to cluster on selected genes K-mean clustering –Faster algorithm –Does not show relations between all variables

Distance measures Euclidian distance Vector angle distance Pearsons distance

Comparison of distance measures

Classification Feature selection Classification methods Cross-validation Training and testing

Reduction of input features Dimension reduction –PCA Feature selection (gene selection) –Significant genes: t-test –Selection of a limited number of genes

Microarray Data ClassprecursorBT... Patient1Patient2Patient3Patient4... Gene Gene Gene Gene Gene Gene Gene

Outcome: PCA on Selected Genes

Outcome Prediction: CC against the Number of Genes

Calculation of a centroid for each class Calculation of the distance between a test sample and each class cetroid Class prediction by the nearest centroid method Nearest Centroid

Based on distance measure –For example Euclidian distance Parameter k = number of nearest neighbors –k=1 –k=3 –k=... Prediction by majority vote for odd numbers K-Nearest Neighbor (KNN)

Neural Networks Hidden neurons Gene2 Gene6 Gene B T Input Output

Comparison of Methods Nearest centroidNeural networks Simple methods Based on distance calculation Good for simple problems Good for few training samples Distribution of data assumed Advanced methods Involve machine learning Several adjustable parameters Many training samples required (eg ) Flexible methods KNN

Cross-validation Data: 10 samples Cross-5-validation: Training: 4/5 of data (8 samples) Testing: 1/5 of data (2 samples) -> 5 different models Leave-one-out cross-validation (LOOCV) Training: 9/10 of data (9 samples) Testing: 1/10 of data (1 sample) -> 10 different models

Validation Definition of –true and false positives –true and false negatives Actual classBBTT Predicted classBTTB TPFNTNFP

Accuracy Definition: TP + TN TP + TN + FP + FN Range: 0 – 100%

Definition:TP·TN - FN·FP √(TN+FN)(TN+FP)(TP+FN)(TP+FP) Range: (-1) – 1 Matthews correlation coefficient

Overview of Classification Expression data Subdivision of data for cross-validation into training sets and test sets Feature selection (t-test) Dimension reduction (PCA) Training of classifier: - using cross-validation - choice of method - choice of optimal parameters Testing of classifierIndependant test set

Important Points Avoid overfitting Validate performance –Test on an independant test set –by using cross-validation Include feature selection in cross-validation Why? –To avoid overestimation of performance! –To make a general classifier

Study of Childhood Leukemia: Results Classification of immunophenotype (precursorB og T) –100% accuracy During the training When testing on an independant test set –Simple classification methods applied K-nearest neighbor Nearest centroid Classification of outcome (relapse or remission) –78% accuracy (CC = 0.59) –Simple and advanced classification methods applied

Prognostic factors: Immunophenotype Age Leukocyte count Number of chromosomes Translocations Treatment response Risk classification in the future ? Risk group: Standard Intermediate High Very high Ekstra high Patient: Clincal data Immunopheno- typing Morphology Genetic measurements Microarray technology Custom designed treatment

Summary Dimension reduction important to visualize data –Principal Component Analysis –Clustering Hierarchical Partitioning (K-means) (distance measure important) Classification –Reduction of dimension often nessesary (t-test, PCA) –Several classification methods avaliable –Validation

Coffee break