Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.

Slides:



Advertisements
Similar presentations
COMPUTER AIDED DIAGNOSIS: CLASSIFICATION Prof. Yasser Mostafa Kadah –
Advertisements

Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.
Learning Algorithm Evaluation
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
Indian Statistical Institute Kolkata
K nearest neighbor and Rocchio algorithm
Classification and risk prediction
Getting the numbers comparable
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
Gene ontology & hypergeometric test Simon Rasmussen CBS - DTU.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Statistical Analysis of Microarray Data
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
Introduction to DNA microarrays DTU - January Hanne Jarmer.
Scanning and image analysis Scanning -Dyes -Confocal scanner -CCD scanner Image File Formats Image analysis -Locating the spots -Segmentation -Evaluating.
Analysis of GO annotation at cluster level by H. Bjørn Nielsen Slides from Agnieszka S. Juncker.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Model Assessment and Selection Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
Whole Genome Expression Analysis
Evaluation – next steps
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY.
Division of Population Health Sciences Royal College of Surgeons in Ireland Coláiste Ríoga na Máinleá in Éirinn Indices of Performances of CPRs Nicola.
It is only the beginning: Putting microarrays into context Matthias E. Futschik Institute for Theoretical Biology Humboldt-University, Berlin, Germany.
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Gene Expression Profiling. Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined.
A Short Overview of Microarrays Tex Thompson Spring 2005.
CZ5225: Modeling and Simulation in Biology Lecture 8: Microarray disease predictor-gene selection by feature selection methods Prof. Chen Yu Zong Tel:
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.
Analysis of GO annotation at cluster level by Agnieszka S. Juncker.
Lecture 27: Recognition Basics CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Statistical Analysis of Microarray Data By H. Bjørn Nielsen.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Covariance matrices for all of the classes are identical, But covariance matrices are arbitrary.
Introduction to Microarrays. The Central Dogma.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.
LECTURE 05: CLASSIFICATION PT. 1 February 8, 2016 SDS 293 Machine Learning.
ECE 471/571 – Lecture 3 Discriminant Function and Normal Density 08/27/15.
FUZZ-IEEE Kernel Machines and Additive Fuzzy Systems: Classification and Function Approximation Yixin Chen and James Z. Wang The Pennsylvania State.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Classification of tissues and samples 指導老師:藍清隆 演講者:張許恩、王人禾.
Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Performance Evaluation 02/15/17
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
Classifiers!!! BCH364C/394P Systems Biology / Bioinformatics
CSSE463: Image Recognition Day 11
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
PCA, Clustering and Classification by Agnieszka S. Juncker
Analysis of GO annotation at cluster level by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering
CSSE463: Image Recognition Day 11
CSSE463: Image Recognition Day 11
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
Presentation transcript:

Classification of Microarray Data

Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical Analysis Fit to Model (time series) Expression Index Calculation Advanced Data Analysis ClusteringPCAClassification Promoter Analysis Meta analysisSurvival analysisRegulatory Network Normalization Image analysis The DNA Array Analysis Pipeline Comparable Gene Expression Data

Outline Example: Childhood leukemia Classification –Method selection –Feature selection –Cross-validation –Training and testing Example continued: Results from a leukemia study

Childhood Leukemia Cancer in the cells of the immune system Approx. 35 new cases in Denmark every year 50 years ago – all patients died Today – approx. 78% are cured Risk groups: –Standard, Intermediate, High, Very high, Extra high Treatment –Chemotherapy –Bone marrow transplantation –Radiation

Prognostic Factors Good prognosisPoor prognosis Immuno-phenotypeprecursor BT Age1-9≥10 Leukocyte countLow (<50*10 9 /L)High (>100*10 9 /L) Number of chromosomesHyperdiploidy (>50) Hypodiploidy (<46) Translocationst(12;21)t(9;22), t(1;19) Treatment responseGood response: Low MRD Poor response: High MRD

Prognostic factors: Immunophenotype Age Leukocyte count Number of chromosomes Translocations Treatment response Risk Classification Today Risk group: Standard Intermediate High Very high Extra High Patient: Clinical data Immuno- phenotype Morphology Genetic measurements Microarray technology

Study of Childhood Leukemia Diagnostic bone marrow samples from leukemia patients Platform: Affymetrix Focus Array –8793 human genes Immunophenotype –18 patients with precursor B immunophenotype –17 patients with T immunophenotype Outcome 5 years from diagnosis –11 patients with relapse –18 patients in complete remission

Reduction of data Dimension reduction –PCA Feature selection (gene selection) –Significant genes: t-test –Selection of a limited number of genes

Microarray Data ClassprecursorBT... Patient1Patient2Patient3Patient4... Gene Gene Gene Gene Gene Gene Gene

Outcome: PCA on all Genes

Outcome Prediction: CC against the Number of Genes

Assumptions: –Data e.g.. Gaussian distributed –Variance and covariance the same for all classes Calculation of class probability Linear discriminant analysis

Calculation of a centroid for each class Calculation of the distance between a test sample and each class centroid Class prediction by the nearest centroid method Nearest Centroid

Based on distance measure –For example Euclidian distance Parameter k = number of nearest neighbors –k=1 –k=3 –k=... –Always an odd number Prediction by majority vote K-Nearest Neighbor

Support Vector Machines Machine learning Relatively new and highly theoretic Works on non-linearly separable data Finding a hyperplane between the two classes by minimizing of the distance between the hyperplane and closest points

Neural Networks Gene1 Gene2 Gene3... B T

Comparison of Methods Linear discriminant analysis Nearest centroid KNN Neural networks Support vector machines Simple method Based on distance calculation Good for simple problems Good for few training samples Distribution of data assumed Advanced methods Involve machine learning Several adjustable parameters Many training samples required (e.g ) Flexible methods

Cross-validation Given data with 100 samples Cross-5-validation: Training: 4/5 of data (80 samples) Testing: 1/5 of data (20 samples) - 5 different models and performance results Leave-one-out cross-validation (LOOCV) Training: 99/100 of data (99 samples) Testing: 1/100 of data (1 sample) different models

Validation Definition of –True and False positives –True and False negatives Actual classBBTT Predicted classBTTB TPFNTNFP

Sensitivity: the ability to detect "true positives" TP TP + FN Specificity: the ability to avoid "false positives" TN TN + FP Positive Predictive Value (PPV) TP TP + FP Sensitivity and Specificity

Accuracy Definition: TP + TN TP + TN + FP + FN Range: [0 … 1]

Definition:TP·TN - FN·FP √(TN+FN)(TN+FP)(TP+FN)(TP+FP) Range: [-1 … 1] Matthews correlation coefficient

Overview of Classification Expression data Subdivision of data for cross-validation into training sets and test sets Feature selection (t-test) Dimension reduction (PCA) Training of classifier: - using cross-validation - choice of method - choice of optimal parameters Testing of classifierIndependent test set

Important Points Avoid over fitting Validate performance –Test on an independent test set –Use cross-validation Include feature selection in cross-validation Why? –To avoid overestimation of performance! –To make a general classifier

Multiclass classification Same prediction methods –Multiclass prediction often implemented –Often bad performance Solution –Division in small binary (2-class) problems

Study of Childhood Leukemia: Results Classification of immunophenotype (precursorB and T) –100% accuracy During the training When testing on an independent test set –Simple classification methods applied K-nearest neighbor Nearest centroid Classification of outcome (relapse or remission) –78% accuracy (CC = 0.59) –Simple and advanced classification methods applied

Summary When do we use classification? Reduction of dimension often necessary –T-test, PCA Methods –K nearest neighbor –Nearest centroid –Support vector machines –Neural networks Validation –Cross-validation –Accuracy and correlation coefficient

Coffee break Next: Exercise

Prognostic factors: Immunophenotype Age Leukocyte count Number of chromosomes Translocations Treatment response Risk classification in the future ? Risk group: Standard Intermediate High Very high Ekstra high Patient: Clincal data Immunopheno- typing Morphology Genetic measurements Microarray technology Customized treatment