Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.
Cluster analysis for microarray data Anja von Heydebreck.
BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Introduction to Machine Learning BMI/IBGP 730 Kun Huang Department of Biomedical Informatics The Ohio State University.
More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab.
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Microarray GEO – Microarray sets database
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Copyright, ©, 2002, John Wiley & Sons, Inc.,Karp/CELL & MOLECULAR BIOLOGY 3E Transcriptional Control in Eukaryotes Background Information Microarrays.
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
Clustering Algorithms Bioinformatics Data Analysis and Tools
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Introduction to Bioinformatics - Tutorial no. 12
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
Gene expression profiling identifies molecular subtypes of gliomas
Classification of multiple cancer types by multicategory support vector machines using gene expression data.
Whole Genome Expression Analysis
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.
The Broad Institute of MIT and Harvard Classification / Prediction.
Microarrays.
Clustering in Microarray Data-mining and Challenges Beyond Qing-jun Wang Center for Biophysics & Computational Biology University of Illinois at Urbana-Champaign.
Scenario 6 Distinguishing different types of leukemia to target treatment.
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 4 Clustering Algorithms Bioinformatics Data Analysis and Tools
Analyzing Expression Data: Clustering and Stats Chapter 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Brad Windle, Ph.D Unsupervised Learning and Microarrays Web Site: Link to Courses and.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Classification of tissues and samples 指導老師:藍清隆 演講者:張許恩、王人禾.
PREDICT 422: Practical Machine Learning
Lab 4.1 From Database to Data mining
Gene expression.
Gene Expression Classification
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Molecular Classification of Cancer
PCA, Clustering and Classification by Agnieszka S. Juncker
Complex methods in clustering and classification
Dimension reduction : PCA and Clustering
(A) Hierarchical clustering was performed to identify groups of patients with similar RNASeq expression of 20 genes associated with reduced survivability.
Presentation transcript:

Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI

Machine Learning Machine learning algorithms predict new classes based on patterns discerned from existing data. Classification algorithms are a form of supervised learning. Clustering algorithms are a form of unsupervised learning. Goal: derive a rule (classifier) that assigns a new object (e.g. patient microarray profile) to a pre-specified group (e.g. aggressive vs non-aggressive prostate cancer).

The Golub Data Golub et al. published gene expression microarray data in a 1999 Science paper entitled: Molecular Classification of Cancer – Class Discovery and Class Prediction by Gene Expression Monitoring. The primary focus of their paper was to demonstrate the use of a class discovery procedure which could assign tumors to either acute myeloid leukemia (ALL) versus acute lymphoblastic leukemia (AML). Bioconductor has this (pre-processed) data packaged up in golubEsets. > library(golubEsets) > library(help=golubEsets)

Some Clustering Algorithms for Array Data Hierarchical Methods: Single, Average, Complete Linkage plus other variations. Partitioning Methods: Self-Organising Maps (Köhonen) K-Means Clustering Gene shaving (Hastie, Tibshirani et al.) Model based clustering … Plaid models (Lazzeroni & Owen)

Cluster Analysis Hierarchical Methods: (Agglomerative, Divisive) + (Single, Average, Complete) Linkage… Model-based Methods: Mixed models. Plaid models. Mixture models… A clustering problem is generally much harder than a classification problem because we don’t know the number of classes. Clustering genes on the basis of experiments or across a time series.  Elucidate unknown gene function. Clustering slides on the basis of genes.  Discover subclasses in tissue samples.

Hierarchical Clustering n genes in n clusters n genes in 1 cluster divisive agglomerative We join (or break) nodes based on the notion of maximum (or minimum) ‘similarity’. Euclidean distance (Pearson) correlation Source: J-Express Manual

Single linkage Complete linkage Average linkage Different Ways to Determine Distances Between Clusters

Implementing Hierarchical Clustering Agglomerative hierarchical clustering with the function agnes: > colnames(eset.filt) <- classLabels > plot(agnes(dist(t(eset.filt), method="euclidean")))

Principal Component Analysis Multi-dimensional scaling tool. See GC's lectures for a more in depth treatment. In our Golub data set, PCA will take the data (~500 genes x 72 samples) and map each sample vector (ALL or AML) from 558 dimensions to 2 dimensions. > pca.samples <- princomp(eset.filt) > plot(pca.samples)

Principal Components

Classification Example: Support Vector Machine For this example we will use data from Golub et al. 47 patients with ALL, 25 patients with AML 7129 genes from an Affymettrix HGU6800 but we'll take a subset for this example. > library(MLInterfaces) ; library(golubEsets) > library(e1071) > data(golubMerge) To fit the support vector machine: > model <- svm(classLabels[1:40]~., data=t(eset.train))

Visualizing the SVM What predictions were made for the test set? predLabels <- predict(model, t(eset.test)) > predLabels ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML AML AML AML AML AML AML AML AML AML Levels: ALL AML How do these stack up to the true classification? > trueLabels <- classLabels[41:72] > table(predLabels, trueLabels) trueLabels predLabels ALL AML ALL 21 0 AML 0 11

More Materials, More Labs? Hypothesis Testing of Differentially Expressed Genes Gene Set Enrichment Clustering Classification Support Vector Machines Lecture Topics Covered Since Last Lab Tutorial: BioConductor Tour