1 Course #412 Analyzing Microarray Data using the mAdb System April 1-2, 2008 1:00 pm - 4:00pm Intended for users of the.

Slides:

Advertisements

Similar presentations

Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University

Advertisements

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Introduction to Bioinformatics

Microarray technology and analysis of gene expression data Hillevi Lindroos.

DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.

SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

More On Preprocessing Javier Cabrera. Outline 1.Transform the data into a scale suitable for analysis. 2.Remove the effects of systematic and obfuscating.

Microarray Data Preprocessing and Clustering Analysis

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Introduction to Bioinformatics - Tutorial no. 12

Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.

Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.

Patrick Kemmeren Using EP:NG.

Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.

Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.

1 A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data Jinwook Seo, Ben Shneiderman University of Maryland Hyun Young Song.

Multivariate Data and Matrix Algebra Review BMTRY 726 Spring 2012.

Microarray Data Analysis Illumina Gene Expression Data Analysis Yun Lian.

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Gene expression profiling identifies molecular subtypes of gliomas

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

Analysis and Management of Microarray Data Dr G. P. S. Raghava.

DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.

Clustering of DNA Microarray Data Michael Slifker CIS 526.

Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.

More on Microarrays Chitta Baral Arizona State University.

Abstract BarleyBase is a USDA-funded public repository for plant microarray data. BarleyBase houses raw and normalized expression data from the 22K Affymetrix.

1 Labs for course #412 Analyzing Microarray Data using the mAdb System July 15-16, :00pm- 4:00pm First, look at the questions on the bottom of each.

1 Dimension Reduction Examples: 1. DNA MICROARRAYS: Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma.

Using geWorkbench: Hierarchical & SOM Clustering Fan Lin, Ph. D Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of.

Gene expression analysis

Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks From Nature Medicine 7(6) 2001 By Javed.

Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.

Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.

MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.

More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.

An Overview of Clustering Methods Michael D. Kane, Ph.D.

Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics

Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Brad Windle, Ph.D Unsupervised Learning and Microarrays Web Site: Link to Courses and.

Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Principal Components Analysis ( PCA)

Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.

Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.

C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.

Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).

Unsupervised Learning

Unsupervised Learning

PREDICT 422: Practical Machine Learning

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Clustering Manpreet S. Katari.

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

Descriptive Statistics vs. Factor Analysis

Dimension reduction : PCA and Clustering

Unsupervised Learning

Unsupervised Learning

Presentation transcript:

1 Course #412 Analyzing Microarray Data using the mAdb System April 1-2, :00 pm - 4:00pm Intended for users of the mAdb system who are familiar with mAdb basics Focus on analysis of multiple array experiments Esther Asaki, Yiwen He

2 Agenda 1.mAdb system overview 2.mAdb dataset overview 3.mAdb analysis tools for dataset –Class Discovery - clustering, PCA, MDS –Class Comparison - statistical analysis t-test ANOVA Significance Analysis of Microarrays - SAM –Class Prediction - PAM Various Hands-on exercises

3 1. mAdb system overview

4 Upload Data mAdb Data Workflow Quality ControlPrepare DatasetAnalysis/ModelReview Annotation File Format GenePix MAS5 GCOS 1.1 ArraySuite Project Summary Summary Statistics Array images Graphical Report Dataset Extraction Normalization Spot Filtering Analysis Tools Class Discovery Class Comparison Class Prediction Annotation Tools Feature Report Gene Ontology BioCarta Pathway KEGG Pathway

5 2. mAdb dataset overview

6 What is a dataset? mAdb Dataset –Collection of data from multiple experiments –Genes as rows and experiments as columns Genes Gene expression level =(normalized) Log( Red signal / Green signal) sample1sample2sample3sample4sample5 …

7 New or Existing Dataset: 1.Create New Dataset 2. Access Existing Dataset

8 Dataset Display Page Analysis Tools Retrieval and Display Options… Dataset History

9 Dataset Display Newly created dataset puts all experiments into a single group Dataset display options dynamic Integrated gene information

10 mAdb Dataset Display Group label Sample name genes

11 Group Examples Technical/Biological replicates Knock-outs and wild types Cancer vs normal samples Time course points Dosage levels

12 Dataset Group Assignment Array Order Designation/Filtering Array Group Assignment/Filtering Filter/Group by Array Properties

13 Dataset group assignment tools

14 Array Order Designation/Filtering Order arrays in dataset Delete/Add back arrays in dataset Subsequent analysis will be ordered by groups first and then ordered within each group Does not group arrays

15 Array Group Assignment/Filtering One click per array for additional group Not convenient for large dataset Can not order within group

16 Filter/Group by Array Properties Array properties include Name and Short Description Identify consistent pattern

17 Filter/Group by Array Properties Convenient for large dataset Can not order arrays within group

18 Group Assignment Group assignment information is carried into relevant analysis Dataset is independent from microarray platforms

19 Examples for using groups Additional Filtering per Group Correlation summary report Average arrays within groups Calculate statistics within groups

20 Filter by Group Properties Ensures each group has sufficient number of non-missing values

21 Correlation Summary Report Pair wise correlation between 2 samples in dataset Individual scatter plot available Group pattern for quality control

22 Visual Bivariate Data Analysis

23 Average Arrays within Groups Averages calculated using log ratios regardless of linear or log display options chosen

24 Calculate statistics within Groups All values calculated using log ratios regardless of linear or log display options chosen

25 Dataset I Small Round Blue Cell Tumors (SRBCTs) Khan et al. Nature Medicine tumor classifications 63 training samples, 25 testing samples, 2308 genes Neural network approach

26 Hands-on Session 1 Lab 1- Lab 4 Read the questions before starting, then answer them in the lab. Use web site: Avoid maximizing web browser to full screen. Total time: 20 minutes

27 3. mAdb dataset analysis tools –Class Discovery: clustering, PCA, MDS –Class Comparison: statistical analysis –Class Prediction: PAM

28 Analysis Overview Class Discovery - Unsupervised Clustering – Hierarchical, K-means, SOMs Principal components Analysis (PCA) Multidimensional Scaling (MDS) Class Comparison - Supervised paired t-tests t-test pooled (equal) variance t-test separate (unequal) variance Significance Analysis of Microarrays (SAM) One way ANOVA Wilcoxon Rank-Sum (Mann Whitney U) Wilcoxon Matched-pairs Signed Rank Kruskal-Wallis Class Prediction - Supervised Prediction Analysis for Microarrays (PAM)

29 Class Discovery Example Discover cancer subtypes by gene expression profiles Identify genes which have different expression patterns in different groups Tools: Cluster Analysis, PCA and MDS

30 Class Comparisons Example Find genes that are differentially expressed among cancer groups Find genes up/down regulated by drug treatment Tools: –Group comparison –Statistics Results filtering

31 Class Prediction Example Identify an expression profile which correlates with survival in certain cancers Identify an expression profile which can be used to diagnose different types of lymphomas Tools: Prediction Analysis for Microarrays (PAM)

32 3. mAdb dataset analysis tools –Class Discovery: clustering, PCA, MDS –Class Comparison: statistical analysis –Class Prediction: PAM

33 Class Discovery Dataset with large amount of data Dataset not organized Visualization with Clustering, PCA, MDS

34 Cluster Analysis Organize large microarray dataset into meaningful structures Visualize and extract expression patterns

35 What to Cluster? Genes - identify groups of genes that have correlated expression profiles Samples - put samples into groups with similar overall gene expression profiles

36 Hierarchical clustering Partitional clustering –K-means –Self-Organizing Maps (SOM) Clustering Methods

37 Clustering Cluster Example on Genes Much easier to look at large blocks of similarly expressed genes Dendogram helps show how ‘closely related’ expression patterns are A. Cholesterol syn. B. Cell cycle C. Immediate-early response D. Signaling E. Tissue remodeling

38 2 Steps –Pick a distance method Correlation Euclidian –Pick the linkage method Average linkage Complete linkage Single linkage

39 Correlation Compares shape of expression curves (-1 to 1) Can detect inverse relationships (absolute correlation)

40 Correlation (centered-classical Pearson) Correlation ( un-centered) –assume the mean of the data is 0, penalize if not –Measures both similarity of shape and the offset from 0 Two Flavors of correlation

41 Euclidean Distance

42 Similarity/Distance Metric Summary shape Shape and offset distance

43 Hierarchical Clustering Example

44 2 clusters? 3 clusters? Tree Cutting 4 clusters? Degrees of dissimilarity

45 Hierarchical Clustering Summary Detection of patterns for both genes and samples Good visualization with tree graphs Dataset size limitations No partition in results, require tree cutting

46 Partitional clustering : K-means Partition data into K clusters, with number K supplied by user. Produce cluster membership as results.

47 Divide observations into K clusters. Use cluster averages (means) to represent clusters Maximize the inter-cluster distance Minimize intra-cluster distance. K-means Algorithm

48 K-means Algorithm X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X 11 X 12 X 13 X 14 X 15 X 16 X 17 X 18 X 19 X 20 X 21 k1k1 k2k2 k3k3 k4k4

49 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X 11 X 12 X 13 X 14 X 15 X 16 X 17 X 18 X 19 X 20 X 21 k1k1 k2k2 k3k3 k4k4 K-means Algorithm

50 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X 11 X 12 X 13 X 14 X 15 X 16 X 17 X 18 X 19 X 20 X 21 k1k1 k2k2 k3k3 k4k4 K-means Algorithm

51 X1X1 X2X2 X3X3 X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X 11 X 12 X 13 X 14 X 15 X 16 X 17 X 18 X 19 X 20 X 21 k1k1 k2k2 k3k3 k4k4 K-means Algorithm

52 Set number of iteration mAdb K-means Options Hierarchical clustering within node Set number of clusters Activate random seed Adjust data for analysis

53 Data Adjustment Options Adjusts data rows so median/mean will be zero Used only for analysis – not saved in dataset Center genes to compare relative values among genes Not appropriate if clustering arrays Not appropriate if using Euclidean distance/similarity metric

54 K-means Clustering Example Save as input to TreeView Create new subset of genes Show hierarchical clustering

55 Fast algorithm Partitions features into smaller, manageable groups mAdb allows hierarchical clustering within each K-mean cluster Must supply reasonable number of K No relationship among partitions Summary

56 Self-Organizing Maps (SOM) Partitions data into 2 dimensional grid of nodes Clusters on the grid have topological relationships 2 numbers for the dimension of grid supplied by user

57 mAdb SOM options Set number of clusters (X, Y) Set number of iteration Hierarchical within SOM clusters Activate Randomized Partition

58 SOM Clustering Example Save as input to TreeView Create new subset of genes Show hierarchical clustering

59 Set number of clusters (X, Y) Set number of iteration Activate Randomized Partition Hierarchical within SOM clusters mAdb SOM options

60 Save as input to TreeView Create new subset of genes Show hierarchical clustering Heat map View

61 Line Plot View Toggle back to Heat Map View

62 SOM Summary Neighboring partitions similar to each other Partitions features into smaller groups mAdb allows hierarchical clustering within each SOM cluster Results may depend on initial partitions

63 Summary of mAdb Clustering Tools HierarchicalK-means SOM Relationship visualization Data Size Performance Small Large SlowFast Middle Cluster Type Gene/ArrayGene Tree Structure partition Membership Partition 2-D topology

64 Cluster Analysis Normalization is important Reduce data points by variance Use K-mean or SOM to partition dataset Use biological information to interpret results

65 Hands-on Session 2 Lab 5 - lab 6 (Lab 7 optional) Total time: 15 minutes

66 Principal Component Analysis How different samples are from each other Project high-dimensional data into lower dimensions, which captures most of the variance Display data in 2D or 3D plot to reveal the data pattern

67 Principal Component Analysis Hypothesis - there exist unobservable or “hidden” variables (complex traits) which have given rise to the correlation among the observed objects (genes or microarrays or patients) The Principal Components (PC) Model is a straightforward model that seeks to achieve this objective

68 PCA 3D plot Axes represent the first 3 components The first 3 components should explain most of the variance Formation of clusters Relationship of clusters.

69 n is the number of genes (gene probes); m is the number of arrays (experiments) A correlation matrix is a symmetric matrix of correlation coefficients ( and ) Raw Data A Structure of Correlation Matrix is the Major Object for PCA Basic Idea of PCA is a Data Reduction Method Based on Analysis of Correlation Pattern(s) That Can Exist Among the Observed Random Variables (i.e. Expression values of Genes).

70 The Results of PCA are a small set of the orthogonal (independent) Variables Grouping of the Variables From a purely mathematical viewpoint the purpose of PCA is to transform n correlated random variables to an orthogonal set which reproduces the original variance/covariance structure. x2x2 x1x1 y1y1 y2y2 r 12 =0.90 (The First) Principal Component y 1 can “explain” the major fraction (~90%) of a dispersion of variables x 1 and x 2 for all of the 10 observed objects.

71 Sample:Small Round Blue Cell Tumors (SRBCTs) 63 Arrays representing 4 groups – BL (Burkitt Lymphoma, n1=8) – EWS (Ewing, n2=23) – NB (neuroblastoma, n3=12) – RMS (rhabdomyosarcoma, n4=20) There are 2308 features (distinct gene probes)

72 PCA Detailed Plot ”Scree” plot 2-D plots

73 PCA 2-D plots First 2 components separate 3 groups well

74 An alternative for PCA Non-linear projection methodology Tolerates missing values MDS overview (Multidimensional Scaling)

75 Summary of PCA and MDS Dimension reduction tools Graphic representation to help explain patterns Quality control for experimental variance

76 Hands-on Session 3 Lab 8 Total time: 15 minutes Next class tomorrow at 1:00 pm