Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data –Subtraction of Background Noise –Global/local.

Slides:



Advertisements
Similar presentations
MicroArray Image Analysis Robin Liechti
Advertisements

Basic Gene Expression Data Analysis--Clustering
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.
MicroArray Image Analysis
MicroArray Image Analysis Robin Liechti
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Image Quantitation in Microarray Analysis More tomorrow...
TIGR Spotfinder: a tool for microarray image processing
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Discrimination Methods As Used In Gene Array Analysis.
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Cluster Analysis Class web site: Statistics for Microarrays.
Image Analysis Class web site: Statistics for Microarrays.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Normalization Review and Cluster Analysis Class web site: Statistics for Microarrays.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Pitfalls in Cluster Analysis Darlene Goldstein Data Club 20 November 2002.
Scanning and image analysis Scanning -Dyes -Confocal scanner -CCD scanner Image File Formats Image analysis -Locating the spots -Segmentation -Evaluating.
Radial-Basis Function Networks
B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Image Quantitation in Microarray Analysis More tomorrow...
Gene expression profiling identifies molecular subtypes of gliomas
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Affymetrix vs. glass slide based arrays
Gene expression & Clustering (Chapter 10)
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
Presented by Tienwei Tsai July, 2005
CDNA Microarrays MB206.
Panu Somervuo, March 19, cDNA microarrays.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Image Classification 영상분류
Microarray - Leukemia vs. normal GeneChip System.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Statistical Analysis of DNA Microarray. An Example of HDLSS in Genetics.
Statistics for Differential Expression Naomi Altman Oct. 06.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Analyzing Expression Data: Clustering and Stats Chapter 16.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Semi-Supervised Clustering
Machine Learning Clustering: K-means Supervised Learning
Microarray - Leukemia vs. normal GeneChip System.
K-means and Hierarchical Clustering
Dimension reduction : PCA and Clustering
Text Categorization Berlin Chen 2003 Reference:
Normalization for cDNA Microarray Data
Presentation transcript:

Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data –Subtraction of Background Noise –Global/local Normalization –House keeping genes (or same gene) –Expression in ratio (test/references) in log Differential Gene expression –Repeats and calculate significance (t-test) –Significance of fold used statistical method Clustering –Supervised/Unsupervised (Hierarchical, K-means, SOM) Prediction or Supervised Machine Learnning (SVM)

Technical probe (on chip) sample (labelled) pseudo-colour image

Images from scanner Resolution –standard 10  m [currently, max 5  m] –100  m spot on chip = 10 pixels in diameter Image format –TIFF (tagged image file format) 16 bit (65’536 levels of grey) –1cm x 1cm image at 16 bit = 2Mb (uncompressed) –other formats exist e.g.. SCN (used at Stanford University) Separate image for each fluorescent sample –channel 1, channel 2, etc.

Images in analysis software The two 16-bit images (Cy3, Cy5) are compressed into 8-bit images Display fluorescence intensities for both wavelengths using a 24-bit RGB overlay image RGB image : –Blue values (B) are set to 0 –Red values (R) are used for Cy5 intensities –Green values (G) are used for Cy3 intensities Qualitative representation of results

Images : examples Cy3 Cy5 Spot colourSignal strengthGene expression yellowControl = perturbedunchanged redControl < perturbedinduced greenControl > perturbedrepressed Pseudo-colour overlay

Processing of images Addressing or gridding –Assigning coordinates to each of the spots Segmentation –Classification of pixels either as foreground or as background Intensity determination for each spot –Foreground fluorescence intensity pairs (R, G) –Background intensities –Quality measures

Background intensity Spot’s measured intensity includes a contribution of non- specific hybridization and other chemicals on the glass Fluorescence from regions not occupied by DNA should by different from regions occupied by DNA -> one solution is to use local negative controls (spotted DNA that should not hybridize) Different background methods : –Local background –Morphological opening –Constant background –No adjustment

Local background Focusing on small regions surrounding the spot mask. Median of pixel values in this region Most software package implement such an approach ImaGeneSpot, GenePixScanAlyze By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure

Morphological opening –Non-linear filtering, used in Spot –Use a square structuring element with side length at least twice as large as the spot separation distance –Compute local minimum filter, then compute local maximum filter This removes all the spots and generates an image that is an estimate of the background for the entire slide –For individual spots, the background is estimated by sampling this background image at the nominal center of the spot –Lower background estimate and less variable

Constant background Global method which subtracts a constant background for all spots Some evidence that the binding of fluorescent dyes to ‘negative control spots’ is lower than the binding to the glass slide -> More meaningful to estimate background based on a set of negative control spots –If no negative control spots : approximation of the average background = third percentile of all the spot foreground values

No background adjustment Do not consider the background –Probably not accurate, but may be better than some forms of local background determination!

Signal/Noise = log 2 (spot intensity/background intensity ) Histograms

Preprocessing of Gene expression Data Scale transformation –CY3/CY5 –LOG(CY3/CY5) Replicates handling –Inconsistent replicate removal –Replicate merging Missing value handling –Removal of patterns having excess of missing values –Value of missing points Flat pattern filtering Unknown Gene Removing

Preprocessing: Normalization Why? To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples. How do we know it is necessary? By examining self-self hybridizations, where no true differential expression is occurring. We find dye biases which vary with overall spot intensity, location on the array, plate origin, pins, scanning parameters,….

Normalization Techniques Global normalization –Divide channel value by means Control spots –Common spots in both channels –House keeping genes –Ratio of intensity of same gene in two channel is used for correction Iterative linear regression Parametric nonlinear nomalization –log(CY3/CY5) vs log(CY5)) –Fitted log ratio – observed log ratio General Non Linear Normalization –LOESS –curve between log(R/G) vs log(sqrt(R.G))

Pre-processed cDNA Gene Expression Data On p genes for n slides: p is O(10,000), n is O(10-100), but growing, Genes Slides Gene expression level of gene 5 in slide 4 = Log 2 ( Red intensity / Green intensity) slide 1slide 2slide 3slide 4slide 5 … These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.

Scatterplots: always log, always rotate log 2 R vs log 2 GM=log 2 R/G vs A=log 2 √RG

Classification Task: assign objects to classes (groups) on the basis of measurements made on the objects Unsupervised: classes unknown, want to discover them from the data (cluster analysis) Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations

Cluster analysis Used to find groups of objects when not already known “Unsupervised learning” Associated with each object is a set of measurements (the feature vector) Aim is to identify groups of similar objects on the basis of the observed measurements

Example: Tumor Classification Reliable and precise classification essential for successful cancer treatment Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables Uncertainties in diagnosis remain; likely that existing classes are heterogeneous Characterize molecular variations among tumors by monitoring gene expression (microarray) Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes)

Nearest Neighbor Classification Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation) k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows: –find the k observations in the learning set closest to X –predict the class of X by majority vote, i.e., choose the class that is most common among those k observations. The number of neighbors k can be chosen by cross-validation

Hierarchical Clustering Produce a dendrogram Avoid prespecification of the number of clusters K The tree can be built in two distinct ways: –Bottom-up: agglomerative clustering –Top-down: divisive clustering

Partitioning vs. Hierarchical Partitioning –Advantage: Provides clusters that satisfy some optimality criterion (approximately) –Disadvantages: Need initial K, long computation time Hierarchical –Advantage: Fast computation (agglomerative) –Disadvantages: Rigid, cannot correct later for erroneous decisions made earlier

Issues in Clustering Pre-processing (Image analysis and Normalization) Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K

Filtering Genes All genes (i.e. don’t filter any) At least k (or a proportion p) of the samples must have expression values larger than some specified amount, A Genes showing “sufficient” variation –a gap of size A in the central portion of the data –a interquartile range of at least B Filter based on statistical comparison –t-test –ANOVA –Cox model, etc.

‘cluster’ unclustered Average linkage hierarchical clustering, melanoma only