Introduction to microarray

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

Self Organization: Competitive Learning
Microarray Normalization
Introduction to Bioinformatics
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Microarray Data Analysis Stuart M. Brown NYU School of Medicine.
Statistics Tools in GeneSpring The Center for Bioinformatics UNC at Chapel Hill Jianping Jin Ph.D. Bioinformatics Scientist Phone: (919)
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Getting the numbers comparable
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
Gene Expression Data Analyses (3)
Differentially expressed genes
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Introduction to Microarry Data Analysis - II BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Microarrays Technology behind microarrays Data analysis approaches
GCB/CIS 535 Microarray Topics John Tobias November 8th, 2004.
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Bioinformatics - Tutorial no. 12
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Making Sense of Complicated Microarray Data
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
Affymetrix GeneChip Data Analysis Chip concepts and array design Improving intensity estimation from probe pairs level Clustering Motif discovering and.
Microarray Data Analysis Illumina Gene Expression Data Analysis Yun Lian.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
COT 6930 HPC & Bioinformatics Microarray Data Analysis
Whole Genome Expression Analysis
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
CDNA Microarrays MB206.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data –Subtraction of Background Noise –Global/local.
Microarrays.
Microarray - Leukemia vs. normal GeneChip System.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Pabio590B – week 1 Microarrays  Overview  Design & hybridization  Data analysis.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
381 Self Organization Map Learning without Examples.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Data Mining, Neural Network and Genetic Programming
Molecular Classification of Cancer
Getting the numbers comparable
Dimension reduction : PCA and Clustering
Microarray Data Analysis
Presentation transcript:

Introduction to microarray Bin Yao byao@med.wayne.edu

Types of Microarray Affymetrix GeneChip (Oligo) Spotted array (cDNA /Oligo)

Affymetrix GeneChip in-situ Synthesis: photolithography and combinatorial chemistry. Each probe set contain13-21 pairs of 25- mer oligo probes. PM and MM

Spotted array cDNA or Oligo are printed on glass slides using arrayer

Procedures Sample1 mRNA Cy3 Cy5 Array Sample2 mRNA

ADC Image PMT Array Laser Data

Image quantification Pixel value Image: 16 bits gray scale image. Range of value 0-65535 216 values. Signal>65535 is saturated.

Image segmentation: separate signal, background and contamination Output data files: Spotted array Signal Mean Background Mean Signal Median Background Median Signal Stdev Background Stdev

Output data files: Affymetrix .DAT: Pixel data .CEL: Intensity information for a given probe on an array .EXP: Experiment information .CHP: Analysis result from a Microarray Suite analysis

Get gene expression value from probe level data Consolidate 26 (13 PM data and 13 MM data) data into one gene expression value MAS (4&5): Affymetrix algorithm Gene expression=weighted average (PM-MM) Dchip: model based expression index PMij – MM ij = i j + εij with invariant Set Normalization RMA: robust multi-array average Normalized log (PMij -BKG)=i+ j + εij With quantile normalization

Data analysis What are problems for microarray data analysis? Different sources of variance Large number of genes (high false positives) Small number of replicates (low sensitivity)

Data pre-processing Background correction: Signal of a spot contains specific binding signal, non-specific binding signal and background signal. Background estimation: local background, global background and negative control spots. Data filtering: Low signal spots and contaminated spots. Data transformation Ratio is not symmetric. 0.5 2 1 2 fold decrease 2 fold increase Log ratio is symmetric -1 1 Log2(2 fold decrease) Log2(2 fold increase) Multiplicative in ratioAdditive in logarithm log(A/B)=logA-logB

Fold change distribution Log(fold) distribution

Sources of Variance Printing pin Scanning (laser and detector, PMT, focus) Hybridization (temperature, time, mixing, etc.) Probe labeling RNA preparation Biological variability

Normalization Many other effects (systematic errors) beside treatment effect can also change gene signal values. Normalization eliminates systematic errors so that gene signals can be compared directly. Numerous normalization methods are available. How to choose? Understand sources of variation in your data. Understand assumptions behind each method. Diagnostic plot

Normalization methods Dividing by mean or median Normalized signal =(signal of a spot on an array)/(mean|median intensity of all spots on the array) This can be done for subset of genes e.g. excluding genes whose intensity is in top 10% or bottom 10% percentile to minimize the effect of outliers or differentially expressed genes. Subtracting mean: Used for log transformed data Z-transformation Normalized signal =(signal of a spot –mean signal of the array)/signal standard deviation of the array

Normalization methods Quantile normalization:

Intensity dependent normalization Housekeeping gene Normalized signal =(signal of a spot)/(signal of house keeping gene(s)) Intensity dependent normalization Use local regression to correct non-linear intensity dependency. 2.000 3.000 4.000 -1.0 -0.5 0.0 0.5 -.5 Before Normalization After Normalization

Which genes are differentially expressed? One of goals of microarray experiment is to find lists of genes that are up or down regulated between treatments Fold change: Simple Low sensitivity High false positives

Hypotheses test Take into consideration of both magnitude of the change and uncertainty of the measurement. T-test: two-group comparison Student t-test: assume equal variance, normal distribution. Welch method: assume normal distribution, variance is not equal. Wilcoxon and Mann-Whitney: Non-parametric, no assumption for distribution

Analysis of Variance (ANOVA): Compare multiple groups: Which genes are differentially expressed at least in one condition. Post Hoc test finds the condition(s) that changes gene expression. Tow- or higher-way ANOVA One-way ANOVA test only one factor, treatment effect. In microarray there are more than one factors. Some of these are the factors that we are not interested but are not avoidable. An ANOVA model for two-color microarray Y=A+D+G+A*D+G*T Where A=array effect, D=dye effect, G=gene effect, T=treatment effect, A*D=array gene interaction, G*T=gene treatment interaction (usually this is what we are interested)

Multiple test and p value adjustment If the probability to make a false positive when doing t test for a single gene is p=0.05, for 5000 genes you can expect 5000x0.05=250 false positives. To ensure the probability to make one mistake over the entire 5000 genes is still 0.05 (Family-wised error rate) p-value for each gene need to be adjusted. Bonferroni adjustments: simple but conservative p*=min{pxN,1} where p is the raw p value and N is the total number of tests. Holm or step-down Bonferroni: less conservative Wellfall and Young’s permutation: Take into consideration of possible correlations between genes. Slow False discovery rate: Percentage of expected false positives in the gene list.

Cluster Analysis First used by Tryon, 1939 to organize observed data into meaningful structures Find genes have similar expression profile Types of cluster analysis: Hierarchical cluster and k-means cluster

Hierarchical cluster Dendrogram or tree shows hierarchical relationship. Bottom up (agglomerative): Start from individual genes. Measure distance of all pairs of genes/nodes Joint the tow genes/nodes with shortest distance iterate until all genes are jointed

g1 g2 g4 g3 g1 g2 g3 g4 d1 d2 d3 d4 d5 d6 g12 g3 g4 d1’ d2’ d3’ Find minimum of {d1…d6} Find minimum of {d1’…d3’} d1 g1 g2 g4 g3 g124 g3 d1’’ d2’ d1’’

K-means cluster: find k clusters that separate as far as possible. Start from k random clusters and move elements between clusters to minimize the variability within clusters and maximize variability between clusters. Iterate until converged or specified number of iteration is reached. Some methods are developed to estimate the number of cluster e.g Silhouette plot. However there is no completely satisfactory method for determining the number clusters.

Time

Distance measurement Euclidean distance distance(x,y) = A B C D

CCity-block (Manhattan) distance distance(x,y) = d(A,B)=a+b+c+d Result is similar to Euclidean distance. Effect of single outlier is smaller Both methods measure geometric distance a b c d

Angle distance Euclidean distance does not take into account magnitude. Angle distance measure Angle distance between two vectors. Moving alone the lines do not change distance between A and B x y A B A’ B’ d d’  d(x,y)= Angle distance

Measure how close are two genes change in same way. Pearson correlation Measure how close are two genes change in same way. rxy is between –1 and 1. rxy <0 two genes change in opposite ways. Distance is defined as 1- | rxy | Spearman correlation A non-parametric method, similar to Pearson correlation

Linkage Determine distance between clusters. Single linkage (nearest neighbor) Distance between two nodes is determined by the distance of the two closest objects (nearest neighbors) in the different nodes Complete linkage (furthest neighbor) Distances between nodes are determined by the greatest distance between any two objects ("furthest neighbors") in the different nodes.

Average (Centroid) The centroid of a node is the average point in the multidimensional space. It is the center of the node. The distance between two clusters is determined as the distance between centroids. Single linkage Average linkage Complete linkage

Self-Organizing Map Self-Organizing Map (SOM) was introduced by Teuvo Kohonen in 1982. In artificial neural network, neurons that forms an one or two dimensional elastic net lattice are trained with input data. neurons competes to approximate the density of the data. After the training is over, input data vectors map to n adjacent map neurons

Input layer neurons Neurons compete for the input pattern. The winner take all. Winner and neighbors move toward the input pattern. Neighborhood: Which neurons move with the winner. Learning rate: How much dose the winner move each time.

Other methods Principle component analysis (PCA) Reduce the dimensionality of the data matrix by finding new variables. Intended to narrow number of variables down to only those that are of importance. Machine learning: Trained with data set with known classification. Predict or classify new data set. y’ x’ B x A y

Biological data mining GeneOntology: Gene functions are classified into hierarchical structures. The top 3 are : molecular function, biological process and cellular component. Tools using GO: Onto-Express, EASE, eGOn, GoSurfer Pathway: KEGG, GeneMapp Regulatory region analysis: Tools for regulatory region analysis: Genomatix, Transfac Gene network: Tools for gene network: Pathway Assist, iHOP

Microarray Standard MIAME: Minimal Information About a Microarray Experiment. Defining data standards Information Required to Interpret and Replicate Experimental Design Array Design Biological Samples Hybridizations Measurements Data Normalization and Transformation

MIAME checklist: http://www.mged.org/Workgroups/ MIAME/miame_checklist.html Public database ArrayExpress (EBI) GEO (NCBI) CIBEX (DDBJ) Other microarray database: BASE, SMD, Oncomine, YMD