Gene expression Statistics 246, Week 3, 2002. Thesis: the analysis of gene expression data is going to be big in 21st century statistics Many different.

Slides:



Advertisements
Similar presentations
Experimental Design and Differential Expression Class web site: Statistics for Microarrays.
Advertisements

M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Cluster analysis for microarray data Anja von Heydebreck.
Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.
Microarray Simultaneously determining the abundance of multiple(100s-10,000s) transcripts.
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical.
1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.
Mathematical Statistics, Centre for Mathematical Sciences
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Statistics for Microarrays
Introduction to the design of cDNA microarray experiments Statistics 246, Spring 2002 Week 9, Lecture 1 Yee Hwa Yang.
Biological background: Gene Expression and Molecular Laboratory Techniques Class web site: Statistics.
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Getting the numbers comparable
Introduction to DNA Microarrays Todd Lowe BME 88a March 11, 2003.
Normalization for cDNA Microarray Data Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed. SPIE BIOS 2001, San Jose, CA January 22, 2001.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Normalization Class web site: Statistics for Microarrays.
Gene expression Terry Speed Lecture 4, December 18, 2001.
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Terry Speed Wald Lecture III August 9, 2001
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Some thoughts of the design of cDNA microarray experiments Terry Speed & Yee HwaYang, Department of Statistics UC Berkeley MGED IV Boston, February 14,
Gene Expression BMI 731 week 5
Gene expression and the transcriptome I. Genomics and transcriptome After genome sequencing and annotation, the second major branch of genomics is analysis.
Introduce to Microarray
Corrections and Normalization in microarrays data analysis
Gene Expression BMI 731 Winter 2005 Catalin Barbacioru Department of Biomedical Informatics Ohio State University.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
Analysis of microarray data
B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
Preprocessing of cDNA microarray data Lecture 19, Statistics 246, April 1, 2004.
Gene expression and the transcriptome I
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Statistical Analyses of Microarray Data Rafael A. Irizarry Department of Biostatistics
CDNA Microarrays MB206.
Data Type 1: Microarrays
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
WORKSHOP SPOTTED 2-channel ARRAYS DATA PROCESSING AND QUALITY CONTROL Eugenia Migliavacca and Mauro Delorenzi, ISREC, December 11, 2003.
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research.
Lecture 16 Gene expression and the transcriptome I
Scenario 6 Distinguishing different types of leukemia to target treatment.
Introduction to DNA microarray technologies Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor short course Summer 2002.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Lecture 7. Functional Genomics: Gene Expression Profiling using
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden Plate Effects in cDNA Microarray Data.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
The second-simplest cDNA microarray data analysis problem Terry Speed, UC Berkeley Bioinformatic Strategies For Application of Genomic Tools to Environmental.
Microarray Data Analysis The Bioinformatics side of the bench.
DNA Microarray Overview and Application. Table of Contents Section One : Introduction Section Two : Microarray Technique Section Three : Types of DNA.
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University Plate Effects in cDNA Microarray Data.
Lecture 14: Gene expression and the transcriptome I.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray: An Introduction
CDNA-Project cDNA project Julia Brettschneider (UCB Statistics)
C E N T R F O I G A V B M S U Lecture 20 Gene expression and the transcriptome I Introduction to Bioinformatics.
Getting the numbers comparable
Volume 36, Issue 3, Pages (October 2002)
Normalization for cDNA Microarray Data
Presentation transcript:

Gene expression Statistics 246, Week 3, 2002

Thesis: the analysis of gene expression data is going to be big in 21st century statistics Many different technologies, including High-density nylon membrane arrays Serial analysis of gene expression (SAGE) Short oligonucleotide arrays (Affymetrix) Long oligo arrays (Agilent) Fibre optic arrays (Illumina) cDNA arrays (Brown/Botstein)*

(projected) Year Number of papers Total microarray articles indexed in Medline

themes Common themes Parallel approach to collection of very large amounts of data (by biological standards) Sophisticated instrumentation, requires some understanding Systematic features of the data are at least as important as the random ones Often more like industrial process than single investigator lab research Integration of many data types: clinical, genetic, molecular…..databases

Biological background G T A A T C C T C | | | | | | | | | C A T T A G G A G DNA G U A A U C C RNA polymerase mRNA Transcription

Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be better, but is currently harder.

Reverse transcription Clone cDNA strands, complementary to the mRNA G U A A U C C U C Reverse transcriptase mRNA cDNA C A T T A G G A G T T A G G A G C A T T A G G A G

cDNA microarray experiments mRNA levels compared in many different contexts Different tissues, same organism (brain v. liver) Same tissue, same organism (ttt v. ctl, tumor v. non-tumor) Same tissue, different organisms (wt v. ko, tg, or mutant) Time course experiments (effect of ttt, development) Other special designs (e.g. to detect spatial patterns).

cDNA microarrays cDNA clones

cDNA microarrays Compare the genetic expression in two samples of cells PRINT cDNA from one gene on each spot SAMPLES cDNA labelled red/green e.g. treatment / control normal / tumor tissue

HYBRIDIZE Add equal amounts of labelled cDNA samples to microarray. SCAN Laser Detector

Biological question Differentially expressed genes Sample class prediction etc. Testing Biological verification and interpretation Microarray experiment Estimation Experimental design Image analysis Normalization Clustering Discrimination R, G 16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg)

Some statistical questions Image analysis: addressing, segmenting, quantifying Normalisation: within and between slides Quality: of images, of spots, of (log) ratios Which genes are (relatively) up/down regulated? Assigning p-values to tests/confidence to results.

Some statistical questions, ctd Planning of experiments: design, sample size Discrimination and allocation of samples Clustering, classification: of samples, of genes Selection of genes relevant to any given analysis Analysis of time course, factorial and other special experiments…..…...& much more.

Some bioinformatic questions Connecting spots to databases, e.g. to sequence, structure, and pathway databases Discovering short sequences regulating sets of genes: direct and inverse methods Relating expression profiles to structure and function, e.g. protein localisation Identifying novel biochemical or signalling pathways, ………..and much more.

Part of the image of one channel false-coloured on a white (v. high) red (high) through yellow and green (medium) to blue (low) and black scale

Does one size fit all?

Segmentation: limitation of the fixed circle method SRG Fixed Circle Inside the boundary is spot (foreground), outside is not.

Some local backgrounds We use something different again: a smaller, less variable value. Single channel grey scale

Quantification of expression For each spot on the slide we calculate Red intensity = Rfg - Rbg fg = foreground, bg = background, and Green intensity = Gfg - Gbg and combine them in the log (base 2) ratio Log 2 ( Red intensity / Green intensity)

Gene Expression Data On p genes for n slides: p is O(10,000), n is O(10-100), but growing, Genes Slides Gene expression level of gene 5 in slide 4 = Log 2 ( Red intensity / Green intensity) slide 1slide 2slide 3slide 4slide 5 … These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.

The red/green ratios can be spatially biased. Top 2.5%of ratios red, bottom 2.5% of ratios green

The red/green ratios can be intensity-biased M = log 2 R/G = log 2 R - log 2 G = (log 2 R + log 2 G )/2 Values should scatter about zero.

Yellow: GAPDH, tubulin Light blue: MSP pool / titration Orange: Schadt-Wong rank invariant set Red line: lowess smooth Normalization: how we “fix” the previous problem The curved line becomes the new zero line

Normalizing: before M

Normalizing: after M normalised

Olfactory Epithelium VomeroNasal Organ Main (Auxiliary) Olfactory Bulb From Buck (2000) From a study of the mouse olfactory system

Axonal connectivity between the nose and the mouse olfactory bulb >2M, ~1,800 types Two principles: “ zone-to-zone projection ”, and “ glomerular convergence ” Neocortex

Of interest: the hardwiring of the vertebrate olfactory system Expression of a specific odorant receptor gene by an olfactory neuron. Targeting and convergence of like axons to specific glomeruli in the olfactory bulb.

The biological question in this case Are there genes with spatially restricted expression patterns within the olfactory bulb?

Layout of the cDNA Microarrays Sequence verified mouse cDNAs 19,200 spots in two print groups of 9,600 each –4 x 4 grid, each with 25 x24 spots –Controls on the first 2 rows of each grid. 77 pg1pg2

Design: How We Sliced Up the Bulb A P D V M L

Design: Two Ways to Do the Comparisons Goal: 3-D representation of gene expression P D M A V L R Compare all samples to a common reference sample (e.g., whole bulb) P D M A V L Multiple direct comparisons between different samples (no common reference)

An Important Aspect of Our Design Different ways of estimating the same contrast: e.g. A compared to P Direct = A-P Indirect = A-M + (M-P) or A-D + (D-P) or -(L-A) - (P-L) How do we combine these? L P V D M A

Analysis using a linear model Define a matrix X so that E(M)=X  Use least squares estimates for A-L, P-L, D-L, V-L, M-L In practice, we use robust regression. Estimates for other estimable contrasts follow in the usual way.

The Olfactory Bulb Experiments completed so far completed so far

Contrasts & Patterns Because of the connectivity of our experiment, we can estimate all 15 different pairwise comparisons directly and/or indirectly. For every gene we thus have a pattern based on the 15 pairwise comparisons. Gene #15,228

Contrasts & patterns:another way Instead of estimating pairwise comparisons between each of the six effects, we can come closer to estimating the effects themselves by doing so subject to the standard zero sum constraint (6 parameters, 5 d.f.). What we estimate for A, say, subject to this constraint, is in reality an estimate of A - 1/6(A + P + D + V + M + L). This set of parameter estimates gives results similar to, but better than, the ones we would have obtained had we carried out the experiments with whole-bulb reference tissue. In effect we have created the whole-bulb reference in silico.

Alternative pattern representation Gene #15,228 once again.

Reconstruction of the Bulb as a Cube: Expression of Gene # 15,228 Expression Level High Low

Patterns, More Globally Find the genes whose expression fits specific, predefined patterns. 2. Perform cluster analysis - see what expression patterns emerge. Can we identify genes with interesting patterns of expression across the bulb? Two approaches:

Clustering procedure Start with a sets of genes exhibiting some minimal level of differential expression across the bulb; here ~650 were chosen from all 15 contrasts. Carry out hierarchical clustering, building a dendrogram: Mahalanobis distance and Ward agglomeration (minimum variance) were used. Now consider all clusters of 2 or more genes in the tree. Singles are added separately. Measure the heterogeneity h of a cluster by calculating the 15 SDs across the cluster of each of the pairwise effects, and taking the largest. Choose a score s (see plots) and take all maximal disjoint clusters with h < s. Here we used s = 0.46 and obtained 16 clusters.

Plots guiding choice of clusters of genes Cluster heterogeneity h (max of 15 SDs) Number of clusters (patterns) Number of genes

Red :genes chosen Blue:controls 15 p/w effects PADA VA LA DP VP LA MP MA LPVD MD LA LV LM MV LD

The 16 groups systematically arranged (6 point representation)

Validation of Gene # 15,228 Expression Pattern by RNA In Situ Hybridization gluR CTX MOB AOB #15,228 CTX AOB MOB

Gene 15,228: another in situ view

384 (group 3) D V LM

3-dimension reconstruction from in-situ data 15,228 5,291 8,

Are the pattens we found real? Here’s how we attempted to show that the answer is a qualified yes. Each cluster average (pattern) has a ‘strength’ we can measure by its root-mean-square (RMS). The n=16 clusters we found have an average RMS of av= 0.3. Both n and av being strongly determined by our heterogeneity cut-off score of s=0.46. Now consider randomizing the labels (e.g. P-A) on our hybridizations and repeating the entire analysis, keeping the cut-off score at We typically get fewer, “weaker” patterns, with less contrast in the red-green patchwork. One such is on the next page. 500 independent random relabellings had a mean av value of 0.18, an SD of 0.07 and a max av value of 0.26, cf. 0.3 in our data. Our clusters are definitely ‘non-random’ in some sense.

Random Real

Problem We later tried all this with a different set of data, one which made use of reference mRNA had generally lower S/N, and where the inveestigator sought fewer interesting patterns. We found that the patterns the previous method discovered were similarly quite distinct in av values from those in randomly labelled hybs, but this time, the av values were ‘significantly’ lower than random. It all depends where you are on the curve.

Where next? I feel that we need a new idea. The previous one doesn’t seem to have worked. Or did it? Just clustering and taking averages seems too easy…. But maybe clustering is all there is to patterns, once we have decided on the appropriate and context dependent profile to cluster, and selected the genes, but I keep wondering…

Some statistical research stimulated by microarray data analysis Experimental design : Churchill & Kerr Image analysis: Zuzan & West, …. Data visualization: Carr et al Estimation: Ideker et al, …. Multiple testing: Westfall & Young, Storey, …. Discriminant analysis: Golub et al,… Clustering: Hastie & Tibshirani, Van der Laan, Fridlyand & Dudoit, …. Empirical Bayes: Efron et al, Newton et al,…. Multiplicative models: Li &Wong Multivariate analysis: Alter et al Genetic networks: D’Haeseleer et al and more

In closing: The pervasiveness of microarray technology and the statistical problems that go with it Hybridization of target DNA or RNA to large numbers of probes attached to a solid support in a microarray format has a much wider applicability. All such applications have their own statistical problems. Here are two relating to the previous lectures.

Meiosis data in which all exchanges are precisely located (from microarrays) Figure courtesy of J Derisi Yeast

Predicted exon Exon Arrays can validate Exon Predictions and assemble Gene Structures Exon Arrays can validate Exon Predictions and assemble Gene Structures One or more Probes per Predicted Exon Verify predicted exons on a genome-wide scale. Group exons into genes via co-regulation. This and the next slide courtesy of Rosetta

Tiling arrays can identify exons and refine gene structures Oligonucleotides 60 bp in length “60-mers” 10 bp steps Predicted exon

Acknowledgments Statistical collaborators Yee Hwa Yang (Berkeley) Sandrine Dudoit (Berkeley) Ingrid Lönnstedt (Uppsala) Natalie Thorne (WEHI) Mauro Delorenzi (WEHI) CSIRO Image Analysis Group Michael Buckley Ryan Lagerstorm WEHI Glenn Begley Suzie Grant Rob Good PMCI Chuang Fong Kong Ngai Lab (Berkeley) Cynthia Duggan Jonathan Scolnick Dave Lin Vivian Peng Percy Luu Elva Diaz John Ngai LBNL Matt Callow RIKEN Genomic Sciences Center Yasushi Okazaki Yoshihide Hayashizaki