Gene Expression BMI 731 week 5

Slides:



Advertisements
Similar presentations
Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.
Advertisements

Introduction to Microarray
Microarray Simultaneously determining the abundance of multiple(100s-10,000s) transcripts.
1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.
Mathematical Statistics, Centre for Mathematical Sciences
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Statistics for Microarrays
Biological background: Gene Expression and Molecular Laboratory Techniques Class web site: Statistics.
Gene Expression Chapter 9.
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Introduction to DNA Microarrays Todd Lowe BME 88a March 11, 2003.
DNA microarray and array data analysis
Microarrays Dr Peter Smooker,
Normalization for cDNA Microarray Data Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed. SPIE BIOS 2001, San Jose, CA January 22, 2001.
Microarray analysis Golan Yona ( original version by David Lin )
Normalization Class web site: Statistics for Microarrays.
The Human Genome Project and ~ 100 other genome projects:
Central Dogma 2 Transcription mRNA Information stored In Gene (DNA) Translation Protein Transcription Reverse Transcription SELF-REPAIRING ARABIDOPSIS,
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Gene expression and the transcriptome I. Genomics and transcriptome After genome sequencing and annotation, the second major branch of genomics is analysis.
Introduce to Microarray
Corrections and Normalization in microarrays data analysis
Gene Expression BMI 731 Winter 2005 Catalin Barbacioru Department of Biomedical Informatics Ohio State University.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
Analysis of microarray data
B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.
Microarray Preprocessing
with an emphasis on DNA microarrays
Gene expression and the transcriptome I
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Affymetrix vs. glass slide based arrays
DNA MICROARRAYS WHAT ARE THEY? BEFORE WE ANSWER THAT FIRST TAKE 1 MIN TO WRITE DOWN WHAT YOU KNOW ABOUT GENE EXPRESSION THEN SHARE YOUR THOUGHTS IN GROUPS.
Statistical Analyses of Microarray Data Rafael A. Irizarry Department of Biostatistics
Lecture 22 Introduction to Microarray
CDNA Microarrays MB206.
Data Type 1: Microarrays
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Microarray Technology
Agenda Introduction to microarrays
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research.
Lecture 16 Gene expression and the transcriptome I
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Introduction to DNA microarray technologies Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor short course Summer 2002.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
What Is Microarray A new powerful technology for biological exploration Parallel High-throughput Large-scale Genomic scale.
Genomics I: The Transcriptome
GeneChip® Probe Arrays
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.
Microarray (Gene Expression) DNA microarrays is a technology that can be used to measure changes in expression levels or to detect SNiPs Microarrays differ.
Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Introduction to Microarrays. The Central Dogma.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Lecture 23 – Functional Genomics I Based on chapter 8 Functional and Comparative Genomics Copyright © 2010 Pearson Education Inc.
DNA Microarray Overview and Application. Table of Contents Section One : Introduction Section Two : Microarray Technique Section Three : Types of DNA.
Lecture 14: Gene expression and the transcriptome I.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray: An Introduction
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
Microarray - Leukemia vs. normal GeneChip System.
The Basics of cDNA Microarray Technology
Microarray Technology and Applications
Normalization for cDNA Microarray Data
Data Type 1: Microarrays
Presentation transcript:

Gene Expression BMI 731 week 5 Catalin Barbacioru Department of Biomedical Informatics Ohio State University

Thesis: the analysis of gene expression data is going to be big in 21st century statistics Many different technologies, including High-density nylon membrane arrays Serial analysis of gene expression (SAGE) Short oligonucleotide arrays (Affymetrix) Long oligo arrays (Agilent) Fibre optic arrays (Illumina) cDNA arrays (Brown/Botstein)*

Total microarray articles indexed in Medline 1995 1996 1997 1998 1999 2000 2001 100 200 300 400 500 600 (projected) Year Number of papers

Common themes Parallel approach to collection of very large amounts of data (by biological standards) Sophisticated instrumentation, requires some understanding Systematic features of the data are at least as important as the random ones Often more like industrial process than single investigator lab research Integration of many data types: clinical, genetic, molecular…..databases

Biological background G U A A U C C RNA polymerase mRNA Transcription DNA G T A A T C C T C | | | | | | | | | C A T T A G G A G

Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be better, but is currently harder.

Reverse transcription Clone cDNA strands, complementary to the mRNA mRNA G U A A U C C U C Reverse transcriptase cDNA T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G C A T T A G G A G

cDNA microarray experiments mRNA levels compared in many different contexts Different tissues, same organism (brain v. liver) Same tissue, same organism (ttt v. ctl, tumor v. non-tumor) Same tissue, different organisms (wt v. ko, tg, or mutant) Time course experiments (effect of ttt, development) Other special designs (e.g. to detect spatial patterns). 4

DNA microarrays represent an important new method for determining the complete expression profile of a cell. Monitoring gene expression lies at the heart of a wide variety of medical and biological research projects, including classifying diseases, understanding basic biological processes, and identifying new drug targets.

Affymetrix® Instrument System Platform for GeneChip® Probe Arrays Integrated Exportable Easy to use Versatile

Photolithography

Synthesis of Ordered Oligonucleotide Arrays O O O O O Light (deprotection) HO HO O O O T T O O O T T C C O C A T A T A G C T G T T C C G Mask Substrate T – C – REPEAT Light removes protecting groups at defined positions. Single nucleotide washed over the chip, binds where the protecting group removed. Through successive steps, any sequence can be built up in any position on the chip. The number of steps corresponds with length of oligo, so can increase # of genes without # of steps

Affymetrix GeneChip arrays

GeneChip® Probe Arrays * Hybridized Probe Cell GeneChip Probe Array Single stranded, labeled RNA target Oligonucleotide probe 24µm Millions of copies of a specific oligonucleotide probe 1.28cm GENECHIP PROBE ARRAYS The core of the platform is our unique arrays Oligonucleotides synthesized de novo (photolithography & combinatorial chemistry) Currently 65,000 (50 micron features) to 250,000 (24 micron) different oligos on commercially available products Each oligo represented in 107 to 108 full-length copies >200,000 different complementary probes Image of Hybridized Probe Array

Analysis of expression level from probe sets A single, contiguous gene set for the rat B-actin gene. Each pixel is quantitated and integrated for each oligo feature (range 0-25,000) Perfect Match (PM) Mis Match (MM) Control log(PM / MM) = difference score All significant difference scores are averaged to create “average difference” = expression level of the gene.

Expression screening by GeneChip • each oligo sequence (20-25 mer) is synthesized as a 20 µ square (feature) • each feature contains > 1 million copies of the oligo • scanner resolution is about 2 µ (pixel) • each gene is quantitated by 16-20 oligos and compared to equal # of mismatched controls • 22,000 genes are evaluated with 20 matching oligos and 10 mismatched oligos = 480,000 features/chip • 480,000 features are photolithographically synthesized onto a 2 x 2 cm glass substrate

Affymetrix GeneChip arrays Global views of gene expression are often essential for obtaining comprehensive pictures of cell function. For example, it is estimated that between 0.2 to 10% of the 10,000 to 20,000 mRNA species in a typical mammalian cell are differentially expressed between cancer and normal tissues. Whole-genome analyses also benefit studies where the end goal is to focus on small numbers of genes, by providing an efficient tool to sort through the activities of thousands of genes, and to recognize the key players. In addition, monitoring multiple genes in parallel allows the identification of robust classifiers, called "signatures", of disease. Global analyses frequently provide insights into multiple facets of a project. A study designed to identify new disease classes, for example, may also reveal clues about the basic biology of disorders, and may suggest novel drug targets.

cDNA microarrays In ‘‘spotted’’ microarrays, slides carrying spots of target DNA are hybridized to fluorescently labeled cDNA from experimental and control cells and the arrays are imaged at two or more wavelengths Expression profiling involves the hybridization of fluorescently labeled cDNA, prepared from cellular mRNA, to microarrays carrying thousands of unique sequences. Typically, a set of target DNA samples representing different genes is prepared by PCR and transferred to a coated slide to form a 2-D array of spots with a center-to-center distance (pitch) of about 200 μm, providing a pan-genomic profile in an area of 3 cm2 or less. cDNA samples from experimental and control cells are labeled with different color fluors (cytochrome Cy5 and Cy3) and hybridized simultaneously to microarrays, and the relative levels of mRNA for each gene are then determined by comparing red and green signal intensities

cDNA microarrays Scanning Technology Microarray slides are imaged with a modified fluorescence microscope designed for scanning large areas at high resolution (arrayWoRx, Applied Precision, Issaquah, WA, Affymetrix). Fluorescence illumination are obtained from a metal halide arc lamp focused onto a fiber optic bundle, the output of which is directed at the microarray slide and emission recorded through a microscope objective (Nikon) onto a cooled CCD (charge-coupled device) camera. Interference filters are used to select the excitation and emission wavelengths corresponding to the Cy3 and Cy5 fluorescent probes (Amersham Pharmacia). Each image covered a 2.4 x 2.4 mm area of the slide at 5-μm resolution. To scan the entire microarray, a series of images (‘‘panels’’) were acquired by moving the slide under the microscope objective in 2.4-mm increments.

http://www.bio.davidson.edu/courses/genomics/chip/chip.swf

16-bit TIFF files (Rfg, Rbg), (Gfg, Gbg) R, G Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment 16-bit TIFF files Image analysis (Rfg, Rbg), (Gfg, Gbg) Normalization R, G Estimation Testing Clustering Discrimination Biological verification and interpretation

Some statistical questions Image analysis: addressing, segmenting, quantifying Normalisation: within and between slides Quality: of images, of spots, of (log) ratios Which genes are (relatively) up/down regulated? Assigning p-values to tests/confidence to results. 4

Some statistical questions, ctd Planning of experiments: design, sample size Discrimination and allocation of samples Clustering, classification: of samples, of genes Selection of genes relevant to any given analysis Analysis of time course, factorial and other special experiments…..…...& much more. 4

Some bioinformatic questions Connecting spots to databases, e.g. to sequence, structure, and pathway databases Discovering short sequences regulating sets of genes: direct and inverse methods Relating expression profiles to structure and function, e.g. protein localisation Identifying novel biochemical or signalling pathways, ………..and much more. 4

Part of the image of one channel false-coloured on a white (v Part of the image of one channel false-coloured on a white (v. high) red (high) through yellow and green (medium) to blue (low) and black scale

Does one size fit all?

Segmentation: limitation of the fixed circle method SRG Fixed Circle Inside the boundary is spot (foreground), outside is not.

Some local backgrounds Single channel grey scale We use something different again: a smaller, less variable value.

Quantification of expression For each spot on the slide we calculate Red intensity (PM) = Rfg - Rbg fg = foreground, bg = background, and Green intensity (MM) = Gfg - Gbg and combine them in the log (base 2) ratio Log2( Red intensity / Green intensity) Log2( PM / MM)

Gene Expression Data = slide 1 slide 2 slide 3 slide 4 slide 5 … On p genes for n slides: p is O(10,000), n is O(10-100), but growing, Slides slide 1 slide 2 slide 3 slide 4 slide 5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Genes 3 Gene expression level of gene 5 in slide 4 = Log2( Red intensity / Green intensity) These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.

The red/green ratios can be spatially biased . Top 2.5%of ratios red, bottom 2.5% of ratios green

Affymetrix vs. cDNA Arrays Affy Strengths: - highly reliable: synthesized in situ - highly reproducible from run to run - no clone maintenance or ‘drift’ - sealed fluidics and controlled temperature - standardized chips increase database power - excellent scanner - complex, but very reliable labelling - excellent cost/benefit ratio - amenable to mutation and SNP detection

Affymetrix weaknesses/limitations not easily customized: $300K/chip high labeling cost $170/chip high per chip cost $350 to $1850 limited choice of species requires knowledge of sequence not designed for competitive protocols

Limitations to all microarrays. dynamic range of gene expression: very difficult to simultaneously detect low and high abundance genes accurately - each gene has multiple splice variants 2 splice variants may have opposite effects (i.e. trk) arrays can be designed for splicing, but complexity ^ 5X - translational efficiency is a regulated process: mRNA level does not correlate with protein level - proteins are modified post-translationally glycosylation, phosphorylation, etc. - pathogens might have little ‘genomic’ effect

Analysis In general the expression level of individual genes is measured by log(PM/MM) or log(R/G). Intensity-dependent normalization methods are preferred over a global methods. To correct intensity- and dye-bias we used location and scale normalization methods, which are based on robust, locally linear fits (lowess). Global methods use linear regression models, combined with ANOVA.

Normalization Why? How do we know it is necessary? To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples. How do we know it is necessary? By examining self-self hybridizations, where no true differential expression is occurring. We find dye biases which vary with overall spot intensity, location on the array, plate origin, pins, scanning parameters,….

Analysis Post-normalization Pre-normalization

The simplest cDNA microarray data analysis problem is identifying differentially expressed genes using replicated slides There are a number of different aspects: First, between-slide normalization; then What should we look at: averages, SDs, t-statistics, other summaries? How should we look at them? Can we make valid probability statements? 4

Apo AI experiment (Matt Callow, LBNL) Goal. To identify genes with altered expression in the livers of Apo AI knock-out mice (T) compared to inbred C57Bl/6 control mice (C). 8 treatment mice and 8 control mice 16 hybridizations: liver mRNA from each of the 16 mice (Ti , Ci ) is labelled with Cy5, while pooled liver mRNA from the control mice (C*) is labelled with Cy3. Probes: ~ 6,000 cDNAs (genes), including 200 related to lipid metabolism.

Which genes have changed? When permutation testing possible 1. For each gene and each hybridisation (8 ko + 8 ctl), use M=log2(R/G). 2. For each gene form the t statistic: average of 8 ko Ms - average of 8 ctl Ms sqrt(1/8 (SD of 8 ko Ms)2 + (SD of 8 ctl Ms)2) 3. Form a histogram of 6,000 t values. 4. Do a normal q-q plot; look for values “off the line”. 5. Permutation testing (next lecture). 6. Adjust for multiple testing (next lecture). 9

Histogram & normal q-q plot of t-statistics ApoA1

Patterns, More Globally... Can we identify genes with interesting patterns of expression across arrays? Two approaches: 1. Find the genes whose expression fits specific, predefined patterns. 2. Perform cluster analysis - see what expression patterns emerge.

The 16 groups systematically arranged (6 point representation)