Lecture 10. Microarray and RNA-seq

Slides:



Advertisements
Similar presentations
Linear Models for Microarray Data
Advertisements

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
RNAseq.
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Getting the numbers comparable
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Differentially expressed genes
1 Preprocessing for Affymetrix GeneChip Data 1/18/2011 Copyright © 2011 Dan Nettleton.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
mRNA-Seq: methods and applications
Microarray Data Analysis Illumina Gene Expression Data Analysis Yun Lian.
Introduction to Microarray Analysis
Li and Dewey BMC Bioinformatics 2011, 12:323
A cell and its population of genes :. DNA forms double strands by a process called hybridization:
Expression Analysis of RNA-seq Data
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
CDNA Microarrays MB206.
Data Type 1: Microarrays
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
RNAseq analyses -- methods
Agenda Introduction to microarrays
Lecture 11. Microarray and RNA-seq II
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Intro to Microarray Analysis Courtesy of Professor Dan Nettleton Iowa State University (with some edits)
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Introduction to RNAseq
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
No reference available
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Transcriptome What is it - genome wide transcript abundance How do you obtain it - Arrays + MPSS What do you do with it when you have it - ?
Gene expression  Introduction to gene expression arrays Microarray Data pre-processing  Introduction to RNA-seq Deep sequencing applications RNA-seq.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
Canadian Bioinformatics Workshops
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Canadian Bioinformatics Workshops
Arrays How do they work ? What are they ?. WT Dwarf Transgenic Other species Arrays are inverted Northerns: Extract target RNA YFG Label probe + hybridise.
Statistics Behind Differential Gene Expression
RNA Quantitation from RNAseq Data
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
RNA-Seq analysis in R (Bioconductor)
Getting the numbers comparable
Additional file 2: RNA-Seq data analysis pipeline
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Pre-processing AFFY data
Presentation transcript:

Lecture 10. Microarray and RNA-seq : Identification of Differentially Expressed Genes (DEG) Hyun Seok Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine MES7594-01 Genome Informatics I (2015 Spring)

Gene expression

MES7594-01 Genome Informatics I (2015 Spring) What is a microarray? Platforms Platforms Glass slides (cDNA array) Chips (Affymetrix) Glass beads (Illumina) 10000s of oligonuceotide (or cDNA) probes are fixed on the surface of the platforms. Microarrays can detect and quantify mRNA microRNA SNP LOH CNV … cDNA Affymetrix Illumina MES7594-01 Genome Informatics I (2015 Spring)

MES7594-01 Genome Informatics I (2015 Spring) Questions of Interest Determine steady-state gene expression levels of a sample in whole transcriptome scale. Identify differentially expressed genes between samples. Identify differentially regulated pathways or protein complexes. MES7594-01 Genome Informatics I (2015 Spring)

Affymetrix GeneChip for mRNA quantification About Affy GeneChip platform Probes (25 mers) are synthesized on a chip using a photolithographic manufacturing process. At each x, y location of a GeneChip, a particular oligonucleotide is synthesized with millions of copies. Each gene is represented by a unique set of probe pairs (PM and MM). MM helps increase specificity of the PM signal. MES7594-01 Genome Informatics I (2015 Spring)

Affymetrix GeneChip for mRNA quantification About Affy workflow Isolate total RNA (need biological replicates) Sample amplification and labeling Sample injected into microarray Probe array hybridization, washing Probe array scanning and intensity quantification Intensity translated into nucleic acid abundance MES7594-01 Genome Informatics I (2015 Spring)

Illumina BeadArray for mRNA quatification Beadchip platform Each bead has one type of oligo and thousands of these oligos/bead Bead is deposited on wells in glass slides. The beads are decoded by a step by proprietary technology MES7594-01 Genome Informatics I (2015 Spring)

MES7594-01 Genome Informatics I (2015 Spring) Affy vs Illumina Affymetrix GeneChip Illumina BeadArray 25 mer Longer oligo Probe synthesized on chips Bead technology Multiple probes/probeset Single probe Multiple probes/transcript .dat, .cel, .cdf, .chp file types Image file processed by Bead Studio Normalization by MAS5, RMA, GC-RMA etc. Normalization by average, quantile, RSN etc. TXT output for downstream analysis Annotations can be updated MES7594-01 Genome Informatics I (2015 Spring) Adapted from Dr. Chandran@pitt

RNA sequencing Isolate RNAs Generate cDNA, fragment, size select, add linkers Samples of interest Condition 1 (normal colon) Condition 2 (colon tumor) Sequence ends Map to genome, transcriptome, and predicted exon junctions 100s of millions of paired reads 10s of billions bases of sequence Downstream analysis Adapted from Canadian Bioinformatics Workshop

Pros and Cons of RNA-seq (versus microarray) More powerful in detecting low expressing genes Detect splicing variants and fusion transcripts Measure allele specific expression Discover mutation Cons Biased to highly expressed genes (e.g. ribosomal, mitochondrial genes) More complicated analysis workflow (mapping to reference genome) More expensive e.g. Hiseq2500, 100bpX2, 4Gb -> $700/sample (vs. < $500/array)

RNA-seq: Experimental Design Single end read: one read sequenced from one end of each sample cDNA insert Paired end read: two reads (one from each end) sequenced from each sample cDNA insert. Better to map reads over repetitive regions. Detect fusions and novel transcripts.

RNA-seq workflows Sequencing: obtain raw data (fastq format) Quality control (optional): FASTX Workflow 1: tophat2 (align) -> cufflinks (transcript assembly) -> cuffdiff (DEGs), cuffmerge (merge assemblies) Workflow 2: bowtie2 (align) -> HTSeq-count (count by gene) -> edgeR or DESeq (DEGs) Fusion detection (optional): “chimerascan” or “defuse”

Normalization method (old) RPKM: Reads per kilobase per million mapped reads (sigle end) RPKM = (10^9*C)/(N*L) C = number of reads mapped to a gene N = total mapped reads in the experiment L = exon length in kb for a gene RPKM measure is inconsistent among samples. FPKM: Fragments per kilobase per million fragments reads (paired end). RPKM and FPKM based DEG discovery is affected by gene length (no more recommended).

Negative binomial RNA-seq Microarray data follows a Poisson distribution. However RNA seq does not. In RNA Seq genes with high mean counts (either because they’re long or highly expressed) tend to show more variance (between samples) than genes with low mean counts. Thus this data fits a Negative Binomial Distribution. edgeR and DESeq identfy DEGs based on negative binomial distribution. Poisson microarray RNA-seq Adapted from EMBL

A case study for practice session GEO accession number: GSE41588 MES7594-01 Genome Informatics I (2015 Spring)

GEO entry of GSE41558 Two platforms: Affy and HiSeq Matrix file: processed data Raw files: CEL, count files

MES7594-01 Genome Informatics I (2015 Spring) Data Pre-processing Affy produces CEL format file as raw data. CEL file contains the feature quantifications CEL file still has probes spread over the chip Values still need to be summarized to probe set level; for example 90525_at = 250 units Probe set: a collection of probes designed to interrogate a given sequence 250 MES7594-01 Genome Informatics I (2015 Spring)

MES7594-01 Genome Informatics I (2015 Spring) CEL file to TXT file In going from .CEL to .TXT file to generate signal values, the multiple probes within a probe set are “averaged” to produce a single value for that gene/transcript. the CEL files must first be normalized to account for technical variation between the arrays MES7594-01 Genome Informatics I (2015 Spring)

Robust Multi-array Average (RMA) Background adjust PM values from .CEL files. Take the base-2 log of each background-adjusted PM intensity. Quantile normalize values from step 2 across all GeneChips. Perform median polish separately for each probe set with rows indexed by GeneChip and columns indexed by probe. For each row, find the average of the fitted values from step 4 to use as probe-set-specific expression measures for each GeneChip. -> .TXT files

Log Transformation Reason for working with log transformed intensities Spread features more evenly across intensity range Makes variability more constant across intensity range Makes results close to normal distribution of intensities and errors

MES7594-01 Genome Informatics I (2015 Spring) How to normalize? Raw data Many methods Median scaling – median intensity for all chips should be the same Known genes, house keeping, invariant genes Quantile normalization: RMA (Robust Multiarray Averaging), GC- RMA Normalization method may differ depending on array platform (Reading materials) GC-RMA: Wu et al. (2004), JASA, 99, 909-917. RMA: Irizarry et al. (2003), Nuc Acids Res, 31, e15. After normalization MES7594-01 Genome Informatics I (2015 Spring)

RMA: Quantile Normalization After background adjustment, find the smallest log2(PM) on each chip. Average the values from step 1. Replace each value in step 1 with the average computed in step 2. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values.

RMA: Median Polish For a given probe set with J probe pairs, let yij denote the background-adjusted, base-2-logged, and quantile- normalized value for GeneChip i and probe j. Assume yij = μi + αj + eij where α1 + α2 + ... + αn = 0. Perform Tukey’s Median Polish on the matrix of yij values with yij in the ith row and jth column. μi from median polish is the probe-set-specific measure of expression for GeneChip i after correcting for array effect and probe effect. residual for the jth probe on the ith GeneChip gene expression of the probe set on GeneChip i probe affinity affect for the jth probe in the probe set

Differentially Expressed Genes (DEG) Criteria for DEG discovery - Amount of difference: Fold change, Signal to noise ratio - Statistical significance: p-value, false discovery rate (FDR), odds ratio Statistical Methods - Parametric: t-test - Non-parametric: Wilcoxon rank-sum tests - Significance Analysis of Microarrays (SAM; permutation based) - Empirical Bayesian (Linear Models of Microarrays, LIMMA, Affy data) - ANOVA (multiple factors; e.g. two different strains +/- drug) Multiplicity of testing: p-value adjustments - Methods: FDR, bonferroni, etc. MES7594-01 Genome Informatics I (2015 Spring)

Limma & Empirical Bayesian Limma is an R package to find DEGs It uses linear models - Fitted to normalized intensities for each gene given a series of arrays - Design matrix: indicates which RNA samples have been applied to each array - Contrast matrix: specifies which comparisons you would like to make between the RNA samples - Can be used to compare two or more groups Assumption: normal distribution Uses empirical Bayesian analysis to improve power in small sample sizes - Borrowing information across genes Output: p-values (adjusted for multiple testing)

Moderated/Bayesian t-test Ordinary t-test is testing for differences in means between two groups given the variability within each group Moderated/Bayesian t-test: rather than estimating within-group variability over and over again for each gene, pool the information from many similar genes. Advantage: eliminate occurrence of accidentally large t-statistics due to accidentally small within-group variance.

Further reading RNA-seq normalization: Dillies M-A et al. Briefings in Bioinformatics, 2012, 14, 671-683. Limma & eBayes: Smyth GK. Statistical Applications in Genetics and Molecular Biology, 2004, 3 (1), article 3.

MES7594-01 Genome Informatics I (2015 Spring) Notice Course homepage:   http://wiki.tgilab.org/MES7594 MES7594-01 Genome Informatics I (2015 Spring)