Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 10. Microarray and RNA-seq

Similar presentations


Presentation on theme: "Lecture 10. Microarray and RNA-seq"— Presentation transcript:

1 Lecture 10. Microarray and RNA-seq
: Identification of Differentially Expressed Genes (DEG) Hyun Seok Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University College of Medicine MES Genome Informatics I (2015 Spring)

2 Gene expression

3 MES7594-01 Genome Informatics I (2015 Spring)
What is a microarray? Platforms Platforms Glass slides (cDNA array) Chips (Affymetrix) Glass beads (Illumina) 10000s of oligonuceotide (or cDNA) probes are fixed on the surface of the platforms. Microarrays can detect and quantify mRNA microRNA SNP LOH CNV … cDNA Affymetrix Illumina MES Genome Informatics I (2015 Spring)

4 MES7594-01 Genome Informatics I (2015 Spring)
Questions of Interest Determine steady-state gene expression levels of a sample in whole transcriptome scale. Identify differentially expressed genes between samples. Identify differentially regulated pathways or protein complexes. MES Genome Informatics I (2015 Spring)

5 Affymetrix GeneChip for mRNA quantification
About Affy GeneChip platform Probes (25 mers) are synthesized on a chip using a photolithographic manufacturing process. At each x, y location of a GeneChip, a particular oligonucleotide is synthesized with millions of copies. Each gene is represented by a unique set of probe pairs (PM and MM). MM helps increase specificity of the PM signal. MES Genome Informatics I (2015 Spring)

6 Affymetrix GeneChip for mRNA quantification
About Affy workflow Isolate total RNA (need biological replicates) Sample amplification and labeling Sample injected into microarray Probe array hybridization, washing Probe array scanning and intensity quantification Intensity translated into nucleic acid abundance MES Genome Informatics I (2015 Spring)

7 Illumina BeadArray for mRNA quatification
Beadchip platform Each bead has one type of oligo and thousands of these oligos/bead Bead is deposited on wells in glass slides. The beads are decoded by a step by proprietary technology MES Genome Informatics I (2015 Spring)

8 MES7594-01 Genome Informatics I (2015 Spring)
Affy vs Illumina Affymetrix GeneChip Illumina BeadArray 25 mer Longer oligo Probe synthesized on chips Bead technology Multiple probes/probeset Single probe Multiple probes/transcript .dat, .cel, .cdf, .chp file types Image file processed by Bead Studio Normalization by MAS5, RMA, GC-RMA etc. Normalization by average, quantile, RSN etc. TXT output for downstream analysis Annotations can be updated MES Genome Informatics I (2015 Spring) Adapted from Dr.

9 RNA sequencing Isolate RNAs
Generate cDNA, fragment, size select, add linkers Samples of interest Condition 1 (normal colon) Condition 2 (colon tumor) Sequence ends Map to genome, transcriptome, and predicted exon junctions 100s of millions of paired reads 10s of billions bases of sequence Downstream analysis Adapted from Canadian Bioinformatics Workshop

10 Pros and Cons of RNA-seq (versus microarray)
More powerful in detecting low expressing genes Detect splicing variants and fusion transcripts Measure allele specific expression Discover mutation Cons Biased to highly expressed genes (e.g. ribosomal, mitochondrial genes) More complicated analysis workflow (mapping to reference genome) More expensive e.g. Hiseq2500, 100bpX2, 4Gb -> $700/sample (vs. < $500/array)

11 RNA-seq: Experimental Design
Single end read: one read sequenced from one end of each sample cDNA insert Paired end read: two reads (one from each end) sequenced from each sample cDNA insert. Better to map reads over repetitive regions. Detect fusions and novel transcripts.

12 RNA-seq workflows Sequencing: obtain raw data (fastq format)
Quality control (optional): FASTX Workflow 1: tophat2 (align) -> cufflinks (transcript assembly) -> cuffdiff (DEGs), cuffmerge (merge assemblies) Workflow 2: bowtie2 (align) -> HTSeq-count (count by gene) -> edgeR or DESeq (DEGs) Fusion detection (optional): “chimerascan” or “defuse”

13 Normalization method (old)
RPKM: Reads per kilobase per million mapped reads (sigle end) RPKM = (10^9*C)/(N*L) C = number of reads mapped to a gene N = total mapped reads in the experiment L = exon length in kb for a gene RPKM measure is inconsistent among samples. FPKM: Fragments per kilobase per million fragments reads (paired end). RPKM and FPKM based DEG discovery is affected by gene length (no more recommended).

14 Negative binomial RNA-seq
Microarray data follows a Poisson distribution. However RNA seq does not. In RNA Seq genes with high mean counts (either because they’re long or highly expressed) tend to show more variance (between samples) than genes with low mean counts. Thus this data fits a Negative Binomial Distribution. edgeR and DESeq identfy DEGs based on negative binomial distribution. Poisson microarray RNA-seq Adapted from EMBL

15 A case study for practice session
GEO accession number: GSE41588 MES Genome Informatics I (2015 Spring)

16 GEO entry of GSE41558 Two platforms: Affy and HiSeq
Matrix file: processed data Raw files: CEL, count files

17 MES7594-01 Genome Informatics I (2015 Spring)
Data Pre-processing Affy produces CEL format file as raw data. CEL file contains the feature quantifications CEL file still has probes spread over the chip Values still need to be summarized to probe set level; for example 90525_at = 250 units Probe set: a collection of probes designed to interrogate a given sequence 250 MES Genome Informatics I (2015 Spring)

18 MES7594-01 Genome Informatics I (2015 Spring)
CEL file to TXT file In going from .CEL to .TXT file to generate signal values, the multiple probes within a probe set are “averaged” to produce a single value for that gene/transcript. the CEL files must first be normalized to account for technical variation between the arrays MES Genome Informatics I (2015 Spring)

19 Robust Multi-array Average (RMA)
Background adjust PM values from .CEL files. Take the base-2 log of each background-adjusted PM intensity. Quantile normalize values from step 2 across all GeneChips. Perform median polish separately for each probe set with rows indexed by GeneChip and columns indexed by probe. For each row, find the average of the fitted values from step 4 to use as probe-set-specific expression measures for each GeneChip. -> .TXT files

20 Log Transformation Reason for working with log transformed intensities
Spread features more evenly across intensity range Makes variability more constant across intensity range Makes results close to normal distribution of intensities and errors

21 MES7594-01 Genome Informatics I (2015 Spring)
How to normalize? Raw data Many methods Median scaling – median intensity for all chips should be the same Known genes, house keeping, invariant genes Quantile normalization: RMA (Robust Multiarray Averaging), GC- RMA Normalization method may differ depending on array platform (Reading materials) GC-RMA: Wu et al. (2004), JASA, 99, RMA: Irizarry et al. (2003), Nuc Acids Res, 31, e15. After normalization MES Genome Informatics I (2015 Spring)

22 RMA: Quantile Normalization
After background adjustment, find the smallest log2(PM) on each chip. Average the values from step 1. Replace each value in step 1 with the average computed in step 2. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values.

23 RMA: Median Polish For a given probe set with J probe pairs, let yij denote the background-adjusted, base-2-logged, and quantile- normalized value for GeneChip i and probe j. Assume yij = μi + αj + eij where α1 + α αn = 0. Perform Tukey’s Median Polish on the matrix of yij values with yij in the ith row and jth column. μi from median polish is the probe-set-specific measure of expression for GeneChip i after correcting for array effect and probe effect. residual for the jth probe on the ith GeneChip gene expression of the probe set on GeneChip i probe affinity affect for the jth probe in the probe set

24 Differentially Expressed Genes (DEG)
Criteria for DEG discovery - Amount of difference: Fold change, Signal to noise ratio - Statistical significance: p-value, false discovery rate (FDR), odds ratio Statistical Methods - Parametric: t-test - Non-parametric: Wilcoxon rank-sum tests - Significance Analysis of Microarrays (SAM; permutation based) - Empirical Bayesian (Linear Models of Microarrays, LIMMA, Affy data) - ANOVA (multiple factors; e.g. two different strains +/- drug) Multiplicity of testing: p-value adjustments - Methods: FDR, bonferroni, etc. MES Genome Informatics I (2015 Spring)

25 Limma & Empirical Bayesian
Limma is an R package to find DEGs It uses linear models - Fitted to normalized intensities for each gene given a series of arrays - Design matrix: indicates which RNA samples have been applied to each array - Contrast matrix: specifies which comparisons you would like to make between the RNA samples - Can be used to compare two or more groups Assumption: normal distribution Uses empirical Bayesian analysis to improve power in small sample sizes - Borrowing information across genes Output: p-values (adjusted for multiple testing)

26 Moderated/Bayesian t-test
Ordinary t-test is testing for differences in means between two groups given the variability within each group Moderated/Bayesian t-test: rather than estimating within-group variability over and over again for each gene, pool the information from many similar genes. Advantage: eliminate occurrence of accidentally large t-statistics due to accidentally small within-group variance.

27 Further reading RNA-seq normalization: Dillies M-A et al. Briefings in Bioinformatics, 2012, 14, Limma & eBayes: Smyth GK. Statistical Applications in Genetics and Molecular Biology, 2004, 3 (1), article 3.

28 MES7594-01 Genome Informatics I (2015 Spring)
Notice Course homepage:   MES Genome Informatics I (2015 Spring)


Download ppt "Lecture 10. Microarray and RNA-seq"

Similar presentations


Ads by Google