RNA-seq Manpreet S. Katari
Abundance of mRNA is what we try to measure DNA RNA protein phenotype cDNA Abundance of mRNA is what we try to measure
Microarrays vs Northern blots: from Gene to Genome Science Northern blot: limited by number of lanes in gel Microarray: A large number of DNA fragments are attached in a systematic way to a solid substrate, can measure mRNA levels for thousands of genes (~ every gene in a genome) in parallel Microarrays permit the simultaneous analysis of the RNA expression of thousands of genes. For fully sequenced genomes, microarrays can be used to analyze the expression of every gene. Northern blots, on the other hand, are limited by the number of lanes on the gel and by the number of probes that can be used on the same blot. Northern blots normally have 20–40 lanes, and no more than three probes can be used simultaneously. Thus, microarrays increase the throughput by several orders of magnitude. DNA Microarrays are an extremely powerful tool. Whereas we have traditionally examines genes in isolation, DNA microarrays allow us to see all the genes of an cell working together in concert. DNA Microarrays allow us to see the gene expression levels for tens of thousands of genes at once. A microarray of 50,000 unique cDNAs allows the expression monitoring of the entire human genome in a single hybridization.
Evolution of Sequence Technology
Transcriptomics using RNA-seq
Genome-wide expression analysis Goal: to measure RNA levels of all genes in a genome under various experimental conditions RNA levels vary with: Cell type Developmental stage External stimuli Disease state Time and location of expression provide information on genes’ function and interactions, and can be useful for many purposes, including disease diagnostics and medical applications. Once every gene in a genome has been identified, it becomes feasible to measure each gene’s expression. One of the first goals along this line has been to measure the steady-state abundance of RNA made from each gene. There have also been ongoing attempts to measure the level of all proteins. (See the chapter on proteomics.) The levels of RNA vary depending on the cell type, the developmental stage, environmental stimuli, etc. For example, the RNAs expressed in a heart cell differ greatly from those expressed in a brain cell, and the RNAs expressed in fetal blood differ from those expressed in adult blood. In addition, exposure to high heat triggers the production of heat-shock RNAs, which are not present under normal conditions. Therefore, determination of the RNA levels found at a particular time and in a specific cell or organ can provide important information as to the function of the genes responsible for this expression. In addition, the spectrum or profile of RNAs found in a particular cell can be used as a means of disease diagnosis. For example, different types of cancer have been shown to have different RNA profiles. (See the example later in this chapter.)
For High-Throughput Transcriptomics studies, comparisons are almost always across experiments whole body liver liver lung brain kidney
Questions that can be addressed with genome-wide expression analysis: What genes have similar function? What regulatory pathways exist? Can we subdivide experiments or genes into meaningful classes? Can we correctly classify an unknown experiment or gene into a known class? Can we make better treatment decisions for a cancer patient based on his or her gene expression profile?
First two basic tasks to generating meaningful data for transcriptomics analysis Normalize or scale all samples and replicated to each other Make a (statistical) statement about what changes are evident in the comparison
Microarrays Provides the mRNA level of thousands of genes (sometimes almost all known genes in a genome) in a given sample Sample=tissue (e.g., liver, brain), tissue in a specific environment or state (e.g., brain with cancer), etc.
Three types of arrays Spotted microarrays Long dsDNA (typically genomic PCR products) On-chip oligonucleotide synthesis Photolithography Affymetrix (~25-mers) Ink-jet printing Agilent (~60-mers)
Sample labeling cRNA + biotin cDNA made using reverse transcriptase Fluorescent cDNA cDNA made using reverse transcriptase Fluorescently labeled nucleotides added Labeled nucleotides incorporated into cDNA cRNA + biotin cDNA made using reverse transcriptase Linker added with T7 RNA polymerase recognition site T7 polymerase added and biotin labeled RNA bases Biotin label incorporated into cRNA + Labeling of the target RNA is usually performed by generating a single-stranded cDNA, using the enzyme reverse transcriptase. One method of labeling uses fluorescently labeled nucleotides that are incorporated into the cDNA during the reverse-transcription reaction. This is generally the way the nucleotides labeled with the dyes Cy3 and Cy5 are incorporated into targets used in competitive hybridization. Another alternative for labeling the target RNA population is first to make double-stranded cDNA and then to use a viral RNA polymerase to make cRNA. To accomplish this task, a linker is added to the cDNA that contains the recognition site for an RNA polymerase (e.g., T7 RNA polymerase). Labeling is done by adding modified RNA bases to the RNA polymerase reaction. This type of labeling is used for Affymetrix GeneChips as well as NimbleGen chips. The production of cRNA using T7 polymerase involves amplification of the original RNA population. Different labels are incorporated depending on the type of microarray experiment that is being performed. For experiments in which two different RNA populations are analyzed on the same microarray (competitive hybridization), two dyes are used that fluoresce at different wavelengths. The most commonly used dyes are Cy3 and Cy5. Labeling for hybridization to Affymetrix GeneChips and NimbleGen chips uses biotin-conjugated RNA bases. Fluorescently labeled avidin is then bound to the biotin. The streptavidin/biotin system is of special interest because it has one of the largest free energies of association of yet observed for noncovalent binding of a protein and small ligand in aqueous solution (K_assoc = 10**14). The complexes are also extremely stable over a wide range of temperature and pH.The streptavidin protomer is organized as an 8-stranded beta-barrel. Pairs of the barrels bind together to form symmetric dimers, pairs of which in turn interdigitate with their dyad axes coincident to form the naturally-occurring tetramer.
Microarray hybridization mRNA cDNA DNA microarray samples Spotted microarrays Competitive hybridization: two labeled cDNA samples (experimental and control) hybridized to same slide Cy3 and Cy5 dye labeling, fluoresce at different wavelengths Affymetrix GeneChips One labeled RNA population per chip Biotin labeling, binds to fluorescently labeled avidin (Comparison made between hybridization intensities of same oligonucleotides on different chips). In addition to the differences in their manufacturing, spotted microarrays and GeneChips (as well as NimbleGen chips) differ in how the hybridization is performed. For spotted microarrays, usually the two labeled targets to be compared are hybridized to the same microarray. This procedure is known as competitive hybridization. For GeneChips, only one labeled target is hybridized to each chip. Comparisons are made at the analysis stage between hybridization intensities measured on two different chips. With competitive hybridization, one is measuring the relative difference between the signal intensity of two targets binding to the same spot of DNA. The practical reason for this approach is that there is often variability in the quality of the spotted DNA, in terms of amount and integrity. This measurement compensates for differences in the quality of the spot. Microarrays made with photolithography tend to have higher reproducibility from slide to slide, making competitive hybridization less important. For spotted-microarray hybridization, one target RNA is labeled with the fluorescent dye Cy3 and the other target with the fluorescent dye Cy5. Both targets are hybridized to the same microarray. The relative intensity of the hybridization is determined using confocal laser scanning microscopy.
Affymetrix system
What is the Affymetrix Signal? Background subtraction: Microarray is divided into sectors Probe signal is ordered and the lowest 2% is taken as the noise level A weighted mean of the background is subtracted from the signal, such that closer sectors are weighted more heavily
Background Adjustment Estimating background effect PM=true signal + background
Quantile Normalization 1. Sort each column disregarding gene order 2. Calculate row averages Experiment 1 Experiment 2 Experiment 3 Gene A 3 100 500 Gene B 17 10 150 Gene C 1000 3 10 100 150 17 1000 500 5.3 86.7 505.7 5.3 86.7 505.7 3. Substitute average values for real ones Gene A 5.3 86.7 505.7 Gene B Gene C 4. Restore gene order
Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene (sum of exons / 1000) T = total number of reads in the library mapped to the genome / 1,000,000
Reproducibility, linearity and sensitivity. Figure 2 | Reproducibility, linearity and sensitivity. (a) Comparison of two brain technical replicate RNA-Seq determinations for all mouse gene models (from the UCSC genome database), measured in reads per kilobase of exon per million mapped sequence reads (RPKM), which is a normalized measure of exonic read density; R2 = 0.96. (b) Distribution of uniquely mappable reads onto gene parts in the liver sample. Although 93% of the reads fall onto exons or the RNAFAR-enriched regions (see Fig. 3 and text), another 4% of the reads falls onto introns and 3% in intergenic regions. (c) Six in vitro–synthesized reference transcripts of lengths 0.3–10 kb were added to the liver RNA sample (1.2 × 104 to 1.2 × 109 transcripts per sample; R2 > 0.99). (d) Robustness of RPKM measurement as a function of RPKM expression level and depth of sequencing. Subsets of the entire liver dataset (with 41 million mapped unique + splice + multireads) were used to calculate the expression level of genes in four different expression classes to their final expression level. Although the measured expression level of the 211 most highly expressed genes (black and cyan) was effectively unchanged after 8 million mappable reads, the measured expression levels of the other two classes (purple and red) converged more slowly. The fraction of genes for which the measured expression level was within ±5% of the final value is reported. 3 RPKM corresponds to approximately one transcript per cell in liver. The corresponding number of spliced reads in each subset is shown on the top x axis.
RNA-seq provides even more
Candidate new and revised exons Figure 4 | Candidate new and revised exons identified by the RNAFAR algorithm. (a) A 40-kb region encompassing the Mef2d gene, which is expressed in adult muscle (28 RPKM in muscle and 45 RPKM in brain), and a neighboring gene that is expressed at a much lower level in brain. RefSeq has only a single annotation for Mef2d, but UCSC has five (labeled α–ε). The α form corresponds to the RefSeq model, and γ is a muscle-specific isoform23. The RNAFAR algorithm identified seven regions (red) enriched with reads that fell outside the NCBI gene annotations and were assigned by the algorithm to the Mef2d locus. (b) A 1.5-kb close-up of muscle-specific alternative splicing at the RNAFAR region labeled ‘B’ in panel a. The prevalence of splicing switches from the canonical exon in the brain sample to the RNAFAR exon in the muscle sample, as seen both in the ratio of spliced reads and in the number of reads falling on the two diagnostic exons. (c) The number of expected spliced reads for each gene model was predicted computationally, based on the number of introns and the exonic read density. The predicted number is then plotted against the number of splices observed (R2 = 0.90). (d) The tissue distribution of genes with two splice isoforms in the same tissue.
Figure 3 Transcript abundance–dependent concordance between RNA-seq and microarray. (a) Root mean squared distance (RMSD in y axis) between pairs of rats for each chemical and averaged over all the chemicals by bins of genes. Expression levels ranged from high (0%) to low (100%) and each bin, A to S, contained 10% of the expressed genes. The analysis was performed on RNA-seq with six pipelines and the microarray with two normalization methods (RMA and MAS5). (b,c) For each chemical, the x axis represents the number of DEGs top ranked by the fold change with P < 0.05 for both platforms with equal numbers of up- and downregulation. The y axis represents the overlap (%) between platforms for a given number of ranked DEGs. Each line on the graph represents the overlap of DEG lists between two platforms for one chemical for above-median expressed genes (b) and below-median expressed genes (c).
Comparison of platforms for detecting gene expression AFFY Gene Chip Illumina All protein coding genes are represented X Can detect all the different types of RNA Cost Can determine gene regulation Requires pre-existing knowledge of gene sequence As the price of sequencing goes down, there will be almost no advantage Of Microarray over RNA-seq
Mapping Reads from RNA molecules What is the advantage of mapping reads from RNA to the genome sequenced instead of a database of all predicted RNA molecules? We are not depending on the quality of annotation. We are not assuming that we know about all of the RNA molecules in the cell. How can we find reads mapping to spliced junctions? Create a separate database of all possible spliced junctions Split reads in half and map them separately.
Bowtie & TopHat
Cufflinks first starts with the output of any alignment tool such as TopHat
Then it assembles the isoforms by first identifying the reads that can not be assembled together.
Then calculate abundance
Assembling the reads to identify transcripts.
CuffCompare The program cuffcompare helps you: Output contains codes Compare your assembled transcripts to a reference annotation Track Cufflinks transcripts across multiple experiments (e.g. across a time course) Output contains codes = match c contained j new isoform u unknown, intergenic transcript i single exon in intron region
Identification of spliced junctions depends largely on the depth of sequences coverage.
Cuffdiff Can be use to find significant changes in transcript expression, splicing, and promoter use. Inputs are: Annotation to compare (can be output from cufflinks) Tophat output from different samples Options are similar to cufflinks, can also specify a different FDR cutoff.
Which comparison is more convincing that genes are different? Control Treatment Rep1 20 Rep2 21 Rep3 19 Mean Rep1 30 Rep2 31 Rep3 29 Mean GENE A COMPARISON A Rep1 10 Rep2 20 Rep3 30 Mean Rep1 20 Rep2 30 Rep3 40 Mean GENE B COMPARISON B
t test Difference in the means Standard Error of the difference Can use this test statistic to evaluate the probability that the two means are same using critical values of T: Where you select the probability of making a type I error e.g., 0.05 Var = sum of squares of the difference n-1 Degrees of freedom = nt+nc-2
Volcano plot: visualizing significance and fold change
Volcano plot: visualizing significance and fold change What can you tell me about this point? Large difference in the mean values, but not significant. Must have high variance in measurements.
Volcano plot: visualizing significance and fold change What can you tell me about this point? Small difference in the mean values, but highly significant. Must have low variance in measurements.
Assumptions of the t-test Samples are drawn from normal distributions i.e. our estimates of geneA and geneB are random samples from a normal distribution The variance of the two populations is equal There is no mean variance relationship
RNA-seq data Count data (discrete) Possible to get zero Cannot get negative number Each sequence read is a random event drawn from a larger population. Variance increases with the mean
RNA-seq data: variance > mean RNA-seq data are consistent with an over-dispersed poisson: variance = a*mean
Should we give treat a difference between 9 vs 12 reads the same as 900 vs 1200?
t = -3.6742, p-value = 0.02131 t = -3.6742, p-value = 0.02131
t test does not account for scale of the data t = -3.6742, p-value = 0.02131 t = -3.6742, p-value = 0.02131
Test using a negative binomial model [glm.nb()] p-value = 0.258 p-value = 1.03e-05
Test using a negative binomial model [glm.nb()] p-value = 0.258 p-value = 1.03e-05
Figure 3 Transcript abundance–dependent concordance between RNA-seq and microarray. (a) Root mean squared distance (RMSD in y axis) between pairs of rats for each chemical and averaged over all the chemicals by bins of genes. Expression levels ranged from high (0%) to low (100%) and each bin, A to S, contained 10% of the expressed genes. The analysis was performed on RNA-seq with six pipelines and the microarray with two normalization methods (RMA and MAS5). (b,c) For each chemical, the x axis represents the number of DEGs top ranked by the fold change with P < 0.05 for both platforms with equal numbers of up- and downregulation. The y axis represents the overlap (%) between platforms for a given number of ranked DEGs. Each line on the graph represents the overlap of DEG lists between two platforms for one chemical for above-median expressed genes (b) and below-median expressed genes (c).
RNA-seq pipeline Manpreet S. Katari
The basic workflow Perform Quality control - fastqc Trim low quality sequence - trimmomatic Map the reads to the Genome - Build the database – bowtie2 Run the alignment - tophat