RNA-seq Manpreet S. Katari.

Slides:



Advertisements
Similar presentations
RNAseq.
Advertisements

Peter Tsai Bioinformatics Institute, University of Auckland
Microarray technology and analysis of gene expression data Hillevi Lindroos.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Microarray analysis Golan Yona ( original version by David Lin )
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Introduce to Microarray
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
and analysis of gene transcription
with an emphasis on DNA microarrays
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Affymetrix vs. glass slide based arrays
Analyzing your clone 1) FISH 2) “Restriction mapping” 3) Southern analysis : DNA 4) Northern analysis: RNA tells size tells which tissues or conditions.
-The methods section of the course covers chapters 21 and 22, not chapters 20 and 21 -Paper discussion on Tuesday - assignment due at the start of class.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
DNA MICROARRAYS WHAT ARE THEY? BEFORE WE ANSWER THAT FIRST TAKE 1 MIN TO WRITE DOWN WHAT YOU KNOW ABOUT GENE EXPRESSION THEN SHARE YOUR THOUGHTS IN GROUPS.
How do you identify and clone a gene of interest? Shotgun approach? Is there a better way?
Gene expression and DNA microarrays Old methods. New methods based on genome sequence. –DNA Microarrays Reading assignment - handout –Chapter ,
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Verna Vu & Timothy Abreo
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Intro to Microarray Analysis Courtesy of Professor Dan Nettleton Iowa State University (with some edits)
Genomics I: The Transcriptome
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
Microarray (Gene Expression) DNA microarrays is a technology that can be used to measure changes in expression levels or to detect SNiPs Microarrays differ.
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Lecture 23 – Functional Genomics I Based on chapter 8 Functional and Comparative Genomics Copyright © 2010 Pearson Education Inc.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Gene expression and DNA microarrays No lab on Thursday. No class on Tuesday or Thursday next week –NCBI training Monday and Tuesday –Feb. 5 during class.
DNA Microarray Overview and Application. Table of Contents Section One : Introduction Section Two : Microarray Technique Section Three : Types of DNA.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray: An Introduction
Arrays How do they work ? What are they ?. WT Dwarf Transgenic Other species Arrays are inverted Northerns: Extract target RNA YFG Label probe + hybridise.
Transcriptomics History and practice.
bacteria and eukaryotes
Metagenomic Species Diversity.
RNA Quantitation from RNAseq Data
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Gene Expression Analysis
Microarray - Leukemia vs. normal GeneChip System.
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
The Basics of cDNA Microarray Technology
Volume 1, Issue 1, Pages (February 2002)
Functional Genomics in Evolutionary Research
Microarray Technology and Applications
Chapter 20 – DNA Technology and Genomics
Introduction to cDNA Microarray Technology
Chapter 14 Bioinformatics—the study of a genome
From: TopHat: discovering splice junctions with RNA-Seq
Transcriptomics History and practice.
Protein Occupancy Landscape of a Bacterial Genome
Getting the numbers comparable
Integrative Multi-omic Analysis of Human Platelet eQTLs Reveals Alternative Start Site in Mitofusin 2  Lukas M. Simon, Edward S. Chen, Leonard C. Edelstein,
Joseph Rodriguez, Jerome S. Menet, Michael Rosbash  Molecular Cell 
Optimal gene expression analysis by microarrays
Volume 16, Issue 8, Pages (August 2016)
Microarray Data Analysis
Impact of Alternative Splicing on the Human Proteome
Volume 122, Issue 6, Pages (September 2005)
Sequence Analysis - RNA-Seq 2
Topic Cloning and analyzing oxalate degrading enzymes to see if they dissolve kidney stones with Dr. VanWert.
Volume 11, Issue 7, Pages (May 2015)
Data Type 1: Microarrays
Presentation transcript:

RNA-seq Manpreet S. Katari

Abundance of mRNA is what we try to measure DNA RNA protein phenotype cDNA Abundance of mRNA is what we try to measure

Microarrays vs Northern blots: from Gene to Genome Science Northern blot: limited by number of lanes in gel Microarray: A large number of DNA fragments are attached in a systematic way to a solid substrate, can measure mRNA levels for thousands of genes (~ every gene in a genome) in parallel Microarrays permit the simultaneous analysis of the RNA expression of thousands of genes. For fully sequenced genomes, microarrays can be used to analyze the expression of every gene. Northern blots, on the other hand, are limited by the number of lanes on the gel and by the number of probes that can be used on the same blot. Northern blots normally have 20–40 lanes, and no more than three probes can be used simultaneously. Thus, microarrays increase the throughput by several orders of magnitude. DNA Microarrays are an extremely powerful tool. Whereas we have traditionally examines genes in isolation, DNA microarrays allow us to see all the genes of an cell working together in concert. DNA Microarrays allow us to see the gene expression levels for tens of thousands of genes at once. A microarray of 50,000 unique cDNAs allows the expression monitoring of the entire human genome in a single hybridization.

Evolution of Sequence Technology

Transcriptomics using RNA-seq

Genome-wide expression analysis Goal: to measure RNA levels of all genes in a genome under various experimental conditions RNA levels vary with: Cell type Developmental stage External stimuli Disease state Time and location of expression provide information on genes’ function and interactions, and can be useful for many purposes, including disease diagnostics and medical applications. Once every gene in a genome has been identified, it becomes feasible to measure each gene’s expression. One of the first goals along this line has been to measure the steady-state abundance of RNA made from each gene. There have also been ongoing attempts to measure the level of all proteins. (See the chapter on proteomics.) The levels of RNA vary depending on the cell type, the developmental stage, environmental stimuli, etc. For example, the RNAs expressed in a heart cell differ greatly from those expressed in a brain cell, and the RNAs expressed in fetal blood differ from those expressed in adult blood. In addition, exposure to high heat triggers the production of heat-shock RNAs, which are not present under normal conditions. Therefore, determination of the RNA levels found at a particular time and in a specific cell or organ can provide important information as to the function of the genes responsible for this expression. In addition, the spectrum or profile of RNAs found in a particular cell can be used as a means of disease diagnosis. For example, different types of cancer have been shown to have different RNA profiles. (See the example later in this chapter.)

For High-Throughput Transcriptomics studies, comparisons are almost always across experiments whole body liver liver lung brain kidney

Questions that can be addressed with genome-wide expression analysis: What genes have similar function? What regulatory pathways exist? Can we subdivide experiments or genes into meaningful classes? Can we correctly classify an unknown experiment or gene into a known class? Can we make better treatment decisions for a cancer patient based on his or her gene expression profile?

First two basic tasks to generating meaningful data for transcriptomics analysis Normalize or scale all samples and replicated to each other Make a (statistical) statement about what changes are evident in the comparison

Microarrays Provides the mRNA level of thousands of genes (sometimes almost all known genes in a genome) in a given sample Sample=tissue (e.g., liver, brain), tissue in a specific environment or state (e.g., brain with cancer), etc.

Three types of arrays Spotted microarrays Long dsDNA (typically genomic PCR products) On-chip oligonucleotide synthesis Photolithography Affymetrix (~25-mers) Ink-jet printing Agilent (~60-mers)

Sample labeling cRNA + biotin cDNA made using reverse transcriptase Fluorescent cDNA cDNA made using reverse transcriptase Fluorescently labeled nucleotides added Labeled nucleotides incorporated into cDNA cRNA + biotin cDNA made using reverse transcriptase Linker added with T7 RNA polymerase recognition site T7 polymerase added and biotin labeled RNA bases Biotin label incorporated into cRNA + Labeling of the target RNA is usually performed by generating a single-stranded cDNA, using the enzyme reverse transcriptase. One method of labeling uses fluorescently labeled nucleotides that are incorporated into the cDNA during the reverse-transcription reaction. This is generally the way the nucleotides labeled with the dyes Cy3 and Cy5 are incorporated into targets used in competitive hybridization. Another alternative for labeling the target RNA population is first to make double-stranded cDNA and then to use a viral RNA polymerase to make cRNA. To accomplish this task, a linker is added to the cDNA that contains the recognition site for an RNA polymerase (e.g., T7 RNA polymerase). Labeling is done by adding modified RNA bases to the RNA polymerase reaction. This type of labeling is used for Affymetrix GeneChips as well as NimbleGen chips. The production of cRNA using T7 polymerase involves amplification of the original RNA population. Different labels are incorporated depending on the type of microarray experiment that is being performed. For experiments in which two different RNA populations are analyzed on the same microarray (competitive hybridization), two dyes are used that fluoresce at different wavelengths. The most commonly used dyes are Cy3 and Cy5. Labeling for hybridization to Affymetrix GeneChips and NimbleGen chips uses biotin-conjugated RNA bases. Fluorescently labeled avidin is then bound to the biotin. The streptavidin/biotin system is of special interest because it has one of the largest free energies of association of yet observed for noncovalent binding of a protein and small ligand in aqueous solution (K_assoc = 10**14). The complexes are also extremely stable over a wide range of temperature and pH.The streptavidin protomer is organized as an 8-stranded beta-barrel. Pairs of the barrels bind together to form symmetric dimers, pairs of which in turn interdigitate with their dyad axes coincident to form the naturally-occurring tetramer.

Microarray hybridization mRNA cDNA DNA microarray samples Spotted microarrays Competitive hybridization: two labeled cDNA samples (experimental and control) hybridized to same slide Cy3 and Cy5 dye labeling, fluoresce at different wavelengths Affymetrix GeneChips One labeled RNA population per chip Biotin labeling, binds to fluorescently labeled avidin (Comparison made between hybridization intensities of same oligonucleotides on different chips). In addition to the differences in their manufacturing, spotted microarrays and GeneChips (as well as NimbleGen chips) differ in how the hybridization is performed. For spotted microarrays, usually the two labeled targets to be compared are hybridized to the same microarray. This procedure is known as competitive hybridization. For GeneChips, only one labeled target is hybridized to each chip. Comparisons are made at the analysis stage between hybridization intensities measured on two different chips. With competitive hybridization, one is measuring the relative difference between the signal intensity of two targets binding to the same spot of DNA. The practical reason for this approach is that there is often variability in the quality of the spotted DNA, in terms of amount and integrity. This measurement compensates for differences in the quality of the spot. Microarrays made with photolithography tend to have higher reproducibility from slide to slide, making competitive hybridization less important. For spotted-microarray hybridization, one target RNA is labeled with the fluorescent dye Cy3 and the other target with the fluorescent dye Cy5. Both targets are hybridized to the same microarray. The relative intensity of the hybridization is determined using confocal laser scanning microscopy.

Affymetrix system

What is the Affymetrix Signal? Background subtraction: Microarray is divided into sectors Probe signal is ordered and the lowest 2% is taken as the noise level A weighted mean of the background is subtracted from the signal, such that closer sectors are weighted more heavily

Background Adjustment Estimating background effect PM=true signal + background

Quantile Normalization 1. Sort each column disregarding gene order 2. Calculate row averages Experiment 1 Experiment 2 Experiment 3 Gene A 3 100 500 Gene B 17 10 150 Gene C 1000 3 10 100 150 17 1000 500 5.3 86.7 505.7 5.3 86.7 505.7 3. Substitute average values for real ones Gene A 5.3 86.7 505.7 Gene B Gene C 4. Restore gene order

Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene (sum of exons / 1000) T = total number of reads in the library mapped to the genome / 1,000,000

Reproducibility, linearity and sensitivity. Figure 2 | Reproducibility, linearity and sensitivity. (a) Comparison of two brain technical replicate RNA-Seq determinations for all mouse gene models (from the UCSC genome database), measured in reads per kilobase of exon per million mapped sequence reads (RPKM), which is a normalized measure of exonic read density; R2 = 0.96. (b) Distribution of uniquely mappable reads onto gene parts in the liver sample. Although 93% of the reads fall onto exons or the RNAFAR-enriched regions (see Fig. 3 and text), another 4% of the reads falls onto introns and 3% in intergenic regions. (c) Six in vitro–synthesized reference transcripts of lengths 0.3–10 kb were added to the liver RNA sample (1.2 × 104 to 1.2 × 109 transcripts per sample; R2 > 0.99). (d) Robustness of RPKM measurement as a function of RPKM expression level and depth of sequencing. Subsets of the entire liver dataset (with 41 million mapped unique + splice + multireads) were used to calculate the expression level of genes in four different expression classes to their final expression level. Although the measured expression level of the 211 most highly expressed genes (black and cyan) was effectively unchanged after 8 million mappable reads, the measured expression levels of the other two classes (purple and red) converged more slowly. The fraction of genes for which the measured expression level was within ±5% of the final value is reported. 3 RPKM corresponds to approximately one transcript per cell in liver. The corresponding number of spliced reads in each subset is shown on the top x axis.

RNA-seq provides even more

Candidate new and revised exons Figure 4 | Candidate new and revised exons identified by the RNAFAR algorithm. (a) A 40-kb region encompassing the Mef2d gene, which is expressed in adult muscle (28 RPKM in muscle and 45 RPKM in brain), and a neighboring gene that is expressed at a much lower level in brain. RefSeq has only a single annotation for Mef2d, but UCSC has five (labeled α–ε). The α form corresponds to the RefSeq model, and γ is a muscle-specific isoform23. The RNAFAR algorithm identified seven regions (red) enriched with reads that fell outside the NCBI gene annotations and were assigned by the algorithm to the Mef2d locus. (b) A 1.5-kb close-up of muscle-specific alternative splicing at the RNAFAR region labeled ‘B’ in panel a. The prevalence of splicing switches from the canonical exon in the brain sample to the RNAFAR exon in the muscle sample, as seen both in the ratio of spliced reads and in the number of reads falling on the two diagnostic exons. (c) The number of expected spliced reads for each gene model was predicted computationally, based on the number of introns and the exonic read density. The predicted number is then plotted against the number of splices observed (R2 = 0.90). (d) The tissue distribution of genes with two splice isoforms in the same tissue.

Figure 3 Transcript abundance–dependent concordance between RNA-seq and microarray. (a) Root mean squared distance (RMSD in y axis) between pairs of rats for each chemical and averaged over all the chemicals by bins of genes. Expression levels ranged from high (0%) to low (100%) and each bin, A to S, contained 10% of the expressed genes. The analysis was performed on RNA-seq with six pipelines and the microarray with two normalization methods (RMA and MAS5). (b,c) For each chemical, the x axis represents the number of DEGs top ranked by the fold change with P < 0.05 for both platforms with equal numbers of up- and downregulation. The y axis represents the overlap (%) between platforms for a given number of ranked DEGs. Each line on the graph represents the overlap of DEG lists between two platforms for one chemical for above-median expressed genes (b) and below-median expressed genes (c).

Comparison of platforms for detecting gene expression AFFY Gene Chip Illumina All protein coding genes are represented X Can detect all the different types of RNA Cost Can determine gene regulation Requires pre-existing knowledge of gene sequence As the price of sequencing goes down, there will be almost no advantage Of Microarray over RNA-seq

Mapping Reads from RNA molecules What is the advantage of mapping reads from RNA to the genome sequenced instead of a database of all predicted RNA molecules? We are not depending on the quality of annotation. We are not assuming that we know about all of the RNA molecules in the cell. How can we find reads mapping to spliced junctions? Create a separate database of all possible spliced junctions Split reads in half and map them separately.

Bowtie & TopHat

Cufflinks first starts with the output of any alignment tool such as TopHat

Then it assembles the isoforms by first identifying the reads that can not be assembled together.

Then calculate abundance

Assembling the reads to identify transcripts.

CuffCompare The program cuffcompare helps you: Output contains codes Compare your assembled transcripts to a reference annotation Track Cufflinks transcripts across multiple experiments (e.g. across a time course) Output contains codes = match c contained j new isoform u unknown, intergenic transcript i single exon in intron region

Identification of spliced junctions depends largely on the depth of sequences coverage.

Cuffdiff Can be use to find significant changes in transcript expression, splicing, and promoter use. Inputs are: Annotation to compare (can be output from cufflinks) Tophat output from different samples Options are similar to cufflinks, can also specify a different FDR cutoff.

Which comparison is more convincing that genes are different? Control Treatment Rep1 20 Rep2 21 Rep3 19 Mean Rep1 30 Rep2 31 Rep3 29 Mean GENE A COMPARISON A Rep1 10 Rep2 20 Rep3 30 Mean Rep1 20 Rep2 30 Rep3 40 Mean GENE B COMPARISON B

t test Difference in the means Standard Error of the difference Can use this test statistic to evaluate the probability that the two means are same using critical values of T: Where you select the probability of making a type I error e.g., 0.05 Var = sum of squares of the difference n-1 Degrees of freedom = nt+nc-2

Volcano plot: visualizing significance and fold change

Volcano plot: visualizing significance and fold change What can you tell me about this point? Large difference in the mean values, but not significant. Must have high variance in measurements.

Volcano plot: visualizing significance and fold change What can you tell me about this point? Small difference in the mean values, but highly significant. Must have low variance in measurements.

Assumptions of the t-test Samples are drawn from normal distributions i.e. our estimates of geneA and geneB are random samples from a normal distribution The variance of the two populations is equal There is no mean variance relationship

RNA-seq data Count data (discrete) Possible to get zero Cannot get negative number Each sequence read is a random event drawn from a larger population. Variance increases with the mean

RNA-seq data: variance > mean RNA-seq data are consistent with an over-dispersed poisson: variance = a*mean

Should we give treat a difference between 9 vs 12 reads the same as 900 vs 1200?

t = -3.6742, p-value = 0.02131 t = -3.6742, p-value = 0.02131

t test does not account for scale of the data t = -3.6742, p-value = 0.02131 t = -3.6742, p-value = 0.02131

Test using a negative binomial model [glm.nb()] p-value = 0.258 p-value = 1.03e-05

Test using a negative binomial model [glm.nb()] p-value = 0.258 p-value = 1.03e-05

Figure 3 Transcript abundance–dependent concordance between RNA-seq and microarray. (a) Root mean squared distance (RMSD in y axis) between pairs of rats for each chemical and averaged over all the chemicals by bins of genes. Expression levels ranged from high (0%) to low (100%) and each bin, A to S, contained 10% of the expressed genes. The analysis was performed on RNA-seq with six pipelines and the microarray with two normalization methods (RMA and MAS5). (b,c) For each chemical, the x axis represents the number of DEGs top ranked by the fold change with P < 0.05 for both platforms with equal numbers of up- and downregulation. The y axis represents the overlap (%) between platforms for a given number of ranked DEGs. Each line on the graph represents the overlap of DEG lists between two platforms for one chemical for above-median expressed genes (b) and below-median expressed genes (c).

RNA-seq pipeline Manpreet S. Katari

The basic workflow Perform Quality control - fastqc Trim low quality sequence - trimmomatic Map the reads to the Genome - Build the database – bowtie2 Run the alignment - tophat