Download presentation
Published byCornelius Fowler Modified over 8 years ago
1
High throughput expression analysis using RNA sequencing (RNAseq)
May 19, 2016
2
Lecture objectives Theory and practice of RNA sequencing (RNA-seq) analysis Rationale for sequencing RNA Challenges specific to RNA-seq Types of libraries for RNA-seq Sequencing coverage (depth) Basic themes of RNA-seq analysis work flows Terminology for RNA-seq Steps in RNAseq analysis Downstream interpretation of expression and differential estimates
3
Gene expression
4
RNA sequencing Isolate RNAs
Generate cDNA, fragment, size select, add linkers Samples of interest Condition 1 (normal colon) Condition 2 (colon tumor) Sequence ends Map to genome, transcriptome, and predicted exon junctions 100s of millions of paired reads 10s of billions bases of sequence Downstream analysis
5
Before you start: What is your experimental question?
Is a global expression experiment the best approach? What is your experimental model? Is global expression technically feasible? What is your budget? $1200 minimum for 3 replicates, 2 conditions on Ion Torrent Proton
6
Common analysis goals of RNA-Seq analysis
Gene expression and differential expression Alternative expression analysis Transcript discovery and annotation Allele specific expression Relating to SNPs or mutations Mutation discovery Fusion detection RNA editing miRNA & other non-coding RNA detection/differences
7
Why sequence RNA (versus DNA)?
Functional studies Genome may be constant but an experimental condition has a pronounced effect on gene expression e.g. Drug treated vs. untreated cell line e.g. Wild type versus knock out mice Some molecular features can only be observed at the RNA level Alternative isoforms, fusion transcripts, RNA editing Predicting all possible transcript sequences from genome sequence is difficult Alternative splicing, RNA editing, etc.
8
Why sequence RNA (versus DNA)?
Interpreting mutations that do not have an obvious effect on protein sequence ‘Regulatory’ mutations that affect what mRNA isoform is expressed and how much e.g. splice sites, promoters, exonic/intronic splicing motifs, etc. Prioritizing protein coding somatic mutations (often heterozygous) If the gene is not expressed, a mutation in that gene would be less interesting If the gene is expressed but only from the wild type allele, this might suggest loss-of-function (haploinsufficiency) If the mutant allele itself is expressed, this might suggest a candidate drug target
9
Challenges Sample Purity?, quantity?, quality? RNAs consist of small exons that may be separated by large introns Mapping reads to genome is challenging The relative abundance of RNAs vary wildly 105 – 107 orders of magnitude Since RNA sequencing works by random sampling, a small fraction of highly expressed genes may consume the majority of reads Ribosomal and mitochondrial genes most highly expressed RNAs come in a wide range of sizes Small RNAs must be captured separately PolyA selection of large RNAs may result in 3‘end bias RNA is fragile compared to DNA (easily degraded)
10
Challenges, cont. Experimental model:
Mammalian, eukaryote, bacteria or virus? Patient derived tissue or cells? Sufficient sample available for enough biological replicates to achieve statistical significance? Three biological replicates for cultured cells Three biological replicates for inbred mice, for each condition Human samples? As many as possible, assuming that you can identify reasonable controls
11
Biological versus technical replicates
12
Variability of RNA-seq libraries
The relative log expression should be constant across the libraries Libraries from different conditions should cluster together Reference: Risso D. et.al. Nature Biotech. (2014) 32:
13
RNA quality is key to success
Garbage in…garbage out Amount of RNA influences which type of library you can construct Quality of the RNA determines if the library construction will work well enough to give you usable data
14
Purity of RNA Pure RNA absorbs strongly at 260 nm
Ratio of absorbance 260/280 used to assess purity Ratio = 2.1 if pure ( acceptable) Ratio depends on pH if there are protein contaminants A260 remains constant, A280 changes if proteins are present Ratio 260/230 should also be close to 2 If <2, proteins, guanidinium isothiocyanate or phenol may be contaminants Always take a scanning measurement ( nm) Reference:
15
RNA purity, cont. The core facility will check your RNA quality using a Agilent Bioanalyzer Lab on a chip to perform capillary electrophoresis with a fluorescent dye that binds to RNA Determines RNA concentration & integrity Works with ng quantities of RNA (50 – 300 ng/ul) QC determined by 28S/18S ratio
16
MW of 28S RNA is ~2.5x more than 18S RNA (mammalian)
So a 28S/18S ratio of 2 is considered ideal
17
Agilent example / interpretation
RNA quality assessed by Agilent Bioanalyzer “RIN” = RNA integrity number 0 (bad) to 10 (good) RIN = 6.0 RIN = 10
18
Enrichment of desired RNA
mRNA makes up 1-3% of total RNA Ribosomal RNA >80%, with majority of that 18S/28S RNA polyA selection Not species specific Works reasonably well for gene expression analyses Ribosomal depletion (Ribo-Minus) Based on oligonucleotide probes that capture rRNA and remove it via binding to streptavadin-coated magnetic beads Species specific, or at least specific to taxonomic groups Necessary if interested in non-coding RNA analysis
19
Other RNA-seq library construction strategies
Size selection (before and/or after cDNA synthesis) Small RNAs (microRNAs) vs. large RNAs? Less bias with RNA fragmentation prior to cDNA synthesis Linear amplification insertion sites (viral, gene therapy) Ref: Nature Protocols (2011) 6:1026 Strand-specific sequencing Ref: Current Genomics (2013) 14:173 Exome capture Cancer Genomes, 1000 genomes, NHLBI genome sequencing Review on exome vs transcriptome for variant detection Exp. Rev. Mol. Diag. (2012) 12:
20
Sequence coverage (depth) needed for RNA-seq
Question being asked of the data Gene expression? Alternative expression? Mutation calling? Tissue type RNA preparation, quality of input RNA, library construction method, etc. Sequencing type read length, paired vs. unpaired, etc. Computational approach and resources Identify publications with similar goals Pilot experiment Work with your Core facility
21
Sequence coverage (depth)
C = LN/G C = coverage L = read length N = number of reads (depends on sequencer; Illumina v3 189 million reads /lane) G = haploid genome length 100 bp reads of human DNA on Illumina (single lane) C = (100 bp)*(189 x 106)/(3 x 109 bp) = 6.3 That is, each base will be sequenced between 6 & 7 times SNP discovery requires 10-30X coverage
22
Sequence coverage, cont.
100 bp reads of human transcriptome on Illumina v3 Human transcriptome: 50 million bp C = (100 bp)*(189 x 106)/(50 x 106 bp) = 378 Only need 30X coverage for gene expression analysis Put 10 RNAseq libraries on a single lane and still get enough coverage for the analysis Use barcodes that distinguish different libraries for analysis
23
Ion Torrent Proton, sequence coverage
Each chip is capable of producing million reads of ~200 bp in length C. neoformans transcriptome is 5 Mb Sequenced 6 libraries on 1 chip C = (200 bp)*(70 x 106)/(5 x 106 bp)*6 = 466 Sample name Total reads Total alignments Aligned Avg. coverage depth Avg. length 0min_1.fastq 10,839,087 10,735,147 99.04% 86.78 90.4 0min_2.fastq 8,665,789 8,570,574 98.90% 47.09 89.02 0min_3.fastq 5,949,041 5,858,474 98.48% 38.23 85.39 15min_1.fastq 12,886,290 12,729,709 98.78% 56.97 77.33 15min_2.fastq 2,828,245 2,767,603 97.86% 24.2 88.37 15min_3.fastq 5,781,444 5,718,533 98.91% 31.76 77.05
24
How many replicates? Statistical considerations suggest that you need at least 3 biological replicates to make valid comparisons More complex samples need more replicates to increase the signal above the biological variability Cost of experiment which is dependent on the depth of sequencing required and thus how many chips (Ion Torrent) or lanes (HiSeq) of sequencing you need Plan this well in advance in consultation with the core personnel Do as many replicates as is feasible and can afford
25
Steps in RNAseq workflow
Obtain raw data from sequencer Align/assemble reads Process alignment to quantify reads/gene feature Conduct differential analysis Summarize and visualize Create gene lists, prioritize candidates for validation, etc. Conduct gene enrichment or pathway analysis
26
Some nomenclature Sequencing file types: Fasta (sequence only) Fastq (contains quality scores) >SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
27
Alignment files SAM – Sequence Alignment/MAP coordinate file
Tab-delimited ASCII file with all of the information needed for the alignment of the reads to reference HWI-ST155_0544:7:2:7658:34048# Chr M * 0 0 GTAGAGGTAGGACCAACAAGGACCAAGTTTCCCTGTTCCAAC ghghhhhgcghh_hhhhhhhhhhhhhhhhghhhhhghhhhhh NM:i:0 NH:i:4 CC:Z:Chr10 CP:i: HWI-ST155_0544:7:61:11040:141129#0 0 Chr M * CTTTTCTGGCGTAACTTGGTTCCCTTTAGTTTGGAACAGATA hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghhchfhhhh NM:i:0 NH:i:3 CC:Z:Chr10 CP:i: HWI-ST155_0544:7:8:11214:130793#0 0 Chr M * CAAATAGTGGTTTGAAACCTATCAATCAAGTCACTTTCAAGT gggggggggggggggggeggdffffggggggggggggdggfc NM:i:0 NH:i:3 CC:Z:Chr10 CP:i: BAM – Binary version of SAM Machine readable, smaller more compact version More commonly used by viewers and analysis programs
28
Reference guided RNAseq alignment
You have a sequenced genome and a file with the gene model annotations Genome reference file is usually a fasta file with each chromosome as a separate entry GTF = gene transcript file GFF = gene feature file (GFF3) The GTF/GFF chromosome name must match the chromosome name in the genome reference file
29
RNAseq alignment
30
Number of reads mapped in the sample
RPKM (FPKM) Reads (fragments) per Kilobase Per Million RPKM = raw number of reads exon length X 1,000,000 Number of reads mapped in the sample In RNA-Seq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. However: The number of fragments is biased towards larger genes Total number of fragments is related to total library depth Use of FPKM/RPKM normalizes for gene size and library depth
31
Alternatives to FPKM Raw read counts HTSeq (htseq-count)
Instead of calculating FPKM, simply assign reads/fragments to a defined set of genes/transcripts and determine “raw counts” HTSeq (htseq-count) Python code that converts aligned reads to counts You give it an alignment file and the associated transcript file and it will output a list of counts by feature FeatureCounts (part of Subread package) BAM file + GTF file => # reads/feature
32
How to quantify reads per feature?
33
Counts vs FPKM(RPKM)? Count files can be analyzed by a number of different R programs EdgeR DESeq2 NOISeq More robust statistical methods for differential expression Accommodates more sophisticated experimental designs with appropriate statistical tests
34
Gene expression differences
Once you have your data in RPKM or counts format, you can use a variety of tools to compare the different conditions and determine what genes are differentially expressed In the original study, the analysis was done by GTAC using Cufflinks/CuffDiff In the analysis to generate the lists you are using, I used an updated version of their alignment software, Subread to convert the alignment to counts/gene and DESeq2 for the differential analysis. Analysis was done at Gene rather than transcript level
35
How does cuffdiff work? Model variability in fragment count for each gene across replicates Estimate fragment count for each isoform with a measure of uncertainty from ambiguously mapped reads transcripts with more shared exons and few uniquely assigned fragments will have greater uncertainty Combine estimates of uncertainty and cross-replicate variability under a negative binomial model of fragment count variability to estimate count variances for each transcript These variance estimates are used during statistical testing to report significantly differentially expressed genes and transcripts. A whole lot of fancy statistical modeling -> a list of expression values for each gene & differences between conditions More replicates = more confidence
36
Multiple approaches advisable
37
Multiple testing correction
As more attributes are compared, it becomes more likely that the treatment and control groups will appear to differ on at least one attribute by random chance alone. Well known from array studies 10,000s genes/transcripts 100,000s exons With RNA-seq, even greater problem All the complexity of the transcriptome Almost infinite number of potential features Genes, transcripts, exons, juntions, retained introns, microRNAs, lncRNAs, etc, etc Bioconductor multtest
38
Lessons learned from microarray days
Hansen et al. “Sequencing Technology Does Not Eliminate Biological Variability.” Nature Biotechnology 29, no. 7 (2011): 572–573. Power analysis for RNA-seq experiments RNA-seq need for biological replicates RNA-seq study design
39
Source of gene list Alspach E, Flanagan KC, Luo X, Ruhland MK, Huang H, Pazolli E, Donlin MJ, Marsh T, Piwnica-Worms D, Monahan J, Novack DV, McAllister SS, Stewart SA. “p38MAPK plays a crucial role in stromal-mediated tumorigenesis.” Cancer Discov Jun;4(6):716-29 Performed RNA sequencing on young, senescent and p38MAPK-inhibited senescent human fibroblasts. 6 libraries: 2 young fibroblasts 2 senescent (late) fibroblasts 2 senescent (late) fibroblasts treated with p38MAPK inhibitor SB203580
40
Gene lists Differential expression (DE) was determined for:
senescent vs young (S vs Y) p38MAPK-inhibited senescent vs young (p38I vs Y) DE expressed genes were identified using the DESeq2 package Statistical cut-off was p < 0.01 Genes DE in S vs Y but NOT in p38I vs Y are considered to be p38MAPK dependent
41
Today in lab Explore the gene list using EBI-Biomart Excel tutorial
Use Excel to compare S vs Y to p38K vs Y to identify those genes specific to S vs Y S vs Y p38I vs Y p38MAPK dependent
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.