High throughput expression analysis using RNA sequencing (RNAseq)

Slides:



Advertisements
Similar presentations
RNA-seq library prep introduction
Advertisements

The Past, Present, and Future of DNA Sequencing
An Introduction to Studying Expression Data Through RNA-seq
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
12/04/2017 RNA seq (I) Edouard Severing.
Processing of miRNA samples and primary data analysis
Simon v2.3 RNA-Seq Analysis Simon v2.3.
Transcriptomics Breakout. Topics Discussed Transcriptomics Applications and Challenges For Each Systems Biology Project –Host and Pathogen Bacteria Viruses.
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
RNAseq analysis Bioinformatics Analysis Team
RNA Lab (Isolation, quantification and qPCR analysis) MCB7300.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
MCB Lecture #21 Nov 20/14 Prokaryote RNAseq.
Ribosomal Profiling Data Handling and Analysis
High Throughput Sequencing
mRNA-Seq: methods and applications
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Amandine Bemmo 1,2, David Benovoy 2, Jacek Majewski 2 1 Universite de Montreal, 2 McGill university and Genome Quebec innovation centre Analyses of Affymetrix.
RNAseq analyses -- methods
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Next Generation DNA Sequencing
RNA-Seq Analysis Simon V4.1.
Verna Vu & Timothy Abreo
The iPlant Collaborative
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
No reference available
Lecture 12 RNA – seq analysis.
NCode TM miRNA Analysis Platform Identifies Differentially Expressed Novel miRNAs in Adenocarcinoma Using Clinical Human Samples Provided By BioServe.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
Library QA & QC Day 1, Video 3
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Simon v RNA-Seq Analysis Simon v
Canadian Bioinformatics Workshops
RNA Quantitation from RNAseq Data
RNA-Seq for the Next Generation RNA-Seq Intro Slides
Cancer Genomics Core Lab
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Canadian Bioinformatics Workshops
Learning to count: quantifying signal
Sequence Analysis - RNA-Seq 2
Sequence Analysis - RNA-Seq 1
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

High throughput expression analysis using RNA sequencing (RNAseq) May 19, 2016

Lecture objectives Theory and practice of RNA sequencing (RNA-seq) analysis Rationale for sequencing RNA Challenges specific to RNA-seq Types of libraries for RNA-seq Sequencing coverage (depth) Basic themes of RNA-seq analysis work flows Terminology for RNA-seq Steps in RNAseq analysis Downstream interpretation of expression and differential estimates

Gene expression

RNA sequencing Isolate RNAs Generate cDNA, fragment, size select, add linkers Samples of interest Condition 1 (normal colon) Condition 2 (colon tumor) Sequence ends Map to genome, transcriptome, and predicted exon junctions 100s of millions of paired reads 10s of billions bases of sequence Downstream analysis

Before you start: What is your experimental question? Is a global expression experiment the best approach? What is your experimental model? Is global expression technically feasible? What is your budget? $1200 minimum for 3 replicates, 2 conditions on Ion Torrent Proton

Common analysis goals of RNA-Seq analysis Gene expression and differential expression Alternative expression analysis Transcript discovery and annotation Allele specific expression Relating to SNPs or mutations Mutation discovery Fusion detection RNA editing miRNA & other non-coding RNA detection/differences

Why sequence RNA (versus DNA)? Functional studies Genome may be constant but an experimental condition has a pronounced effect on gene expression e.g. Drug treated vs. untreated cell line e.g. Wild type versus knock out mice Some molecular features can only be observed at the RNA level Alternative isoforms, fusion transcripts, RNA editing Predicting all possible transcript sequences from genome sequence is difficult Alternative splicing, RNA editing, etc.

Why sequence RNA (versus DNA)? Interpreting mutations that do not have an obvious effect on protein sequence ‘Regulatory’ mutations that affect what mRNA isoform is expressed and how much e.g. splice sites, promoters, exonic/intronic splicing motifs, etc. Prioritizing protein coding somatic mutations (often heterozygous) If the gene is not expressed, a mutation in that gene would be less interesting If the gene is expressed but only from the wild type allele, this might suggest loss-of-function (haploinsufficiency) If the mutant allele itself is expressed, this might suggest a candidate drug target

Challenges Sample Purity?, quantity?, quality? RNAs consist of small exons that may be separated by large introns Mapping reads to genome is challenging The relative abundance of RNAs vary wildly 105 – 107 orders of magnitude Since RNA sequencing works by random sampling, a small fraction of highly expressed genes may consume the majority of reads Ribosomal and mitochondrial genes most highly expressed RNAs come in a wide range of sizes Small RNAs must be captured separately PolyA selection of large RNAs may result in 3‘end bias RNA is fragile compared to DNA (easily degraded)

Challenges, cont. Experimental model: Mammalian, eukaryote, bacteria or virus? Patient derived tissue or cells? Sufficient sample available for enough biological replicates to achieve statistical significance? Three biological replicates for cultured cells Three biological replicates for inbred mice, for each condition Human samples? As many as possible, assuming that you can identify reasonable controls

Biological versus technical replicates

Variability of RNA-seq libraries The relative log expression should be constant across the libraries Libraries from different conditions should cluster together Reference: Risso D. et.al. Nature Biotech. (2014) 32:896-902

RNA quality is key to success Garbage in…garbage out Amount of RNA influences which type of library you can construct Quality of the RNA determines if the library construction will work well enough to give you usable data

Purity of RNA Pure RNA absorbs strongly at 260 nm Ratio of absorbance 260/280 used to assess purity Ratio = 2.1 if pure (1.8-2.0 acceptable) Ratio depends on pH if there are protein contaminants A260 remains constant, A280 changes if proteins are present Ratio 260/230 should also be close to 2 If <2, proteins, guanidinium isothiocyanate or phenol may be contaminants Always take a scanning measurement (220-300 nm) Reference: http://biomedicalgenomics.org/RNA_quality_control.html

RNA purity, cont. The core facility will check your RNA quality using a Agilent Bioanalyzer Lab on a chip to perform capillary electrophoresis with a fluorescent dye that binds to RNA Determines RNA concentration & integrity Works with ng quantities of RNA (50 – 300 ng/ul) QC determined by 28S/18S ratio

MW of 28S RNA is ~2.5x more than 18S RNA (mammalian) So a 28S/18S ratio of 2 is considered ideal

Agilent example / interpretation RNA quality assessed by Agilent Bioanalyzer “RIN” = RNA integrity number 0 (bad) to 10 (good) RIN = 6.0 RIN = 10

Enrichment of desired RNA mRNA makes up 1-3% of total RNA Ribosomal RNA >80%, with majority of that 18S/28S RNA polyA selection Not species specific Works reasonably well for gene expression analyses Ribosomal depletion (Ribo-Minus) Based on oligonucleotide probes that capture rRNA and remove it via binding to streptavadin-coated magnetic beads Species specific, or at least specific to taxonomic groups Necessary if interested in non-coding RNA analysis

Other RNA-seq library construction strategies Size selection (before and/or after cDNA synthesis) Small RNAs (microRNAs) vs. large RNAs? Less bias with RNA fragmentation prior to cDNA synthesis Linear amplification insertion sites (viral, gene therapy) Ref: Nature Protocols (2011) 6:1026 Strand-specific sequencing Ref: Current Genomics (2013) 14:173 Exome capture Cancer Genomes, 1000 genomes, NHLBI genome sequencing Review on exome vs transcriptome for variant detection Exp. Rev. Mol. Diag. (2012) 12:241-251

Sequence coverage (depth) needed for RNA-seq Question being asked of the data Gene expression? Alternative expression? Mutation calling? Tissue type RNA preparation, quality of input RNA, library construction method, etc. Sequencing type read length, paired vs. unpaired, etc. Computational approach and resources Identify publications with similar goals Pilot experiment Work with your Core facility

Sequence coverage (depth) C = LN/G C = coverage L = read length N = number of reads (depends on sequencer; Illumina v3 189 million reads /lane) G = haploid genome length 100 bp reads of human DNA on Illumina (single lane) C = (100 bp)*(189 x 106)/(3 x 109 bp) = 6.3 That is, each base will be sequenced between 6 & 7 times SNP discovery requires 10-30X coverage

Sequence coverage, cont. 100 bp reads of human transcriptome on Illumina v3 Human transcriptome: 50 million bp C = (100 bp)*(189 x 106)/(50 x 106 bp) = 378 Only need 30X coverage for gene expression analysis Put 10 RNAseq libraries on a single lane and still get enough coverage for the analysis Use barcodes that distinguish different libraries for analysis

Ion Torrent Proton, sequence coverage Each chip is capable of producing 60-80 million reads of ~200 bp in length C. neoformans transcriptome is 5 Mb Sequenced 6 libraries on 1 chip C = (200 bp)*(70 x 106)/(5 x 106 bp)*6 = 466 Sample name Total reads Total alignments Aligned Avg. coverage depth Avg. length 0min_1.fastq 10,839,087 10,735,147 99.04% 86.78 90.4 0min_2.fastq 8,665,789 8,570,574 98.90% 47.09 89.02 0min_3.fastq 5,949,041 5,858,474 98.48% 38.23 85.39 15min_1.fastq 12,886,290 12,729,709 98.78% 56.97 77.33 15min_2.fastq 2,828,245 2,767,603 97.86% 24.2 88.37 15min_3.fastq 5,781,444 5,718,533 98.91% 31.76 77.05

How many replicates? Statistical considerations suggest that you need at least 3 biological replicates to make valid comparisons More complex samples need more replicates to increase the signal above the biological variability Cost of experiment which is dependent on the depth of sequencing required and thus how many chips (Ion Torrent) or lanes (HiSeq) of sequencing you need Plan this well in advance in consultation with the core personnel Do as many replicates as is feasible and can afford

Steps in RNAseq workflow Obtain raw data from sequencer Align/assemble reads Process alignment to quantify reads/gene feature Conduct differential analysis Summarize and visualize Create gene lists, prioritize candidates for validation, etc. Conduct gene enrichment or pathway analysis

Some nomenclature Sequencing file types: Fasta (sequence only) Fastq (contains quality scores) >SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Alignment files SAM – Sequence Alignment/MAP coordinate file Tab-delimited ASCII file with all of the information needed for the alignment of the reads to reference HWI-ST155_0544:7:2:7658:34048#0 16 Chr1 2082 1 42M * 0 0 GTAGAGGTAGGACCAACAAGGACCAAGTTTCCCTGTTCCAAC ghghhhhgcghh_hhhhhhhhhhhhhhhhghhhhhghhhhhh NM:i:0 NH:i:4 CC:Z:Chr10 CP:i:1057787 HWI-ST155_0544:7:61:11040:141129#0 0 Chr1 2956 1 42M * 0 0 CTTTTCTGGCGTAACTTGGTTCCCTTTAGTTTGGAACAGATA hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghhchfhhhh NM:i:0 NH:i:3 CC:Z:Chr10 CP:i:1056913 HWI-ST155_0544:7:8:11214:130793#0 0 Chr1 3645 1 42M * 0 0 CAAATAGTGGTTTGAAACCTATCAATCAAGTCACTTTCAAGT gggggggggggggggggeggdffffggggggggggggdggfc NM:i:0 NH:i:3 CC:Z:Chr10 CP:i:1056224 BAM – Binary version of SAM Machine readable, smaller more compact version More commonly used by viewers and analysis programs

Reference guided RNAseq alignment You have a sequenced genome and a file with the gene model annotations Genome reference file is usually a fasta file with each chromosome as a separate entry GTF = gene transcript file GFF = gene feature file (GFF3) The GTF/GFF chromosome name must match the chromosome name in the genome reference file

RNAseq alignment

Number of reads mapped in the sample RPKM (FPKM) Reads (fragments) per Kilobase Per Million RPKM = raw number of reads exon length X 1,000,000 Number of reads mapped in the sample In RNA-Seq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. However: The number of fragments is biased towards larger genes Total number of fragments is related to total library depth Use of FPKM/RPKM normalizes for gene size and library depth

Alternatives to FPKM Raw read counts HTSeq (htseq-count) Instead of calculating FPKM, simply assign reads/fragments to a defined set of genes/transcripts and determine “raw counts” HTSeq (htseq-count) Python code that converts aligned reads to counts You give it an alignment file and the associated transcript file and it will output a list of counts by feature FeatureCounts (part of Subread package) BAM file + GTF file => # reads/feature

How to quantify reads per feature?

Counts vs FPKM(RPKM)? Count files can be analyzed by a number of different R programs EdgeR DESeq2 NOISeq More robust statistical methods for differential expression Accommodates more sophisticated experimental designs with appropriate statistical tests

Gene expression differences Once you have your data in RPKM or counts format, you can use a variety of tools to compare the different conditions and determine what genes are differentially expressed In the original study, the analysis was done by GTAC using Cufflinks/CuffDiff In the analysis to generate the lists you are using, I used an updated version of their alignment software, Subread to convert the alignment to counts/gene and DESeq2 for the differential analysis. Analysis was done at Gene rather than transcript level

How does cuffdiff work? Model variability in fragment count for each gene across replicates Estimate fragment count for each isoform with a measure of uncertainty from ambiguously mapped reads transcripts with more shared exons and few uniquely assigned fragments will have greater uncertainty Combine estimates of uncertainty and cross-replicate variability under a negative binomial model of fragment count variability to estimate count variances for each transcript These variance estimates are used during statistical testing to report significantly differentially expressed genes and transcripts. A whole lot of fancy statistical modeling -> a list of expression values for each gene & differences between conditions More replicates = more confidence

Multiple approaches advisable

Multiple testing correction As more attributes are compared, it becomes more likely that the treatment and control groups will appear to differ on at least one attribute by random chance alone. Well known from array studies 10,000s genes/transcripts 100,000s exons With RNA-seq, even greater problem All the complexity of the transcriptome Almost infinite number of potential features Genes, transcripts, exons, juntions, retained introns, microRNAs, lncRNAs, etc, etc Bioconductor multtest http://www.bioconductor.org/packages/release/bioc/html/multtest.html

Lessons learned from microarray days Hansen et al. “Sequencing Technology Does Not Eliminate Biological Variability.” Nature Biotechnology 29, no. 7 (2011): 572–573. Power analysis for RNA-seq experiments http://euler.bc.edu/marthlab/scotty/scotty.php RNA-seq need for biological replicates http://www.biostars.org/p/1161/ RNA-seq study design http://www.biostars.org/p/68885/

Source of gene list Alspach E, Flanagan KC, Luo X, Ruhland MK, Huang H, Pazolli E, Donlin MJ, Marsh T, Piwnica-Worms D, Monahan J, Novack DV, McAllister SS, Stewart SA. “p38MAPK plays a crucial role in stromal-mediated tumorigenesis.” Cancer Discov. 2014 Jun;4(6):716-29 Performed RNA sequencing on young, senescent and p38MAPK-inhibited senescent human fibroblasts. 6 libraries: 2 young fibroblasts 2 senescent (late) fibroblasts 2 senescent (late) fibroblasts treated with p38MAPK inhibitor SB203580

Gene lists Differential expression (DE) was determined for: senescent vs young (S vs Y) p38MAPK-inhibited senescent vs young (p38I vs Y) DE expressed genes were identified using the DESeq2 package Statistical cut-off was p < 0.01 Genes DE in S vs Y but NOT in p38I vs Y are considered to be p38MAPK dependent

Today in lab Explore the gene list using EBI-Biomart Excel tutorial Use Excel to compare S vs Y to p38K vs Y to identify those genes specific to S vs Y S vs Y p38I vs Y p38MAPK dependent