Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017 RNA-Seq Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
What is RNA-Seq? An experimental protocol that uses next-generation sequencing technologies to sequence the messenger RNA molecules within a biological sample in an effort to determine the primary sequence and relative abundance of each mRNA Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nat Rev Genet. 12(10):671-682 Also known as “Whole Transcriptome Shotgun Sequencing” (WTSS)
Sequencing strategy Metabolite profiling Plant material combination of ½-plate of 454 and 1 lane of 108PE Illumina sequencing excellent depth and coverage high-quality assemblies submission of total RNA samples improves quality control takes better advantage of sequencing facilities similar overall cost 76SE Illumina sequencing on selected species for comparative transcriptomics Plant material Biochemistry PIs Total RNA extraction Bioanalyzer (RNA quality) mRNA isolation cDNA libraries Genome Québec Innovation Centre 454 (1/2-plate) Illumina 1 lane 108PE Reference transcriptomes (75) repeat sequencing in rare cases of low-quality initial output Bioinformatics Innovation Centre Bioinformatics
RNA-Seq workflow intron Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 10(1):57-63.
RNA-Seq vs. microarray Characteristics RNA-Seq Microarray Which transcripts? All in a sample Only those for which probes are designed Transcript sequence generation Yes No Low-abundance transcript detection Limited Abundance info source Count (of the reads aligned to gene) Fluorescence level (of the probe spot for gene) Resolution Base Probe sequence Background noise Low High Additional info Alternative splicing, transcriptome-level variation
RNA-Seq data analysis Map reads Bin reads to features Normalize counts Lots of short reads Reference genome Map reads Table of mapped loci per read Feature annotation (exons, genes, transcripts) Bin reads to features Table of counts per feature Usually combined in a tool Normalize counts Table of normalized quantification values per feature Detect differentially expressed (DE) features DE features
Mapping reads Need a reference genome Issues Huge amounts of data Reads spanning across exon junction Alternative splicing Reads mapping to multiple locations in the genome Huge amounts of data Most common mapping results format SAM: sequence alignment/map BAM: binary format of SAM Many tools Bowtie, SOAP, BWA, SHRiMP, mrFAST, mrsFAST, ZOOM, SSAHA2, Mosaik
Bowtie
Binning reads Need annotated features Exons, genes, transcripts For each feature, the total number of reads mapped is produced Not directly comparable across features/samples yet Usually followed by normalization
Normalizing counts Why normalize? RPKM is most frequently used Longer features have more reads mapped Deeper sequencing produces more reads RPKM is most frequently used Reads Per Kilobase per Million reads Defined as C/(LN) C = number of reads mapped to a feature L = length of the feature (in kilobases) N = total number of reads from the sample (in millions)
RPKM examples http://jura.wi.mit.edu/bio/education/hot_topics/RNAseq/RNA_Seq.pdf
Gene model predicted for fungus Trametes versicolor using Augustus and RNA-seq hints Above is a screenshot of Gbrowse instance for fungal species Trametes versicolor for Genozymes project. Project is sequencing both DNA and transcriptome (RNA-seq) and COE is responsible for annotation. Example of gene predicted using ab intio predictor Augustus (Confident models) using hints from RNA-seq to check accuracy of prediction - Hints are built from short-read alignment of Illumina RNA-seq spliced reads onto the genome (Mapped Reads) - Splice reads show direct evidence of introns (next slide) - Hints are used with ab initio predictors (Augustus) during training and prediction stages
Splice Variants
“non-coding” RNA molecules LincRNA-p21 Tran et al., In press
MIRA Assembly Contig: T_rep_c1201 Read members: 96 Length: 2429 bp Example MIRA Assembly Contig: T_rep_c1201 Read members: 96 Length: 2429 bp Combined Assembly T_rep_c1201 is part of a 6 member contig 2 are partial transcripts assembled by PTA
Detecting Differential Expression Compare quantification values across samples or across features Most tools summarize/normalize counts and suggest DE features Cufflinks/Cuffdiff, R packages (DESeq, edgeR, baySeq, TSPM), SAMtools DE features go through similar analysis to microarray data analysis (e.g. validation)
Cufflinks
Cufflinks Tutorial https://docs.google.com/document/d/1t1gi2Djxd0ykMVe2bF8BVOBsOsPngjFh2999u3rZq-A/edit?hl=en&authkey=CKL1i8sD#
Anaerobic biocorrosion in reactors filled with WP-LS medium
SSV1 Replication Cycle (UV Induced)