RNA-Seq Software, Tools, and Workflows

RNA-Seq Software, Tools, and Workflows
Monica Britton, Ph.D. Sr. Bioinformatics Analyst June 2016 Workshop

Some mRNA-Seq Applications
Differential gene expression analysis Transcriptional profiling Identification of splice variants Novel gene identification Transcriptome assembly SNP finding RNA editing Assumption: Changes in transcription/mRNA levels correlate with phenotype (protein expression) Focus: RNA-Seq can be used in many experiments to answer biological questions.

Experimental Design What biological question am I trying to answer?
What types of samples (tissue, timepoints, etc.)? How much sequence do I need? Length of read? Platform? Single-end or paired-end? Barcoding? Pooling? Biological replicates: how many? Technical replicates: how many? Protocol considerations? Focus: Considerations before designing your experiment

Strand-Specific (Directional) RNA-Seq
Now the default for Illumina TruSeq kits Preserves orientation of RNA after reverse transcription to cDNA Informs alignments to genome Determine which genomic DNA strand is transcribed Identify anti-sense transcription (e.g., lncRNAs) Quantify expression levels more precisely Demarcate coding sequences in microbes with overlapping genes Very useful in transcriptome assemblies Allows precise construction of sense and anti-sense transcripts --3’ 5’--

Influence of the Organism
Novel – little/no previous sequencing (may need assembly) Non-Model some sequence available (draft genome or transcriptome assembly) Thousands of scaffolds, maybe tens of chromosomes Some annotation (ab initio, EST-based, etc.) Model – genome fully sequenced and annotated Multiple genomes available for comparison Well-annotated transcriptome based on experimental evidence Genetic maps with markers available Basic research can be conducted to verify annotations (mutants available) Focus: Consider how much annotated sequence is available for your organism, and the heterozygosity/ploidy.

Differential Gene Expression Generalized Workflow
Focus: introduction of the workflow – details on next slides

Read QC and Trimming Scythe Sickle
Stringency of trimming depends on read length and downstream processing. Alignment can often tolerate some amount of contamination. Various sizes of trimmed reads can introduce biases in alignments. Transcriptome assembly works best with aggressive trimming. Focus: introduction of the workflow – details on next slides

QC and Alignment You may want to align your reads to a few different datasets, including rRNA of your organism (or related organisms) Contaminant screening, if desirable Genome or transcriptome to generate counts for differential expression analysis Focus: introduction of the workflow – details on next slides First we’ll go through alignment to a gene set/transcriptome, then we’ll look at aligning to a genome…

Alignment – Choosing an Aligner (Pt. 1 – Gene Set)
Examples of gene sets: rRNA sequences potential contaminant sequences (such as PhiX), a transcriptome (eukaryotic or bacterial), where each sequence in the fasta represents a different transcript (mRNA) De novo transcriptome assembly A splicing-aware aligner is not needed bwa aln (<76 bp), bwa mem (>75 bp), or bowtie2 Requiring strand-specific orientation can be more difficult (requires filtering the bam file).

Library QC – rRNA Contamination
Total RNA sample rRNA mRNA After PolyA isolation After rRNA depletion Total RNA contains >95% rRNA Poly A selected libraries should contain <5% rRNA rRNA-depleted libraries should contain <15% rRNA Align reads to rRNA sequences from organism or relatives Higher than expected or inconsistent rRNA content may indicate degraded RNA or problems in library prep. Generally, don’t need to remove rRNA reads Focus: Need to do standard QA&I plus rRNA contamination check; but sample protocol has influence

Insert Size (determined from gene set alignments)
Adapter ~60 bases cDNA bases =Insert Size The “insert” is the cDNA (or RNA) ligated between the adapters. Typical insert size is bases, but can be larger. Insert size distribution depends on library prep method. Overlapping paired-end reads Focus: The cDNA insert is between the adapters. Showing SE and PE reads in relation to the insert. Gapped paired-end reads Single-end read 20 40 60 80 100 120 140 160 180 200

Paired-End Reads Read 1  Read 1   Read 2 INSERT SIZE (TLEN)
20 40 60 80 100 120 140 160 180 200 Read 1   Read 2 INSERT SIZE (TLEN) INNER DISTANCE (+ or -) Focus: The cDNA insert is between the adapters. Showing SE and PE reads in relation to the insert. 20 40 60 80 100 120 140 160 180 200 Read 1   Read 2

Generating Raw Counts from Gene Set Alignments
The sam/bam file can be filtered for just those pairs where both reads align in concordantly (proper pairs) to one transcript. sam2counts or samtools idxstats can be used for counting. Counts per transcript need to be added up to counts per gene before running stats software Focus: introduction of the workflow – details on next slides

Alignment to a Genome - Choosing a Reference
Human/mouse: GENCODE (uses Ensembl IDs) ( but may need some manipulation to work with certain software Ensembl genomes ( and Biomart ( Illumina igenomes ( sequencing_software/igenome.html) provides genome indexes for some software, and files with extra info for Tophat/cufflinks. NCBI genomes ( Many specialized databases (Phytozome, Patric, VectorBase, FlyBase, WormBase) “Do it yourself” genome assembly and gene-finding (don’t forget functional annotation) Focus: Choice of reference depends on experimental question, if your organism is annotated

Alignment – Choosing an Aligner (Pt. 2 – Genome)
Eukaryotic genes are separated into exons and introns, so if you are using a genome reference, you need an aligner that is either “splicing-aware” or won’t choke on spliced RNA reads. A splicing-aware aligner will recognize the difference between a short insert and a read that aligns across exon-intron boundaries Spliced Reads (joined by dashed line) Read Pairs (joined by solid line)

(Some) Splicing-Aware or Splicing-Agnostic Aligners
Tophat2 (bowtie2) – what we’ll be using in these exercises. (remember, igenomes); also integrates with cufflinks to find novel genes and transcripts HISAT/HISAT2 – successor to tophat2, and also works for alignment of populations (i.e., multiple human samples). I haven’t tried this yet. It’s on our Galaxy … try it out and report what happens. GSNAP – I haven’t used this; can handle (and produce) non-standard file formats. STAR – this is what I use. It’s very fast and doesn’t need a separate counting step. But it does need a lot of memory (>128G RAM to index a mammalian genome). bwa mem – a local aligner, is not splicing-aware, but can handle local alignments (doesn’t require the entire read to align). Focus: re-iterate main points, including file formats, link to stats talk to follow.

Generating Raw Counts from Genome Alignments
Reads aligning to a chromosome or scaffold must be parsed to assign them to a gene. We’ll need a “roadmap” to indicate where each gene (exon) is located in the genome Focus: introduction of the workflow – details on next slides

Generating Raw Counts from Genome Alignments
This roadmap is the GTF (Gene Transfer Format) file: Let’s look at the GTF fields in more detail … The left columns list source, feature type, and genomic coordinates The right column includes attributes, including gene ID, etc.

Fields in the GTF File (left to right)
Sequence Name (i.e., chromosome, scaffold, etc.) chr12 Source (program that generated the gtf file or feature) unknown Feature (i.e., gene, exon, CDS, start codon, stop codon) CDS Start (starting location on sequence) End (end position on sequence) Score Strand (+ or -) Frame (0, 1, or 2: which is first base in codon, zero-based) 2 Attribute (“;”-delimited list of tags with additional info) gene_id "PRMT8"; gene_name "PRMT8"; p_id "P10933"; transcript_id "NM_019854"; tss_id "TSS4368";

An Unusual GTF File

Generating Gene Counts from Genome Alignments
HTSeq-count – runs rather slowly; doesn’t work well on coordinate-sorted bam files. FeatureCounts – Runs faster, and will work on coordinate-sorted bam files. Has more parameters than HTSeq-count. STAR – Will generate counts equivalent to HTSeq-count with default parameters, in the same run as alignment. All of these can utilize strand-specific information in GTF file to assign reads from strand-specific libraries. But what about those reads that can’t be assigned unambiguously to a gene? Focus: re-iterate main points, including file formats, link to stats talk to follow.

Multiple-Mapping Reads (“Multireads”)
Some reads will align to more than one place in the reference, due to: Common domains, gene families Paralogs, pseudogenes, polyploidy etc. This can distort counts, leading to misleading expression levels If a read can’t be uniquely mapped, how should it be counted? Should it be ignored (not counted at all)? Should it be randomly assigned to one location among all the locations to which it aligns equally well? This may depend on the question you’re asking… …and also depends on the alignment and counting software you use.

Reads That Align to Gene Features
This read unambiguously aligns to Gene A This read also uniquely aligns to Gene A This read aligns uniquely, but could be unspliced This spliced read uniquely aligns to Gene A Genes A and B overlap, but the read is uniquely within Gene A Genes A and B overlap, the read is within Gene A, but overlaps Gene B Genes A and B overlap, and the read ambiguously maps to both genes From:

Read Count Definitions
When using tools such as HTSeq-count, FeatureCounts, and STAR, the output table will include counts of reads that can be assigned to each gene and some additional lines for reads that can’t be assigned, including: Ambiguous – read overlaps more than one gene Multiple Mapping – read maps equally well to more than one place in the genome No Feature – read maps uniquely to genome, but in an intergenic (or intronic) region Unmapped – read cannot be mapped to genome (based on run parameters)

Examples From Two Recent Experiments
Genome Human (hg38) Mouse (mm10) Annotation GENCODE Library Prep rRNA depletion polyA selection rRNA 8.8% 5.6% Aligned to Genome 98.6% 98.1% Assigned to Genes (exons) 51.8% 84.1% Ambiguous 2.2% 1.3% Multimapping 12.9% 6.2% No Feature 31.6% 6.5% Unmapped 1.4% 1.9%

How Well Did Your Reads Align to the Reference?
Calculate percentage of reads mapped per sample % Reads Mapped Incomplete Reference? Good Great! Sample Contamination? Focus: It’s important and informative to calculate % reads mapped per sample

Checking Your Results Key genes that may confirm sample ID
Knock-out or knock-down genes Genes identified in previous research Specific genes of interest Hypothesis testing Important pathways Experimental validation (e.g., qRT-PCR) The best way to determine if your analysis protocol accurately models your organism/experiment Ideally, validation should be conducted on a different set of samples Focus: Ways to check results to see if they make sense based on the expectations of your experiment

Bioinformatics analyses are in silico experiments The tools and parameters you choose will be influenced by factors including: Available reference/annotation Experimental design (e.g., pairwise vs. multi-factor) The “right” tools are the ones that best inform on your experiment Don’t just shop for methods that give you the answer you want Focus: introduction of the workflow – details on next slides

File Types Fastq Fasta (reference) gtf (gene features) sam/bam Focus: introduction of the workflow – details on next slides Text tables

Today’s Exercises – Differential Gene Expression
Today, we’ll analyze a few types of data you’ll typically encounter: Single-end reads (“tag counting” with well-annotated genomes) using RNA-Seq data from the same experiment as yesterday’s ChIP-Seq data. Paired-end reads (finding novel transcripts in a genome with incomplete annotation) Paired-end reads for differential gene expression when only a transcriptome is available (such as after a transcriptome assembly)

Today’s Exercises – Differential Gene Expression
And we’ll be using a few different software: Tophat to align spliced reads to a genome FeatureCounts to generate raw counts tables for… edgeR, for differential expression of genes Cufflinks to find novel genes and transcripts Bowtie2 to align reads to a transcriptome reference sam2counts to extract raw counts from bowtie2 alignments

Gene Construction

Alignment and Differential Expression
Read set(s) TopHat bam file(s) We will follow these steps in the first exercise. Existing annotation (GTF) Toptables, etc. edgeR/limma FeatureCounts

But, do we have all the genes?
For organisms with genomes, gene models are stored in gtf files Assumptions: The gtf file contains annotation for ALL transcripts and genes All splice sites, start/stop codons, etc. are correct Are these assumptions correct for every sequenced organism? RNA-Seq reads can be used to independently construct genes and splice variants using limited or no annotation Method used depends on how much sequence information there is for the organism…

Gene Construction (Alignment) vs. Assembly
Trinity software Novel or Non-Model Organisms Genome-Sequenced Organisms Haas and Zody (2010) Nat. Biotech. 28:421-3

Gene / Transcriptome Construction
Annotation can be improved – even for well-annotated model organisms Identify all expressed exons Combine expressed exons into genes Find all splice variants for a gene Discover novel transcripts For newly sequenced organisms Validate ab initio annotation Comparison between different annotation sets Can assist in finding some types of contamination Reconstruction of rRNA genes Genomic/mitochondrial DNA in RNA library preps. Focus: Annotation of genomes can be improved with transcriptome assembly

Reference Annotation Based Transcript (RABT) Assembly
Read set(s) TopHat bam file(s) Read-set specific GTF(s) Cufflinks Existing annotation (GTF) [optional] Cuffmerge Merged GTF Final assembly (GTF and stats) Cuffcompare Toptables, etc. edgeR/limma FeatureCounts

A Few Words About Bacterial RNA-Seq

Eukaryotic and Bacterial Gene Structures are Different
Eukaryotes Gene structure includes introns and exons Splicing, poly-adenylation Each mRNA is a discrete molecule when translated Bacteria / Prokaryotes Individual genes and groups of genes in operons Generally, no splicing, no polyA One mRNA can contain coding sequences for multiple proteins

Bacterial RNA-Seq Considerations
rRNA depletion strategies may leave considerable amounts of non-coding RNA molecules Splicing-aware aligners (such as Tophat) may not be useful Reads from polycistronic mRNA may overlap two genes How would HTSeq-Count or FeatureCounts handle this? Compare alignments to the genome to alignments to transcriptome. Some aligners, such as bwa-mem, will report secondary alignments Transcriptome alignments can be used to generate counts table for edgeR Specialized software, such as Rockhopper (stand-alone,

RNA-Seq Software, Tools, and Workflows

Similar presentations

Presentation on theme: "RNA-Seq Software, Tools, and Workflows"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RNA-Seq Software, Tools, and Workflows

Similar presentations

Presentation on theme: "RNA-Seq Software, Tools, and Workflows"— Presentation transcript:

Similar presentations

About project

Feedback