Introduction to RNA-Seq

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

DEG Mi-kyoung Seo.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-seq Analysis in Galaxy
RNA-Seq data analysis Qi Liu Department of Biomedical Informatics
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
RNA-Seq Visualization
Introduction to RNA-Seq and Transcriptome Analysis
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
BIF Group Project Group (A)rabidopsis: David Nieuwenhuijse Matthew Price Qianqian Zhang Thijs Slijkhuis Species: C. Elegans Project: Advanced.
RNAseq analyses -- methods
Introduction to RNA-Seq & Transcriptome Analysis
LOC_Os02g08480 Supplementary Figure S1. Exons shorter than a read length have few or no reads aligned. The gene at LOC_Os02g08040 contains exons shorter.
Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
RNA-Seq Analysis Simon V4.1.
Transcriptome Analysis
Next Generation Sequencing. Overview of RNA-seq experimental procedures. Wang L et al. Briefings in Functional Genomics 2010;9: © The Author.
RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.
An Introduction to RNA-Seq Transcriptome Profiling with iPlant.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq using the Discovery Environment And COGE.
Introduction To Next Generation Sequencing (NGS) Data Analysis
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Introduction to RNAseq
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq visualization with cummeRbund.
The iPlant Collaborative
An Introduction to RNA-Seq Transcriptome Profiling with iPlant (
The iPlant Collaborative
No reference available
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Canadian Bioinformatics Workshops
RNA-Seq visualization with CummeRbund
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Arrays How do they work ? What are they ?. WT Dwarf Transgenic Other species Arrays are inverted Northerns: Extract target RNA YFG Label probe + hybridise.
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Transcriptomics History and practice.
Introductory RNA-seq Transcriptome Profiling
GCC Workshop 9 RNA-Seq with Galaxy
Placental Bioinformatics
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
RNA-Seq visualization with CummeRbund
RNA-Seq analysis in R (Bioconductor)
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Canadian Bioinformatics Workshops
Introductory RNA-Seq Transcriptome Profiling
The FASTQ format and quality control
Kallisto: near-optimal RNA seq quantification tool
Transcriptome analysis
Introduction To Next Generation Sequencing (NGS) Data Analysis
Additional file 2: RNA-Seq data analysis pipeline
Sequence Analysis - RNA-Seq 2
Transcriptomics – towards RNASeq – part III
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Introduction to RNA-Seq Transcriptome profiling with iPlant Jason Williams iPlant / Cold Spring Harbor Laboratory

RNA-Seq in the Discovery Environment This training module is designed to demonstrate a workflow in the iPlant Discovery Environment using RNA-Seq for transcriptome profiling. Question: How can we compare gene expression levels using RNA-Seq data in Arabidopsis WT and hy5 genetic backgrounds?

Scientific Objective LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper transcription factor (TF). Mutations cause aberrant phenotypes in Arabidopsis morphology, pigmentation and hormonal response. We will use RNA-Seq to compare WT and hy5 to identify HY5-regulated genes. Source: http://www.gla.ac.uk/media/media_73736_en.jpg

Sample Dataset Experimental data downloaded from the NCBI Short Read Archive (GEO:GSM613465 and GEO:GSM613466) Two replicates each of RNA-Seq runs for Wild-type and hy5 mutant seedlings.

RNA-Seq Conceptual Overview This is a quick visual overview of transcriptome profiling via RNA-seq. It does not go into comparisons but we cover that with CuffDiff later. Image source: http://www.bgisequence.com

RNA-seq Sample Read Statistics Genome alignments from TopHat were saved as BAM files, the binary version of SAM (samtools.sourceforge.net/). Reads retained by TopHat are shown below Sequence run WT-1 WT-2 hy5-1 hy5-2 Reads 10,866,702 10,276,268 13,410,011 12,471,462 Seq. (Mbase) 445.5 421.3 549.8 511.3 These are the read counts generated by TopHat as part of its alignment analysis. This is a modestly sized data set by NGS standard; good time to mention scalability, Data Store, etc.

RNA-Seq Data …Now What? @SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41 CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC + BA?39AAA933BA05>A@A=?4,9################# @SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41 GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT @=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48## @SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41 TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9? @SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41 CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B? @SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41 AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@ @SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41 GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB> @SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41 GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC ?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>: …Now What?

RNA-Seq Data …Now What? @SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41 CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC + BA?39AAA933BA05>A@A=?4,9################# @SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41 GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT @=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48## @SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41 TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9? @SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41 CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B? @SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41 AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@ @SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41 GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB> @SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41 GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC ?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>: …Now What?

RNA-Seq Data - FastQ

1 1 1 @SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41 CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC + BA?39AAA933BA05>A@A=?4,9################# @SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41 GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT @=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48## @SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41 TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9? @SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41 CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B? @SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41 AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@ @SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41 GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB> @SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41 GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC ?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>: 1 1 Bioinformatician

*Graphics taken from these publications

The Tuxedo Protocol *TopHat and Cufflinks require a sequenced genome

The Tuxedo Protocol

Most of RNA-Seq happens before analysis ENCODE Project RNA-Seq Standards

Your RNA-Seq Data Your transformed RNA-Seq Data $ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq $ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C1_R3_2.fq $ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R3_clout C1_R3_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R3_clout C2_R3_thout/accepted_hits.bam $ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt $ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf \ ./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,\ ./C1_R3_thout/accepted_hits.bam \./C2_R1_thout/accepted_hits.bam,\ ./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam Your transformed RNA-Seq Data

RNA-Seq Analysis Workflow Tophat (bowtie) Cufflinks Cuffmerge Cuffdiff CummeRbund Your Data iPlant Data Store FASTQ Discovery Environment Atmosphere This is a quick visual overview of transcriptome profiling via RNA-seq. It does not go into comparisons but we cover that with CuffDiff later.

www.iplantc.org/ds2 Moving your data in This is a quick visual overview of transcriptome profiling via RNA-seq. It does not go into comparisons but we cover that with CuffDiff later. www.iplantc.org/ds2

Moving your data in iDrop Desktop – Java Program This is a quick visual overview of transcriptome profiling via RNA-seq. It does not go into comparisons but we cover that with CuffDiff later. iDrop Desktop – Java Program

Moving your data in iCommands This is a quick visual overview of transcriptome profiling via RNA-seq. It does not go into comparisons but we cover that with CuffDiff later. iCommands

The iPlant Discovery Environment

Data preparation Decompress your data?

Data preparation Decompress your data?

Data preparation Pre-process sequences if needed (e.g., Sabre for de-multiplexing reads, and Scythe for removing primer/adapter sequences) Image from: http://www.westburg.eu/lp/rna-seq-library-preparation

FASTQC – Quality Control Data preparation FASTQC – Quality Control http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Per Base Sequence Quality Data preparation Per Base Sequence Quality BAD GOOD The central red line is the median value The yellow box represents the inter-quartile range (25-75%) The upper and lower whiskers represent the 10% and 90% points The blue line represents the mean quality

Data preparation Per Sequence Quality Scores BAD GOOD Fail: most frequently observed mean quality is below 20 (1% error rate)

Data preparation Per Base N Content BAD GOOD Fail: any position shows an N content of >20%.

The iPlant Discovery Environment Sequence Length Distribution GOOD Fail: error if any of the sequences have zero length.

The iPlant Discovery Environment Overrepresented Sequences BAD Fail: module will issue an error if any sequence is found to represent more than 1% of the total

Tophat Explain reference-sequence based NGS read alignments. Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

TopHat TopHat is one of many applications for aligning short sequence reads to a reference genome. It uses the BOWTIE aligner internally. Other alternatives are BWA, MAQ, OLego, Stampy, Novoalign, etc. Emphasize that the TopHat aligner is one of many choices. Let them know that others are available in the DE and they can also integrate their own if they want to.

TopHat TopHat has a number of parameters and options, and their default values are tuned for processing mammalian RNA-Seq reads. If you would like to use TopHat for another class of organism, we recommend setting some of the parameters with more strict, conservative values than their defaults. Usually, setting the maximum intron size to 4 or 5 Kb is sufficient to discover most junctions while keeping the number of false positives low. - TopHat User Manual Emphasize that the TopHat aligner is one of many choices. Let them know that others are available in the DE and they can also integrate their own if they want to.

TopHat outputs in IGV

Assembling the Transcripts Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

Assembling the Transcripts Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

Assembling the Transcripts Provide a mask file (gtf/gff) Tells Cufflinks to ignore all reads that could have come from transcripts in this GTF file. Annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore. - Cufflinks User Manual Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

Assembling the Transcripts 1) transcripts.gtf This GTF file contains Cufflinks' assembled isoforms. The first 7 columns are standard GTF, and the last column contains attributes, some of which are also standardized ("gene_id", and "transcript_id"). There one GTF record per row, and each record represents either a transcript or an exon within a transcript. 2) isoforms.fpkm_tracking This file contains the estimated isoform-level expression values (FPKM). 3) genes.fpkm_tracking This file contains the estimated gene-level expression values (FPKM). - Cufflinks User Manual Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

Assembling the Transcripts Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff - Cufflinks User Manual

Merging the Transcriptomes Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff Cuffmerge is a meta-assembler; Assembly of Cufflinks transcripts / Reference based assembly

Comparing wild-type to hy5 transcriptomes Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff

Comparing wild-type to hy5 transcriptomes Cuffdiff evaluates variation in read counts for each gene across the replicates this estimate is used to calculate significance of expression changes Cuffdiff can identify genes that are differentially spliced or differentially regulated via promoter switching. Isoforms of a gene that have the same TSS are grouped Detection rate of differentially expressed genes/transcripts is strongly dependent on sequencing depth

Comparing wild-type to hy5 transcriptomes Changes in fragment counts ≠ changes in expression True expression is estimated by the sum of the length-normalized isoform read counts so the entire transcript must be taken into account.

Cuffdiff Results 1. FPKM tracking files Cuffdiff calculates the FPKM of each transcript, primary transcript, and gene in each sample. Primary transcript and gene FPKMs are computed by summing the FPKMs of transcripts in each primary transcript group or gene group. (tss_groups.fpkm_tracking tracks summed FPKM of transcripts sharing tss_ids) 2) Count tracking files Estimate of the number of fragments that originated from each transcript, primary transcript, and gene in each sample. 3) Read group tracking files Expression and fragment count for each transcript, primary transcript, and gene in each replicate. 4) Differential expression tests Tab delimited file lists the results of differential expression testing between samples for spliced transcripts, primary transcripts, genes, and coding sequences. Plus several other outputs (diff splicing, CDS, promoter, etc.)

Differentially expressed genes Example filtered Cuffdiff results generated in the Discovery Environment.

Differentially expressed transcripts Example filtered Cuffdiff results generated in the Discovery Environment.

Density Plot

Scatter Plot

Volcano Plot

Expression Plots Explain that we are skipping the cufflinks step because the Arabidopsis transcriptome is so well annotated that we can use the TAIR gene models as our refernce transcripts for CuffDiff