IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Slides:



Advertisements
Similar presentations
Submitting a Genome to RAST. Uploading Your Job 1.Login to your RAST account. You will need to register if this is your first time using SEED technologies.
Advertisements

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
Differentially expressed genes Sample class prediction etc.
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNAseq analysis Bioinformatics Analysis Team
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-seq Analysis in Galaxy
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Introduction to RNA-Seq and Transcriptome Analysis
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
RNAseq analyses -- methods
Introduction to RNA-Seq & Transcriptome Analysis
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Next Generation Sequencing. Overview of RNA-seq experimental procedures. Wang L et al. Briefings in Functional Genomics 2010;9: © The Author.
RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNA-Seq
Sackler Medical School
Cloud Implementation of GT-FAR (Genome and Transcriptome-Free Analysis of RNA-Seq) University of Southern California.
RNA-seq workshop COUNTING & HTSEQ Erin Osborne Nishimura.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
The iPlant Collaborative
The iPlant Collaborative
No reference available
RNA-Seq in Galaxy Igor Makunin DI/TRI, March 9, 2015.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.
Canadian Bioinformatics Workshops
Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Introductory RNA-seq Transcriptome Profiling
Using command line tools to process sequencing data
GCC Workshop 9 RNA-Seq with Galaxy
Canadian Bioinformatics Workshops
Placental Bioinformatics
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
High-Throughput Analysis of Genomic Data [S7] ENRIQUE BLANCO
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Introductory RNA-Seq Transcriptome Profiling
Kallisto: near-optimal RNA seq quantification tool
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
Learning to count: quantifying signal
Additional file 2: RNA-Seq data analysis pipeline
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Introduction to RNA-Seq & Transcriptome Analysis
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Typical RNA_Seq Project Work Flow Tissue Sample Total RNA mRNA cDNA FASTQ file Sequencing QC Visualization TopHat Focus for Today Cufflinks Gene/Transcript/Exon Expression Statistical Analysis JAX Computational Sciences Service

RNASeq Tasks, Tools and File Formats Quality Control FastQC FastQ, SangerFastQ TopHat Alignment SAM/BAM IGV Summarization Cufflinks GTF Differential Gene Expression Cuffdiff,Edge, DESeq, baySeq

Dialog/Parameter Selection History Tools

ftp://ftp.ncbi.nlm.nih.gov/pub/church/GenomeAnalysis/h1-hESC_Sample_Dataset.fastq

Data upload review Our data are H1 human embryonic stem cell RNA Seq data from the CalTech encode project. Single end reads from Illumina.

Typical RNA_Seq Project Work Flow Tissue Sample Total RNA mRNA cDNA FASTQ file Sequencing QC Visualization TopHat Cufflinks Gene/Transcript/Exon Expression Statistical Analysis JAX Computational Sciences Service

Prior to alignment, perform some quality control (QC) assessments of the data. Here we use FastQC **. **http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

FastQC provides a wide range of QC checks FastQC provides a wide range of QC checks. Here we will only look at “Per base sequence quality”

Sequence quality per base position Good data Consistent High Quality Along the reads Bad data High Variance Quality Decrease with Length The central red line is the median value The yellow box represents the inter-quartile range (25-75%) The upper and lower whiskers represent the 10% and 90% points The blue line represents the mean quality

Position along sequencing read Quality Score Our data… Position along sequencing read

Galaxy has several tools for trimming sequences, removing adapters, etc. prior to alignment. Using the information from FastQC, let’s trim our input sequences so that the aggregate quality score is 15.

Typical RNA_Seq Project Work Flow Tissue Sample Total RNA mRNA cDNA FASTQ file Sequencing QC Visualization TopHat Cufflinks Gene/Transcript/Exon Expression Statistical Analysis JAX Computational Sciences Service

TopHat http://tophat.cbcb.umd.edu/ TopHat is a good tool for aligning RNA Seq data compared to other aligners (Maq, BWA) because it takes splicing into account during the alignment process. Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515. Trapnell et al. (2009). Bioinformatics 25:1105-1111.

Setting parameters for TopHat in Galaxy Be sure to use the quality trimmed sequences!

Does it seem like your Galaxy jobs never finish?! Galaxy is increasingly popular so it can take time for some of these computationally expensive processes to run…don’t restart your job or you will go to the end of the line! Your job will continue to run on the Galaxy servers even if you shut down your computer.

For now we have pre-computed data to illustrate the main points!

Visualizing alignments in Galaxy When TopHat finishes the alignments are available in BAM format.

You can look at the alignments in a variety of browsers…. Which browser you choose is a matter of personal preference.

UCSC Browser…the track and the title of the track are made automatically for you from Galaxy. UCSC also has controls to let you display many other kinds of annotations as tracks. chr19:2,373,346-2,398,357

Click on an element in the TopHat track to see the details of the alignment…all of this information is stored in that very compact BAM file!!!

Launch IGV (Integrated Genome Viewer)

Typical RNA_Seq Project Work Flow Tissue Sample Total RNA mRNA cDNA FASTQ file Sequencing QC Visualization TopHat Cufflinks Gene/Transcript/Exon Expression Statistical Analysis JAX Computational Sciences Service

Cufflinks Assembles transcripts, Estimates their abundances, and http://cufflinks.cbcb.umd.edu/ Assembles transcripts, Estimates their abundances, and Tests for differential expression and regulation in RNA-Seq samples Trapnell et al. (2010). Nature Biotechnology 28:511-515.

There are several ways to generate annotation files for Cufflinks to use. Here we will create an annotation file using the UCSC genome browser tool in Galaxy. A B There are many options for the features to include in the annotation file. Cufflinks expects a GTF file format

Once you have selected your annotations…you can send them directly to your history in Galaxy.

Setting parameters for Cufflinks Use the reference annotations you just downloaded…

Example of an RNA Seq data set in NCBI’s Gene Expression Omnibus (GEO)…you don’t always need the raw sequences to do RNA Seq, you can start with a SAM or BAM file. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM521256

SAM files need to be converted into BAM format in order to run Cufflinks…. There’s a tool in Galaxy for that!!

Cufflinks output can be downloaded and viewed in Excel.

RPKM vs FPKM Reads Per Kilobase of transcript per Million mapped reads (RPKM) Used for single end sequencing reads Count # of uniquely mappable reads to a set of exons that constitute a gene prediction/model. Fragments Per Kilobase of exon per Million fragments mapped (FPKM) Used for paired-end sequence data FPKM is an estimate of the number of reads per transcript TopHat aligns reads to the genome Cufflinks assembles reads into transcript models/fragments Cufflinks counts the number of reads per fragment to estimate FPKM FPKM is used as an indication of expression level for a gene

Quantification of gene expression using RNA Seq can be complicated by reads that don’t map uniquely to the genome. RNA Seq by Expectation Maximization (RSEM) takes mapping uncertainty into account when estimating expression levels.

Typical RNA_Seq Project Work Flow Tissue Sample Total RNA mRNA cDNA FASTQ file Sequencing QC Visualization TopHat Cufflinks Gene/Transcript/Exon Expression Statistical Analysis JAX Computational Sciences Service

Differential Gene Expression For RNA Seq data from multiple conditions, Cuffdiff can be used to detect significant differences in transcript expression.. Is the abundance of transcripts different between two samples?

Is there a difference in total expression of a given gene due to treatment conditions? edgeR DESeq Bayseq http://www.ijbcb.org/DEB/php/onlinetool.php

Don’t just go along for the ride! Summing Up Alignments, Assemblies, and Annotations are essential to using Next Gen sequence data for biological investigation Know the strengths and weaknesses of each Have Fun! But Be Careful! Don’t just go along for the ride!

Tutorial Web Site http://www.ncbi.nlm.nih.gov/staff/church/GenomeAnalysis/index.shtml This site will be accessible after the meeting. Check back for updates and new tutorials.

SEQanswers is a very active public discussion board on sequence analysis issues. http://seqanswers.com/