RNA-Seq visualization with CummeRbund

Slides:



Advertisements
Similar presentations
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Advertisements

DEG Mi-kyoung Seo.
© by Pearson Education, Inc. All Rights Reserved.
RNA-seq Analysis in Galaxy
Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.
Introduction to SPSS Short Courses Last created (Feb, 2008) Kentaka Aruga.
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
RNA-Seq Visualization
Introduction to RNA-Seq and Transcriptome Analysis
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 9: Quantitative.
ASP.NET Programming with C# and SQL Server First Edition
Expression Analysis of RNA-seq Data
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
BIF Group Project Group (A)rabidopsis: David Nieuwenhuijse Matthew Price Qianqian Zhang Thijs Slijkhuis Species: C. Elegans Project: Advanced.
RNAseq analyses -- methods
An Introduction to RNA-Seq Transcriptome Profiling with iPlant.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNA-Seq
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq using the Discovery Environment And COGE.
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
Introduction to SPSS. Object of the class About the windows in SPSS The basics of managing data files The basic analysis in SPSS.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop RNA-Seq visualization with cummeRbund.
Lecture 26: Reusable Methods: Enviable Sloth. Creating Function M-files User defined functions are stored as M- files To use them, they must be in the.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
The iPlant Collaborative
An Introduction to RNA-Seq Transcriptome Profiling with iPlant (
Downloading the MAXENT Software
RNA-Seq visualization with CummeRbund
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Information Retrieval in Practice
Web Database Programming Using PHP
Dive Into® Visual Basic 2010 Express
Statistics Behind Differential Gene Expression
Simon v RNA-Seq Analysis Simon v
Introductory RNA-seq Transcriptome Profiling
RNA Quantitation from RNAseq Data
An Introduction to RNA-Seq Data and Differential Expression Tools in R
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
Working in the Forms Developer Environment
Introduction to Visual Basic 2008 Programming
Getting Started with R.
RNA-Seq analysis in R (Bioconductor)
Data Virtualization Tutorial: XSLT and Streaming Transformations
By Dr. Madhukar H. Dalvi Nagindas Khandwala college
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
How to store and visualize RNA-seq data
Introductory RNA-Seq Transcriptome Profiling
DEPARTMENT OF COMPUTER SCIENCE
Workshop on Microbiome and Health
Kallisto: near-optimal RNA seq quantification tool
Statistical Analysis with Excel
Statistical Analysis with Excel
Digital Image Processing
Communication and Coding Theory Lab(CS491)
Learning to count: quantifying signal
Assessing changes in data – Part 2, Differential Expression with DESeq2
Additional file 2: RNA-Seq data analysis pipeline
Supporting High-Performance Data Processing on Flat-Files
Sequence Analysis - RNA-Seq 2
Transcriptomics – towards RNASeq – part III
Differential Expression of RNA-Seq Data
Presentation transcript:

RNA-Seq visualization with CummeRbund Atmosphere Workflow Joslynn Lee – Data Science Educator Cold Spring Harbor Laboratory, DNA Learning Center jolee@cshl.edu

RNA-Seq Experiment Overview In this example we will compare gene transcript abundance between wild type and mutant Arabidopisis strains. LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper transcription factor (TF) Mutations in the HY5 gene cause a myriad of aberrant phenotypes (morphology /pigmentation) in Arabidopsis We will use RNA-seq to compare expression levels for genes between WT and hy5- samples.

Useful resources Papers with some CuffDiff results *Graphics taken from these publications

Compare expression analysis using CuffDiff One option using Tuxedo Package Since we show users the Tuxedo Suite, we share visualization of CuffDiff. There are other tools and statistical analyses: cummeRbund was designed to process the multi-file output format for a 'cuffdiff' differential expression analysis. By design, cuffdiff produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes. cuffdiff also performs differential expression tests between supplied conditions.

Workflow of RNA-Seq in CyVerse Using a GUI Tophat (bowtie) Cufflinks Cuffmerge Cuffdiff CummeRbund Your Data CyVerse Data Store Discovery Environment Atmosphere FASTQ

Workflow of RNA-Seq in Atmosphere CummeRbund: visualization package for Cufflinks high-throughput sequencing data Any image w/R can work, and you could also search for an image with cummeRbund installed Description: Allows for persistent storage, access, exploration, and manipulation of Cufflinks high-throughput sequencing data. In addition, provides numerous plotting functions for commonly used visualizations.

cummeRbund is a visualization package for Cufflinks Available via Bioconductor All of the output files are related to each other through their various tracking_ids, but parsing through individual files to query for important result information requires both a good deal of patience and a strong grasp of command-line text manipulation. Enter cummeRbund, an R solution to aggregate, organize, and help visualize this multi-layered dataset promote rapid analysis of RNA-Seq data by aggregating, indexing, and allowing you easily visualize

Installing CummeRbund Available via Bioconductor Bioconductor is an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomic data. CummeRbund is now included as part of the R/Bioconductor project. The links above point to the R packages directly, and every effort should be made to use the version provided by Bioconductor

Connect with VNC Viewer Visualization for Atmosphere Integrated Development Environment (IDE) for R Execute R code directly from the source editor Quickly jump to function definitions Easily manage multiple working directories using projects REDO

Output files of CuffLinks Find in Atmosphere #log into your dnasubway atmosphere account 1. Find cuffdiff output files on Atmosphere desktop Desktop/test_data/iplant_cuffdiff_out This is an abbreviated version of the raw reads we are visualizing. So the data isn’t that significant. Due to the demo process.

Output files of CuffLinks Find in Atmosphere 1. Find cuffdiff output files on Atmosphere desktop Desktop/test_data/iplant_cuffdiff_out 2. Install cummeRbund (once / not on Atmosphere) > source("https://bioconductor.org/biocLite.R") > biocLite("cummeRbund”) TIPS enter the working directory as a character string (enclose it in quotes) the separator between folders is forward slash (/) Use tab for autocomplete Use arrow up to re-enter previous entry

Output files of CuffLinks Find in Atmosphere 1. Find cuffdiff output files on Atmosphere desktop Desktop/test_data/iplant_cuffdiff_out 2. Install cummeRbund (once / not on Atmosphere) > source("https://bioconductor.org/biocLite.R") > biocLite("cummeRbund") 3. Load package using library function > library(cummeRbund) 4. Set working directory to “iplant_cuffdiff_out” > setwd(“/home/dnasubway09/Desktop/test_data/iplant_cuffdiff_out/”) TIPS enter the working directory as a character string (enclose it in quotes) the separator between folders is forward slash (/) Use tab for autocomplete Use arrow up to re-enter previous entry

Load cuffdiff data into the database Import files into R Studio CummeRbund begins by re-organizing output files of a cuffdiff analysis, and storing these data in a local SQLite database. To import your cuffdiff files: > cuff <- readCufflinks(rebuild=TRUE) > cuff <- assignment operator cuff variable, stores value Each R session should begin with a call to readCufflinks so as to initialize the database connection and create an object with the appropriate RSQLite connection information. The base class, cuffSet is a 'pointer' to cuffdiff data that are stored out-of-memory in a sqlite database. - See more at: http://compbio.mit.edu/cummeRbund/manual_2_0.html#tth_sEc2 The runInfo() method can be used to retrieve information about the actual cuffdiff run itself, including command-line arguments used to generate the results files. readCufflinks(): enables you to keep the entire dataset out of memory and significantly improve performance for large cufflinks datasets cuff : reads out the specific feature data genes, isoforms, TSS, CDS…

What do we visualize? CummeRbund was designed to provide analysis and visualization tools analogous to microarray data Global statistics and Quality Control How good is my data? Are my biological replicates equal? Creating gene sets from significantly regulated genes Exploring the relationships between conditions (control/treatment)

Global statistics and Quality Control A good place to begin is to evaluate the quality of the model fitting Mean counts, variance, and dispersion are all emitted, allowing you to visualize the estimated overdispersion for each sample as a quality control measure. Overdispersion is the presence of greater variability in a data set than would be expected based on a given model (in our case extra-Poisson variation with CuffDiff). >disp <- dispersionPlot(genes(cuff)) >disp Counts (x-axis) vs. dispersion (y-axis) “Counts” usually refers to the number of reads that align to a particular feature If you use Poisson model, you will overestimate differential expression A scatter plot comparing the mean counts against the estimated dispersion for a given level of features from a cuffdiff run WT (left) and hy5 (right)

Examples of overdispersed data

Squared Coefficient of Variation FPKM: Fragments per kilobase of transcript per million mapped reads (For paired-end sequencing data) The squared coefficient of variation is a normalized measure of cross-replicate variability that can be useful for evaluating the quality your RNA-seq data > genes.scv <- fpkmSCVPlot(genes(cuff)) > genes.scv Represents the relationship of the standard deviation to the mean Differences in SCV can result in lower numbers of differentially expressed genes due to a higher degree of variability between replicate FPKM estimates

Distribution of FPKM scores To assess the distributions of the genes in the two samples, you can use the csDensity plot. The density plot generated by cummeRbund can be thought of as a ‘smoothed’ representation of a histogram. >dens <- csDensity(genes(cuff)) >dens >densRep <- csDensity(genes(cuff), replicates=T) >densRep WT and hy5 samples

Distribution of FPKM scores Pairwise comparisons between conditions can be made by using a scatterplot. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. > csScatter(genes(cuff),'WT','hy5',smooth=TRUE) (here we skip naming a variable) These plots can be useful for identifying global changes and trends in gene expression between pairs of conditions. Scatterplot of log10(FPKM) in WT and hy5 sample is shown

Selecting and Filtering Gene Sets Creating gene sets from significantly regulated genes Determine which genes are differentially regulated. cummeRbund makes accessing the results of these significance tests simple via getSig(): Returns the identifiers of significant genes in a vector format. >sig <- getSig(cuff, alpha=0.05, level =‘genes’) #An alpha value by which to filter multiple-testing corrected q-values to determine significance, 0.05 is default >length(sig) #returns the number of genes in the sig object

Selecting and Filtering Gene Sets Using the ‘getGenes’ function # Get the gene information 0.05 is default > sigGenes <- getGenes(cuff,sig)) Plot this in another scatter plot > csScatter(sigGenes, ‘WT’, ‘hy5’) A scatter plot comparing the FPKM values from two samples in a cuffdiff run.

Heatmap: Similar Expression Values Compare your short list of genes to the expression between WT and sample (hy5) There are several plotting functions available for gene-set-level visualization: The csHeatmap() function is a plotting wrapper that takes as input either a CuffGeneSet object (essentially a collection of genes and/or features) and produces a heatmap of FPKM expression values. > sigGenes <- getGenes(cuff, tail(sig,50)) #last 50 genes in the list we created #sig is large number of genes > csHeatmap(sigGenes, cluster=‘both’) WT hy5

Heatmap: Similar Expression Values Use replicates = TRUE for show the replicates > csHeatmap(sigGenes, cluster=‘both’, replicates=‘T’)

Volcano plots: identify those genes that are both significant as well as above a specified threshold of fold change List of many thousands of replicate datapoints between two conditions and one wishes to quickly identify the most-meaningful changes A volcano plot typically plots some measure of effect on the x-axis (typically the fold change negative log) and the statistical significance on the y-axis (typically the -log10 of the p-value). Genes that are highly significant changes appear higher on the plot VolcanoMatirx - Provide an alpha cutoff for visualizing significant genes csVolcano – regular volcano plot > csVolcanoMatrix(genes(cuff),'WT','hy5’)

Help: ask.iplantcollaborative.org Detailed instructions with videos, manuals, documentation in CyVerse Wiki Search by tag