Kallisto: near-optimal RNA seq quantification tool

Slides:



Advertisements
Similar presentations
RNA-Seq based discovery and reconstruction of unannotated transcripts
Advertisements

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
Simon v2.3 RNA-Seq Analysis Simon v2.3.
Peter Tsai Bioinformatics Institute, University of Auckland
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-seq Analysis in Galaxy
High Throughput Sequencing
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
RNAseq analyses -- methods
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
RNA-Seq Analysis Simon V4.1.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Transcriptome Analysis
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
The iPlant Collaborative
The iPlant Collaborative
No reference available
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Overview of Genomics Workflows
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Simon v RNA-Seq Analysis Simon v
Introductory RNA-seq Transcriptome Profiling
Short Read Mapping On Post Genomics Datasets
Short Read Sequencing Analysis Workshop
RNA Quantitation from RNAseq Data
An Introduction to RNA-Seq Data and Differential Expression Tools in R
Placental Bioinformatics
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Transcriptomics II De novo assembly
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
High-Throughput Analysis of Genomic Data [S7] ENRIQUE BLANCO
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Differential Expression from RNA-seq
Transcriptome Assembly
Reference based assembly
Transcriptome analysis
RNA sequencing (RNA-Seq) and its application in ovarian cancer
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Working with RNA-Seq Data
Additional file 2: RNA-Seq data analysis pipeline
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Schematic representation of a transcriptomic evaluation approach.
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Kallisto: near-optimal RNA seq quantification tool Discovery Environment

cDNAs using sequencing platform RNA seq Overview Sequence- cDNAs using sequencing platform Analysis Reads are mapped to reference or transcriptome Mapped reads counted per gene or per transcripts Counts are tested statistically for significant differences

RNA seq analysis pipeline QC, Demultiplex, filter, and trim sequencing reads FASTQC, Trimmomatic Normalize sequencing reads Diginorm Trinity normalization de novo assembly of transcripts or Trinity, SOAP-denovo Map (align) sequencing reads to reference genome or transcriptome Tophat, HISAT, STAR Annotate transcripts assembled and count mapped reads to estimate transcript abundance Cufflinks Perform statistical analysis to identify differential expression (or differential splicing) among samples or treatments Cuffdiff, eXpress,DESeq2

“Alignment free” quantification

Kallisto- near optimal RNA seq quantification tool

Kallisto Introduction of pseudoalignment instead of alignment -Nicolas Bray, Ph.D. thesis 2014. RNA-Seq analysis of 30 million reads in 2.5 minutes; 500—1000x faster than previous approaches. Possible thanks to fast hashing techniques and pseudoalignment via the Target de Bruijn Graph. First ever RNA-Seq analysis approach that is tractable on a laptop while being as accurate (or more accurate) than existing methods. Speed allows for bootstrapping to obtain uncertainty estimates, thus leading to new methods for differential analysis. https://math.berkeley.edu/~lpachter/group.html

RNA-Seq transcript abundance Given a set of RNA-seq reads and a reference transcriptome , quantify proportion of each transcript RNA-seq reads: assume standard reads, single or paired end reads Reference transcriptome: does not require a genome reference, works only with transcriptome Proportion: corresponds to TPM(transcripts per million) “for every 1M transcripts expressed how many are in this one?”

Why Kallisto? Advantages: Pseudoalignment of reads preserves the key information needed for quantification. Blazing fast and accurate

How fast is Pseudoalignment? Given a paired read, from which transcript could I have originated from? Not nucleotide sequence alignment It determines, for each read, not where in each transcript it aligns, but rather which transcripts it is compatible with. Pseudoalignments provide the sufficient statistic for the EM algorithm How fast is Pseudoalignment? The quantification of 78.6 million reads takes 14 minutes on a standard desktop using a single CPU core. ~6 million reads quantified per minute

Why Kallisto? Most RNA seq tools(Cufflinks, RSEM, eXpress etc) do RNA seq analysis in two parts- Alignment- Align reads to transcriptome or split reads over genome Quantification- converts the alignments to abundance metrics( FPKM, RPKM, TPM) Two clusters of quantification tools, count based Vs. Expectation-Maximization(EM) based Key difference is how they deals with ambiguous read alignments Kallisto fuses the two steps Reads are pseudoaligned to the reference transcriptome EM algorithm deconvolutes pseudoalignments to obtain transcript abundances

Target de Bruijn Graph (T-DBG) http://arxiv.org/pdf/1505.02710v2.pdf Create every k-mer in the transcriptome (k=31), build de Bruin Graph and color each k-mer Preprocess the transcriptome to create the T-DBG Indexing is faster

Target de Bruijn Graph (T-DBG) http://arxiv.org/pdf/1505.02710v2.pdf Use k-mers in read to find which transcript it came from Want to find pseudo alignments pseudoalignment : which transcripts the read (pair) is compatible with not an alignment of the nucleotide sequences.

Target de Bruijn Graph (T-DBG) Each k-mer appears in a set of transcripts The intersection of all sets is our pseudoalignment Can jump over k-mers in the T-DBG that provide same information Jumping provides ~8x speedup over chekcing all k-mers

Performance - Accuracy Simulated 20, 30M PE reads using RSEM simulator Relative difference = Accuracy http://arxiv.org/pdf/1505.02710v2.pdf

Performance - speed Total running time for running 20 samples on 20 cores. Speed http://arxiv.org/pdf/1505.02710v2.pdf

Bootstrap A new statistical feature of Kallisto, possible only because of its speed, is the bootstrap The result is that we can accurately estimate the uncertainty in abundance estimates

Hands on Demo of Kallisto in DE

Detailed instructions with videos, manuals, documentation in Keep asking: ask.iplantcollabortive.org