Quantitative analyses using RNA-seq data

Slides:



Advertisements
Similar presentations
RNA-Seq based discovery and reconstruction of unannotated transcripts
Advertisements

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
Exploring the Human Transcriptome
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Simon v2.3 RNA-Seq Analysis Simon v2.3.
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
MCB Lecture #21 Nov 20/14 Prokaryote RNAseq.
RNA-seq Analysis in Galaxy
mRNA-Seq: methods and applications
RNA-Seq and RNA Structure Prediction
Brief workflow RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA (cDNA). Fragments meeting a certain size specification.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Li and Dewey BMC Bioinformatics 2011, 12:323
Maximum likelihood estimation of relative transcript abundances Advanced bioinformatics 2012.
Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.
RNAseq analyses -- methods
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
RNA-Seq Analysis Simon V4.1.
Next Generation Sequencing. Overview of RNA-seq experimental procedures. Wang L et al. Briefings in Functional Genomics 2010;9: © The Author.
The iPlant Collaborative
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
RNA-seq workshop COUNTING & HTSEQ Erin Osborne Nishimura.
Introduction to RNAseq
RNA-seq: Quantifying the Transcriptome
The iPlant Collaborative
No reference available
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Statistics Behind Differential Gene Expression
Simon v RNA-Seq Analysis Simon v
RNA Quantitation from RNAseq Data
An Introduction to RNA-Seq Data and Differential Expression Tools in R
Is the end of RNA-Seq alignment?
Moderní metody analýzy genomu
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Ch3: Model Building through Regression
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
High-Throughput Analysis of Genomic Data [S7] ENRIQUE BLANCO
edgeR: empirical Bayes analysis
Kallisto: near-optimal RNA seq quantification tool
Lecture 7. Topics in RNA Bioinformatics (Single-Cell RNA Sequencing)
Differential Expression from RNA-seq
Gene expression estimation from RNA-Seq data
Sequence Analysis 2- RNA-Seq
Reference based assembly
Discrete Event Simulation - 4
Learning to count: quantifying signal
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Assessing changes in data – Part 2, Differential Expression with DESeq2
Modeling Enzyme Processivity Reveals that RNA-Seq Libraries Are Biased in Characteristic and Correctable Ways  Nathan Archer, Mark D. Walsh, Vahid Shahrezaei,
Additional file 2: RNA-Seq data analysis pipeline
Modeling Enzyme Processivity Reveals that RNA-Seq Libraries Are Biased in Characteristic and Correctable Ways  Nathan Archer, Mark D. Walsh, Vahid Shahrezaei,
Sequence Analysis - RNA-Seq 2
Relative expression levels of known mRNA selected as points of reference. Relative expression levels of known mRNA selected as points of reference. mRNA.
Analysis of RNA-Seq data Counting, normalization, and statistical tests for differential expression March 16, 2018 Dr. ir. Perry D. Moerland Bioinformatics.
Differential Expression of RNA-Seq Data
Relative abundance and expression of the 10 most abundant MAGs in the bioreactor at day 96. Relative abundance and expression of the 10 most abundant MAGs.
Presentation transcript:

Quantitative analyses using RNA-seq data

Classic quantification of gene expression using RNA-seq Mapping Alignment to genome -Hisat2 -STAR Counts reads per transcript Normalization Read counts tables FPKM TPM

Normalised expression values For gene/isoform length Gene A Gene B a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. it uses a negative binomial model estimated from data to obtain variance estimates from which p-values are computed. Gene Raw reads Length Normalised Reads A 10 2 5 B 1

Normalised expression values For total number of mapped reads Gene A Condition x Condition z Condition Raw reads Total mapped reads Normalised Reads x 10 1000 0.01 z 5 500 a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. it uses a negative binomial model estimated from data to obtain variance estimates from which p-values are computed.

FPKM (Fragment Per Kilobase Million) I STEP: normalize by depth GENE REP1 REP2 REP3 A1 (2kb) 10 12 30 A2 (4kb) 20 25 60 A3 (1kb) 5 8 15 A4 (10kb) 1 a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. it uses a negative binomial model estimated from data to obtain variance estimates from which p-values are computed.

FPKM (RPKM) GENE REP1 REP2 REP3 A1 (2kb) 10 12 30 A2 (4kb) 20 25 60 I STEP: normalize by depth GENE REP1 REP2 REP3 A1 (2kb) 10 12 30 A2 (4kb) 20 25 60 A3 (1kb) 5 8 15 A4 (10kb) 1 a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. it uses a negative binomial model estimated from data to obtain variance estimates from which p-values are computed. Sum all the counts Scale by 1M (10) 35 45 106 3.5 4.5 10.6

FPKM (RPKM) GENE REP1 REP2 REP3 A1 (2kb) 2.86 2.67 2.83 A2 (4kb) 5.71 II STEP: divide counts by scaling factor 3.5 4.5 10.6 SCALING FACTOR GENE REP1 REP2 REP3 A1 (2kb) 2.86 2.67 2.83 A2 (4kb) 5.71 5.56 5.66 A3 (1kb) 1.43 1.78 A4 (10kb) 0.09 a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. it uses a negative binomial model estimated from data to obtain variance estimates from which p-values are computed. COUNTS -> FPM

FPKM (RPKM) GENE REP1 REP2 REP3 A1 (2kb) 1.43 1.33 1.42 A2 (4kb) 1.39 III STEP: divide counts by length (kb) GENE REP1 REP2 REP3 A1 (2kb) 1.43 1.33 1.42 A2 (4kb) 1.39 A3 (1kb) 1.78 A4 (10kb) 0.009 a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. it uses a negative binomial model estimated from data to obtain variance estimates from which p-values are computed. FPM -> FPKM

TPM (Transcripts Per Million) TPM is similar to FPKM and RPKM but it is calculated in a different order GENE REP1 REP2 REP3 A1 (2kb) 10 12 30 A2 (4kb) 20 25 60 A3 (1kb) 5 8 15 A4 (10kb) 1 a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. it uses a negative binomial model estimated from data to obtain variance estimates from which p-values are computed.

TPM (Transcripts Per Million) I STEP: normalize by gene length GENE REP1 REP2 REP3 A1 (2kb) 5 6 15 A2 (4kb) 6.25 A3 (1kb) 8 A4 (10kb) 0.1 a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. it uses a negative binomial model estimated from data to obtain variance estimates from which p-values are computed. COUNTS -> FPK

TPM (Transcripts Per Million) II STEP: normalize by sequencing depth GENE REP1 REP2 REP3 A1 (2kb) 5 6 15 A2 (4kb) 6.25 A3 (1kb) 8 A4 (10kb) 0.1 a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. it uses a negative binomial model estimated from data to obtain variance estimates from which p-values are computed. Sum all the FPKs Scale by 1M (10) 15 20.25 45.1 1.5 2.025 4.51

TPM (Transcripts Per Million) II STEP: normalize by sequencing depth GENE REP1 REP2 REP3 A1 (2kb) 3.33 2.96 3.326 A2 (4kb) 3.09 A3 (1kb) 3.95 A4 (10kb) 0.02 a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. it uses a negative binomial model estimated from data to obtain variance estimates from which p-values are computed. FPK -> TPM

FPKM VS TPM FPKM TPM GENE REP1 REP2 REP3 A1 (2kb) 1.43 1.33 1.42 1.39 A3 (1kb) 1.78 A4 (10kb) 0.009 FPKM TPM 4.29 4.5 4.25 GENE REP1 REP2 REP3 A1 (2kb) 3.33 2.96 3.326 A2 (4kb) 3.09 A3 (1kb) 3.95 A4 (10kb) 0.02 a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have originated from the same isoform. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have originated. Because only the ends of each fragment are sequenced, the length of each may be unknown. Assigning a fragment to different isoforms often implies a different length for it. Cufflinks incorporates the distribution of fragment lengths to help assign fragments to isoforms. it uses a negative binomial model estimated from data to obtain variance estimates from which p-values are computed. 10 10 10

Defying the paradigm of transcript quantification Quasi-mapping -> Quantification Regular Mapping -> Quantification Mapping to the transcriptome Simple and fast - > Diferential expesion with DESeq2, edgeR, limma or sleuth.

Classic quantification of gene expression using RNA-seq Mapping Salmon Quasi-mapping to transcriptome Alignment to genome -Hisat2 -STAR Counts reads per transcript Bias correction and Quantification Normalization Read counts tables TPM TPM

Quasi-mapping: Let speed up! In many cases all the information provided for the alignment is not necessary. Base-to-base alignment is slow and to quantify we just need to know the position where the reads map. Quasi-mapping (RapMap) Faster!!! Produces mapping that meet or exceed the accuracy of existing popular aligners

RNA-seq biases Love et al. (2016) Nature Biotechnology

Salmon: Accounting for fragment sequence bias Love et al. (2016) Nature Biotechnology [Salmon] “It is the first transcriptome-wide quantifier to correct for fragment GC-content bias” Patro et al. (2017) Nature Methods

Onlina phase that estimates: -initial expression levesls -Auxiality parametes -Foreground bias modeles -construct equivalence clases over impit fragments offline pahse: -Refines these expressione stimates Online and offline phases optimize the estimates of transcript abunances Online – Collapsed variational bayesian inference Offiline – EM algorithm