RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Slides:

Advertisements

Similar presentations

RNA-Seq as a Discovery Tool

Advertisements

RNA-seq library prep introduction

Capturing the chicken transcriptome with PacBio long read RNA-seq data OR Chicken in awesome sauce: a recipe for new transcript identification Gladstone.

An Introduction to Studying Expression Data Through RNA-seq

RNA-Seq based discovery and reconstruction of unannotated transcripts

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

12/04/2017 RNA seq (I) Edouard Severing.

Simon v2.3 RNA-Seq Analysis Simon v2.3.

Peter Tsai Bioinformatics Institute, University of Auckland

DEG Mi-kyoung Seo.

RNA-seq: the future of transcriptomics ……. ?

Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.

Data Analysis for High-Throughput Sequencing

Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Transcriptomics Jim Noonan GENE 760.

RNA-seq Analysis in Galaxy

mRNA-Seq: methods and applications

RNA-Seq and RNA Structure Prediction

Brief workflow RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA (cDNA). Fragments meeting a certain size specification.

Li and Dewey BMC Bioinformatics 2011, 12:323

Expression Analysis of RNA-seq Data

Todd J. Treangen, Steven L. Salzberg

Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.

RNAseq analyses -- methods

RNA-Seq Analysis Simon V4.1.

Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)

Transcriptome Analysis

The iPlant Collaborative

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Introduction to RNAseq

RNA-seq: Quantifying the Transcriptome

TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.

No reference available

CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.

Statistics Behind Differential Gene Expression

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on

Simon v RNA-Seq Analysis Simon v

RNA Quantitation from RNAseq Data

An Introduction to RNA-Seq Data and Differential Expression Tools in R

RNA-Seq for the Next Generation RNA-Seq Intro Slides

Is the end of RNA-Seq alignment?

Moderní metody analýzy genomu

Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017

Gene expression from RNA-Seq

RNA-Seq analysis in R (Bioconductor)

The RNA-Seq Bid Idea: Statistical Design and Analysis for RNA Sequencing Data The RNA-Seq Big Idea Team: Yaqing Zhao1,2, Erika Cule1†, Andrew Gehman1,

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

Kallisto: near-optimal RNA seq quantification tool

Differential Expression from RNA-seq

Gene expression estimation from RNA-Seq data

Sequence Analysis 2- RNA-Seq

Reference based assembly

Transcriptome analysis

RNA sequencing (RNA-Seq) and its application in ovarian cancer

Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey

Quantitative analyses using RNA-seq data

Introduction to RNA-seq

Sequence Analysis - RNA-Seq 2

Schematic representation of a transcriptomic evaluation approach.

Sequence Analysis - RNA-Seq 1

Differential Expression of RNA-Seq Data

Presentation transcript:

RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520 Guest lecture by Wei Li

RNA-seq Protocol Martin and Wang Nat. Rev. Genet. (2011)

RNA-seq https://www.youtube.com/watch?v=V_4n8n5Z6I8 (RNA-Seq using Ion Proton)

Why RNA-seq, not microarray? No need to design microarray probes Digital representation, higher detection range Alternative splicing Fusion Mutations

RNA-seq Applications Gene expression; differential expression

RNA-seq Applications Alternative splicing, novel isoforms

RNA-seq Applications Novel genes or transcripts, lncRNA

RNA-seq Applications Detect gene fusions Mutations, RNA editing

RNA-seq Experimental Design and Analysis

Experimental Design Assessing biological variation requires biological replicates (no need for technical replicates) 3 preferred, 2 OK, 1 only for exploratory assays (not good for publications)

Experimental Design For differential expression, don’t pool RNA from multiple biological replicates Batch effects still exist, try to be consistent or process all samples at the same time

Batch effect A research group’s striking finding in 2014 “Human heart is more similar with human brain than mouse brain” Human Heart Mouse Brain Human Brain

Circles: human tissues Cones: mouse tissues

Batch effect Other researcher’s response in Twitter

1st batch: human tissues 2nd batch: human tissues 3rd batch: mouse tissues 4th batch: mouse tissues 5th batch: human/mouse tissues

Batch effect

Batch effect Before experiments: careful design After experiments: batch effect removal (combat)

Experimental Design Ribo-minus (remove too abundant genes) PolyA (mRNA, enrich for exons) Strand specific (anti-sense lncRNA) Sequencing: PE (resolve redundancy) or SE: expression PE for splicing, novel transcripts Depth: 30-50M differential expression, deeper transcript assembly Read length: longer for transcript assembly

Alignment Prefer splice-aware aligners TopHat, BWA, STAR (not DNASTAR) Sometimes need to trim the beginning bases

Quality Control: RSeQC Read qualities

Quality Control: RSeQC Nucleotide compositions

Quality Control: RSeQC Read count distribution and GC content

Quality Control: RSeQC Read count distributions across genes

Quality Control: RSeQC Insert size distribution and splicing junctions Paired-end read Insert size

Quality Control: RSeQC

Differential Expression

Differential expression You see the expression of gene X doubles in condition B compared with condition A How reliable it is? What’s the chance of observing it by random? All comes to variation estimation! Expression A B p=0.001 Expression A B Expression A B p=0.27

Differential expression Variation can be estimated if you have many biological replicates But in practice, only 2-3 replicates are available What to do next? – Proper statistical models

Sequencing Read Distribution Poisson distribution: # events within an interval Mean = Variance But: sequencing data is over-dispersed (Mean<Variance)

Sequencing Read Distribution Negative binomial Def: # of successes before r failures occur, if Pb(each success) is p

Differential Expression Negative binomial for RNA-seq Variance estimated by borrowing information from all the genes – hierarchical models Test whether μi is the same for gene i between samples j FDR?

Differential expression EdgeR DESeq/DESeq2

Expression Index RPKM (Reads per kilobase of transcript per million reads of library) Corrects for coverage, gene length 1 RPKM ~ 0.3 -1 transcript / cell Comparable between different genes within the same dataset TopHat / Cufflinks FPKM (Fragments), PE libraries, RPKM/2 TPM (transcripts per million) Normalizes to transcript copies instead of reads Longer transcripts have more reads RSEM, HTSeq

Differential Expression Should we do differential expression on RPKM/FPKM or TPM? Cufflinks: RPKM/FPKM LIMMA-VOOM and DESeq: TPM Power to detect DE is proportional to length Continued development and updates Gene A (1kb) Gene B (8kb)

Alternative Splicing Assign reads to splice isoforms (TopHat)

Alternative Splicing Different AS events

Alternative Splicing MATS: Multivariate Analysis of Transcript Splicing

Reference-based assembly Transcript Assembly Reference-based assembly Cufflinks De novo assembly Trinity

Transcript Assembly (Cufflinks) Read mapping using Tophat Construct a graph of reads “Incompatible” fragments (reads) means they are definitely NOT from the same transcript

Transcript Assembly (Cufflinks) Incompatible

Transcript Assembly (Cufflinks) 3. Identify the minimum # paths that cover all reads (each path is one possible transcript) Dilworth’s theorem: finding a minimum partition P into chains is equivalent to finding a maximum antichain in P (an antichain is a set of mutually incompatible fragments)

Transcript Assembly (Cufflinks) 4. Transcript abundance estimation

Isoform Inference If given known set of isoforms Estimate x to maximize the likelihood of observing n

Known Isoform Abundance Inference

Isoform Inference With known isoform set, sometimes the gene-level expression level inference is great, although isoform abundances have big uncertainty (e.g. known set incomplete) De novo isoform inference is a non-identifiable problem if RNA-seq reads are short and gene is long with too many exons Algorithm: Trinity

De-novo transcriptome assembly

De bruijn graph (1946) Used in the earliest human genome assemblies Standard algorithm for genome assembly A sequence of length k can be represented as an edge between two sequences (length k-1)

De bruijn graph (1946)

De bruijn graph How to do genome assembly? Sequences as nodes -> traverse all nodes in a graph -> Hamilton path problem -> NP complete problem! De bruijn graph: Sequences as edges -> traverse all edges in a graph -> Euler graph -> Polynomial algorithm!

Gene Fusion More seen in cancer samples Still a bit hard to call TopHatFusion in TopHat2 Maher et al, Nat 2009

Other Applications RNA editing Circular RNA Change on RNA sequence after transcription Most frequent: A to I (behaves like G), C to U Evolves from mononucleotide deaminases, might be involved in RNA degradation Circular RNA Mostly arise from splicing Varying length, abundance, and stability Possible function: sponge for RBP or miRNA

Summary RNA-seq design considerations Read mapping: TopHat, BWA, STAR De novo transcriptome assembly: TRINITY Quality control: RSeQC Expression index: FPKM and TPM Differential expression Cufflinks: versatile LIMMA-VOOM and DESeq: better variance estimates Alternative splicing: MATS Gene fusion, genome editing, circular RNA

Acknowledgement Alisha Holloway Simon Andrews Radhika Khetani