Presentation is loading. Please wait.

Presentation is loading. Please wait.

RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Similar presentations


Presentation on theme: "RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520"— Presentation transcript:

1 RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Guest lecture by Wei Li

2 RNA-seq Protocol Martin and Wang Nat. Rev. Genet. (2011)

3 RNA-seq https://www.youtube.com/watch?v=V_4n8n5Z6I8
(RNA-Seq using Ion Proton)

4 Why RNA-seq, not microarray?
No need to design microarray probes Digital representation, higher detection range Alternative splicing Fusion Mutations

5 RNA-seq Applications Gene expression; differential expression

6 RNA-seq Applications Alternative splicing, novel isoforms

7 RNA-seq Applications Novel genes or transcripts, lncRNA

8 RNA-seq Applications Detect gene fusions Mutations, RNA editing

9 RNA-seq Experimental Design and Analysis

10 Experimental Design Assessing biological variation requires biological replicates (no need for technical replicates) 3 preferred, 2 OK, 1 only for exploratory assays (not good for publications)

11 Experimental Design For differential expression, don’t pool RNA from multiple biological replicates Batch effects still exist, try to be consistent or process all samples at the same time

12 Batch effect A research group’s striking finding in 2014
“Human heart is more similar with human brain than mouse brain” Human Heart Mouse Brain Human Brain

13 Circles: human tissues
Cones: mouse tissues

14 Batch effect Other researcher’s response in Twitter

15 1st batch: human tissues
2nd batch: human tissues 3rd batch: mouse tissues 4th batch: mouse tissues 5th batch: human/mouse tissues

16 Batch effect

17 Batch effect Before experiments: careful design
After experiments: batch effect removal (combat)

18 Experimental Design Ribo-minus (remove too abundant genes)
PolyA (mRNA, enrich for exons) Strand specific (anti-sense lncRNA) Sequencing: PE (resolve redundancy) or SE: expression PE for splicing, novel transcripts Depth: 30-50M differential expression, deeper transcript assembly Read length: longer for transcript assembly

19 Alignment Prefer splice-aware aligners TopHat, BWA, STAR (not DNASTAR)
Sometimes need to trim the beginning bases

20 Quality Control: RSeQC
Read qualities

21 Quality Control: RSeQC
Nucleotide compositions

22 Quality Control: RSeQC
Read count distribution and GC content

23 Quality Control: RSeQC
Read count distributions across genes

24 Quality Control: RSeQC
Insert size distribution and splicing junctions Paired-end read Insert size

25 Quality Control: RSeQC

26 Differential Expression

27 Differential expression
You see the expression of gene X doubles in condition B compared with condition A How reliable it is? What’s the chance of observing it by random? All comes to variation estimation! Expression A B p=0.001 Expression A B Expression A B p=0.27

28 Differential expression
Variation can be estimated if you have many biological replicates But in practice, only 2-3 replicates are available What to do next? – Proper statistical models

29 Sequencing Read Distribution
Poisson distribution: # events within an interval Mean = Variance But: sequencing data is over-dispersed (Mean<Variance)

30 Sequencing Read Distribution
Negative binomial Def: # of successes before r failures occur, if Pb(each success) is p

31 Differential Expression
Negative binomial for RNA-seq Variance estimated by borrowing information from all the genes – hierarchical models Test whether μi is the same for gene i between samples j FDR?

32 Differential expression
EdgeR DESeq/DESeq2

33 Expression Index RPKM (Reads per kilobase of transcript per million reads of library) Corrects for coverage, gene length 1 RPKM ~ transcript / cell Comparable between different genes within the same dataset TopHat / Cufflinks FPKM (Fragments), PE libraries, RPKM/2 TPM (transcripts per million) Normalizes to transcript copies instead of reads Longer transcripts have more reads RSEM, HTSeq

34 Differential Expression
Should we do differential expression on RPKM/FPKM or TPM? Cufflinks: RPKM/FPKM LIMMA-VOOM and DESeq: TPM Power to detect DE is proportional to length Continued development and updates Gene A (1kb) Gene B (8kb)

35 Alternative Splicing Assign reads to splice isoforms (TopHat)

36 Alternative Splicing Different AS events

37 Alternative Splicing MATS: Multivariate Analysis of Transcript Splicing

38 Reference-based assembly
Transcript Assembly Reference-based assembly Cufflinks De novo assembly Trinity

39 Transcript Assembly (Cufflinks)
Read mapping using Tophat Construct a graph of reads “Incompatible” fragments (reads) means they are definitely NOT from the same transcript

40 Transcript Assembly (Cufflinks)
Incompatible

41 Transcript Assembly (Cufflinks)
3. Identify the minimum # paths that cover all reads (each path is one possible transcript) Dilworth’s theorem: finding a minimum partition P into chains is equivalent to finding a maximum antichain in P (an antichain is a set of mutually incompatible fragments)

42 Transcript Assembly (Cufflinks)
4. Transcript abundance estimation

43 Isoform Inference If given known set of isoforms
Estimate x to maximize the likelihood of observing n

44 Known Isoform Abundance Inference

45 Isoform Inference With known isoform set, sometimes the gene-level expression level inference is great, although isoform abundances have big uncertainty (e.g. known set incomplete) De novo isoform inference is a non-identifiable problem if RNA-seq reads are short and gene is long with too many exons Algorithm: Trinity

46 De-novo transcriptome assembly

47

48 De bruijn graph (1946) Used in the earliest human genome assemblies
Standard algorithm for genome assembly A sequence of length k can be represented as an edge between two sequences (length k-1)

49 De bruijn graph (1946)

50 De bruijn graph How to do genome assembly?
Sequences as nodes -> traverse all nodes in a graph -> Hamilton path problem -> NP complete problem! De bruijn graph: Sequences as edges -> traverse all edges in a graph -> Euler graph -> Polynomial algorithm!

51 Gene Fusion More seen in cancer samples Still a bit hard to call
TopHatFusion in TopHat2 Maher et al, Nat 2009

52 Other Applications RNA editing Circular RNA
Change on RNA sequence after transcription Most frequent: A to I (behaves like G), C to U Evolves from mononucleotide deaminases, might be involved in RNA degradation Circular RNA Mostly arise from splicing Varying length, abundance, and stability Possible function: sponge for RBP or miRNA

53 Summary RNA-seq design considerations Read mapping: TopHat, BWA, STAR
De novo transcriptome assembly: TRINITY Quality control: RSeQC Expression index: FPKM and TPM Differential expression Cufflinks: versatile LIMMA-VOOM and DESeq: better variance estimates Alternative splicing: MATS Gene fusion, genome editing, circular RNA

54 Acknowledgement Alisha Holloway Simon Andrews Radhika Khetani


Download ppt "RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520"

Similar presentations


Ads by Google