Download presentation
Presentation is loading. Please wait.
1
Working with RNA-Seq Data
Part 1: Working with RNA-Seq Data
2
RNA-seq: overview Genome .…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….
3
RNA-seq: overview Genome Gene A Gene B Gene C
4
RNA-seq: overview Genome Gene A Gene B Gene C Transcr. A Transcr. A
Transcr. C
5
RNA-seq: overview Genome Reads Gene A Gene B Gene C Transcr. A
Transcr. C Reads
6
RNA-seq: overview Genome Reads Gene A Gene B Gene C Transcr. A
Transcr. C Reads Transcr. A Transcr. C
7
RNA-seq: some details Genome Shattering Gene A Gene B Gene C Transcr.
Transcr. A Transcr. C Transcr. C Shattering
8
RNA-seq: some details Genome Adapters ligation Gene A Gene B Gene C
Transcr. Transcr. Transcr. A Transcr. Transcr. C Adapters ligation
9
RNA-seq: some details Genome PCR amplification Gene A Gene B Gene C
Transcr. Transcr. Transcr. A Transcr. Transcr. C PCR amplification
10
RNA-seq: some details Genome “Reading” Gene A Gene B Gene C Transcr.
Transcr. A Transcr. Transcr. C “Reading”
11
RNA-seq: per-sample processing
Preprocessing: Adapters removal plus additional trimming Removing PCR duplicates Mapping Mapping on the set of known transcripts Mapping on genome (and potential identification of novel transcripts) Combined strategy Quantification of expression levels
12
RNA-seq: Comments PCR removal should be used with caution to avoid removing natural duplicates (valuable links: - DNA-seq and variant calling - RNA-seq, ChIP-seq data - trimming
13
RNA-seq: processing
14
RNA-seq: processing
15
RNA-seq: expression level quantification
Standard measures read counts (raw, expected) FPKM – fragments per kilo base per million mapped reads: Number of reads mapped on the gene / ((total number of mapped reads – in millions) x (gene length – in kilobases)) TPM – transcripts per million For one sample TPMg = C x FPKMg, where C is selected in such a way that sum of all TPMg is one million. But constants C are different for different samples.
16
RNA-seq: expression level quantification
Alternative definition of TPM: (Number of reads mapped on the gene x read mean length x 106) / (gene length x T), where T is the sum over all genes of (Number of reads mapped on the gene x read mean length) / gene length Each term here represents the number of sampled transcripts corresponding to a gene, and T estimates the total number of sampled transcripts (molecules). Thus, TPM is the estimate of the number of transcripts corresponding to a gene in every million transcripts. Details: Wagner G.P., Kin K., Lynch V.J. (Theory Biosci., 2012)
17
RNA-seq: expression level quantification
Linear scale vs Log-scale Relative differences are biologically more meaningful than absolute. Computations are simplified if a log-scaling is performed: Log-scaled measure = log2 (linear-scale measure + shift) For relatively large values a difference equal to 1 in log-scale is a 2x difference in linear scale; difference equal to 3 in log-scale is a 8x difference in linear scale, etc.; difference equal to -1 in log-scale is a 2x difference in linear scale, but in the opposite direction.
18
Comparison: the role of preprocessing
No preprocessing
19
Comparison: the role of preprocessing
No PCR duplicate removal
20
Comparison: the role of preprocessing
Standard
21
Comparison: the role of preprocessing (output)
22
Comparison: the role of preprocessing
23
Comparison: the role of preprocessing
24
Extended pipeline
25
Extended pipeline
26
BREAK B R E A K
27
Differential expression and pathway / gene set enrichment analysis
Part 2: Differential expression and pathway / gene set enrichment analysis
28
Differential expression analysis
Quantities related to the degree of differential expression: Difference between mean expression levels – fold change (please, pay attention to scale); Statistical significance – p-value, adjusted p-value (e.g., FDR) Expression level magnitude (caution with low- expressed genes from the analysis).
29
Differential expression analysis
30
Differential expression analysis
31
Gene set / pathway enrichment analysis
Possible options: Use only lists (thresholding required): one of the standard tools here is The Database for Annotation, Visualization and Integrated Discovery – DAVID ( d.ncifcrf.gov/). Take into consideration degrees of differential expression; Additionally take into consideration pathway topology.
32
Gene set / pathway enrichment analysis
33
Gene set / pathway enrichment analysis
34
BREAK B R E A K
35
Unsupervised analysis
Part 3: Unsupervised analysis
36
Unsupervised analysis: PCA
37
Unsupervised analysis: PCA
38
Unsupervised analysis: PCA
39
Unsupervised analysis: hierarchical clustering
40
Unsupervised analysis: hierarchical clustering
41
Unsupervised analysis: hierarchical clustering
42
Unsupervised analysis: hierarchical clustering
43
Unsupervised analysis: hierarchical clustering
44
Unsupervised analysis: hierarchical clustering
45
Unsupervised analysis: hierarchical clustering
46
Unsupervised analysis: hierarchical clustering
47
Unsupervised analysis: hierarchical clustering
Dendrogram
48
Unsupervised analysis: hierarchical clustering
Dendrogram
49
Unsupervised analysis: PCA (15 genes)
50
Unsupervised analysis: PCA (15 genes)
51
Unsupervised analysis: hierarchical clustering, 15 genes
Dendrogram
52
Unsupervised analysis: hierarchical clustering, 15 genes
Dendrogram Luminal C-low N-like Basal
53
Gene annotation: ENSG to Gene Symbols plus GO
54
Unsupervised analysis: K-means, 15 genes
55
Unsupervised analysis: K-means, 15 genes
56
Unsupervised analysis: K-means, 15 genes
57
Unsupervised analysis: K-means, 15 genes
58
Unsupervised analysis: K-means, 15 genes
59
Unsupervised analysis: K-means, 15 genes
60
Unsupervised analysis: K-means, 15 genes
61
Unsupervised analysis: K-means, 15 genes
62
Unsupervised analysis: K-means, 15 genes
63
Unsupervised analysis: K-means, 15 genes
64
Unsupervised analysis: K-means, 15 genes
65
Unsupervised analysis: K-means, 15 genes
“The SUM52PE cell line was derived from a pleural effusion and was found to be negative for ER and PR expression, however the original primary tumor from this patient was positive for both hormone receptors”. Chavez KJ, Garimella SV, Lipkowitz S. Triple negative breast cancer cell lines: one tool in the search for better treatment of triple negative breast cancer. Breast Dis. 2010; 32(1-2):35-48. Ethier SP, Kokeny KE, Ridings JW, Dilts CA. erbB family receptor expression and growth regulation in a newly isolated human breast cancer cell line. Cancer Res. 1996; 56(4):
66
BREAK B R E A K
67
Supervised analysis: classification
Part 4: Supervised analysis: classification
68
Supervised analysis: SVM with a linear kernel as an example
69
Supervised analysis: SVM with a linear kernel as an example
70
Supervised analysis: SVM with a linear kernel as an example
71
Supervised analysis: SVM with a linear kernel as an example
72
Supervised analysis: SVM with a linear kernel as an example
73
Supervised analysis: SVM with a linear kernel as an example
?
74
Supervised analysis: SVM with a linear kernel as an example
?
75
Supervised analysis: available methods
Linear Discriminant Analysis (LDA) Quadratic Discriminant Analysis (QDA) Random Forest Support Vector Machine (SVM) Naïve Bayes
76
Supervised analysis: 15 genes
77
BREAK B R E A K
78
Separation of TCGA and breast cancer PDX samples
BREAK HANDSON Separation of TCGA and breast cancer PDX samples
79
Analysis of a subset of breast cancer PDX samples
BREAK HANDSON Analysis of a subset of breast cancer PDX samples
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.