Working with RNA-Seq Data Part 1: Working with RNA-Seq Data
RNA-seq: overview Genome .…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….
RNA-seq: overview Genome Gene A Gene B Gene C
RNA-seq: overview Genome Gene A Gene B Gene C Transcr. A Transcr. A Transcr. C
RNA-seq: overview Genome Reads Gene A Gene B Gene C Transcr. A Transcr. C Reads
RNA-seq: overview Genome Reads Gene A Gene B Gene C Transcr. A Transcr. C Reads Transcr. A Transcr. C
RNA-seq: some details Genome Shattering Gene A Gene B Gene C Transcr. Transcr. A Transcr. C Transcr. C Shattering
RNA-seq: some details Genome Adapters ligation Gene A Gene B Gene C Transcr. Transcr. Transcr. A Transcr. Transcr. C Adapters ligation
RNA-seq: some details Genome PCR amplification Gene A Gene B Gene C Transcr. Transcr. Transcr. A Transcr. Transcr. C PCR amplification
RNA-seq: some details Genome “Reading” Gene A Gene B Gene C Transcr. Transcr. A Transcr. Transcr. C “Reading”
RNA-seq: per-sample processing Preprocessing: Adapters removal plus additional trimming Removing PCR duplicates Mapping Mapping on the set of known transcripts Mapping on genome (and potential identification of novel transcripts) Combined strategy Quantification of expression levels
RNA-seq: Comments PCR removal should be used with caution to avoid removing natural duplicates (valuable links: http://www.cureffi.org/2012/12/11/how-pcr-duplicates-arise-in-next-generation-sequencing/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4965708/ - DNA-seq and variant calling https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4597324/ - RNA-seq, ChIP-seq data https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3871669/ - trimming
RNA-seq: processing
RNA-seq: processing
RNA-seq: expression level quantification Standard measures read counts (raw, expected) FPKM – fragments per kilo base per million mapped reads: Number of reads mapped on the gene / ((total number of mapped reads – in millions) x (gene length – in kilobases)) TPM – transcripts per million For one sample TPMg = C x FPKMg, where C is selected in such a way that sum of all TPMg is one million. But constants C are different for different samples.
RNA-seq: expression level quantification Alternative definition of TPM: (Number of reads mapped on the gene x read mean length x 106) / (gene length x T), where T is the sum over all genes of (Number of reads mapped on the gene x read mean length) / gene length Each term here represents the number of sampled transcripts corresponding to a gene, and T estimates the total number of sampled transcripts (molecules). Thus, TPM is the estimate of the number of transcripts corresponding to a gene in every million transcripts. Details: Wagner G.P., Kin K., Lynch V.J. (Theory Biosci., 2012) https://www.ncbi.nlm.nih.gov/pubmed/22872506
RNA-seq: expression level quantification Linear scale vs Log-scale Relative differences are biologically more meaningful than absolute. Computations are simplified if a log-scaling is performed: Log-scaled measure = log2 (linear-scale measure + shift) For relatively large values a difference equal to 1 in log-scale is a 2x difference in linear scale; difference equal to 3 in log-scale is a 8x difference in linear scale, etc.; difference equal to -1 in log-scale is a 2x difference in linear scale, but in the opposite direction.
Comparison: the role of preprocessing No preprocessing
Comparison: the role of preprocessing No PCR duplicate removal
Comparison: the role of preprocessing Standard
Comparison: the role of preprocessing (output)
Comparison: the role of preprocessing
Comparison: the role of preprocessing
Extended pipeline
Extended pipeline
BREAK B R E A K
Differential expression and pathway / gene set enrichment analysis Part 2: Differential expression and pathway / gene set enrichment analysis
Differential expression analysis Quantities related to the degree of differential expression: Difference between mean expression levels – fold change (please, pay attention to scale); Statistical significance – p-value, adjusted p-value (e.g., FDR) Expression level magnitude (caution with low- expressed genes from the analysis).
Differential expression analysis
Differential expression analysis
Gene set / pathway enrichment analysis Possible options: Use only lists (thresholding required): one of the standard tools here is The Database for Annotation, Visualization and Integrated Discovery – DAVID (https://david.ncifcrf.gov/home.jsp, https://david- d.ncifcrf.gov/). Take into consideration degrees of differential expression; Additionally take into consideration pathway topology.
Gene set / pathway enrichment analysis
Gene set / pathway enrichment analysis
BREAK B R E A K
Unsupervised analysis Part 3: Unsupervised analysis
Unsupervised analysis: PCA
Unsupervised analysis: PCA
Unsupervised analysis: PCA
Unsupervised analysis: hierarchical clustering
Unsupervised analysis: hierarchical clustering
Unsupervised analysis: hierarchical clustering
Unsupervised analysis: hierarchical clustering
Unsupervised analysis: hierarchical clustering
Unsupervised analysis: hierarchical clustering
Unsupervised analysis: hierarchical clustering
Unsupervised analysis: hierarchical clustering
Unsupervised analysis: hierarchical clustering Dendrogram
Unsupervised analysis: hierarchical clustering Dendrogram
Unsupervised analysis: PCA (15 genes)
Unsupervised analysis: PCA (15 genes)
Unsupervised analysis: hierarchical clustering, 15 genes Dendrogram
Unsupervised analysis: hierarchical clustering, 15 genes Dendrogram Luminal C-low N-like Basal
Gene annotation: ENSG to Gene Symbols plus GO
Unsupervised analysis: K-means, 15 genes
Unsupervised analysis: K-means, 15 genes
Unsupervised analysis: K-means, 15 genes
Unsupervised analysis: K-means, 15 genes
Unsupervised analysis: K-means, 15 genes
Unsupervised analysis: K-means, 15 genes
Unsupervised analysis: K-means, 15 genes
Unsupervised analysis: K-means, 15 genes
Unsupervised analysis: K-means, 15 genes
Unsupervised analysis: K-means, 15 genes
Unsupervised analysis: K-means, 15 genes
Unsupervised analysis: K-means, 15 genes “The SUM52PE cell line was derived from a pleural effusion and was found to be negative for ER and PR expression, however the original primary tumor from this patient was positive for both hormone receptors”. Chavez KJ, Garimella SV, Lipkowitz S. Triple negative breast cancer cell lines: one tool in the search for better treatment of triple negative breast cancer. Breast Dis. 2010; 32(1-2):35-48. Ethier SP, Kokeny KE, Ridings JW, Dilts CA. erbB family receptor expression and growth regulation in a newly isolated human breast cancer cell line. Cancer Res. 1996; 56(4): 899-907.
BREAK B R E A K
Supervised analysis: classification Part 4: Supervised analysis: classification
Supervised analysis: SVM with a linear kernel as an example
Supervised analysis: SVM with a linear kernel as an example
Supervised analysis: SVM with a linear kernel as an example
Supervised analysis: SVM with a linear kernel as an example
Supervised analysis: SVM with a linear kernel as an example
Supervised analysis: SVM with a linear kernel as an example ?
Supervised analysis: SVM with a linear kernel as an example ?
Supervised analysis: available methods Linear Discriminant Analysis (LDA) Quadratic Discriminant Analysis (QDA) Random Forest Support Vector Machine (SVM) Naïve Bayes
Supervised analysis: 15 genes
BREAK B R E A K
Separation of TCGA and breast cancer PDX samples BREAK HANDSON Separation of TCGA and breast cancer PDX samples
Analysis of a subset of breast cancer PDX samples BREAK HANDSON Analysis of a subset of breast cancer PDX samples