Working with RNA-Seq Data

Working with RNA-Seq Data
Part 1: Working with RNA-Seq Data

RNA-seq: overview Genome .…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….

RNA-seq: overview Genome Gene A Gene B Gene C

RNA-seq: overview Genome Gene A Gene B Gene C Transcr. A Transcr. A
Transcr. C

RNA-seq: overview Genome Reads Gene A Gene B Gene C Transcr. A
Transcr. C Reads

RNA-seq: overview Genome Reads Gene A Gene B Gene C Transcr. A
Transcr. C Reads Transcr. A Transcr. C

RNA-seq: some details Genome Shattering Gene A Gene B Gene C Transcr.
Transcr. A Transcr. C Transcr. C Shattering

RNA-seq: some details Genome Adapters ligation Gene A Gene B Gene C
Transcr. Transcr. Transcr. A Transcr. Transcr. C Adapters ligation

RNA-seq: some details Genome PCR amplification Gene A Gene B Gene C
Transcr. Transcr. Transcr. A Transcr. Transcr. C PCR amplification

RNA-seq: some details Genome “Reading” Gene A Gene B Gene C Transcr.
Transcr. A Transcr. Transcr. C “Reading”

RNA-seq: per-sample processing
Preprocessing: Adapters removal plus additional trimming Removing PCR duplicates Mapping Mapping on the set of known transcripts Mapping on genome (and potential identification of novel transcripts) Combined strategy Quantification of expression levels

RNA-seq: Comments PCR removal should be used with caution to avoid removing natural duplicates (valuable links: - DNA-seq and variant calling - RNA-seq, ChIP-seq data - trimming

RNA-seq: processing

RNA-seq: expression level quantification
Standard measures read counts (raw, expected) FPKM – fragments per kilo base per million mapped reads: Number of reads mapped on the gene / ((total number of mapped reads – in millions) x (gene length – in kilobases)) TPM – transcripts per million For one sample TPMg = C x FPKMg, where C is selected in such a way that sum of all TPMg is one million. But constants C are different for different samples.

Alternative definition of TPM: (Number of reads mapped on the gene x read mean length x 106) / (gene length x T), where T is the sum over all genes of (Number of reads mapped on the gene x read mean length) / gene length Each term here represents the number of sampled transcripts corresponding to a gene, and T estimates the total number of sampled transcripts (molecules). Thus, TPM is the estimate of the number of transcripts corresponding to a gene in every million transcripts. Details: Wagner G.P., Kin K., Lynch V.J. (Theory Biosci., 2012)

Linear scale vs Log-scale Relative differences are biologically more meaningful than absolute. Computations are simplified if a log-scaling is performed: Log-scaled measure = log2 (linear-scale measure + shift) For relatively large values a difference equal to 1 in log-scale is a 2x difference in linear scale; difference equal to 3 in log-scale is a 8x difference in linear scale, etc.; difference equal to -1 in log-scale is a 2x difference in linear scale, but in the opposite direction.

Comparison: the role of preprocessing
No preprocessing

No PCR duplicate removal

Standard

Comparison: the role of preprocessing (output)

Extended pipeline

BREAK B R E A K

Differential expression and pathway / gene set enrichment analysis
Part 2: Differential expression and pathway / gene set enrichment analysis

Differential expression analysis
Quantities related to the degree of differential expression: Difference between mean expression levels – fold change (please, pay attention to scale); Statistical significance – p-value, adjusted p-value (e.g., FDR) Expression level magnitude (caution with low- expressed genes from the analysis).

Differential expression analysis

Gene set / pathway enrichment analysis
Possible options: Use only lists (thresholding required): one of the standard tools here is The Database for Annotation, Visualization and Integrated Discovery – DAVID ( d.ncifcrf.gov/). Take into consideration degrees of differential expression; Additionally take into consideration pathway topology.

Gene set / pathway enrichment analysis

BREAK B R E A K

Unsupervised analysis
Part 3: Unsupervised analysis

Unsupervised analysis: PCA

Unsupervised analysis: hierarchical clustering

Unsupervised analysis: hierarchical clustering
Dendrogram

Unsupervised analysis: PCA (15 genes)

Unsupervised analysis: hierarchical clustering, 15 genes
Dendrogram

Unsupervised analysis: hierarchical clustering, 15 genes
Dendrogram Luminal C-low N-like Basal

Gene annotation: ENSG to Gene Symbols plus GO

Unsupervised analysis: K-means, 15 genes

Unsupervised analysis: K-means, 15 genes
“The SUM52PE cell line was derived from a pleural effusion and was found to be negative for ER and PR expression, however the original primary tumor from this patient was positive for both hormone receptors”. Chavez KJ, Garimella SV, Lipkowitz S. Triple negative breast cancer cell lines: one tool in the search for better treatment of triple negative breast cancer. Breast Dis. 2010; 32(1-2):35-48. Ethier SP, Kokeny KE, Ridings JW, Dilts CA. erbB family receptor expression and growth regulation in a newly isolated human breast cancer cell line. Cancer Res. 1996; 56(4):

BREAK B R E A K

Supervised analysis: classification
Part 4: Supervised analysis: classification

Supervised analysis: SVM with a linear kernel as an example

Supervised analysis: SVM with a linear kernel as an example
?

Supervised analysis: available methods
Linear Discriminant Analysis (LDA) Quadratic Discriminant Analysis (QDA) Random Forest Support Vector Machine (SVM) Naïve Bayes

Supervised analysis: 15 genes

BREAK B R E A K

Separation of TCGA and breast cancer PDX samples
BREAK HANDSON Separation of TCGA and breast cancer PDX samples

Analysis of a subset of breast cancer PDX samples
BREAK HANDSON Analysis of a subset of breast cancer PDX samples

Working with RNA-Seq Data

Similar presentations

Presentation on theme: "Working with RNA-Seq Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Working with RNA-Seq Data

Similar presentations

Presentation on theme: "Working with RNA-Seq Data"— Presentation transcript:

Similar presentations

About project

Feedback