Presentation is loading. Please wait.

Presentation is loading. Please wait.

Working with RNA-Seq Data

Similar presentations


Presentation on theme: "Working with RNA-Seq Data"— Presentation transcript:

1 Working with RNA-Seq Data
Part 1: Working with RNA-Seq Data

2 RNA-seq: overview Genome .…TCTGAAACAATGCTTCAATCTAACTTATCATTCATTGGGA….

3 RNA-seq: overview Genome Gene A Gene B Gene C

4 RNA-seq: overview Genome Gene A Gene B Gene C Transcr. A Transcr. A
Transcr. C

5 RNA-seq: overview Genome Reads Gene A Gene B Gene C Transcr. A
Transcr. C Reads

6 RNA-seq: overview Genome Reads Gene A Gene B Gene C Transcr. A
Transcr. C Reads Transcr. A Transcr. C

7 RNA-seq: some details Genome Shattering Gene A Gene B Gene C Transcr.
Transcr. A Transcr. C Transcr. C Shattering

8 RNA-seq: some details Genome Adapters ligation Gene A Gene B Gene C
Transcr. Transcr. Transcr. A Transcr. Transcr. C Adapters ligation

9 RNA-seq: some details Genome PCR amplification Gene A Gene B Gene C
Transcr. Transcr. Transcr. A Transcr. Transcr. C PCR amplification

10 RNA-seq: some details Genome “Reading” Gene A Gene B Gene C Transcr.
Transcr. A Transcr. Transcr. C “Reading”

11 RNA-seq: per-sample processing
Preprocessing: Adapters removal plus additional trimming Removing PCR duplicates Mapping Mapping on the set of known transcripts Mapping on genome (and potential identification of novel transcripts) Combined strategy Quantification of expression levels

12 RNA-seq: Comments PCR removal should be used with caution to avoid removing natural duplicates (valuable links: - DNA-seq and variant calling - RNA-seq, ChIP-seq data - trimming

13 RNA-seq: processing

14 RNA-seq: processing

15 RNA-seq: expression level quantification
Standard measures read counts (raw, expected) FPKM – fragments per kilo base per million mapped reads: Number of reads mapped on the gene / ((total number of mapped reads – in millions) x (gene length – in kilobases)) TPM – transcripts per million For one sample TPMg = C x FPKMg, where C is selected in such a way that sum of all TPMg is one million. But constants C are different for different samples.

16 RNA-seq: expression level quantification
Alternative definition of TPM: (Number of reads mapped on the gene x read mean length x 106) / (gene length x T), where T is the sum over all genes of (Number of reads mapped on the gene x read mean length) / gene length Each term here represents the number of sampled transcripts corresponding to a gene, and T estimates the total number of sampled transcripts (molecules). Thus, TPM is the estimate of the number of transcripts corresponding to a gene in every million transcripts. Details: Wagner G.P., Kin K., Lynch V.J. (Theory Biosci., 2012)

17 RNA-seq: expression level quantification
Linear scale vs Log-scale Relative differences are biologically more meaningful than absolute. Computations are simplified if a log-scaling is performed: Log-scaled measure = log2 (linear-scale measure + shift) For relatively large values a difference equal to 1 in log-scale is a 2x difference in linear scale; difference equal to 3 in log-scale is a 8x difference in linear scale, etc.; difference equal to -1 in log-scale is a 2x difference in linear scale, but in the opposite direction.

18 Comparison: the role of preprocessing
No preprocessing

19 Comparison: the role of preprocessing
No PCR duplicate removal

20 Comparison: the role of preprocessing
Standard

21 Comparison: the role of preprocessing (output)

22 Comparison: the role of preprocessing

23 Comparison: the role of preprocessing

24 Extended pipeline

25 Extended pipeline

26 BREAK B R E A K

27 Differential expression and pathway / gene set enrichment analysis
Part 2: Differential expression and pathway / gene set enrichment analysis

28 Differential expression analysis
Quantities related to the degree of differential expression: Difference between mean expression levels – fold change (please, pay attention to scale); Statistical significance – p-value, adjusted p-value (e.g., FDR) Expression level magnitude (caution with low- expressed genes from the analysis).

29 Differential expression analysis

30 Differential expression analysis

31 Gene set / pathway enrichment analysis
Possible options: Use only lists (thresholding required): one of the standard tools here is The Database for Annotation, Visualization and Integrated Discovery – DAVID ( d.ncifcrf.gov/). Take into consideration degrees of differential expression; Additionally take into consideration pathway topology.

32 Gene set / pathway enrichment analysis

33 Gene set / pathway enrichment analysis

34 BREAK B R E A K

35 Unsupervised analysis
Part 3: Unsupervised analysis

36 Unsupervised analysis: PCA

37 Unsupervised analysis: PCA

38 Unsupervised analysis: PCA

39 Unsupervised analysis: hierarchical clustering

40 Unsupervised analysis: hierarchical clustering

41 Unsupervised analysis: hierarchical clustering

42 Unsupervised analysis: hierarchical clustering

43 Unsupervised analysis: hierarchical clustering

44 Unsupervised analysis: hierarchical clustering

45 Unsupervised analysis: hierarchical clustering

46 Unsupervised analysis: hierarchical clustering

47 Unsupervised analysis: hierarchical clustering
Dendrogram

48 Unsupervised analysis: hierarchical clustering
Dendrogram

49 Unsupervised analysis: PCA (15 genes)

50 Unsupervised analysis: PCA (15 genes)

51 Unsupervised analysis: hierarchical clustering, 15 genes
Dendrogram

52 Unsupervised analysis: hierarchical clustering, 15 genes
Dendrogram Luminal C-low N-like Basal

53 Gene annotation: ENSG to Gene Symbols plus GO

54 Unsupervised analysis: K-means, 15 genes

55 Unsupervised analysis: K-means, 15 genes

56 Unsupervised analysis: K-means, 15 genes

57 Unsupervised analysis: K-means, 15 genes

58 Unsupervised analysis: K-means, 15 genes

59 Unsupervised analysis: K-means, 15 genes

60 Unsupervised analysis: K-means, 15 genes

61 Unsupervised analysis: K-means, 15 genes

62 Unsupervised analysis: K-means, 15 genes

63 Unsupervised analysis: K-means, 15 genes

64 Unsupervised analysis: K-means, 15 genes

65 Unsupervised analysis: K-means, 15 genes
“The SUM52PE cell line was derived from a pleural effusion and was found to be negative for ER and PR expression, however the original primary tumor from this patient was positive for both hormone receptors”. Chavez KJ, Garimella SV, Lipkowitz S. Triple negative breast cancer cell lines: one tool in the search for better treatment of triple negative breast cancer. Breast Dis. 2010; 32(1-2):35-48. Ethier SP, Kokeny KE, Ridings JW, Dilts CA. erbB family receptor expression and growth regulation in a newly isolated human breast cancer cell line. Cancer Res. 1996; 56(4):

66 BREAK B R E A K

67 Supervised analysis: classification
Part 4: Supervised analysis: classification

68 Supervised analysis: SVM with a linear kernel as an example

69 Supervised analysis: SVM with a linear kernel as an example

70 Supervised analysis: SVM with a linear kernel as an example

71 Supervised analysis: SVM with a linear kernel as an example

72 Supervised analysis: SVM with a linear kernel as an example

73 Supervised analysis: SVM with a linear kernel as an example
?

74 Supervised analysis: SVM with a linear kernel as an example
?

75 Supervised analysis: available methods
Linear Discriminant Analysis (LDA) Quadratic Discriminant Analysis (QDA) Random Forest Support Vector Machine (SVM) Naïve Bayes

76 Supervised analysis: 15 genes

77 BREAK B R E A K

78 Separation of TCGA and breast cancer PDX samples
BREAK HANDSON Separation of TCGA and breast cancer PDX samples

79 Analysis of a subset of breast cancer PDX samples
BREAK HANDSON Analysis of a subset of breast cancer PDX samples


Download ppt "Working with RNA-Seq Data"

Similar presentations


Ads by Google