mRNASeq analysis using TCGA HNSC data Vinay Kartha Monti lab rotation project 11/25/2013
Expression data mRNASeqv2 (Illumina HiSeq 2000) Samples with data available: 340 Each sample has 6 associated files: junction_quantification.txt rsem.genes.results rsem.genes.normalized_results rsem.isoforms.results rsem.isoforms.normalized_results bt.exon_quantification.txt Dataset reduction: Raw expression matrix: 20,531 genes Non-zero expression matrix: 20,200 genes Filtered expression matrix (CV >=1.25): 7,091 genes
QC Scatter plot of mean vs SD expression for non-zero expression data CV = std dev / mean = 1.25 CV-filtered data N = 340; n = 7091
QC Log-transformed* Asinh-transformed Box plot of CV - filtered expression data across all samples * Pseudocount of 0.01 added
QC
QC CV = std dev / mean = 1.25 x = y
QC Log-transformed* Box plot of MAD-filtered expression data across all samples * Pseudocount of 1 added
Clustered gene expression profile
Sample clustering based on grade/stage? See if expression is associated with clinical/phenotypic variables of interest Grade: GX: Grade cannot be assessed (undetermined grade) G1: Well differentiated (low grade) G2: Moderately differentiated (intermediate grade) G3: Poorly differentiated (high grade) G4: Undifferentiated (high grade) Stage: SI,SII, and SIII: Higher numbers indicate more extensive disease: Larger tumor size and/or spread of the cancer beyond the organ in which it first developed to nearby lymph nodes and/or tissues or organs adjacent to the location of the primary tumor SIV: Cancer has spread to distant organs and tissues For more information, see:
Sample clustering based on grade/stage? Fisher’s exact test (k = 2) Histological Grade Pathological Stage ClusterG1G2G3G4GXNATotal Total ClusterS1S2S3S4AS4BNATotal Total p = 2.98e-04 (< 0.05) p = (< 0.05)
Differential Expression with respect to Grade/Stage 340 samples (Total) TCGA sample vial codes: Histological Grade distribution among samples: Pathological Stage distribution among samples: 01A B 2 11A 37 G1 30 G2 203 G3 87 G4 6 GX 13 NA 1 G0 37 G1 25 G2 185 G3 77 G4 6 GX 9 NA 1 S0 37 SI 16 SII 47 SIII 41 SIVA 147 SIVB 6 NA 46 SI 18 SII 62 SIII 46 SIVA 162 SIVB 6 NA 46
Differential Expression with respect to Grade/Stage Cannot adjust expression for certain factors (Race/Ethnicity) due to missing phenotypic information Remove samples with missing information with respect to Grade/Stage and non-white patients G0 37 G1 25 G2 185 G3 77 G4 6 GX 9 NA 1 G0 32 G1 24 G2 156 G3 68 G4 6 Total = 286 S0 37 SI 16 SII 47 SIII 41 SIVA 147 SIVB 6 NA 46 S0 32 SI 14 SII 42 SIII 34 SIVA 130 SIVB 3 Total = 255
Adjust for gender? DE wrt Grade DE wrt Stage Don’t want to adjust for gender when it is associated with very few genes
Percentile-based gene filtering prior to DE testing Further reduce gene space prior to DE testing using 90 th percentiles to filter on Roughly divide # genes in half by choosing threshold log2(90 th percentile) value 90 th percentile >= 10.5 n = 5046 Grade (N = 286) 90 th percentile >= 10.5 n = 5019 Stage (N = 255)
Differential Expression testing Perform DE wrt Grade (N=286; n=5046) and Stage (N=255; n=5019) Tumor vs Normal (G0 vs G1+ ; S0 vs S1+) Within Grade/Stage comparison (G1 vs G2+ ; S2- vs S3+; excluding controls) Permutation-based t-test with sliding ‘time-points’ and sample pooling S3- vs S4A+ => (S1+S2+S3) vs (S4A + S4B) ‘diffanal’ function from diffanal.R (CBM repository) Number of permutations: 1000
DE testing by grade ComparisonNo. DE genes G G2+617 G3+943 G4456
DE testing by grade
DE testing by stage ComparisonNo. DE genes S1+327 S2+0 S3+0 S4A+0
DE testing by stage
DE genes: G0 vs G1+
DE genes: G1 vs G2+
DE genes: G2- vs G3+
DE genes: G3- vs G4
AhR targets
Variation of expression across grade DPAGT1
Variation of expression across grade TAZ
Variation of expression across grade YAP1
Variation of expression across grade PDGFRB
Sliding windows Tool takes Time factors in the order in which they appear in the ‘Time’ column Does not pull corresponding factors in order of Time point levels For example: Results in incorrect ordering of groups prior to sliding window DE testing Time G3 G2 … … Levels: G3 G2 …
Future work Perform GSEA/hyper-enrichment and pathway analyses Perform Oral cancer-specific analyses Restrict anatomic sub-types to include only: Alveolar Ridge Base of tongue Buccal Mucosa Floor of mouth Hard Palate Hypopharynx Larynx Lip Oral cavity Oral tongue Oropharynx Tonsil