Driver mutations – Epigenetics – Transcriptomics CrossHub 2: identifying associations between driver mutations, transcriptomic and epigenomic tumor-specific changes G.S. Krasnov, A.A.Dmitriev, A.V. Snezhkina, V.N. Senchenko, A.V. Kudryavtseva. Engelhardt Institute of Molecular Biology RAS, Vavilova 32, Moscow, Russia. Abstract The contribution of different mechanisms to the regulation of gene expression varies for different tissues and tumors. Complementation of predicted mRNA-miRNA and gene-transcription factor (TF) relationships with the results of expression correlation analyses derived for specific tumor types outlines the interactions with functional impact in the current biomaterial. We developed CrossHub software, which enables two-way identification of most possible TF-gene interactions: on the basis of ENCODE ChIP-Seq binding evidence or Jaspar prediction and co-expression according to the data of The Cancer Genome Atlas (TCGA) project, the largest cancer omics resource. Similarly, CrossHub identifies mRNA-miRNA pairs with predicted or validated binding sites (TargetScan, mirSVR, PicTar, DIANA microT, miRTarBase) and strong negative expression correlations. We observed partial consistency between ChIP-Seq or miRNA target predictions and gene-TF/miRNA co-expression, demonstrating a link between these indicators. Additionally, CrossHub expression-methylation correlation analysis can be used to identify hypermethylated CpG sites or regions with the greatest potential impact on gene expression. Thus, CrossHub is capable of outlining molecular portraits of a specific gene and determining the three most common sources of expression regulation: promoter/enhancer methylation, miRNA interference, and TF-mediated activation or repression. The recently developed CrossHub 2 suggests an algorithm to evaluate overall CpG methylation density in the promoter and/or enhancer regions and link it with the presence of driver mutations, as well as link gene expression changes with the presence of driver mutations. Using RTrans, R script pack for the analysis of transcrimtomic data, we were able to link these shifts with the alterations in the most important biological processes (cell cycle, cell adhesion, metabolic reprogramming, etc.) According to our preliminary results, several cancer types (breast, colorectal, bladder, etc.) have a very variable CpG hyper- and hypo-methylation profile across the different tumor specimens. This distribution strongly depends on tumor subtype (for example, microsatellite instable or CpG island methylator phenotype for colorectal cancer), the presence of driver mutations and overall mutations frequency in the genome. The derived results allow us to suggest tumor classification method at the epigenetic level, based on the evaluation of CpG hyper-/hypo- methylation frequency in the genomic regions of interest, the selection of which is performed according to the ENCODE genome state segmentation data. CrossHub is freely available at https://sourceforge.net/projects/crosshub/. CrossHub workflow CrossHub workflow. Complementation of ENCODE ChIP-Seq data and Jaspar predictions with TCGA expression correlation analysis allows identifying interactions (gene – transcription factor) with potential functional impacts to a specific cancer subtype. Similarly, combining miRNA target predictions with gene-miRNA expression correlation profiling outlines gene-miRNA interactions that likely take a place for a particular tumor type. Expression-methylation correlation analysis allows identification of hypermethylated CpG sites or regions within promoters or enhancers (annotated with ENCODE) that may have the greatest impact on the gene expression. In addition, CrossHub enables conventional differential expression and methylation analysis. CrossHub finds out CIMP-like tumors basing on the analysis of promoter/enhancer CpG hyper/hypo-methylation profiles. It identifies top 1%, 5% or 10% most hyper-methylated or most differentially methylated promoter or enhancer CpG sites and then calculates median methylation level for these marker sites in each sample and uses the derived value as a measure of overall methylation level. CrossHub uses MutSig algorithm to find out driver mutations among pool of somatic ones. It creates list of top100 driver genes for each TCGA cancer type and list of top100 common drivers. Then it is able to find out association between driver mutations, CIMP-like epigenetic features and transcriptomic changes including alterations in biological processes (Gene Ontology) and cellular pathways (KEGG/reactome). Additional RTrans scripts for R are required for this analysis (freely available upon request). Driver mutations underlie malignant cell transformation and determine further tumor development pathways by causing transcriptomic shifts and epigenetic alterations. Driver mutations – Epigenetics – Transcriptomics Gene expression alterations in hyper-methylated tumors Driver mutations – epigenetics associations Driver mutations are associated with transcriptomic changes This figure illustrates the associations between driver mutations and transcriptomic changes in melanoma (SKCM) Plots in the frames show gene expression alterations between group of samples containing any somatic mutations in the presented drivers and the remaining specimens. Alterations in p53 pathways that are associated with driver mutations in MLL gene This table represents Spearman correlation coefficients between the presence of nonsynonymic mutations in potential driver genes (MutSig) and tumor epigenetic features, e.g. overall hyper-methylation rate, which is assessed using different methods. TCGA colon cancer dataset. “2xProm” means that CrossHub consider only CpG sites that are annotated as “promoter” in at least 2 out of 6 cell lines (ENCODE data). “DiffMeth / Top.10” means that CrossHub consider only top 10% differentially methylated CpG sites. “90 perc.” means that CrossHub calculates 90th percentile of beta-values (beta-value is ratio of methylated DNA at the current CpG position in the current sample) for the chosen CpG sites and analyses correlation between these 90th beta-value percentiles and the presence of driver mutations. This figure illustrates the associations between tumor epigenetic features (CpG hyper-methylation levels) and transcriptomic shifts Plots in the frames show gene expression alterations between groups of hyper-methylated TCGA tumor samples and other specimens. Colorectal cancer (COAD) along with lung squamous cell carcinoma (LUSC) reveals the most prominent transcriptomic alterations, especially in genes participating cell cycle, response to stress, developmental process, cell motility and migration. Cell border color indicate whether this GO term is enriched with up- or down-regulated genes in hyper-methylated samples Expression LogFC mut/wt TCGA melanoma dataset (SKCM) Predicting microRNA and transcription factor targets Predicting common miRNA regulators Predicting microRNA targets Predicting transcription factor targets Here is an example of identification of common microRNA regulators with CrossHub. Spearman correlation coefficients between microRNA and target gene expression levels are presented. Bars indicate microRNA binding site prediction scores (TargetScan and other resources). Cluster of onco-miR-96, miR-182, miR-183 is highlighted as the most possible regulator of small serine CTD phosphatases (SCP) gene family and their potential target Rb (RB1). Conventional features Distribution of genes for two parameters: ENCODE ChIP-Seq transcription factor (TF) binding score and Spearman gene-TF expression correlation coefficients (rs; colon cancer TCGA dataset). Two samplings were analyzed: genes participating in the glucose transport and metabolism (top) and genes encoding extracellular proteins (bottom). Genes with no ChIP-Seq evidence of TF binding are marked with zero score. Circle size is proportional to square root of total read count for a gene. Circle color indicates gene expression level change in tumor. The analysis was performed for two TF strongly up-regulated in colon cancer: well-known oncogenic protein Myc and CBX3 which is less extensively studied in the context of cancer. We compared distributions of rs between genes that passed and did not pass score thresholds (TS). Several TS were selected: > 0 (any positive score), 25th, 50th, 75th, and 90th score percentiles. Vertical dashed lines indicate mean values of rs for these groups. For each TS we observed statistically significant difference between the distributions of rs indicating linkage of these characteristics – ChIP-Seq score and TF-gene co-expression. CrossHub also performs conventional RNA-Seq differential expression analysis, calculates gene hyper- and hypo-methy-lation scores, separately for promoters and enhan-cers. Correlation analysis with clinical characteristics of specimens can be performed. These examples of Excel workbooks are generated with CrossHub Distribution density of gene-microRNA pairs in expression level correlation coefficients (rs) and miRNA binding site scores according to TargetScan (A, B), DIANA microT (C), miRTarBase (D, E) and overall score according to several algorithms (F). TargetScan (conservative sites), DIANA microT showed the greatest mean rs bias among the analyzed prediction algorithms. This suggest these databases as the most informative. Distribution density is slightly asymmetrical for these databases, especially for high scores (d = 0.02-0.045). These areas are marked with an arrow. miRNA-gene relationships predicted by several algorithms showed a greater rs bias (d = 0.109; overall score range 100—225; F). This region represents the maximum number of true miRNA-gene relationships with functional impact. Another region with a significant rs bias (d = 0.052, overall score > 400) mainly includes miRNA-gene relationships with strong experimental evidence or weak evidence coupled to miRNA binding site prediction by one or more algorithms. Published in: Krasnov GS, Dmitriev AA, Melnikova NV, Zaretsky AR, Nasedkina TV, Zasedatelev AS, Senchenko VN, Kudryavtseva AV. CrossHub: a tool for multi-way analysis of The Cancer Genome Atlas (TCGA) in the context of gene expression regulation mechanisms. Nucleic Acids Res.2016 Apr 20;44(7):e62. doi: 10.1093/nar/gkv1478. For any questions please contact George Krasnov, gskrasnov@mail.ru