1. SELECTION OF THE KEY GENE SET 2. BIOLOGICAL NETWORK SELECTION Background Functional interdependencies among genes, transcription factors, and other molecular effectors, define networks characterized by small-word properties and a strong community structure. Detecting disease-associated sub-networks is essential to understand complex diseases. The goal is not trivial as at the present it is difficult to validate biologically the putative disease modules. Three different module identification approaches were proposed: 1) network-based approach; 2) expression-based approach; 3) pathway-based approach [1,2]. The first approach is based on the topology of network where the modules are defined as subsets of vertices with high connectivity between them and low connectivity with the other nodes. The second approach uses gene expression data for inferring modules of genes exhibiting similar expression pattern by, for example, clustering methods. The third approach detects expression changes in biological pathways, group of genes that accomplishes specific biological functions. Here we propose a new method that joins the potentiality of all these three approaches. In fact we use a novel algorithm that starting from gene-expression data and the information enclosed in biological pathways is able to detect perturbed functional communities. The perturbation was evaluated by multiple-group structural equation modeling (SEM), a statistical methodology able to test causal models. SEM was previously applied to gene expression downstream analyses for the determination of significant edges and nodes in pathway models [3,4]. 1. SELECTION OF THE KEY GENE SET The list of genes could represent the list of differentially expressed genes (DEGs) obtained by microarray experiments or RNA-seq analysis, genes in which it was found a mutation or ‘a priori’ list of genes associated with a disease. 2. BIOLOGICAL NETWORK SELECTION For this purpose, any type of directed network that include the list of genes is acceptable. For example network of pathways (KEGG, Reactome, ect.) or protein-protein interaction networks (STRING, BioGRID etc.) could be used. 3. SHORTEST PATH MODEL (SPM) Starting from the key gene set and the biological network in which the key gene set act, all the shortest paths between every couple of genes were determined. The shortest paths could include genes that are not in the key gene set. All the shortest paths were joined to have the shortest path model. 4. SPM COMMUNITIES There are a lot of algorithms for the identification of communities that could be used (minimum-cut, hierarchical clustering, Information-based algorithms etc.). The walktrap community finding algorithm [5], based on the random walks, was used. 6. PUTATIVE DISEASE MODULE The communities resulted statistically significant among the biological groups (p-value < 0.05) were joined in a unique network to have the putative disease module, the areas in the original network resulted perturbed. The module was tested by DO enrichment analysis [4]. 5. TESTING COMMUNITIES BY SEM Each community can be seen as a causal model. This allows to use SEM to verify which communities were statistically significant between the biological conditions you want to compare. The covariance structures of two groups were tested by likelihood ration test (LRT). Gene Set Biological pathways Shortest path model (SPM) Communities detection Final disease module SEM testing RESULTS The algorithm was applied to a gene expression microarray experiment from GEO (GSE52471). The experiment has the goal to identify signaling pathways and cellular signatures that may be targeted for treatment purposes of discoid lupus erythematosus (DLE). DLE is the most common skin manifestation of lupus. Despite its high frequency in systemic lupus in addition to cases without extracutaneous manifestations, targeted treatments for DLE are lacking. Two groups were compared, patients affected by DLE (7 samples) against control (13 samples). A differential analysis using a t-test with a Benjamini-Hochberg correction, allowed to individuate 1371 (DEGs). On these, a pathway analysis with the “Signaling Impact Pathway Analysis” (SPIA) revealed 25 important perturbed KEGG pathways as cytokine-cytokine receptor interaction, chemokine signaling pathway or systemic lupus erythematosus. All the 25 KEGG pathways were joined in a unique graph with 1580 nodes and 7265 edges. The SPM was a graph of 331 nodes and 868 edges. The community analysis with the walktrap algorithm revealed 35 communities. Each community was tested by SEM to detect which areas in the original network are actually perturbed. 5 communities resulted to be differently regulated between DLE samples and controls. These communities were joined to have a global picture of perturbed areas, bringing to a graph of 182 nodes and 436 edges. The validation of the final disease module was performed by enrichment analysis based on terms in Disease Ontology (DO), that allowed to individuate which diseases are associated to the nodes/genes present in the module, and by literature validation, to verify if the genes in the final module were already associated to DLE. SPM COMMUNITIES PUTATIVE DLE MODULE ENRICHMENT ANALYSIS Conclusion Reference [1] Segal, E et al. Nature Genetics (2005), 37: 38-45. [2] Wang, X et al. Current opinion in biotech. (2008), 19(5): 482-491. [3] Pepe, D & Grassi, M. BMC Bioinformatics (2014), 15(1): 132. [4] Pepe, D & Hwan Do, J. Biochip Journal (2015), 1:1-8. [5] Pascal, P, & Latapy M. Computer and Information Sciences-ISCIS (2005), 284-293. . The procedure illustrated here is finalized to identify putative disease modules. The method is based on the characteristics of biological networks as the community structure and the small world phenomena. Starting from a gene set of interest and the biological network in which the genes are involved, the SPM and then the community structure is determined. Each community is evaluated by SEM to detect which areas in the network are differently regulated between the biological groups. We tested the procedure on gene expression data from DLE. 1371 DEGs and 25 KEGG pathways were detected. Four significant communities were unveiled. The DO enrichment analysis showed diseases as lupus erythematosus, melanoma, ulcerative colitis. They are connected by similar biological mechanisms that bring to the different phenotype. The genes in the module are fundamental for the manifestation of the disease as the TNF of the MYD88. The procedure, considering the goodness of the results, will be tested on other gene expression data. Acknowledgement Thanks to BAMBI, FP7 project (GA 618024) Travel funding to ISCM/ECCB 2015 was generously provided by ECCB