Functional annotation and identification of candidate disease genes by computational analysis of normal tissue gene expression data L. Miozzi 1, U. Ala 1, R. Piro 2, F. Rosa 3, F. Di Cunto 1 and P. Provero 1 1 Dipartimento di Genetica, Biologia e Biochimica, Università di Torino, Torino, Italy; 2 INFN, Sezione di Torino, Torino, Italy; 3 ISI Foundation, Torino, Italy Introduction Among the open problems of molecular biology in the post-genomic era the functional annotation of the human genome and the identification of genes involved in genetic diseases are especially important. Expression data on a genomic scale have been available for several years thanks to a set of new experimental techniques, and are widely believed to contain much information potentially relevant towards the solution of such problems. Here we present the results of a computational analysis of publicly available expression data on human normal tissues, based on the integration of data obtained with the two most important experimental platforms (microarrays and SAGE) and different measures of dissimilarity between expression profiles. The building blocks of the procedure are the Gene Expression Neighborhoods (GEN), small sets of tightly coexpressed genes which are analyzed in terms of functional annotation and relevance to human diseases. This analysis provides putative functional annotations for many genes, and identifies promising candidate disease genes for experimental verification. The “guilt by association” principle: The presented work is based on the following principle: “ since there is a strong correlation between coexpression and functional relatedness, a gene found to be coexpressed with several others involved in the same biological process can be putatively given the same functional annotation (Brazma A. et Vilo J., 2000, FEBS Lett. 480:17-24) ”. Method In this work we analyze publicly available expression data on human normal tissues obtained with Affymetrix microarrays ( and with SAGE (Serial Analysis of Gene Expression; We considered 158 experiments concerning genes for Affymetrix and 62 experiments concerning genes for SAGE. Different measures of dissimilarity between expression profiles have been defined and integrated: Euclidean distance and Pearson linear dissimilarity for the microarray data, Euclidean distance and a dissimilarity measure based on the Poisson distribution (developed in Van Helden J., 2004, Bioinformatics 20(3): in a different context) for SAGE data. The unit of functional analysis, named Gene Expression Neighborhood (GEN), has been defined as a gene plus its k nearest expression neighbors, with k typically a rather small number (the results we report were obtained with k=6). For each dataset and each choice of dissimilarity measure we identified a number of GENs equal to the number of genes represented in the dataset. A GEN was considered functionally characterized if there was at least one Gene Ontology term ( shared by the majority (K) of its genes (K=4 genes in the results presented). To avoid too generic GO terms, the analysis has been limited to those terms, shared by no more than a given maximum number M of genes in the whole experimental dataset under investigation (M=300 in the results presented). This limit ensures that the majority rule used to define functionally characterized GENs automatically implies statistically significant overrepresentation of the GO term involved. The false discovery rate for the functionally characterized GENs has been estimated: random GENs have been generated by reshuffling the gene names in the whole dataset (thus preserving the characteristics of the actual GENs, such as their degree of self-overlapping) and subjected to the same functional analysis. A leave-one-out analysis has been performed to estimate how many correct annotations the method can correctly identify. Characterized GENs have been used to determine putative new functional annotations: for each functionally characterized GEN and for each GO term associated to it (shared by the majority of its genes), the same GO term has been putatively attributed to the genes in the GEN not associated to it. Finally, we looked for functionally characterized GENs containing at least 3 genes associated with a genetic disease in the OMIM database ( When the relevant OMIM entries were related to each other, the genes in the GEN not associated to OMIM entries have been considered as interesting candidates to be involved in similar pathologies. Publicly available expression data integration of different quantitative measures of dissimilarity between expression profiles Identification of Gene Expression Neighborhoods (GEN) GEN functional analysis using the controlled annotation vocabulary Gene Ontology Potential new disease genes (OMIM) MicroarraysSAGE Putative new GO functional annotations Integration with OMIM data Estimation of false discovery rate Leave-one-out EuclideanPearsonPoisson Euclidean+ Pearson Euclidean+ Poisson Microarray428788/ SAGE50/ Microarray+ SAGE EuclideanPearsonPoisson Euclidean+ Pearson Euclidean+ Poisson Microarray318546/ SAGE48/ 82 Microarray+ SAGE EuclideanPearsonPoisson Euclidean+ Pearson Euclidean+ Poisson Microarray / SAGE188/ Microarray+ SAGE EuclideanPearsonPoisson Euclidean+ Pearson Euclidean+ Poisson Microarray569950/ SAGE173/ Microarray+ SAGE Conclusion We have developed a useful approach to analyze and integrate information obtained with different experimental techniques and different definitions of dissimilarity measures able to explore several aspects of coexpression. The results demonstrate that this integration increases the amount of useful information obtained. Results The leave-one-out analysis showed that 1026 correct GO annotations involving 644 genes and 94 GO terms would have been correctly identified by the method (see table 1). Table 1 - Leave-one-out analysis results showing the number of GO annotations (a) and annotated genes (b) correctly identified. a) b) c) d) Table 2 - Number of obtained putative new functional GO annotations (c) and new annotated genes (d). Different definition of dissimilarity measures describe different aspects of coexpression correlated with different kinds of functional annotation (see table 1 and 2) as shown by the fact that only a small fraction of GO annotations is predicted by more than one dissimilarity measure – dataset. The distribution of GO terms among the three Gene Ontology branches changes significantly among the experimental datasets-dissimilarity measures showing that different combinations are able to capture different aspects of coexpression. We have obtained 2113 putative new GO annotations involving 1540 genes and 194 GO terms (see table 2). Fig.1- the graphics show the distribution of correct obtained GO annotations among the three GO branch ( Biological process; Molecular function; Cellular conponent) The integration of our functional annotation results with the OMIM database allowed us to identify at least 59 interesting candidate genes potentially involved in human genetic disease (see table 3). Table 3 – List of candidates genes potentially involved in human genetic diseases. DatasetDiseaseGene Microarray+PearsonACROMEGALOID FEATURES, OVERGROWTH, CLEFT PALATE, AND HERNIAENSG Microarray+PearsonAORTIC ANEURYSM, FAMILIAL THORACIC 1ENSG Microarray+PearsonCARDIOMYOPATHY, DILATED, 1C; CMD1CENSG Microarray+PearsonCHARCOT-MARIE-TOOTH DISEASE, AXONAL, TYPE 2G; CMT2GENSG Microarray+PearsonCHARCOT-MARIE-TOOTH DISEASE, DOMINANT INTERMEDIATE AENSG Microarray+PearsonCONVULSIONS, BENIGN FAMILIAL INFANTILE, 2ENSG Microarray+PearsonCONVULSIONS, FAMILIAL INFANTILE, WITH PAROXYSMAL CHOREOATHETOSIS; ICCAENSG Microarray+PearsonDEAFNESS, NEUROSENSORY, AUTOSOMAL RECESSIVE 46; DFNB46ENSG Microarray+PearsonEPILEPSY, IDIOPATHIC GENERALIZED, SUSCEPTIBILITY TO, 3; EIG3ENSG Microarray+PearsonEPILEPSY, PARTIAL, WITH VARIABLE FOCIENSG Microarray+PearsonFACIOSCAPULOHUMERAL MUSCULAR DYSTROPHY 1A; FSHMD1AENSG Microarray+PearsonMUSCULAR DYSTROPHY, LIMB-GIRDLE, TYPE 1F; LGMD1FENSG Microarray+PearsonPARKINSON DISEASE 3, AUTOSOMAL DOMINANT LEWY BODY; PARK3ENSG Microarray+PearsonPOLYDACTYLY, PREAXIAL II; PPD2ENSG Microarray+PearsonROSSELLI-GULIENETTI SYNDROMEENSG Microarray+PearsonSCAPULOPERONEAL MYOPATHY; SPMENSG Microarray+PearsonVACUOLAR NEUROMYOPATHYENSG Microarray+PearsonVACUOLAR NEUROMYOPATHYENSG Microarray+PearsonACROMEGALOID FEATURES, OVERGROWTH, CLEFT PALATE, AND HERNIAENSG Microarray+PearsonBREAST CANCER, TRANSLOCATION ASSOCIATEDENSG Microarray+PearsonBREAST CANCER, DUCTAL, 1; BRCD1ENSG Microarray+PearsonELECTROENCEPHALOGRAM, LOW-VOLTAGEENSG Microarray+PearsonEOSINOPHILIA, FAMILIALENSG Microarray+PearsonMICROCEPHALY, PRIMARY AUTOSOMAL RECESSIVE, 4; MCPH4ENSG Microarray+PearsonMUSCULAR DYSTROPHY, CONGENITAL, 1BENSG Microarray+PearsonSCAPULOPERONEAL MYOPATHY; SPMENSG Microarray+PearsonTRIPHALANGEAL THUMB-POLYSYNDACTYLY SYNDROMEENSG Microarray+PearsonTUMOR SUPPRESSOR GENE ON CHROMOSOME 11ENSG Microarray+PearsonCARDIOMYOPATHY, DILATED, 1F; CMD1FENSG Microarray+PearsonCARDIOMYOPATHY, DILATED, 1Q; CMD1QENSG Microarray+PearsonDEAFNESS, AUTOSOMAL RECESSIVE 51; DFNB51ENSG Microarray+PearsonMYOPATHY, LIMB-GIRDLE, WITH BONE FRAGILITYENSG Microarray+EuclideaARRHYTHMOGENIC RIGHT VENTRICULAR DYSPLASIA, FAMILIAL, 5; ARVD5ENSG Microarray+EuclideaNONCOMPACTION OF LEFT VENTRICULAR MYOCARDIUM, FAMILIAL ISOLATED, AUTOSOMAL DOMINANT 2ENSG Microarray+EuclideaSCAPULOPERONEAL MYOPATHY; SPMENSG Microarray+EuclideaMUSCULAR DYSTROPHY, CONGENITAL, 1BENSG Microarray+EuclideaCARDIOMYOPATHY, DILATED, 1C; CMD1CENSG SAGE+EuclideanANEURYSM, INTRACRANIAL BERRY, 3ENSG SAGE+EuclideanMYOPIA 5ENSG SAGE+EuclideanMYOPIA 6ENSG SAGE+EuclideanNONCOMPACTION OF LEFT VENTRICULAR MYOCARDIUM, FAMILIAL ISOLATED, AUTOSOMAL DOMINANT 2ENSG SAGE+EuclideanMICROPHTHALMIA-CATARACTENSG SAGE+EuclideanEXFOLIATIVE ICHTHYOSIS, AUTOSOMAL RECESSIVE, ICHTHYOSIS BULLOSA OF SIEMENS-LIKEENSG SAGE+EuclideanMACULAR DYSTROPHY, RETINAL, 2, BULL'S EYEENSG SAGE+EuclideanCATARACT, CONGENITAL NUCLEAR, AUTOSOMAL RECESSIVE 1; CATCN1ENSG SAGE+EuclideanCARDIOMYOPATHY, DILATED, 1C; CMD1CENSG SAGE+EuclideanARRHYTHMOGENIC RIGHT VENTRICULAR DYSPLASIA, FAMILIAL, 5; ARVD5ENSG SAGE+EuclideanACHROMATOPSIA 1ENSG SAGE+EuclideanACHROMATOPSIA 1ENSG SAGE+EuclideanCONE-ROD DYSTROPHY 5; CORD5ENSG SAGE+EuclideanCONE-ROD DYSTROPHY 5; CORD5ENSG SAGE+EuclideanPOSTERIOR COLUMN ATAXIA WITH RETINITIS PIGMENTOSA; AXPC1ENSG SAGE+EuclideanMYOPIA 6ENSG SAGE+EuclideanGLAUCOMA 3, PRIMARY INFANTILE, B; GLC3BENSG SAGE+EuclideanMICROPHTHALMIA-CATARACTENSG SAGE+EuclideanDUPUYTREN CONTRACTUREENSG SAGE+EuclideanCORNEAL DYSTROPHY, CRYSTALLINE, OF SCHNYDERENSG SAGE+EuclideanCATARACT, AUTOSOMAL RECESSIVE, EARLY-ONSET, PULVERULENTENSG SAGE+EuclideanCATARACT, POSTERIOR POLAR 3ENSG