Set similarity: given two gene products, G 1 and G 2, we can consider them as being represented by collections of terms: Based on the two sets, the goal.

Set similarity: given two gene products, G 1 and G 2, we can consider them as being represented by collections of terms: Based on the two sets, the goal is to define a natural similarity between G 1 and G 2 and, denoted as : Two types of set similarity: element based (Dice, Jaccard, Cosine, fuzzy measure) pair of elements based (Maximum, Average, OWA, Choquet) Expression dimension: real measures (Euclidian measure, etc…). Sequence dimension: sequence similarity measure (Smith-Waterman, Needleman-Wunsch, etc…) GO and Abstract dimension: set similarity. Set Similarity Measures for Gene Matching Mihail Popescu #, James Keller +, Joyce Mitchell # # Department of Health Management and Informatics;+Department of Electrical and Computer Engineering; University of Missouri-Columbia, Columbia, MO 65211 Why Similarity Measures? For a unified clustering approach in a 4D gene space Gene space dimensions (4D): sequence, microarray expression, literature abstracts (articles), gene ontology (GO) Two dimensions are numeric (sequence, expression) and two symbolic The existent symbolic measures are not adequate: Dice, Jaccard: do not consider the weight of the elements Maximum and average usually overestimates the or underestimates the similarity, respectively Example: ATM (human ataxia telangiectasia mutated) and STK11 (serine/threonine kinase 11.) The geneticist assessed these two genes as quasi-similar (similarity ~0.5) because: they both have protein serine/threonine kinase enzyme activity (they share a kinase domain) They both cause cancers when mutated, including breast cancer. Possible similarity measures Example of Similarity Calculation for the Gene Ontology (GO) Dimension s(ATM, STK11)=? (GO dimension) Algorithm: 1. Retrieve LocusLink GO annotations: ATM={4674: “ protein serine/threonine kinase activity”, 3677: ” DNA binding”, 4428 ” inositol/phosphatidylinositol kinase activity”, 7131 : ” meiotic recombination”, 6281 : ” DNA repair”, 7165: ” signal transduction”, 5634: ” nucleus”, 16740: ” transferase activity”, 45786: ” negative regulation of cell cycle”} STK11={5524: “ ATP binding”, 4674: ” protein serine/threonine kinase activity”, 6468: ” protein amino acid phosphorylation”, 16740: ” transferase activity”} 2. Compute GO term densities using the Resnik formula [4], the normalized version [.] or the depth in the hierarchy (.) Example of Similarity Calculation for the Retrieved Abstracts Dimension Acknowledgements This research was supported by National Library of Medicine Biomedical and Health Informatics Research Training grant 2-T15-LM07089-11. References [1] C.D. Manning, H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 2001. [2] R. Yager, “Criteria Aggregation Functions Using Fuzzy Measures and the Choquet Integral”, Int. Jour. of Fuzzy Systems, Vol.1, No. 2, December 1999. [3] J.J. Jiang, D.W. Conrath, “Semantic Similarity Based on Corpus Statistics and Lexical Ontology”, Proc. of Int. Conf. Research on Comp. Linguistics X, 1997, Taiwan. [4] P.W. Lord, R.D. Stevens, A. Brass, C.A. Goble, “Semantic similarity measure as a tool for exploring the gene ontology”, In Pacific Symposium on Biocomputing, pages 601-612, 2003. [5] M. Sugeno, Fuzzy measures and fuzzy integrals: a survey, (M.M. Gupta, G. N. Saridis, and B.R. Gaines, editors) Fuzzy Automata and Decision Processes, pp. 89-102, North-Holland, New York, 1977. [6] S. Raychaduri, R.B. Altman, “A literature-based method for assessing the functional coherence of a gene group”, Bioinformatics, 19(3), pp. 396:401, Feb. 2003. [7]. M. Grabisch, T. Murofushi, and M. Sugeno (eds.), Fuzzy Measures and Integrals: Theory and Applications, Springer-Verlag, 2000. [8]. Hvidsten TR, Komorowski J, Sandvik AK, Laegreid A. Predicting gene function from gene expressions and ontologies. Pac Symp Biocomput. 2001;:299- 310. [9]. Trupti Joshi. Cellular function prediction for hypothetical proteins using high-throughput data. MS thesis, University of Tennessee, Knoxville, 2003. [10]. Keller J, Popescu M, Mitchell J. Soft Computing Tools for Gene Similarity Measures in Bioinformatics, FLINT-CIBI 2003, Berkeley, Dec 15-18, 2003. Set similarity measures s(ATM, STK11)=? (Abstract dimension) Algorithm: Retrieve PubMed abstracts for ATM, STK11 Calculate all the pair-wise distances based on the MeSH indexing Keep the 4 best-matching pairs Find the impact factor for each journal: g(A i ), i=1…8 ATM12917635- Oncogene (6.737) 12970738- Oncogene (6.737) 14500819-Nucleic Acids Res. (6.373) 14499692-Science (23.329) STK1112183403 – Cancer Res (8.30) 12234250 – Biochem J (4.326) 12805220 - EMBO J. (12.459) 11853558- Biochem J (4.326) Calculate the confidence of the pair g(A 1, A 2 ) =g(A 1 )*g(A 2 ) and normalize using maximum value: The pair-wise similarity values calculated using FMS are: Similarity calculation: Using weighted average: s(ATM, STK11)=0.37 Using Choquet integral: s(ATM, STK11)=0.53 3. Compute the similarity: Conclusions For the GO dimension, the best method of assigning densities was normalizing the information content [4] by the maximum value The proposed fuzzy similarity measure (FMS) agrees better with our intuition of similarity: if the common elements have a high confidence, then the similarity is stronger. In addition, the non common terms have also a contribution to the similarity since the measure is computed apriori for each term set. The Choquet similarity measure is much more general, depending only on the fuzzy measure. In addition the optimal fuzzy measure can be learned from examples.

Set similarity: given two gene products, G 1 and G 2, we can consider them as being represented by collections of terms: Based on the two sets, the goal.

Similar presentations

Presentation on theme: "Set similarity: given two gene products, G 1 and G 2, we can consider them as being represented by collections of terms: Based on the two sets, the goal."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Set similarity: given two gene products, G 1 and G 2, we can consider them as being represented by collections of terms: Based on the two sets, the goal.

Similar presentations

Presentation on theme: "Set similarity: given two gene products, G 1 and G 2, we can consider them as being represented by collections of terms: Based on the two sets, the goal."— Presentation transcript:

Similar presentations

About project

Feedback