Set similarity: given two gene products, G 1 and G 2, we can consider them as being represented by collections of terms: Based on the two sets, the goal.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Semantic Similarity Measures Across The Gene Ontology. Relating Sequence to Annotation. P.W. Lord, R.D. Stevens, A.Brass, and C. Goble Department of Computer.
Using Semantic Similarity Measures in the Biomedical Domain for Computing Similarity between Genes based on Gene Ontology By : Elham Khabiri Adviser :
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Curators’ Meeting Oct. 27, 2003 Clustering MeSH Representations of Medical Literature Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and.
University at BuffaloThe State University of New York Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
What is an ontology and Why should you care? Barry Smith with thanks to Jane Lomax, Gene Ontology Consortium 1.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
Signaling and the Signal Transduction Cascade. Question?????? External Stimulus Inside cell Nucleus, Gene transcription Other cellular effects.
Fuzzy K means.
ICA-based Clustering of Genes from Microarray Expression Data Su-In Lee 1, Serafim Batzoglou 2 1 Department.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pathways Database System: An Integrated System For Biological Pathways L. Krishnamurthy, J. Nadeau, G. Ozsoyoglu, M. Ozsoyoglu, G. Schaeffer, M. Tasan.
From T. MADHAVAN, & K.Chandrasekaran Lecturers in Zoology.. EXIT.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Malignant Melanoma and CDKN2A
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Developing Pairwise Sequence Alignment Algorithms
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
Tae-Hyung Kim 1 Gil-Mi Ryu 1,2 InSong Koh 2 Jong Park 3 1.
Analyzing transcription modules in the pathogenic yeast Candida albicans Elik Chapnik Yoav Amiram Supervisor: Dr. Naama Barkai.
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
March 24, Integrating genomic knowledge sources through an anatomy ontology Gennari JH, Silberfein A, and Wiley JC Pac Symp Biocomputing 2005:
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
The Gene Ontology and its insertion into UMLS Jane Lomax.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
The Gene Ontology Categorizer C.A. Joslyn 1, S.M. Mniszewski 1, A. Fulmer 2 and G. Heaton 3 1 Computer and Computational Sciences, Los Alamos National.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
An overview of Bioinformatics. Cell and Central Dogma.
Mining the Biomedical Research Literature Ken Baclawski.
Bioinformatics and Computational Biology
Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip.
A Report on CAMDA’01 Biointelligence Lab School of Computer Science and Engineering Seoul National University Kyu-Baek Hwang and Jeong-Ho Chang.
A literature network of human genes for high-throughput analysis of gene expression Speaker : Shih-Te, YangShih-Te, Yang Advisor : Ueng-Cheng, YangUeng-Cheng,
1 Bioinformatics at Norwegian University of Science and Technology Professor Finn Drabløs Department of Cancer Research and Molecular Medicine Finn Drabløs.
Cluster validation Integration ICES Bioinformatics.
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
1 Annotation EPP 245/298 Statistical Analysis of Laboratory Data.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
DISCUSSION Using a Literature-based NMF Model for Discovering Gene Functional Relationships Using a Literature-based NMF Model for Discovering Gene Functional.
Tumor Suppressor Gene Involved in Breast and Ovarian Cancers SCIENCE96/gene.cgi?BRCA1.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.
Copyright © 1998 Pangea Systems, Inc. All rights reserved. Summary A definition of ontology as a characterisation of conceptualisation -- capturing the.
Joined up ontologies: incorporating the Gene Ontology into the UMLS.
6/11/20161 Graph models and efficient exact algorithms in studying cancer signaling pathways Songjian Lu, Lujia Chen, Chunhui Cai Department of Biomedical.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
2014 Using machine learning to predict binding sites in proteins Jenelle Bray Stanford University October 10, 2014 #GHC
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Selection of Resources for the Development of an Information Service Program in Molecular Biology and Genetics Ansuman Chattopadhyay, PhD Information Specialist.
Alignment table: group 4
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Hood College Master of Science in Bioinformatics (Proposed)
Predict Protein Sequence by Fuzzy-Association Rules
FUNCTIONAL ANNOTATION OF REGULATORY PATHWAYS
Citation-based Extraction of Core Contents from Biomedical Articles
BIO307- Bioengineering principles SPRING 2019
Presentation transcript:

Set similarity: given two gene products, G 1 and G 2, we can consider them as being represented by collections of terms: Based on the two sets, the goal is to define a natural similarity between G 1 and G 2 and, denoted as : Two types of set similarity: element based (Dice, Jaccard, Cosine, fuzzy measure) pair of elements based (Maximum, Average, OWA, Choquet) Expression dimension: real measures (Euclidian measure, etc…). Sequence dimension: sequence similarity measure (Smith-Waterman, Needleman-Wunsch, etc…) GO and Abstract dimension: set similarity. Set Similarity Measures for Gene Matching Mihail Popescu #, James Keller +, Joyce Mitchell # # Department of Health Management and Informatics;+Department of Electrical and Computer Engineering; University of Missouri-Columbia, Columbia, MO Why Similarity Measures? For a unified clustering approach in a 4D gene space Gene space dimensions (4D): sequence, microarray expression, literature abstracts (articles), gene ontology (GO) Two dimensions are numeric (sequence, expression) and two symbolic The existent symbolic measures are not adequate: Dice, Jaccard: do not consider the weight of the elements Maximum and average usually overestimates the or underestimates the similarity, respectively Example: ATM (human ataxia telangiectasia mutated) and STK11 (serine/threonine kinase 11.) The geneticist assessed these two genes as quasi-similar (similarity ~0.5) because: they both have protein serine/threonine kinase enzyme activity (they share a kinase domain) They both cause cancers when mutated, including breast cancer. Possible similarity measures Example of Similarity Calculation for the Gene Ontology (GO) Dimension s(ATM, STK11)=? (GO dimension) Algorithm: 1. Retrieve LocusLink GO annotations: ATM={4674: “ protein serine/threonine kinase activity”, 3677: ” DNA binding”, 4428 ” inositol/phosphatidylinositol kinase activity”, 7131 : ” meiotic recombination”, 6281 : ” DNA repair”, 7165: ” signal transduction”, 5634: ” nucleus”, 16740: ” transferase activity”, 45786: ” negative regulation of cell cycle”} STK11={5524: “ ATP binding”, 4674: ” protein serine/threonine kinase activity”, 6468: ” protein amino acid phosphorylation”, 16740: ” transferase activity”} 2. Compute GO term densities using the Resnik formula [4], the normalized version [.] or the depth in the hierarchy (.) Example of Similarity Calculation for the Retrieved Abstracts Dimension Acknowledgements This research was supported by National Library of Medicine Biomedical and Health Informatics Research Training grant 2-T15-LM References [1] C.D. Manning, H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, [2] R. Yager, “Criteria Aggregation Functions Using Fuzzy Measures and the Choquet Integral”, Int. Jour. of Fuzzy Systems, Vol.1, No. 2, December [3] J.J. Jiang, D.W. Conrath, “Semantic Similarity Based on Corpus Statistics and Lexical Ontology”, Proc. of Int. Conf. Research on Comp. Linguistics X, 1997, Taiwan. [4] P.W. Lord, R.D. Stevens, A. Brass, C.A. Goble, “Semantic similarity measure as a tool for exploring the gene ontology”, In Pacific Symposium on Biocomputing, pages , [5] M. Sugeno, Fuzzy measures and fuzzy integrals: a survey, (M.M. Gupta, G. N. Saridis, and B.R. Gaines, editors) Fuzzy Automata and Decision Processes, pp , North-Holland, New York, [6] S. Raychaduri, R.B. Altman, “A literature-based method for assessing the functional coherence of a gene group”, Bioinformatics, 19(3), pp. 396:401, Feb [7]. M. Grabisch, T. Murofushi, and M. Sugeno (eds.), Fuzzy Measures and Integrals: Theory and Applications, Springer-Verlag, [8]. Hvidsten TR, Komorowski J, Sandvik AK, Laegreid A. Predicting gene function from gene expressions and ontologies. Pac Symp Biocomput. 2001;: [9]. Trupti Joshi. Cellular function prediction for hypothetical proteins using high-throughput data. MS thesis, University of Tennessee, Knoxville, [10]. Keller J, Popescu M, Mitchell J. Soft Computing Tools for Gene Similarity Measures in Bioinformatics, FLINT-CIBI 2003, Berkeley, Dec 15-18, Set similarity measures s(ATM, STK11)=? (Abstract dimension) Algorithm: Retrieve PubMed abstracts for ATM, STK11 Calculate all the pair-wise distances based on the MeSH indexing Keep the 4 best-matching pairs Find the impact factor for each journal: g(A i ), i=1…8 ATM Oncogene (6.737) Oncogene (6.737) Nucleic Acids Res. (6.373) Science (23.329) STK – Cancer Res (8.30) – Biochem J (4.326) EMBO J. (12.459) Biochem J (4.326) Calculate the confidence of the pair g(A 1, A 2 ) =g(A 1 )*g(A 2 ) and normalize using maximum value: The pair-wise similarity values calculated using FMS are: Similarity calculation: Using weighted average: s(ATM, STK11)=0.37 Using Choquet integral: s(ATM, STK11)= Compute the similarity: Conclusions For the GO dimension, the best method of assigning densities was normalizing the information content [4] by the maximum value The proposed fuzzy similarity measure (FMS) agrees better with our intuition of similarity: if the common elements have a high confidence, then the similarity is stronger. In addition, the non common terms have also a contribution to the similarity since the measure is computed apriori for each term set. The Choquet similarity measure is much more general, depending only on the fuzzy measure. In addition the optimal fuzzy measure can be learned from examples.