Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature Hammad Afzal, Robert Stevens, Goran Nenadic School of Computer Science.

Slides:



Advertisements
Similar presentations
FP7 meeting - Gent - Carlos Rodríguez - April 18 WP4: Conceptual Mining from Text for Knowledge Engineering State of the Art WP Coordinators: Alfonso Valencia.
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Using the Semantic Web to Construct an Ontology- Based Repository for Software Patterns Scott Henninger Computer Science and Engineering University of.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Moving beyond free text. Authors Scientist does research Scientist publishes research results in journal article Old Paradigm:
Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
How do we Collect Data for the Ontology? AmphibiaTree 2006 Workshop Saturday 11:30–11:45 J. Leopold.
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : Jorge Villalon and Rafael A. Calvo 2011, EST Concept Maps as Cognitive Visualizations.
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
BioSumm A novel summarizer oriented to biological information Elena Baralis, Alessandro Fiori, Lorenzo Montrucchio Politecnico di Torino Introduction text.
Professor Carole Goble
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
Protein and RNA Families
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
Motif discovery and Protein Databases Tutorial 5.
Labeling and Enhancing Life Science Links S. Heymann*, F. Naumann*, L. Raschid +, P. Rieger * * Humboldt Universität zu Berlin + University of Maryland.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
An overview of Bioinformatics. Cell and Central Dogma.
Mining the Biomedical Research Literature Ken Baclawski.
A collaborative tool for sequence annotation. Contact:
Bioinformatics and Computational Biology
Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip.
Information Retrieval
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
Using DAML+OIL Ontologies for Service Discovery in myGrid Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, Mark Greenwood
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov 2, Birgitta.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
InterPro Sandra Orchard.
ISMB Demo, 01 July 2009 Franck Tanoh University of Manchester, UK.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
TDM in the Life Sciences Application to Drug Repositioning *
Databases, Ontologies and Text mining Session Introduction Part 2
Professor Carole Goble University of Manchester, UK
Development of the Amphibian Anatomical Ontology
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Department of Genetics • Stanford University School of Medicine
PIR: Protein Information Resource
Applying principles of computer science in a biological context
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Presentation transcript:

Mining Semantic Descriptions of Bioinformatics Web Resources from the Literature Hammad Afzal, Robert Stevens, Goran Nenadic School of Computer Science University of Manchester

Motivation  A number of bioinformatics tools and resources available for service use and composition guessimate is Web Services publically available how to find a service, what is out there to use? provenance?  Semantic annotation of bioinformatics services annotate functional capabilities e.g. Taverna, myGrid, myExperiment, EBI, BioMOBY  Not only services and tools databases, repositories, corpora

Motivation  Manual curation e.g. myGrid, BioCatalogue etc. e.g. Taverna/Feta: only ~15-20% functionally described backlog – and the number of services is growing  Annotations combine textual descriptions ontological mappings

Example text ontological descriptions - multiple local align. - Soaplab

BioCatalogue  Single registration point for Web Service providers  Single search site for scientists and developers  Place where the community can find contacts and meet the experts and maintainers of these services  Community-sourced annotation, expert oversee  Mixed annotations: free text, tags, controlled vocabularies, community ontologies

BioCatalogue Beta version at Launch June 2009 at ISMB

Our approach  Collect service semantic descriptions by extracting and integrating information from text resources full text bioinformatics journal publications  Approach: identify descriptors that are used for service and resource annotations locate them in text infer the annotations  textual evidence and mappings to an ontology

The rest of the talk  Methodology mining bioinformatics terminology extraction of service description profiles  Experiments and results semi-automated curation  What next?

Methodology Corpus Information Retrieval Sentence Filtering Domain Ontology (e.g. myGrid) Domain Ontology (e.g. myGrid) Semantic Description of Services Identifying Topic Related Terms Text Mining Engine (Information Extraction) Semantic Network of Services Service Discovery

Bioinformatics terminology 1) get a corpus 2) get all terms 3) get seed examples 4) find relevant ones using term profiling and comparison to seed examples Learn bioinformatics terms from literature

Bioinformatics terminology  Use seed terms to bootstrap e.g. known descriptors used in existing service descriptions, either in literature or service repositories  250 terms identified, manual pruning after automatic term recognition examples of lexical constituents and textual behaviour (pragmatics)  lexical profiling  contextual profiling

Bioinformatics terminology  Lexical profiling what is in the name  Contextual profiling characterise sentences in which terms appear (nouns, verbs and context-patterns)  Comparing candidate term profiles to average seed term best-match

Bioinformatics terminology Two domain experts evaluated the top 300 terms

Semantic classes – myGrid  Informatics concepts general concepts of data, data structures, databases, metadata  Bioinformatics concepts domain-specific data sources and algorithms for searching and analysing data e.g. Smith-Waterman algorithm

Semantic classes – myGrid  Molecular biology concepts higher level concepts used to describe bioinformatics data types, used as inputs and outputs in services e.g. protein sequence, nucleic acid sequence  Task concepts generic tasks a service operation can perform e.g. retrieving, displaying, aligning

Semantic classes  Engineered from MyGrid bioinformatics sub-ontology classexamples Algorithm SigCalc algorithm, CHAOS local alignment, SNP analysis, KEGG Genome-based approach, GeneMark method, K-fold cross validation procedure Application PreBIND Searcher program, Apollo2Go Web Service, FLIP application, Apollo Genome Annotation curation tool, GenePix software, Pegasys system Data GeneBank record, Genome Microbial CoDing sequences, Drug Data report Data resource PIR Protein Information Resource, BIND database, TIGR dataset, BioMOBY Public Code repository

Semantic classes and instances

Service mentions  Named-entity recognition (NER) task  Recognition of service mentions using terminological (semantic) heads of automatically recognised terms  Apollo2Go Web Service is an Application  BIND database is a Data source  assign the corresponding semantic class Hearst patterns (co-ordinations, appositions, enumerations, etc.)

Semantic descriptors  Recognition of phrases depicting semantic roles used to describe services  Flexible dictionary look-up terms from myGrid ontology terms/noun phrases from existing descriptions of bioinformatics resources (collected from Taverna and other Web service providers).

Mining service descriptions

Extraction/functional rules  Predicate-driven rules: each verb associated with the type of “information content” it provides

Extraction/functional rules  Manually designed predicate-driven rules: Subject (Arg) – Verb (Predicate) – Object (Arg)  Applied on dependency parsed sentences Stanford parser no phrase structures complex sentences information in sub-clause

Extraction/functional rules  Phrase structures identified and integrate with the dependency  Predicate-dependent rules applied to extract specific ‘content’ and profile the services  Profiles collated for all mentions service name variation

Semantic service profiles  For a given service, collection of descriptors, including parameters links to other related instances related myGrid ontology semantic labels “informative” sentences

Example – GeneClass  Descriptors

Example – GeneClass  Functions, parameters

Example – GeneClass  Sentences We extend the original GeneClass algorithm to use all target genes for which both motif and expression data is available. In order to study different aspects of target gene regulation we use different sets of motifs and parents with the GeneClass algorithm. The GeneClass algorithm for predicting differential gene expression starts with a candidate set of motifs; representing known or putative regulatory element sequence patterns and a candidate set of regulators or parentSS.

Experiments  2120 BMC Bioinformatics articles full-text articles before March 2008  Service descriptors dictionary 471 descriptors from myGrid/Feta 450 descriptors collected from other bioinformatics service/tools providers  108 predicates used

Experiments  Number of candidate resources

 Number of descriptions collected using rules Experiments

Evaluated for their capability to be used for semantic description of a given bioinformatics resource irrelevant partially useful useful HeatMapper The HeatMapper tool has already proven to be very useful in several studies Kalign To compare Kalign to other MSA programs, the following test sets were used. Cognitor To add a new species to the COG system, the annotated protein sequences from the respective genome were compared to the proteins in the COG database by using the BLAST program and assigned to pre-existing COGs by using the COGNITOR program Evaluation of semantic profiles

 Two experiments:  5 well-known resources with descriptions already available  excellent rating for sentences  average rating for semantic descriptors  predicate functions  5 new, unknown resources  excellent rating for sentences  average rating for semantic descriptors  predicate functions Evaluation of semantic profiles

What next?  Good recall, poor precision context needs a better model  Mining parameter values sub-language of parameters  Candidate service/resource mentions an entity whose profile looks like a service comparison of semantic profiles network of services [ISMB 2009]  Do we have good service ontologies?

Conclusion  Literature mining approach to service description and annotation  Aims reduce curation efforts provide semantic synopses of services for the Semantic Web  Potential of text mining integration with other annotation approaches extracting the entire service context is still challenging

Acknowledgements  gnTEAM (text extraction, analitics, mining) H. Yang, I. Spasic, H. Afzal, A. Gledson, J. Eales, M. Greenwood, F. Sarafraz  myGrid team: Franck Tanoh  BBSRC “Mining term associations from literature to support knowledge discovery in biology” ( ) “pubmed2ensembl” ( ) “BioCatalogue” ( )

Announcement  Journal of BioMedical Semantics published by BioMed Central launched at ISMB 2009  Topics include Infrastructure for biomedical semantics  semantic resources and repositories  meta-data management and resource description  knowledge representation and semantic frameworks  Biomedical Semantic Web  life-long management of semantic resources Semantic mining, annotation and analysis