Integrating BioMedical Text Mining Services into a Distributed Workflow Environment Rob Gaizauskas, Neil Davis, George Demetriou, Yikun Guo, Ian Roberts.

Slides:

Advertisements

Similar presentations

April 21, 2005EPSRC E-Science Meeting, NeSC Real-time Text Mining for the Biomedical Literature a collaboration between Discovery Net & myGrid Rob Gaizauskas.

Advertisements

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.

ISMB Demo; June 27, 2005 Integrating Text Mining into Bio-Informatics Workflows Neil Davis George Demetriou Robert Gaizauskas Yikun Guo Ian Roberts Henk.

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan

1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.

Introduction to Web services MSc on Bioinformatics for Health Sciences May 2006 Arnaud Kerhornou Iván Párraga García INB.

Search Engines and Information Retrieval

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

The my Grid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. my Grid is building high.

1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Overview of Search Engines

B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego

PubMed/How to Search, Display, Download & (module 4.1)

A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University.

Digital Library Architecture and Technology

Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.

Search Engines and Information Retrieval Chapter 1.

Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,

Yike Guo/Jiancheng Lin InforSense Ltd. 15 September 2015 Bioinformatics workflow integration.

Gene Expression Omnibus (GEO)

Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.

Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.

Chapter 1 Introduction to Data Mining

Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Creating Metabolic Network Models using Text Mining and Expert Knowledge J.A. Dickerson, D. Berleant, Z. Cox, W. Qi, and E. Wurtele Iowa State University.

Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,

1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.

Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.

ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.

Anil Wipat University of Newcastle upon Tyne, UK A Grid based System for Microbial Genome Comparison and analysis.

Data provenance in biomedical discovery Donald Dunbar Queen’s Medical Research Institute University of Edinburgh Workshop on Principles of Provenance in.

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

Association of variations in I kappa B-epsilon with Graves' disease using classical and my Grid methodologies Peter Li School of Computing Science University.

Information in the Digital Environment Information Seeking Models Dr. Dania Bilal IS 530 Spring 2005.

L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.

How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”

Gene Expression Omnibus (GEO)

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.

Information Retrieval

Using DAML+OIL Ontologies for Service Discovery in myGrid Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, Mark Greenwood

Copyright OpenHelix. No use or reproduction without express written consent1.

GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.

Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.

A centre of expertise in digital information management UKOLN is supported by: Functional Requirements Eprints Application Profile Working.

Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.

Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.

CS276B Text Information Retrieval, Mining, and Exploitation Practical 1 Jan 14, 2003.

MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.

User Characterization in Search Personalization

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

PubMed Database Interface (Basic Course Module 4 Part A)

Data Warehousing and Data Mining

Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1

Introduction to Information Retrieval

Lesson 3 Bioinformatics Laboratory

PubMed Database Interface (Basic Course: Module 4 Part A)

PubMed Database Interface Part A (Basic Course Module 4)

Presentation transcript:

Integrating BioMedical Text Mining Services into a Distributed Workflow Environment Rob Gaizauskas, Neil Davis, George Demetriou, Yikun Guo, Ian Roberts UK E-Science All Hands Meeting Nottingham September 1-3, 2004

September 1-3, 2004 All Hands Meeting, Nottingham Outline n Introduction: Workflows, Web Services and Text Mining for Bioinformatics n Two Case Studies: Graves’ Disease and Williams Syndrome n Text Services –Text Collection Server –Text Services Workflow Server –Interface/Browsing Client n Conclusions/Future Work

September 1-3, 2004 All Hands Meeting, Nottingham Workflows, Web Services and Text Mining for Bioinformatics n Workflows –useful computational models for processes that require repeated execution of a series of complex analytical tasks –E.g. biologist researching genetic basis of a disease repeatedly n maps reactive spot in microarray data to gene sequence n uses a sequence alignment tool to find proteins/DNA of similar structure n mines info about these homologues from remote DBs n annotates unknown gene sequence with this discovered info

September 1-3, 2004 All Hands Meeting, Nottingham Workflows, Web Services and Text Mining for Bioinformatics n Web services –Processing resources that are n available via the Internet n use standardised messaging formats, such as XML n enable communication between applications without being tied to a particular operating system/programming language –Useful for bioinformatics where data used in research is n heterogeneous in nature – DB records, numerical results, NL texts n distributed across the internet in research institutions around the world n available on a variety of platforms and via non-uniform interfaces

September 1-3, 2004 All Hands Meeting, Nottingham Workflows, Web Services and Text Mining for Bioinformatics n Text mining –any process of revealing information – regularities, patterns or trends – in textual data –includes more established research areas such as information extraction (IE), information retrieval (IR), natural language processing (NLP), knowledge discovery from databases (KDD) –relevant to bioinformatics because of n explosive growth of biomedical literature n availability of some information in textual form only, e.g. clinical records

September 1-3, 2004 All Hands Meeting, Nottingham Workflows, Web Services and Text Mining for Bioinformatics WorkflowsWeb servicesText mining Bioinformatics

September 1-3, 2004 All Hands Meeting, Nottingham Context n Objective: deliver text services for the myGrid and CLEF projects n myGrid has adopted the workflow model for delivering an e-biologist’s workbench –Scufl workflow specification language –Taverna workflow design tool –Freefluo workflow enactment engine n Problem: how to integrate text mining into a biological workflow? –Most text mining runs off-line and supports interactive browsing of results –Most workflows run end to end with no user intervention –What are the inputs to text mining to be? n Solution: tap off result of a workflow step and treat as implicit query

September 1-3, 2004 All Hands Meeting, Nottingham Two Case Studies in the Genetic Basis of Disease n Graves’ Disease –an autoimmune condition affecting tissues in the thyroid and orbit –being investigated using the micro-array methods n micro-array shows which genes are differentially expressed in normal patients vs patients with the disease = candidate genes n sequence alignment search (e.g. BLAST) finds genes/proteins with similar structure n function of these “homologues” may suggest function of candidate gene –key step for text mining follows BLAST search n for homologous proteins BLAST report contains references to proteins in SWISSPROT protein database n Swissprot records contain ids of abstracts describing the protein in Medline abstract database n abstracts can be mined directly or used as ``seed'' documents to assemble a set of related abstracts

September 1-3, 2004 All Hands Meeting, Nottingham Two Case Studies in the Genetic Basis of Disease n Williams Syndrome –congenital disorder resulting in mental retardation caused by deletion of genetic material on 7 th chromosome –area in which deletions occur not well characterised – better sequence info is becoming available –as new sequence information becomes available n gene finding software run against it n BLAST is run against new putative genes to identify homologues whose function may be known –BLAST reports provide links to abstracts in the literature

September 1-3, 2004 All Hands Meeting, Nottingham Text Services Architecture User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts

September 1-3, 2004 All Hands Meeting, Nottingham Text Services Architecture n 3-way division of labour sensible way to deliver distributed text mining services –Providers of e-archives, such as Medline, will make archives available via web-services interface n Cannot offer tailored sevices for every application n Will provide core, common services –Specialist workflow designers will add value to basic services from archive to meet their organization’s needs –Users will prefer to execute predefined workflows via standard light clients such as a browser n Architecture appropriate for many research areas, not just bioinformatics

September 1-3, 2004 All Hands Meeting, Nottingham Text Services Architecture User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts

September 1-3, 2004 All Hands Meeting, Nottingham Text Collection Server n Text collection is Medline ( –> 10 million abstracts since 1950’s –largest repository of biomedical abstracts –copies made available for research, updated annually –records contain semi-structured information annotated in XML n Unique id – PubMed id n Citation information – author(s), journal, year, etc. n Manually assigned controlled vocabulary keywords (MeSH terms) n Text of abstract

September 1-3, 2004 All Hands Meeting, Nottingham Text Collection Server (cont) n Local copy –Loaded in mySQL, indexed on various fields, e.g. MeSH terms –Text portion indexed with for search engines (Lucene, Madcow) –Text pre-preprocessed with text mining tools n Tokenisation n Terminology look-up and indexes built for term classes (proteins, genes, diseases, etc.) n Server accepts web service calls to, e.g. –Return text of abstract given a PubMed id –Return MeSH terms of abstracts given PubMed ids –Return PubMed ids of abstracts with given MeSH terms –Return PubMed ids of abstracts matching a free text query –Return PubMed ids of abstracts containing a specific term  Part-of-speech tagging  Term Parsing

September 1-3, 2004 All Hands Meeting, Nottingham Text Services Architecture User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts

September 1-3, 2004 All Hands Meeting, Nottingham Workflow Server n Workflow server runs Freefluo enactment engine to execute Scufl workflow (designed using Taverna) n Graves’ disease workflow:

September 1-3, 2004 All Hands Meeting, Nottingham Text Services Architecture User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts

September 1-3, 2004 All Hands Meeting, Nottingham Interface/Browsing Client n Two components –Submit workflow for enactment –Explore results and launch follow-on queries n Three types of follow-on search –Find other texts containing terms in current text –Find texts containing a specific search string (free text search) –Find others text “like” current one (with same MeSH terms) n Implemented as a Java-Swing applet for easy inclusion in portals

September 1-3, 2004 All Hands Meeting, Nottingham Abstract body Interface/Browsing Client MeSH Tree Abstract Titles Free text search Search scope restrictors Linked terms Get Related Abstracts

September 1-3, 2004 All Hands Meeting, Nottingham Conclusion n Have implemented a set of text mining web services that run in a workflow to support biologists in exploring the genetic basis of disease n Implementation based on a generic 3 component architecture (archive server, workflow server, browser client) with wider applicability n Basic idea is to glean an implicit query from a workflow operation (e.g. sequence alignment) –find abstracts of papers related to abstracts describing homologous proteins/genes of gene of interest –Cluster results and present to user n User can explore results and issue follow-on queries via a richly- featured graphical interface

September 1-3, 2004 All Hands Meeting, Nottingham Future Work n Integrate in practice with rest of Graves’/Williams workflows in myGrid and get feedback from biologists n Explore other intepretations of “relatedness” for abstracts in addition to MeSH terms –in assembling corpus of related abstracts (e.g. vector space/language model notions of similarity) –in clustering results (e.g. k-means/agglomerative clustering) n Explore other ways of deriving implicit queries from workflows – e.g. mining provenance data n Explore further interface search filtering operations and interface design issues n Scale up to process all of Medline for term/entity identification