Download presentation
Presentation is loading. Please wait.
Published byAlannah Hardy Modified over 9 years ago
1
Integrating BioMedical Text Mining Services into a Distributed Workflow Environment Rob Gaizauskas, Neil Davis, George Demetriou, Yikun Guo, Ian Roberts UK E-Science All Hands Meeting Nottingham September 1-3, 2004
2
September 1-3, 2004 All Hands Meeting, Nottingham Outline n Introduction: Workflows, Web Services and Text Mining for Bioinformatics n Two Case Studies: Graves’ Disease and Williams Syndrome n Text Services –Text Collection Server –Text Services Workflow Server –Interface/Browsing Client n Conclusions/Future Work
3
September 1-3, 2004 All Hands Meeting, Nottingham Workflows, Web Services and Text Mining for Bioinformatics n Workflows –useful computational models for processes that require repeated execution of a series of complex analytical tasks –E.g. biologist researching genetic basis of a disease repeatedly n maps reactive spot in microarray data to gene sequence n uses a sequence alignment tool to find proteins/DNA of similar structure n mines info about these homologues from remote DBs n annotates unknown gene sequence with this discovered info
4
September 1-3, 2004 All Hands Meeting, Nottingham Workflows, Web Services and Text Mining for Bioinformatics n Web services –Processing resources that are n available via the Internet n use standardised messaging formats, such as XML n enable communication between applications without being tied to a particular operating system/programming language –Useful for bioinformatics where data used in research is n heterogeneous in nature – DB records, numerical results, NL texts n distributed across the internet in research institutions around the world n available on a variety of platforms and via non-uniform interfaces
5
September 1-3, 2004 All Hands Meeting, Nottingham Workflows, Web Services and Text Mining for Bioinformatics n Text mining –any process of revealing information – regularities, patterns or trends – in textual data –includes more established research areas such as information extraction (IE), information retrieval (IR), natural language processing (NLP), knowledge discovery from databases (KDD) –relevant to bioinformatics because of n explosive growth of biomedical literature n availability of some information in textual form only, e.g. clinical records
6
September 1-3, 2004 All Hands Meeting, Nottingham Workflows, Web Services and Text Mining for Bioinformatics WorkflowsWeb servicesText mining Bioinformatics
7
September 1-3, 2004 All Hands Meeting, Nottingham Context n Objective: deliver text services for the myGrid and CLEF projects n myGrid has adopted the workflow model for delivering an e-biologist’s workbench –Scufl workflow specification language –Taverna workflow design tool –Freefluo workflow enactment engine n Problem: how to integrate text mining into a biological workflow? –Most text mining runs off-line and supports interactive browsing of results –Most workflows run end to end with no user intervention –What are the inputs to text mining to be? n Solution: tap off result of a workflow step and treat as implicit query
8
September 1-3, 2004 All Hands Meeting, Nottingham Two Case Studies in the Genetic Basis of Disease n Graves’ Disease –an autoimmune condition affecting tissues in the thyroid and orbit –being investigated using the micro-array methods n micro-array shows which genes are differentially expressed in normal patients vs patients with the disease = candidate genes n sequence alignment search (e.g. BLAST) finds genes/proteins with similar structure n function of these “homologues” may suggest function of candidate gene –key step for text mining follows BLAST search n for homologous proteins BLAST report contains references to proteins in SWISSPROT protein database n Swissprot records contain ids of abstracts describing the protein in Medline abstract database n abstracts can be mined directly or used as ``seed'' documents to assemble a set of related abstracts
9
September 1-3, 2004 All Hands Meeting, Nottingham Two Case Studies in the Genetic Basis of Disease n Williams Syndrome –congenital disorder resulting in mental retardation caused by deletion of genetic material on 7 th chromosome –area in which deletions occur not well characterised – better sequence info is becoming available –as new sequence information becomes available n gene finding software run against it n BLAST is run against new putative genes to identify homologues whose function may be known –BLAST reports provide links to abstracts in the literature
10
September 1-3, 2004 All Hands Meeting, Nottingham Text Services Architecture User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts
11
September 1-3, 2004 All Hands Meeting, Nottingham Text Services Architecture n 3-way division of labour sensible way to deliver distributed text mining services –Providers of e-archives, such as Medline, will make archives available via web-services interface n Cannot offer tailored sevices for every application n Will provide core, common services –Specialist workflow designers will add value to basic services from archive to meet their organization’s needs –Users will prefer to execute predefined workflows via standard light clients such as a browser n Architecture appropriate for many research areas, not just bioinformatics
12
September 1-3, 2004 All Hands Meeting, Nottingham Text Services Architecture User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts
13
September 1-3, 2004 All Hands Meeting, Nottingham Text Collection Server n Text collection is Medline (www.ncbi.nlm.nih.gov/)www.ncbi.nlm.nih.gov/ –> 10 million abstracts since 1950’s –largest repository of biomedical abstracts –copies made available for research, updated annually –records contain semi-structured information annotated in XML n Unique id – PubMed id n Citation information – author(s), journal, year, etc. n Manually assigned controlled vocabulary keywords (MeSH terms) n Text of abstract
14
September 1-3, 2004 All Hands Meeting, Nottingham Text Collection Server (cont) n Local copy –Loaded in mySQL, indexed on various fields, e.g. MeSH terms –Text portion indexed with for search engines (Lucene, Madcow) –Text pre-preprocessed with text mining tools n Tokenisation n Terminology look-up and indexes built for term classes (proteins, genes, diseases, etc.) n Server accepts web service calls to, e.g. –Return text of abstract given a PubMed id –Return MeSH terms of abstracts given PubMed ids –Return PubMed ids of abstracts with given MeSH terms –Return PubMed ids of abstracts matching a free text query –Return PubMed ids of abstracts containing a specific term Part-of-speech tagging Term Parsing
15
September 1-3, 2004 All Hands Meeting, Nottingham Text Services Architecture User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts
16
September 1-3, 2004 All Hands Meeting, Nottingham Workflow Server n Workflow server runs Freefluo enactment engine to execute Scufl workflow (designed using Taverna) n Graves’ disease workflow:
17
September 1-3, 2004 All Hands Meeting, Nottingham Text Services Architecture User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts
18
September 1-3, 2004 All Hands Meeting, Nottingham Interface/Browsing Client n Two components –Submit workflow for enactment –Explore results and launch follow-on queries n Three types of follow-on search –Find other texts containing terms in current text –Find texts containing a specific search string (free text search) –Find others text “like” current one (with same MeSH terms) n Implemented as a Java-Swing applet for easy inclusion in portals
19
September 1-3, 2004 All Hands Meeting, Nottingham Abstract body Interface/Browsing Client MeSH Tree Abstract Titles Free text search Search scope restrictors Linked terms Get Related Abstracts
20
September 1-3, 2004 All Hands Meeting, Nottingham Conclusion n Have implemented a set of text mining web services that run in a workflow to support biologists in exploring the genetic basis of disease n Implementation based on a generic 3 component architecture (archive server, workflow server, browser client) with wider applicability n Basic idea is to glean an implicit query from a workflow operation (e.g. sequence alignment) –find abstracts of papers related to abstracts describing homologous proteins/genes of gene of interest –Cluster results and present to user n User can explore results and issue follow-on queries via a richly- featured graphical interface
21
September 1-3, 2004 All Hands Meeting, Nottingham Future Work n Integrate in practice with rest of Graves’/Williams workflows in myGrid and get feedback from biologists n Explore other intepretations of “relatedness” for abstracts in addition to MeSH terms –in assembling corpus of related abstracts (e.g. vector space/language model notions of similarity) –in clustering results (e.g. k-means/agglomerative clustering) n Explore other ways of deriving implicit queries from workflows – e.g. mining provenance data n Explore further interface search filtering operations and interface design issues n Scale up to process all of Medline for term/entity identification
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.