April 21, 2005EPSRC E-Science Meeting, NeSC Real-time Text Mining for the Biomedical Literature a collaboration between Discovery Net & myGrid Rob Gaizauskas.

Slides:



Advertisements
Similar presentations
Copyright Discovery Net Imperial College SARS Analysis on the Grid Discovery Net in Bioinformatics.
Advertisements

Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
University of Illinois Visualizing Text Loretta Auvil UIUC February 25, 2011.
ISMB Demo; June 27, 2005 Integrating Text Mining into Bio-Informatics Workflows Neil Davis George Demetriou Robert Gaizauskas Yikun Guo Ian Roberts Henk.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
SEVENPRO – STREP KEG seminar, Prague, 8/November/2007 © SEVENPRO Consortium SEVENPRO – Semantic Virtual Engineering Environment for Product.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
1 Richard White Design decisions: architecture 1 July 2005 BiodiversityWorld Grid Workshop NeSC, Edinburgh, 30 June - 1 July 2005 Design decisions: architecture.
Information Retrieval in Practice
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Presented by Zeehasham Rasheed
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Advanced Data Mining and Integration Research for Europe ADMIRE – Framework 7 ICT ADMIRE Overview European Commission 7 th.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Analysis Environments For Scientific Communities From Bases to Spaces Bruce R. Schatz Institute for Genomic Biology University of Illinois at Urbana-Champaign.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Life Sciences Integrated Demo Joyce Peng Senior Product Manager, Life Sciences Oracle Corporation
Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.
Chapter 1 Introduction to Data Mining
Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Integrating BioMedical Text Mining Services into a Distributed Workflow Environment Rob Gaizauskas, Neil Davis, George Demetriou, Yikun Guo, Ian Roberts.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Association of variations in I kappa B-epsilon with Graves' disease using classical and my Grid methodologies Peter Li School of Computing Science University.
A radiologist analyzes an X-ray image, and writes his observations on papers  Image Tagging improves the quality, consistency.  Usefulness of the data.
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Using DAML+OIL Ontologies for Service Discovery in myGrid Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, Mark Greenwood
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
Genomic Medicine Grid Juan Pedro Sánchez Merino Instituto de Salud Carlos III
Information Retrieval in Practice
Development of the Amphibian Anatomical Ontology
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehousing and Data Mining
Chaitali Gupta, Madhusudhan Govindaraju
Web Mining Research: A Survey
Supporting High-Performance Data Processing on Flat-Files
Presentation transcript:

April 21, 2005EPSRC E-Science Meeting, NeSC Real-time Text Mining for the Biomedical Literature a collaboration between Discovery Net & myGrid Rob Gaizauskas Department of Computer Science University of Sheffield Moustafa M. Ghanem Department of Computing Imperial College London

April 21, 2005EPSRC E-Science Meeting, NeSC Outline Context –Workflows, Services and Text Mining –Discovery Net & myGrid Aims and Objectives of New Project Architecture of New System –Integration of Existing Components Approach to Text Mining –Data Resources & Evaluation –Techniques for Go Tagging Interface and Results Presentation Lessons Learnt So far, Conclusions and Broader Applicability of Work

April 21, 2005EPSRC E-Science Meeting, NeSC Workflows, Web Services and Text Mining for Bioinformatics Workflows –useful computational models for processes that require repeated execution of a series of complex analytical tasks –e.g. biologist researching genetic basis of a disease repeatedly maps reactive spot in microarray data to gene sequence uses a sequence alignment tool to find proteins/DNA of similar structure mines info about these homologues from remote DBs annotates unknown gene sequence with this discovered info

April 21, 2005EPSRC E-Science Meeting, NeSC Workflows, Web Services and Text Mining for Bioinformatics Web services –Processing resources that are available via the Internet use standardised messaging formats, such as XML enable communication between applications without being tied to a particular operating system/programming language –Useful for bioinformatics where data used in research is heterogeneous in nature – DB records, numerical results, NL texts distributed across the internet in research institutions around the world available on a variety of platforms and via non-uniform interfaces

April 21, 2005EPSRC E-Science Meeting, NeSC Workflows, Web Services and Text Mining for Bioinformatics Text mining –any process of revealing information – regularities, patterns or trends – in textual data –includes more established research areas such as information extraction (IE), information retrieval (IR), natural language processing (NLP), knowledge discovery from databases (KDD) and traditional data mining (DM) –relevant to bioinformatics because of explosive growth of biomedical literature availability of some information in textual form only, e.g. clinical records

April 21, 2005EPSRC E-Science Meeting, NeSC Workflows, Web Services and Text Mining for Bioinformatics Workflows Web services Text mining Bioinformatics

April 21, 2005EPSRC E-Science Meeting, NeSC Discovery Net & myGrid Discovery Net: An e-Science testbed for High Throughput Informatics –£2.2M EPSRC Pilot Project –Started Oct 01, Ended in March 05 –Service-based infrastructure/workflow model for Life Sciences, Environmental Modelling and Geo-hazard Modelling –Infrastructure for mixed data mining / text mining –Machine learning methods for text mining myGrid: Directly Supporting the e-Scientist –£3.5M EPSRC Pilot Project –Started Oct 01, Ends June 05 –Service-based infrastructure/workflow model for Life Sciences –Infrastructure for Text Collection Server, Text Services Workflow Server and Interface/Browsing Client –Service-based Terminology Servers

April 21, 2005EPSRC E-Science Meeting, NeSC myGrid Overall aim: develop an e-biologists workbench – a platform allowing biologists to execute, analyze, repeat multi-stage in silico experiments involving distributed data, code and processing resources –Workflow model for composing/executing processing components –Web services for distribution Problem: how to integrate text mining into a biological workflow? –Most text mining runs off-line and supports interactive browsing of results –Most workflows run end to end with no user intervention –What are the inputs to text mining to be? Solution: tap off result of a workflow step and treat as implicit query

April 21, 2005EPSRC E-Science Meeting, NeSC A myGrid example studying the Genetic Basis of Disease Graves Disease –an autoimmune condition affecting tissues in the thyroid and orbit –being investigated using the micro-array methods micro-array shows which genes are differentially expressed in normal patients vs patients with the disease = candidate genes sequence alignment search (e.g. BLAST) finds genes/proteins with similar structure function of these homologues may suggest function of candidate gene –key step for text mining follows BLAST search for homologous proteins BLAST report contains references to proteins in SWISSPROT protein database Swissprot records contain ids of abstracts describing the protein in Medline abstract database abstracts can be mined directly or used as ``seed'' documents to assemble a set of related abstracts

April 21, 2005EPSRC E-Science Meeting, NeSC myGrid Text Services Architecture User Client Medline Server Swissprot/Blast record Workflow Server Workflow Enactment Extract PubMed Id Get Medline Abstract Initial Workflow Cluster Abstracts Get Related Abstracts Medline: pre-processed offline to extract biomedical terms + indexed Workflow definition + parameters Clustered PubMed Ids + titles PubMed Ids Term-annotated Medline abstracts Medline Abstracts

April 21, 2005EPSRC E-Science Meeting, NeSC myGrid Text Services Architecture 3-way division of labour sensible way to deliver distributed text mining services –Providers of e-archives, such as Medline, will make archives available via web-services interface Cannot offer tailored sevices for every application Will provide core, common services –Specialist workflow designers will add value to basic services from archive to meet their organizations needs –Users will prefer to execute predefined workflows via standard light clients such as a browser Architecture appropriate for many research areas, not just bioinformatics

April 21, 2005EPSRC E-Science Meeting, NeSC Abstract body myGrid Interface/Browsing Client MeSH Tree Abstract Titles Free text search Search scope restrictors Linked terms Get Related Abstracts

April 21, 2005EPSRC E-Science Meeting, NeSC Find Relevant Genes from Online Databases Find Associations between Frequent Terms Gene Expression Analysis Discovery Net: Adding text mining to e- Science workflows DNet Workflow server executes DPML workflow and uses Discovery Nets InfoGrid data access and integration wrappers and web services

April 21, 2005EPSRC E-Science Meeting, NeSC Text Mining in e-Science workflows Problem: how to develop new distributed text mining applications using a workflow? –Most text mining applications require the integration of a mixture of components (Services) for text processing tasks (e.g. parsing and cleaning), natural language processing (e.g. named entity recognition), statistics and data mining (e.g. classification, clustering, etc). –There are many design alternatives and end users may want to prototype and compare alternative implementations. –Once application developed, most workflows run end to end with no user intervention Solution: Extend service infrastructure to allow composition of text mining services.

April 21, 2005EPSRC E-Science Meeting, NeSC Building text mining applications from workflows Text Processing Stemming, Stop-word filters, Pattern filters, Lexicon matching, Ontologies, NLP parsing etc,.. Feature Extraction Statistical: Word Counts, Pattern Extraction & Counts, etc Domain-specific Gene Name counts, etc NLP-specific Phrase counts, etc Data Mining Classification, Clustering, Association, Statistical Analysis, Visual Analysis, etc … Text documents Text docs Numerical Feature Vectors Retrieval/ Storage Indexing Access Drivers Storage Text docs Pre-process documents to enhance the ease of feature extraction Features are summarized into vector forms which are suitable for data mining Results can be document characterization or hidden relationship extraction Retrieve and organize relevant documents Text Mining Pipelines Using workflow technologies to build text mining applications and services using finer grain components/services

April 21, 2005EPSRC E-Science Meeting, NeSC Simplified Document Classification Workflow Examples of Extracted Patterns GENE_NAME protein GENE_NAME express express GENE_NAME GENE_NAME mutant GENE_NAME activity activity GENE_NAME GENE_NAME drosophila Examples of Pattern Definitions delet\s([a-z]*(\s)+)*genenam+\s depend\s([a-z]*(\s)+)*genenam+\s describ\s([a-z]*(\s)+)*genenam+\s detect\s([a-z]*(\s)+)*genenam+\s determin\s([a-z]*(\s)+)*genenam+\s differ\s([a-z]*(\s)+)*genenam+\s disc\s([a-z]*(\s)+)*genenam+\s dna\s([a-z]*(\s)+)*genenam+\s Predictive Accuracy of Relevance prediction, using Support Vector Machine classification Overall accuracy: 84.5% Precision 78.11% Recall 73.40%

April 21, 2005EPSRC E-Science Meeting, NeSC Text Meta Data Model Build Classifier training phase using workflow co-ordinating distributed services Build Prediction phase using workflow co-ordinating distributed services Metadata Model: Service Interfaces only tell you how to invoke remote service but it is up to you to decide what information flows between services !

April 21, 2005EPSRC E-Science Meeting, NeSC Aims & Objectives of New Project Aim: to develop a unified real-time e-Science text-mining infrastructure that leverages the technologies and methods developed by both Discovery Net and myGrid –Software engineering challenge: integrate complementary service-based text mining capabilities with different metadata models into a single framework –Application challenge: annotate biomedical abstracts with semantic categories from the Gene Ontology Deliverables: –D1: A GO Annotation Service –D2: A Generic Shared Infrastructure for Grid-enabled Biomedical Document Categorization –D3: Infrastructure for Semantic Document Annotation –D4: A Detailed Case Study (analysing/evaluating the GO annotator) –D5: Developing a common framework for representing + exchanging information about: 1.Data: biomedical documents/doc collections + metadata, biomedical dictionaries 2.Intermediate data: Document indexes and Document feature vectors 3.Text Analysis Results

April 21, 2005EPSRC E-Science Meeting, NeSC Go TAG: A Novel Application The GO TAG Application: Automatic Assignment of GO (Gene Ontology) Codes to Medline Documents

April 21, 2005EPSRC E-Science Meeting, NeSC A Machine Learning Approach Overview of Training Phase

April 21, 2005EPSRC E-Science Meeting, NeSC Run-time System Overview of Run-time System

April 21, 2005EPSRC E-Science Meeting, NeSC GO Annotator – Version 1 Version 1a: –Direct search for GO Annotation descriptions and synonyms in document text –If description is found, document is labelled with this GO Annotation –Description is also marked-up in document Version 1b: –1a + search for gene names extracted from yeast genome DB –If gene name found, document labelled with GO annotation(s) associated with gene in DB –Gene name also marked up in document Termino web-service, hosted at Sheffield, provides lookup capability This is wrapped in a DiscoveryNet workflow to include PubMed query, results visualization and performance calculations Workflow is deployed as a web application for end users which includes applet to interactively browse results

April 21, 2005EPSRC E-Science Meeting, NeSC GO Annotator – Version 1 Underlying Discovery Net Workflow

April 21, 2005EPSRC E-Science Meeting, NeSC GO Annotator – Version 1 Underlying Discovery Net Workflow Enter query and retrieve abstracts from PubMed.

April 21, 2005EPSRC E-Science Meeting, NeSC GO Annotator – Version 1 Underlying Discovery Net Workflow Use Termino to mark-up abstracts with GO Annotations when match for GO Annotation description is found.

April 21, 2005EPSRC E-Science Meeting, NeSC GO Annotator – Version 1 Underlying Discovery Net Workflow Tabulate GO Annotations by PMID.

April 21, 2005EPSRC E-Science Meeting, NeSC GO Annotator – Version 1 Underlying Discovery Net Workflow Join PMIDs and matching GO Annotations with abstracts and titles.

April 21, 2005EPSRC E-Science Meeting, NeSC Workflow Deployment

April 21, 2005EPSRC E-Science Meeting, NeSC GO Annotator – Version 2 Use Saccharomyces (Yeast) Genome Database as source of papers expertly curated with GO Annotations Train classifier using these papers Hierarchical classification Training data sufficient to classify over 2000 GO Annotations Classifier is then applied to assign unseen papers with GO Annotations Main Issues: –Choice of features to be extracted from the training documents –Choice of feature reduction methods to produce accurate classification –Choice of classification algorithm to be used?

April 21, 2005EPSRC E-Science Meeting, NeSC GO Annotator – Version 2 Underlying Discovery Net Workflow

April 21, 2005EPSRC E-Science Meeting, NeSC GO Annotator – Version 2 Underlying DiscoveryNet Workflow Papers expertly curated with GO Annotations from SGD database.

April 21, 2005EPSRC E-Science Meeting, NeSC GO Annotator – Version 2 Underlying Discovery Net Workflow Generate vector of features (frequent phrases) for each paper. This is used to train classifier.

April 21, 2005EPSRC E-Science Meeting, NeSC GO Annotator – Version 2 Underlying Discovery Net Workflow Generate a Naïve Bayesian classification model.

April 21, 2005EPSRC E-Science Meeting, NeSC GO Annotator – Version 2 Underlying Discovery Net Workflow Generate vector of features (frequent phrases) for each paper in test data set. This is used to test the classifier.

April 21, 2005EPSRC E-Science Meeting, NeSC GO Annotator – Version 2 Underlying Discovery Net Workflow Apply classification model to test data to evaluate classification accuracy.

April 21, 2005EPSRC E-Science Meeting, NeSC Interface + Results Presentation GO Hierarchy Abstract Titles Abstract Bodies Go Labels/ Gene Names

April 21, 2005EPSRC E-Science Meeting, NeSC Achievements to date Infrastructure Interoperability –More than just remote web service invocation: interoperable metadata models Mark 1 System Implemented –Annotation based on terminology lookups –15% Recall & 5% Precision (Exact matches for 18,000 GO terms) Measures inadequate due to incompleteness of gold standard In process of Finalising Training Data Sets and Evaluation Metrics –4,922 papers referencing 2,455 GO Terms Mark 2 Systems in Progress –Naïve Bayesian Approach –41% Recall and 27% Precision User Interfaces Mark 3, 4, … Systems and Evaluation

April 21, 2005EPSRC E-Science Meeting, NeSC Implementation Options Feature Vector Options –Bag of words –Frequent Phrases –Key Phrases (Gene Names, Protein Names, MeSH terms, etc). Classifier Options –Bayesian Classifiers –Support Vector Machines –Drag Push (a novel centroid based method)

April 21, 2005EPSRC E-Science Meeting, NeSC Lessons Learnt and Challenges to Face Infrastructure –Interoperability Issues –Performance Issues: Communication vs Persistence of remote server Off-line vs on-line feature extraction Text Mining –Usability Issues –Evaluation Issues