National Centre for Text Mining NaCTeM e-science and data mining workshop John Keane Co-Director, NaCTeM School of Informatics,

Slides:



Advertisements
Similar presentations
Supporting further and higher education 19th APAN Meetings in Bangkok Innovative Uses of Pervasive Broadband Network Is adoption of technology running.
Advertisements

Intute Repository Search Project A showcase for UK research output Sophia Jones SHERPA October.
Intute Repository Search Project An iterative approach to developing a national search service to support scholarly communication, teaching and learning.
National Centre for Text Mining John Keane NaCTeM Co-director University of Manchester.
The Economic and Social Data Service (ESDS) Kevin Schürer ESDS/UKDA ESDS Awareness Day 5 December 2003.
Supporting the Research Process The NaCTeM Text Mining Service William Black Informatics, Manchester.
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Facebook for scientists Titus Schleyer et al. 1 of 38 Digital Vita : Leveraging Personal Information Management Practices to Facilitate Research Collaborations.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
13 th September 2007 UK e-Science All Hands Meeting Text Mining Services to Support e-Research Brian Rea and Sophia Ananiadou National Centre for Text.
San Diego Supercomputer Center Analyzing the NSDL Collection Peter Shin, Charles Cowart Tony Fountain, Reagan Moore San Diego Supercomputer Center.
1. UKPMC ‘We exist for everyone who wants to do research – for academic, personal, or commercial purposes.’ - BL Strategy 2005/8.
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
Search Engines and Information Retrieval
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Thee-Framework for Education & Research The e-Framework for Education & Research an Overview TEN Competence, Jan 2007 Bill Olivier,
UK e-Science and the White Rose Grid Paul Townend Distributed Systems and Services Group Informatics Research Institute University of Leeds.
A Data Curation Application Using DDI: The DAMES Data Curation Tool for Organising Specialist Social Science Data Resources Simon Jones*, Guy Warner*,
Luxembourg, Sep 2001 Pedro Fernandes Inst. Gulbenkian de Ciência, Oeiras, Portugal EMBER A European Multimedia Bioinformatics Educational Resource.
1 Digital Libraries and Evidence in the Developing World Context Dr. Jon Ferguson Senior Health Database Scientist IMMPACT Project University of Aberdeen.
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space The Capabilities of the GridSpace2 Experiment.
Search Engines and Information Retrieval Chapter 1.
Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
25./ Final DIP Review, Innsbruck, Austria1 D11.22 DIP Project Presentation V5 Oct 2006 Presented at Final Review Innsbruck, Oct, 2006.
Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
Flexible Text Mining using Interactive Information Extraction David Milward
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Semantic Web services Interoperability for Geospatial decision.
E-science in the Netherlands Maria Heijne TU Delft Library Director / Chair Consortium of University Libraries and National Library.
Interpreting & communicating about computational models: using visual analytics as a bridge between policy makers and researchers Philippe J. Giabbanelli,
ODD-Genes: Accelerating data-driven scientific discovery NeSC Review 2003 NeSC
CLARIN work packages. Conference Place yyyy-mm-dd
Task-oriented approach to information handling support within web-based education Lora M. Aroyo 15 November 2001.
Scenarios for a Learning GRID Online Educa Nov 30 – Dec 2, 2005, Berlin, Germany Nicola Capuano, Agathe Merceron, PierLuigi Ritrovato
EVA Workshop, 26 March 2003, Florence, Italy1 COINE Cultural Objects In Networked Environments Anthi Baliou University of Macedonia,Library Thessaloniki,
SEEK Welcome Malcolm Atkinson Director 12 th May 2004.
Semantic based P2P System for local e-Government Fernando Ortiz-Rodriguez 1, Raúl Palma de León 2 and Boris Villazón-Terrazas 2 1 1Universidad Tamaulipeca.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
Combining the strengths of UMIST and The Victoria University of Manchester “Use cases” Stephen Pickles e-Frameworks meets e-Science workshop Edinburgh,
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Scientific Workflow systems: Summary and Opportunities for SEEK and e-Science.
Enabling e-Research in Combustion Research Community T.V Pham 1, P.M. Dew 1, L.M.S. Lau 1 and M.J. Pilling 2 1 School of Computing 2 School of Chemistry.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
UKOLN is supported by: Introduction to UKOLN Dr Liz Lyon, Director UKOLN, University of Bath, UK Grand Challenge Meeting, June a centre.
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
A centre of expertise in digital information management Shaping the e-future? Grids, Web Services and Digital Libraries Professor Tony.
Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
1 e-Arts and Humanities Scoping an e-Science Agenda Sheila Anderson Arts and Humanities Data Service Arts and Humanities e-Science Support Centre King’s.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
Toward a common data and command representation for quantum chemistry Malcolm Atkinson Director 5 th April 2004.
NeOn Components for Ontology Sharing and Reuse Mathieu d’Aquin (and the NeOn Consortium) KMi, the Open Univeristy, UK
EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space The Capabilities of the GridSpace2 Experiment.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
TDM in the Life Sciences Application to Drug Repositioning *
From CLEF to TrebleCLEF Promoting Technology Transfer
TRSS Terminology Registry Scoping Study
Data Warehousing and Data Mining
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
Presentation transcript:

National Centre for Text Mining NaCTeM e-science and data mining workshop John Keane Co-Director, NaCTeM School of Informatics, University of Manchester

2 What is text mining? Text mining attempts to discover new, previously unknown information by applying techniques from data mining, information retrieval, and natural language processing: –To identify and gather relevant textual sources –To analyse these to extract facts involving key entities and their properties –To combine the extracted facts to form new facts or to gain valuable insights.

3 Why text mining? Users overwhelmed by amount of text; 80% of information in textual form –Results of scientific experiments also reproduced in textual form –Critical information missed Only 12% of TOXLINE users find what they want –Lost in irrelevant documents

4 Why text mining? (cont.) Biological information –Disseminated over huge amount of documents –Dynamic (e.g. discovery of new genes) –Databases and controlled vocabularies encode only fraction of information and interactions Significant error rate in manually curated DBs –No agreement on naming (inconsistent terminology) –$800M for 12 years for discovery of new drug Reduced to 10 years if all relevant information known

5 Main problems? – Growing number and size of documents – Language variability and ambiguity – Difficult to assimilate/locate new knowledge without automated help Main purposes of TM? – M ake information buried in text accessible – Integrate information across articles – Help interpret experimental data – Update databases (semi)automatically Problem and Purpose

6 Blagosklonny & Pardee argued information necessary to understand some processes related to p53 gene implicitly available for 10 years in the literature before finally clarified Blagosklonny, M., Pardee, A., 2001: Conceptual Biology: Unearthing the Gems, Nature 416: 373. A simple motivating example

7 A vision for the future DBs with accurate, valid, exhaustive, rapidly updated data Drug discovery costs slashed; animal experimentation reduced through early identification of unpromising paths New insights gained through integration and exploitation of experimental results, DBs, and scientific knowledge Product development archives and patents yield new directions for R&D Searching yields facts rather than documents to read

8 SCALABLE TEXT MINING TERM & INFORMATION EXTRACTION USER INTERFACE ONTOLOGIES DATA MINING INFORMATION RETRIEVAL USERS TEXT MINING BIO-MEDICALBIO-MEDICAL ENGINEERINGENGINEERING SOCIAL SCIENCEDIGITAL LIBRARIES SCIENCE SERVICE ASPECTS USER EVALUATION PRODUCTION SOFTWARE SERVICE PROVISION ACCESSIBILITY BENCHMARKING

9 Information retrieval –Gather, select, filter, documents that may prove useful –Find what is known Information extraction –Partial, shallow language analysis –Find relevant entities, facts about entities –Find only what looking for Mining –Combine, link facts –Discover new knowledge, find new facts Resources: –ontologies, lexicons, terminologies, grammars, annotated corpora (machine learning, evaluation) Text mining tasks and resources

10 National Centre for Text Mining First such centre in the world Funding: JISC, BBSRC, EPSRC –£1M over 3 years Consortium investment –>800K including new chair in TM Location: –Manchester Interdisciplinary Biocentre (MIB) Focus: Bio, then extend to other domains To move towards self-sustainability –Extend services to industry

11 Consortium Manchester, Liverpool, Salford Service activity run by MIMAS (National Centre for Dataset Services), within MC (Manchester Computing) Self-funded partners –San Diego Supercomputing Center –UCalifornia, Berkeley –UGeneva –UTokyo Strong industrial & academic support –Astrazeneca, Xerox, EBI, Wellcome Trust, Sanger Institute, IBM, Unilever, ELRA, NowGEN, bionow

12 Competence within the consortium Manchester: TM (IE); standards; bioinformatics; parallel & distributed DM, HPC; HCI; ontologies & standards for semantic web; e-Science and GRID; curation of biodatabases Salford: Bio-TM; terminology; visualisation Liverpool: digital archives & IR; bioinformatics Manchester Computing: service provision; national data services (MIMAS); Supercomputing (HPC CSAR)

13 Self-funded Partners San Diego Supercomputing Centre: SKIDL toolkit for DM of high dimensional datasets; distributed and parallel computing SIMS, UCalifornia, Berkeley: probabilistic semantic grammar TM UTokyo: Bio TM; IE, GENIA, ontologies ISSCO, UGeneva: Standards-based evaluation methodologies for TM tools

14 Services Establish a high quality service provision for the UK academic community –Identify the ‘best of breed’ TM tools; Types of services –Facilitate access to TM tools, resources & support –Offer on-line use of resources and tools (also to guide and instruct) –Offer one-stop shop for complete, end-to-end processing

15 Critical aspects of service provision Scalability: most critical aspect of TM tool usage for a national service Consortium strong in distributed, high performance, GRID activity Focus on user-need related development whilst using experts to feed-in research results –need to separate research from TM service provision

16 Core development Creation of an infrastructure for TM –Standards based infrastructure for components and datasets to allow efficient distributed computing –Portal access Support for information retrieval –Use of a digital library system (Cheshire) to harvest and index data with improved index term weighting –Combine SKIDL (data mining toolkit, SDSC) with latent semantic analysis and probabilistic retrieval to extract text fragments for IE

17 Core development Support for inter-component communication, term management and IE –Produce a common annotation scheme to support UK TM, further developed from EU project –Collaboration ISO TC37/SC4 Term management –ATRACT workbench (Lion BioScience / EBI) Information Extraction –Manchester CAFETIERE

18 Core development Support for data grid technologies –Provide a means to connect to heterogeneous resources User interfaces, scientific data integration and mediation –Support user to set up data mining scenarios (via wizards) Support for advanced visualisation

19 Specific collaboration with EBI/SIB Swiss-Prot –Semi-auto produce highly precise terminology –Demonstrate precision of facts and associations mined from text that can be linked to terminology. Curation facilitation via text mining –Enhances access to, and retrieval of, literature –Keep DB links current/consistent with literature –Extracts pertinent facts from literature EBI tests suggest ~10% improvement over manual –Evident link with work of Digital Curation Centre Improve/construct/validate ontologies Evaluation for users/performance evaluation Component standardisation/ API/infrastructure/Common Annotation Scheme Exemplary Outcomes

20 Creation of NaCTeM responds to a widely-felt need for text mining to support research Critical to engage with –Users in academia and (eventually) industry –Providers of bioinformatics infrastructure and resources such as biodatabases TM can be used also in teaching & learning –Problem-based learning –CAL support (extraction of facts and examples) –Exploration of link between student experimental results and results reported in literature Conclusion

21 Introductory BioLink SIG (workshop at ISMB 04 in Glasgow) BIONLP.org Tutorials psb2001tutorials.html Background reading

22 Appendix: Types of services Access to TM tools developed from leading research Access to a selection of commercial text mining tools Access to ontology libraries Access to large and varied data sources Access to a library of data filtering tools On-line tutorials, briefings and white papers On-line advice to match specific requirements to TM solutions On-line TM & packaging of results involving GRID-based flexible composition of tools, resources and data by users TM tool trials and evaluations Collaborative development / enhancement of TM tools, annotated corpora and ontologies