Wednesday 25 June 2014 – FAO, Rome BiOnym A concept-mapping workflow for taxon names reconciliation iMarine Board 5 – 25 June 2014, FAO, Rome, Italy Fabio.

Slides:



Advertisements
Similar presentations
January 30, 2014 Copyright Jim Farley Beyond JDBC: Java Object- Relational Mappings Jim Farley e-Commerce Program Manager GE Research and Development
Advertisements

User Communities scenarios and achievements Marc Taconet Anton Ellenbroek FAO – Fisheries and Aquaculture Department Nicolas Bailly.
Service Manager for MSPs
Virtualizing Entomology Collection Student: Di Wang (Alan) Sponsors: John Marris: Curator, Entomology Research Museum Stuart Charters: Department of Applied.
Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.
Stat-JR: eBooks Richard Parker. Quick overview To recap… Stat-JR uses templates to perform specific functions on datasets, e.g.: – 1LevelMod fits 1-level.
8.
© 2006 IBM Corporation IBM Software Group Relevance of Service Orientated Architecture to an Academic Infrastructure Gareth Greenwood, e-learning Evangelist,
1 SWE Introduction to Software Engineering Lecture 3 Introduction to Software Engineering.
GLOBAL BIODIVERSITY INFORMATION FACILITY David Remsen ECAT Program Officer September G A Darwin-Core Archive solution to publishing and.
Product Offering Overview CONFIDENTIAL AND PROPRIETARY Copyright ©2004 Universal Business Matrix, LLC All Rights Reserved The duplication in printed or.
Professional Informatics & Quality Assurance Software Lifecycle Manager „Tools that are more a help than a hindrance”
11/05/2006 Using a Virtual Research Environment to present CRIS grouped to support the real research users’ research lifecycle Derek Mark Sergeant University.
WP6: Grid Authorization Service Review meeting in Berlin, March 8 th 2004 Marcin Adamski Michał Chmielewski Sergiusz Fonrobert Jarek Nabrzyski Tomasz Nowocień.
Species Banks a GBIF mechanism to provide electronic access to quality species information Peter H. Schalk, Marc Brugman ETI, University of Amsterdam Tinde.
Ocean Biogeographic Information System. ‘Mission’ OBIS publishes primary data on marine species locations online through –It.
1 Autonomic Computing An Introduction Guenter Kickinger.
Framework for Automated Builds Natalia Ratnikova CHEP’03.
MEASUREMENT PLAN SOFTWARE MEASUREMENT & ANALYSIS Team Assignment 15
Working Together to Advance Terminology Tooling Presentation to OHT Board, Birmingham Jennifer Zelmer & Karen Gibson.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Performance Monitoring - Internet2 Member Meeting -- Nicolas Simar Performance Monitoring Internet2 Member Meeting, Indianapolis.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Digital Earth Communities GEOSS Interoperability for Weather Ocean and Water GEOSS Common Infrastructure Evolution Roberto Cossu ESA
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
Understanding to InterGrid and WAG Dr. ZhenChun Huang Tsinghua Univ. NRSCC/RSGS/SIG Team Sep, 2006.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
9 Systems Analysis and Design in a Changing World, Fourth Edition.
Co-funded by the Community programme eContentplus Arrow Plus Project Use of Book and Press s. Rights of Authors and Publishers Kraków,
Max Craglia (JRC) and Stefano Nativi (CNR) FP7-ENV-2011 Planning Meeting 24 September 2010, GEO Secretariat Outcome of EuroGEOSS Multi-disciplinary Interoperability:
Health eDecisions Use Case 2: CDS Guidance Service Strawman of Core Concepts Use Case 2 1.
Google Refine for Data Quality / Integrity. Context BioVeL Data Refinement Workflow Synonym Expansion / Occurrence Retrieval Data Selection Data Quality.
Nature Reviews/2012. Next-Generation Sequencing (NGS): Data Generation NGS will generate more broadly applicable data for various novel functional assays.
Portable Infrastructure for the Metafor Metadata System Charlotte Pascoe 1, Gerry Devine 2 1 NCAS-BADC, 2 NCAS-CMS University of Reading PIMMS provides.
Introduction to the Semantic Web and Linked Data
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Ocean Biogeographic Information System Edward Vanden Berghe.
Managing Learning Objects in Large Scale Courseware Authoring Studio Ivo Marinchev, Ivo Hristov Institute of Information Technologies Bulgarian Academy.
IMarine and our contribution 1 Presentation methodology: PechaKucha 20x20 Andrea Manzi (CERN) Nick Drakopoulos (CERN) IT GT.
1 WS-GIS: Towards a SOA-Based SDI Federation Fábio Luiz Leite Júnior Information System Laboratory University of Campina Grande
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Adrian Jackson, Stephen Booth EPCC Resource Usage Monitoring and Accounting.
Taxonomic Workflow in the EDIT Platform for Cybertaxonomy Andreas Kohlbecker, Pepe Ciardelli, Niels Hoffmann, Katja Luther, Andreas Müller Botanic Garden.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
CS223: Software Engineering Lecture 2: Introduction to Software Engineering.
Building Scientific Workflows for the Fisheries and Aquaculture Management Community based on Virtual Research Environments Pedro Andrade (CERN)
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
User scenario on Marine Biodiversity AquaMaps Pasquale Pagano National Research Council (CNR) – ISTI Italy.
An approach to Web services Management in OGSA environment By Shobhana Kirtane.
GLOBAL BIODIVERSITY INFORMATION FACILITY David Remsen Senior Programme Officer, ECAT 3 Oct th Nodes Meeting.
Active Directory Domain Services (AD DS). Identity and Access (IDA) – An IDA infrastructure should: Store information about users, groups, computers and.
Fifth iMarine Board meeting June 2014, FAO (Rome) Related to Agenda item 6 Fifth iMarine Board meeting, June 2014, FAO (Rome) The Vulnerable.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GOCDB4 Gilles Mathieu, RAL-STFC, UK An introduction.
Data access and sharing policies Ecosystem Approach Community of Practice (EA-CoP) Data access and sharing policies Towards the finalization of the document.
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Lifemapper 2.0 Using and Creating Geospatial Data and Open Source Tools for the Biological Community Aimee Stewart, CJ Grady, Dave Vieglais, Jim Beach.
Fifth iMarine Board meeting June 2014, FAO (Rome) Related to Agenda item 6 NEAFC to ICES data exchange Supporting collaboration and secure data exchange.
RDA 9th Plenary Breakout 3, 5 April :00-17:30
Pasquale Pagano CNR – ISTI (Pisa, Italy)
Pasquale Pagano CNR, Italy
The IPT user interface and data quality tools
Workplan for Updating the As-built Architecture of the 2007 GEOSS Architecture Implementation Pilot Session 7B, 6 June 2007 GEOSS Architecture Implementation.
Flanders Marine Institute (VLIZ)
Distribution and components
YourDataStories: Transparency and Corruption Fighting through Data Interlinking and Visual Exploration Georgios Petasis1, Anna Triantafillou2, Eric Karstens3.
Enterprise Data Model Enterprise Architecture approach Insights on application for through-life collaboration 2018 – E. Jesson.
Introduction to D4Science
DBOS DecisionBrain Optimization Server
Presentation transcript:

Wednesday 25 June 2014 – FAO, Rome BiOnym A concept-mapping workflow for taxon names reconciliation iMarine Board 5 – 25 June 2014, FAO, Rome, Italy Fabio Fiorellato, Edward Vanden Berghe, Gianpaolo Coro, Nicolas Bailly, Caselyn Aldemita FAO / CNR / FIN / VUB

‘Big Data’: Data make its way to biology Need for data integration Becoming a very realistic possibility –Management of DBs of millions of records Needs integration of small, restricted-scope datasets into massive databases –Intra-discipline integration (homogenous) –Inter-discipline integration (heterogeneous) Individual studies too small to inform on a scale commensurate with problems humankind faces –Evidence-based management of living resources –Climate change, global warming…

Central role of taxon name reconciliation Taxon name enrichment Taxon name reconciliationTaxon name access Occurrence data access Environmental data access openModeller AquaMaps Distribution modelling Occurrence data enrichment Occurrence data reconciliation

The BiOnym Workflow

Taxonomic names are the keys… … Keys to bind together information on the same taxon from different sources But there are problems – Different research groups use different spellings – Accidental misspellings – Synonym, homonym reconciliation (but outside scope of ByOnym)

Some people can’t type Real example in OBIS point data database Asthenognathas inaefaipes Asthenognathus inaeqipes Asthenognathus maefaipes Astheognathus inaequipes Asthenognathus inaeguipes Astheognathus inaeqinipes Asthenognathus inaequipes

Things can go very wrong with Excel Clupea harengus Linnaeus, 1758 Clupea harengus Linnaeus, 1759 Clupea harengus Linnaeus, 1760 … Clupea harengus Linnaeus, 2254 Clupea harengus Linnaeus, 2255

Taxonomic names are the keys… … Keys to bind together information on the same taxon from different sources But there are problems – Different research groups use different spellings – Accidental misspellings Reconciliation is necessity, not luxury!!!

Existing systems… … Are not flexible –We need flexibility, as our use case will dictate what the ‘optimal’ behaviour of the system is E.g. manual vs automatic systems … Are often coupled to a single ‘reference list’ –Using different tax. Scope for test and reference only increases false positives E.g. TaxaMatch with IRMNG… …Don’t always have throughput needed for large-scale projects –Largest db appr. 20M names – too many pairs!

Our need A flexible, highly customisable, workflow- based approach to taxon name matching –User controls input –Output can be used as input in other processes –Running on high performance computing infrastructure BiOnym!

The BiOnym Workflow

Key concepts and features in BiOnym Real-world application of the concept-mapping principles Focused on marine taxonomy but extendible to other life zones, and embedded in a wider-scope technology (COMET) Provides a full customisable workflow (order of matchers) Takes advantage of the iMarine distributed infrastructure The modular architecture enabled developers to integrate from third party components, new functionalities or improve existing ones with ease ….. And to add taxonomic authority files Based on standard and open formats (DwC, DwCa, …)

The iMarine solution: existing state-of-the-art A general purpose concept mapping framework (COMET) was already available in FAO: – based on an existing FAO product (limited to the fishing vessels domain) initially developed with the support of the Japanese trust fund – domain independent (can be tailored to any custom domain with little effort) – provided with all the necessary building blocks and components for general purpose usage

The iMarine solution: the quest for integration The integration of COMET inside iMarine was hailed and expected. Its main challenges: – Identify and define the custom domain (biological taxonomy) – Design and implement: custom COMET matchlets (engine assigning similarity scores to pairs of names) additional, reusable tools for data interchange and data preparation (DwCA converter, input parser, pre- and post-processors) – Enable components to be easily distributed among worker nodes inside the infrastructure – Integration in the iMarine Statistical Manager

BiOnym System: Overview

Where are we? Tools available in the VRE Statistical Manager for techies (!) Portlet available in the infrastructure but … … still to be integrated in the production part of Biodiversity Research VRE … … after testing by users outside iMarine Match names from a file in SM-VRE, not yet in portlet Accessible as a webservice under WPS protocol

The Bionym Interface in Statistical Manager VRE Never mind the small print. Step 1: Select your data Step 2: Compose the matching process. This relies on infrastructure resources Step 3: review results. This can be private and ‘for your eyes only’, or public.

Interface in the portlet: Advanced search

Matching results VME-DB and iMarine Reports - 8th TCom Feb 19

Future work Within the framework of EC iMarine: –Finalise/fine tune the interface; –Analyse the feed back of the to be contacted members of the biodiversity community Beyond September: –Several suggestions in the technical report recently published, see section 5 [ Postprocessing Sharing matching results –Explore and fine-tune the WPS services. 20 VME-DB and iMarine Reports - 8th TCom Feb

Thank you 21