Common Data Models and Protocols Richard White, Cardiff University Talk given at “Making Species Databases Interoperable”,

Slides:

Advertisements

Similar presentations

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.

Advertisements

Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.

28 March 2003e-MapScholar: content management system The e-MapScholar Content Management System (CMS) David Medyckyj-Scott Project Director.

At Reading Frank Bisby, Alistair Culham, Paul Valdes, Neil Caithness, Tim Sutton, Peter Brewer At Cardiff Alec Gray, Andrew Jones, Nick Fiddian, Nick Pittas,

Database System Concepts and Architecture

Cardiff School of Computer Science & Informatics Biodiversity Informatics at COMSC Andrew Jones & Richard White School of Computer Science & Informatics.

4D4Life Opening Meeting, September, Reading, UK The Global Multi-Hub Network in Concept Frank Bisby An EC Seventh Framework Scientific Data Infrastructures.

Placing barcodes with precision against the Catalogue of Life Frank Bisby Executive Director: Species 2000 Species 2000 Secretariat University of Reading,

Knowledge Sharing and Collaborative Problem Solving in Biodiversity Informatics Andrew C. Jones Cardiff University, UK.

Richard White Biodiversity Informatics. Part One An introduction to biodiversity data.

The Species 2000 Protocols for a Distributed System by Yuri Roskov, Species 2000 Secretariat 20th International CODATA Conference, Session K2, 25 October.

Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 8 Slide 1 System modeling 2.

10 March 2004Richard J. White – COMSC / BB Unit Reliable knowledge discovery in a biodiversity Grid Part 2: Litchi and ambiguous names by Richard J. White.

1 Richard White Design decisions: architecture 1 July 2005 BiodiversityWorld Grid Workshop NeSC, Edinburgh, 30 June - 1 July 2005 Design decisions: architecture.

Next Steps in the Catalogue of Life Frank Bisby, Sp2000 and Thomas Orrell, ITIS Catalogue of Life Partnership.

Lecture 13 Revision IMS Systems Analysis and Design.

Understanding Metamodels. Outline Understanding metamodels Applying reference models Fundamental metamodel for describing software components Content.

Chapter 2: IS Building Blocks Objectives

System Analysis and Design

Supplement 02CASE Tools1 Supplement 02 - Case Tools And Franchise Colleges By MANSHA NAWAZ.

115 October 2005Richard White - Sp2000/ENBI - Stockholm Litchi: interlinking species information systems Richard White, Andrew Jones, Ed Donovan Computer.

SERNEC Image/Metadata Database Goals and Components Steve Baskauf

Species Banks a GBIF mechanism to provide electronic access to quality species information Peter H. Schalk, Marc Brugman ETI, University of Amsterdam Tinde.

QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.

Richard White Biodiversity Informatics Projects. Thoughts Role of biodiversity data in bioinformatics – assisting with organising and retrieving bioinformatic.

Computing for Bioinformatics Introduction to databases What is a database? Database system components Data types DBMS architectures DBMS systems available.

Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.

Richard White Biodiversity Data. Outline Biodiversity: what is it? – Definitions: is biodiversity: A resource? Something which can be measured? How to.

Internet Basics Dr. Norm Friesen June 22, Questions What is the Internet? What is the Web? How are they different? How do they work? How do they.

1 CMPT 275 High Level Design Phase Architecture. Janice Regan, Objectives of Design  The design phase takes the results of the requirements analysis.

Use case lessons: Components of the SEEK architecture Robert K. Peet University of North Carolina.

Indexing the Species Names of the World - for the World Frank Bisby (Species 2000), Michael Ruggiero (ITIS) Per de Place Bjørn (GBIF - ECAT)

SITools Enhanced Use of Laboratory Services and Data Romain Conseil

OBIS Portal Architecture Concepts plus potential for utilization as a basis for Regional OBIS Nodes Tony Rees, CSIRO Marine Research, Hobart (and OBIS.

1 Technologies for distributed systems Andrew Jones School of Computer Science Cardiff University.

Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK

MET280: Computing for Bioinformatics Introduction to databases What is a database? Not a spreadsheet. Data types and uses DBMS (DataBase Management System)

QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

1-1 System Development Process System development process – a set of activities, methods, best practices, deliverables, and automated tools that stakeholders.

A curation interface for reconciliation of species names for India. Thomas Vattakaven and R. Prabhakar, India Biodiversity Portal, Strand Life Sciences,

Automated Benchmarking Of Local Authority Web Sites Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY UKOLN is supported by:

DITA Single Source technology. What is Single Source? Single source technology is a concept of publishing documents when same content can be used in different.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.

Towards an Experience Management System at Fraunhofer Center for Experimental Software Engineering Maryland (FC-MD)

Christina Flann Species 2000 October 2014 Catalogue of Life Indexing The World’s Known Species Connecting the taxonomic community and the names infrastructure.

1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.

by Maria Rita Marruganti DIFFERENT WAYS OF SENDING INFORMATION Passive e.g. newspapers, radio, television. You don’t produce, just receive information.

Christoph F. Eick University of Houston Organization 1. What are Ontologies? 2. What are they good for? 3. Ontologies and.

Alternative Architecture for Information in Digital Libraries Onno W. Purbo

GBIF Data Access and Database Interoperability 2003 Work Programme Overview Donald Hobern, GBIF Programme Officer for Data Access and Database Interoperability.

Results of a Needs Assessment Survey of the Global Invasive Species Information Network Biodiversity Information Standards- Taxonomic Databases Working.

BiodiversityWorld GRID Workshop NeSC, Edinburgh – 30 June and 1 July 2005 Taxonomic verification: Species 2000 and the Catalogue of Life Frank Bisby.

1October 2006Richard White, Andrew Jones & Frank Bisby - TDWG - St Louis Federating taxonomic databases: progress with the Catalogue of Life Dynamic Checklist.

Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff.

The role of persistent identifiers in tracking taxon changes Andrew C. Jones, Richard J. White, Ewen R. Orme, School of Computer Science, Cardiff University,

Slide 1 Service-centric Software Engineering. Slide 2 Objectives To explain the notion of a reusable service, based on web service standards, that provides.

Progress Alastair Culham. i4Life – the BIG aim To move Catalogue of Life from a research project to a sustainable service 1.To enhance the content 2.To.

CAAB and taxon management at CSIRO Marine Research Tony Rees Divisional Data Centre CSIRO Marine Research, Hobart

Extending the biogeographical model Africamuseum 6 (7?) June 2013.

The New GBIF Data Portal Web Services and Tools Donald Hobern GBIF Deputy Director for Informatics October 2006.

Example projects using metadata and thesauri: the Biodiversity World Project Richard White Cardiff University, UK

GBIF - ECAT  Electronic Catalogue of Names of Known Organisms  Program Officer;  Per de Place Bjørn 

Charles Copp, Neil Caithness & Richard White.  Evaluation, selection and acquisition of existing thesauri  Thesaurus modelling - logical and physical.

Topic 4: Distributed Objects Dr. Ayman Srour Faculty of Applied Engineering and Urban Planning University of Palestine.

Flanders Marine Institute (VLIZ)

CHAPTER 3 Architectures for Distributed Systems

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.

Service-centric Software Engineering

Database Architecture

Presentation transcript:

Common Data Models and Protocols Richard White, Cardiff University Talk given at “Making Species Databases Interoperable”, Reading, 15 July SPICE for Species 2000 Funded in the UK by the BBSRC/EPSRC Bioinformatics Initiative Universities of Cardiff & Reading

Species 2000 The story so far... Species 2000 is an international collaborative project to create and provide access to an authoritative and up-to-date checklist and index to all the world’s species. How is it going to do this?

Species 2000 services to users Dynamic Checklist Annual Checklist Web site, including database links submitted by users or producers Distribution media, including downloaded data Index to species information (hyperlinks to SISs) Packaged functions providing services to other software

Species 2000 organisation Taxonomic hierarchy (or hierarchies) Species Global species databases (GSDs) and interim checklists: the species index GSD interim checklists Species information sources (SISs): regional faunas and floras, specialist or sectoral databases, web pages etc. SIS

Merging & Linking Merging The original databases are physically copied into a new combined database Linking The original databases remain separate, but are accessed through a single system

Merging 1.The original databases are physically copied into a new combined database. 2.The user interacts with the new combined database.

Linking 1.The user interacts with an access system which does not itself contain data. 2.When the user requests data, it is fetched from the appropriate database.

Architecture of Species 2000 User interface Data collector Wrapper GSD Wrapper GSD Wrapper GSD CAS (Common Access System) or “harness” Protocol Distributed array of databases

Need for communication Different people are building the various components of the system: –GSDs –wrappers –CAS –user interface We need to ensure they all have a common understanding of the data to avoid embarrassing mistakes

Database wrappers Only the interface to the CAS needs to speak CORBA Wrappers must: –Translate CAS requests into a form suitable for the GSD (e.g. SQL) and translate responses back –Deal with other kinds of heterogeneity, including schema heterogeneity

Data flow through a wrapper Divided wrapper GSD Wrapper interface CAS External wrapper XML Strings e.g. CGI

Common Data Model We need a Common Data Model (CDM) –A definition of the information being passed to and fro –Human-readable, not machine-readable –This is used as a reference when creating specific implementations for CGI/XML (DTD, XML Schema), Web Services, etc.

What does the CDM look like? It defines the input (“request”) and output (“response”) for six fundamental operations which the system needs to be able to carry out

Request Types 0-6 –Type 0: Get version of the CDM with which the GSD complies –Type 3: Get information about the GSD –Type 1: Search for a name in the GSD –Type 2: Fetch “standard data” about a chosen species –Type 4: Move up the taxonomic hierarchy –Type 5: Move down the taxonomic hierarchy

Type 0 Request Request: –(nothing) Response: –CDMVersion

Type 3 Request Request: –GSDIdentifier Response: –GSDInfo (a set of fields including its name, date of last editing, etc.)

Type 1 Request Request: –SearchString, SearchType (scientific name, common name, unknown), SearchLimit (including higher taxon, maximum number of names to return) Response: –Number, SpeciesName[0:N]

Type 2 Request Request: –Identifier, GSDIdentifier Response: –StandardData (approximately the same as the Standard Data defined by Species 2000 and seen by the user)

Type 4 Request Request: –Identifier, GSDIdentifier Response: –HigherTaxon[0:N]

Type 5 Request Request: –Identifier, SearchLimit Response: –Taxon[0:N]

The “standard data” This comprises the information about a species which Species 2000 wishes to provide: –AVCNameWithRefs –SynonymWithRefs –CommonNameWithRefs –Family (or other agreed higher taxon) –Comment –Scrutiny –DataLink (links to the GSD’s or other web pages) –Geography (list of places)

Where are we now? Is the Spice Project finished? –We have a fairly stable CDM (version 1.20 is about to be replaced with version 1.21) –XML DTD exists –Several CGI/XML implementations in Java and PHP, and a Web Service –We have a working Spice system –A few changes are anticipated: geographical information linking to further information sources infraspecific taxa

“Intelligent” linking Species 2000 is –not just a catalogue (which lists things) –It is an index (which points to things) It plans to provide links to take a user –from a species entry (from a GSD) –to further sources of information about that particular species (Species Information Sources or SISs)

“Intelligent” linking There are experimental “unintelligent” links already (as in the ILDIS GSD), which rely on exact name matching But there are issues in making links more intelligent

Data quality (again!) How do we know the information is reliable? One problem is the differing interpretation of species names (species concepts) in different resources

LITCHI Project A rule-based tool for the detection and repair of conflicts and merging of data in taxonomic databases

Summary of Litchi project We modelled the knowledge integrity rules in a taxonomic treatment. The knowledge tested is implicit in the assemblage of scientific names and synonyms used to represent each taxon. Practical uses include detecting and resolving taxonomic conflicts when merging or linking two databases. Version 2 now implemented focusses on the creation of “cross-maps”

Example 1 Checklist A Caesalpinia crista L. [accepted name] Checklist B Caesalpinia crista L. [accepted name] Caesalpinia bonduc (L.) Roxb. [accepted name] Caesalpinia crista L., p.p. [synonym]

Example 2 In the case of the species Cytisus scoparius Treatment A will list it as Cytisus scoparius (synonym Sarothamnus scoparius) Treatment B will list it as Sarothamnus scoparius (synonym Cytisus scoparius) Genus Cytisus Genus Sarothamnus Genus Cytisus Cytisus scoparius Sarothamnus scoparius Cytisus striatus Sarothamnus striatus Cytisus multiflorus Cytisus praecox Treatment A recognises one genus, Cytisus Treatment B recognises two genera, Cytisus and Sarothamnus

Cross-mapping So how can we make intelligent links work? One way to make links appear more intelligent is to create and maintain “cross-maps” which describe how one or more taxa in one resource (such as the Species 2000 index) relate to one or more taxa in another resource

Litchi 2.2 in use Checklist AChecklist B Rules Heuristics Concept relationships Cross-map Taxonomic intelligence Read into system Write Conflict detection Inference of concept relationships

More about cross-maps They may be created and maintained –manually by experts –automatically or semi-automatically by LITCHI (as above) –by monitoring the behaviour of users following species links –by analysing data sets describing the taxa, when sufficient such data is available, using the usual species taxonomy tools (phenetic and cladistic analyses)

More about cross-maps They may be held –by individual GSDs, describing how to link their species to selected related resources, as ILDIS has done for linking to the Northern Eurasia (aka USSR) database) –by Species 2000 as a repository and service to facilitate intelligent species links –by an “intelligent linking engine”, as planned for Species 2000 Europa to link its two hubs

A dream A system for managing intelligent species links using taxonomic concept relationships would maximise the potential of the plethora of species-based catalogues, indexes and rich species resources currently being assembled all over the world Perhaps on the Web, as with the current Spice/Species 2000 prototype Or...

The Grid Or maybe on the Grid –One of the aims of which is to provide access to such knowledge sources as species checklists, synonymy servers, rich species data sets, and cross-maps, for example in the Biodiversity World project