Download presentation
Presentation is loading. Please wait.
Published byAlvin Stewart Modified over 9 years ago
1
Global Biodiversity Information Facility
2
GLOBAL BIODIVERSITY INFORMATION FACILITY Hannu Saarenmaa & al. Ecoinformatics Workshop Brussels 22 September 2004 WWW.GBIF.ORG The GBIF Information System – an Update
3
Global Biodiversity Information Facility Outline 1. Objective 2. Status of GBIF network 3. Data exchange standards 4. Protocol standards 5. Schema repository requirements
4
Global Biodiversity Information Facility GBIF’s objective is l to establish an distributed information infrastructure that serves primary biodiversity data l with initial focus on species- and specimen-level data, l with links to molecular, genetic and ecosystems levels l to function as a global integrator
5
Global Biodiversity Information Facility GBIF is a global inte- grator
6
Global Biodiversity Information Facility Participants have expressed their willingness to... l Share biodiversity data through nodes l Formulate and implement GBIF work programme l Voting Participants (to date 24 countries) make yearly contribution based on GDP l GBIF central budget is $3M l Associate Participants (to date 15 countries/economies, 22 international organisations) cannot vote, but otherwise participate fully l Make additional investments in biodiversity information and the necessary infrastructure l 90% of investment in GBIF happens within Participants l 10% centrally and only for providing the linking mechanism
7
Global Biodiversity Information Facility 2.Status of GBIF network GBIF is building a distributed network of databases for sharing biodiversity data using a web services approach
8
Global Biodiversity Information Facility GBIF network status Q3/2004 l What has been achieved l UDDI registry up and running since July 2003 l Data Portal with global index opened 6 February 2004 l Several DiGIR provider implementations available l Integration of the BioCASe network l Regional training workshops l What to expect in the next 12 months l Integration of name providers (on-line) l Integration of image data l Release of Data Portal software and distributed architecture l Scaling up from 41 to 100 million records l Uses of data, like through SpeciesBank
9
Global Biodiversity Information Facility Portal Data provider Provider Services Request Marshaller Query Engine Registry Institutions Providers Services ( UDDI ) Resource Metadata Resource Metadata GBIF Architecture Index Name provider Provider Services Resource Metadata Resource Metadata Cache Metadata Accounting SOAP DiGIR HTTP other Data Portal Data provider Provider Services Provider query Request Marshaller Query Engine Available providers Registry Institutions Providers Services ( UDDI ) User Resource Metadata Resource Metadata Index Name provider Provider Services Resource Metadata Resource Metadata and name query Metadata response Full data query Full data response Metadata and statistics Synonyms Publish availability Cache Metadata Accounting SOAP DiGIR HTTP other
10
Global Biodiversity Information Facility l Global yellow pages ”marketplace” of shared biodiversity data l Populated automatically by provider installations l Based on UDDI (Systinet WASP) and web services l Directory of Participants and data providers l Services of the providers, i.e., datasources and datasets offered l tModels of the standards that must be adhered to l Open interfaces for portals and specialised search engines l Registry is available to any portal or search engine The Registry You don’t get very far with web services unless you have a registry...” -Tom Gaskins, uddi.org
11
Global Biodiversity Information Facility What role for WSDL? l Links the standards and UDDI registry The various data stan- dards are represented by tModels
12
Global Biodiversity Information Facility How does the GBIF registry work? GBIF UDDI Registry Services Registrations Provider Registrations 1) GBIF Secretariat and other developers create and populate the registry with descriptions of standards (tModels) 2) Museums and other data providers install data provider packages which are automatically registered 6) Scientists, decision- makers, and others can use portals to acquire data sets for analysis and synthesis 5) Specialised portals and search engines can be built by anybody to query the registry and the index 4) A global index queries the registry, caches metadata, and creates a unique identifier for each record (and name) 3) GBIF Participant is notified of new provider in their domain, for endorse- ment as a GBIF data provider
13
Global Biodiversity Information Facility
16
The Interim GBIF Data Sharing Agreement 1. Biodiversity data accessible via the GBIF network are openly and universally available to all users within the framework of the GBIF Data Use Agreement and with the terms and conditions that the data provider has identified in its metadata. 2. GBIF does not assert any intellectual property rights in the data that is made available through its network. 3. The data provider warrants that they have made the necessary agreements with the original owners of the data that it can make the data available through GBIF network. 4. The data provider makes reasonable efforts to ensure that the data they serve are accurate. 5. Responsibility regarding the restriction of access to sensitive data resides with the data provider. 6. The data provider includes stable and unique identifier in their data so that the owner of the data is known and for other necessary purposes. 7. GBIF Secretariat may cache a copy and serve full or partial data further to other users together with the terms and conditions for use set by the data provider. Queries of such data through the GBIF Secretariat are reported to the data provider. 8. Data providers are endorsed by a GBIF Participant, if applicable, before their metadata is made available by the GBIF Secretariat. 9. GBIF Secretariat is not responsible for data content or the use of the data. 10. GBIF Secretariat is not liable or responsible, nor are its employees or contractors, for the data contents; or for any loss, damage, claim, cost or expense however it may arise, from an inability to use the GBIF network.
17
Global Biodiversity Information Facility The Interim GBIF Data Use Agreement 1. The quality and completeness of data cannot be guaranteed. Users employ these data at their own risk. 2. Users shall respect restrictions of access to sensitive data. 3. In order to make attribution of use for owners of the data possible, the identifier of ownership of data must be retained with every data record. 4. Users must publicly acknowledge, in conjunction with the use of the data, the data providers whose biodiversity data they have used. Data providers may require additional attribution of specific collections within their institution. 5. Users must comply with additional terms and conditions of use set by the data provider. Where these exist they will be available through the metadata associated with the data.
18
Global Biodiversity Information Facility Portal Data provider Provider Services Request Marshaller Query Engine Registry Institutions Providers Services ( UDDI ) Resource Metadata Resource Metadata GBIF Architecture Index Name provider Provider Services Resource Metadata Resource Metadata Cache Metadata Accounting SOAP DiGIR HTTP other Data Portal Data provider Provider Services Provider query Request Marshaller Query Engine Available providers Registry Institutions Providers Services ( UDDI ) User Resource Metadata Resource Metadata Index Name provider Provider Services Resource Metadata Resource Metadata and name query Metadata response Full data query Full data response Metadata and statistics Synonyms Publish availability Cache Metadata Accounting SOAP DiGIR HTTP other
19
Global Biodiversity Information Facility
20
Data provider software l Each system entails l Provider software l Communication with the DiGIR (or BioCASe) protocol l Data standards Darwin Core, (ABCD,) Dublin Core l Configuration for each resource (local existing database) l Registration with GBIF UDDI registry l Turn-key package for easy installation l Based on PHP and digir.sourceforge.net code l Packaged and supported by GBIF l Available now for Linux and Windows l Installs automatically
21
Global Biodiversity Information Facility l Supported by helpdesk@gbif.org l Turn-key package l Based on PHP and DiGIR project code l Available now for Linux and Windows l Registration with GBIF UDDI registry
22
Global Biodiversity Information Facility GBIF Data Repository Tool l Upload and manage datasets in document format such as spreadsheet and XML l Parses the data into embedded MySQL database that becomes available to the public as a DiGIR resource l Owner can revoke release (data is deleted from database) l Enable data custodians to manage and publish their own data l Make available a simple data warehouse tool for those who want to host datasets for the community
23
Global Biodiversity Information Facility Data quality is a problem l Central data validation service being planned l The data provider can ask the DVS to run through its data and spot inconsistencies l Requirement: A data dictionary
24
Global Biodiversity Information Facility Portal Data provider Provider Services Request Marshaller Query Engine Registry Institutions Providers Services ( UDDI ) Resource Metadata Resource Metadata GBIF Architecture Index Name provider Provider Services Resource Metadata Resource Metadata Cache Metadata Accounting SOAP DiGIR HTTP other Data Portal Data provider Provider Services Provider query Request Marshaller Query Engine Available providers Registry Institutions Providers Services ( UDDI ) User Resource Metadata Resource Metadata Index Name provider Provider Services Resource Metadata Resource Metadata and name query Metadata response Full data query Full data response Metadata and statistics Synonyms Publish availability Cache Metadata Accounting SOAP DiGIR HTTP other
25
Global Biodiversity Information Facility GBIF Data Portal l Gateway to data of the providers l Search and browse data by name, country, etc. l Download data and display simple maps l Multilingual l Maintains a cache of key data in case provider goes off-line l Opened 6 February 2004 l Based on Java and MySQL, source code available later
26
Global Biodiversity Information Facility Name Service: Major component of the global index Catalogue of Life and other name providers GBIF Data Portal Biodiversity Data Index Taxo- nomic Name Service (ECAT) User requests GBIF Data Nodes Specimen Data Links to other data Specimen Data Name Lists Specimen Data Observation Data Specimen Data
27
Global Biodiversity Information Facility
34
The future globally distributed architecture of GBIF
35
Global Biodiversity Information Facility 3. Data exchange standards
36
Global Biodiversity Information Facility Data exchange standards are the key Data description in XML l Institutions, providers, collections, and persons in various roles l Specimen, observation l Name, taxonomic concept l Images l Characters for identification l Species information Standards process l GBIF works with TDWG l Discussion, documentation l Schema repository l Open source sourceforge.net Standards for protocols and data exchange l SOAP / UDDI l Darwin Core /DiGIR l ABCD/BioCASe l SDD/BioCASe l UBIF
37
Global Biodiversity Information Facility
38
Darwin Core (and Mantle) l TDWG standard in review l 48 elements l Metadata almost nonexistent (is in the protocol, not data itself) l New version 2 being reviewed l Extensibility wanted l Curatorial l Bacteriological l Paleontological l Trappers...
39
Global Biodiversity Information Facility ABCD l TDWG standard in review l 300+ elements in a hierarchical structure l Can model almost anything l Metadata handling totally different from DiGIR/DwC
40
Global Biodiversity Information Facility Image data standards l JPEG2000 l Metadata from Dublin Core
41
Global Biodiversity Information Facility SDD (Structured Descriptive Data) l TDWG standard in review l Description of characters of organisms l XML standard for identification key interchange l Distributed descriplets as semantic web of diagnostic/identification knowledge (cf. CYC)
42
Global Biodiversity Information Facility SDD elements l the root of an SDD document, and encloses all other elements l used to specify metadata about the process (application or script) l used to capture metadata about the project from which the document data are sourced. l defines a list of characters and their states used to describe the entities described in the document. l defines a list of entities (such as taxa and specimens) for which descriptions are provided in the document. l provides for definitions of resources (images, notes, contributors etc) referred to elsewhere in the document. l contains descriptions (either coded or marked-up natural language) of the document's entities.
43
Global Biodiversity Information Facility Slide courtesy of Kevin Thiele
44
Global Biodiversity Information Facility Slide courtesy of Kevin Thiele
45
Global Biodiversity Information Facility Slide courtesy of Kevin Thiele
46
Global Biodiversity Information Facility SpeciesBank l This is where it all comes together... l Species home pages mushrooming, but no standard exists for species information pages l Needed for identification, invasives, pest control, taxonomic review,...
47
Global Biodiversity Information Facility 4. Protocol standards
48
Global Biodiversity Information Facility l DiGIR is lightweight. It is not SOAP, but could be payload on SOAP. l XML messaging on top of http l Used for communication between data providers and data users l More light-weight and specialised than SOAP l Enables single point of access (portal/search) to distributed information resources l Resource: a collection of data objects that conform to a common schema (DB records, XML documents) l Distributed resources comply with a federation schema l Enables search & retrieval of structured data l Search for data values in context (semantics) l Results are presented as a structured data set l Makes location and technical characteristics of the native resource transparent to the user l The Distributed Generic Information Retrieval protocol was created by the TDWG/CODATA subgroup on biological collection data protocol The
49
Global Biodiversity Information Facility A simple DiGIR architecture Data providers (have one or more databases to share and have installed DiGIR or BioCASe) Databases Portals, search engines, and applications developed for various purposes
50
Global Biodiversity Information Facility protocol
51
Global Biodiversity Information Facility Unified protocol l Merger of DiGIR and BioCASe
52
Global Biodiversity Information Facility 5. Requirements for a Biodiversity Schema Repository
53
Global Biodiversity Information Facility Schema repository – version history Schema A 1.0 1.1 Schema B 1.01.11.2 Structured store of data models for each schema Modelling of version history for each schema Single location from which to generate new machine-readable formats.xsd.dtd.xsd.???.dtd Elements Datatypes Cardinality Enumerated values Annotations
54
Global Biodiversity Information Facility Schema repository – documentation Automated generation of standardised HTML, PDF and other human readable documentation for each standard Can easily generate documentation e.g. of inter-version differences Schema A 1.0 1.1 Schema B 1.01.11.2
55
Global Biodiversity Information Facility Schema repository – conceptual mappings Schema A 1.0 1.1 Schema B 1.01.11.2 Identify relationships between elements in different versions of the same schema or different schemas Generate software products to automate transformation between different versions and schemas How should we handle different content models for related concepts? Should mappings simply be between different imported schemas, or should they all be made against some central set of concepts?.xslt.pl ?.java ?
56
Global Biodiversity Information Facility Schema repository – biodiversity object classes Darwin Core 1.1 Establish core datatypes for use in biodiversity informatics Easily integrated into distributed query protocol (DiGIR/BioCASe next version) and could allow data providers and clients to share an understanding of what is being transferred (i.e. elements have an associated datatype) Can include subclasses to allow for more precision (and e.g. to allow quick determination in many cases whether a taxon occurrence is associated with a specimen) Curatorial 1.0 ABCD 1.49 Object classes TaxonOccurrence Observation Specimen LivingMaterial Supported concepts Collection Institution Darwin:, ABCD:Unit Curatorial: Microbial: ABCD:OriginalSource
57
Global Biodiversity Information Facility Data Provider Table relationships Schema repository – configuring a data provider Darwin Core 1.1 Define the relationships between database tables (as with DiGIR/BioCASe provider software today) Select object classes to support and concepts to offer (including option to select from extended lists of specialised schemas) based on live connection to schema repository to discover object classes and supported concepts Map concepts to table columns (some may be automated using conceptual mappings from schema repository) Map each selected object class to a query endpoint The same model can be extended to cover additional object classes (e.g. TaxonName, TaxonConcept, Publication) all based on the same set of table relationships Curatorial 1.0 ABCD 1.49 Object classes TaxonOccurrence Observation Specimen LivingMaterial Supported concepts Collection Institution Darwin:, ABCD:Unit Curatorial: Microbial: ABCD:OriginalSource AB E C D Map Institution ABCD:OriginalSource/SourceInstitutionCode D:instCode ABCD:OriginalSource/SourceName D:instName Map Specimen Darwin:CatalogNumber B:id Darwin:ScientificName E:sciName Darwin:Latitude C:latitude Darwin:Longitude C:longitude Curatorial:Disposition B:state ABCD:Unit/UnitID B:id ABCD:Unit/.../NameAuthorYearString E:sciName Query protocol endpoints http://www.myInst.org/query/institution/digir2.py http://www.myInst.org/query/specimen/digir2.py
58
Global Biodiversity Information Facility Schema repository – querying a data provider A client can discover the object types supported by each provider The capabilities request allows a client to discover the schemas and concepts supported by the provider The schema repository can provide information about the concepts, even if they are completely new to the client (including labels for use in user interface presentations, or linkages to web pages documenting the meaning of terms used as values for the concept) USe in data quality control (validation) Data Provider Query protocol endpoints http://www.myInst.org/query/institution/digir2.py http://www.myInst.org/query/specimen/digir2.py Darwin Core 1.1 Curatorial 1.0 ABCD 1.49 User 1. Capabilities request to discover endpoints with object classes and supported schemas and concepts 2. Search request to retrieve data for Specimen objects 3. Documentation request to get user interface labels for elements from schema repository
59
Global Biodiversity Information Facility Schema repository – Data validation service Data validation service can be used by the data provider and GBIF helpdesk for quality control of the shared data. Reports incosistencies in values and relationships so that they can be corrected. Requires that value domains (lookup values) are included in SR. The data validation service is proposed for GBIF WP 2005-06. Data Provider Query protocol endpoints http://www.myInst.org/query/institution/digir2.py http://www.myInst.org/query/specimen/digir2.py Darwin Core 1.1 Curatorial 1.0 ABCD 1.49 System administrator of data provider or GBIF helpdesk Data Validation Service Query protocol endpoints http://www.myInst.org/query/institution/digir2.py http://www.myInst.org/query/specimen/digir2.py Definitions Values Data Corrections Report of inconsistencies Query
60
Global Biodiversity Information Facility
61
Information model Biodiversity Data Index Services Registry NodesServicesRecords GBIF Portal Participant Nodes Data Nodes Taxonomic Name Service Specimen/Observation Service General Resource Service Name List Service … Taxonomic Names Specimen/Observation Records HTML Pages Images … holds metadata for provides index of holds metadata for provide supply
62
Global Biodiversity Information Facility Records need unique identifiers l Why should each record have a globally unique ID? l To trace data back to the original source (specimens, images or other evidence) l To allow for updates/corrections l To indicate ownership l To remove duplicate data l Issues l Lookup of codes needed when installing providers l Mandatory elements in data, but not always present in data source l Work with ”coden providers” like Index Herbariorum & others to standardise codes l The most common question at helpdesk and training right now: What is my institution code and collection code?
63
Global Biodiversity Information Facility How to construct a unique identifier l LSID/URN with 5 elements l Format URN:NetworkName: CodenProviderCode- InstitutionCode-CollectionCode- ObjectType:CatalogNumber:Version l Example URN:gbif.net:ISC-DABUH-SEMF-Specimen:44622:1 = ”Insect and Spider Collections of the World”- ”University of Helsinki, Department of Applied Biology”-”Lepidoptera Collecton”-”Specimen” : Record 44622
64
Global Biodiversity Information Facility Summary l Data remains under the control of providers l Data standards and web services make it work l Central registry and ”marketplace” of shared and distributed data l Anyone can build their vertical portals or specialised search engines on top of that l Participant nodes: Major role in coordination and dissemination and data warehousing, possible local portal l Data nodes: Register datasets, provide online access to database or repository
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.