Global Biodiversity Information Facility GLOBAL BIODIVERSITY INFORMATION FACILITY Presentation: Wouter Addink – ETI Most slides made by: Giorgos Ksouris - GBIF Secretariat Utrecht, 14 January What is DiGIR? What is ABCD?
Global Biodiversity Information Facility “Primary Biodiversity Data” Network l GBIF is concerned with primary biodiversity data: l Specimens l Observations l Names l Species l Literature l Metadata on the above l How will the data be contributed to the GBIF Network?
Global Biodiversity Information Facility GBIF “Data Providers (Nodes)” l Responsible for providing, through standard WEB exchange interfaces, metadata describing themselves and the data services they offer and free access to biodiversity data. l Should use a common data exchange format with a fixed structure which clearly defines how the information is to be shared. l Data should be exchanged in a way which makes it as simple as possible to compare and merge information from different resources. l GBIF therefore needs a simple model which will allow institutions to share their data using structured formats, regardless of what formats they use in their own databases.
Global Biodiversity Information Facility Data exchange standards Models that allow data on individual specimens or observations to be structured and shared as XML documents that can be transmitted across the Internet: l Darwin Core V2(limited set of core data elements) Darwin Core ( l ABCD V1.2(complete set of all possible data elements in specimen and observation data) ABCD (
Global Biodiversity Information Facility Data exchange protocols Defines request and reponse message formats for standardized communication between provider and portal l DiGIR protocol l Uses open protocols and standards, such as HTTP, XML, and UDDI l De-couples protocol, software and semantics l Automates the establishment of a new data provider as much as possible l In use with Darwin Core in a few projects like MaNIS, but cannot be used with a complex XML Schema like ABCD l BioCASE protocol l Based on DiGIR, but with a few improvements, like capability to use ABCD l Not compatible with DiGIR l Still under development l De-couples protocol, software and semantics better then DiGIR but establishment of a new data provider is more complex l SOAP l Generic protocol using HTTP, XML and UDDI, not focussed on specimen and observation data exchange
Global Biodiversity Information Facility data exchange format: ABCD l XML complex schema l Coverage of complete specimen and observation data domain l Schema is used in BioCASE project l Hundreds of concepts (data elements) l Schema includes: l Meta-data: Information about the source, from name of the holding institution to copyrights statements of the hole dataset. l Unit-data: Information regarding the records, specific copyrights, date last modification, facts that don't fix in any other place etc. l Gathering site: Information about the gathering site. Gathering place, altitude, responsible person etc. l Taxon identification: Possible identifications for this unit. Includes the taxon part and data on the identification event like who identified the unit etc. l Taxon name: l Data about the name of the taxon. It is split into different parts for the different biological disciplines like botany, zoology, etc with their own nomenclatural code. Includes data on the Scientific name, higher taxon etc.
Global Biodiversity Information Facility data exchange format: Darwin Core2 l XML schema l In use for some time already (MaNIS project) l Suitable for collections and observations data. l 48 concepts (data elements): DateLastModified *InstitutionCode *CollectionCode *CatalogNumber * ScientificName *BasisOfRecordKingdomPhylum ClassOrderFamilyGenus SpeciesSubspeciesScientificNameAuthorIdentifiedBy YearIdentifiedMonthIdentifiedDayIdentifiedTypeStatus CollectorNumberFieldNumberCollectorYearCollected MonthCollectedDayCollectedJulianDayTimeOfDay ContinentOceanCountryStateProvinceCounty LocalityLongitudeLatitudeCoordinatePrecision BoundingBoxMinimumElevationMaximumElevationMinimumDepth MaximumDepthSexPreparationTypeIndividualCount PreviousCatalogNumberRelationshipTypeRelatedCatalogItemNotes * = required element
Global Biodiversity Information Facility Software for GBIF “Data Providers (Nodes)” GBIF has chosen to use DiGIR software and Darwin Core2 because: l The provider software is stable l Easy to install and easy to use l Used already in the MaNIS network and some other projects l Collection database models are rather easy to map against Darwin Core2 (but dataproviders will often miss data elements that are important for their database) However, BioCASE software and ABCD will also be supported in the near future because: l Will be in use in BioCASE network l BioCASE software has some improvements compared with DiGIR (but is still less easy to install and use) l ABCD has more potential for the future then Darwin Core2
Global Biodiversity Information Facility Data Provider within GBIF Architecture Portal Data provider Provider Services Provider query Request Manager Query Engine Available providers UDDI Registry Institutions Services (Providers) AccessPoints Resource Metadata Resource Metadata Index Metadata and name query Metadata response Data query Data response Metadata and logs Name provider Provider Services Resource Metadata Resource Metadata Synonyms, GUIDs Publish availability Cache Metadata Accounting SOAP DiGIR HTTP
Global Biodiversity Information Facility WEB exchange interface: DiGIR l Distributed Generic Information Retrieval is a client/server protocol for retrieving information from distributed resources. l Uses HTTP as the transport mechanism and XML for encoding messages sent between client and server. l Three type of messages: l Metadata: get metadata information of the provider and the resource(s) that serves. l Search: find specimen and observation records based on search criteria, for example: the name of a species and/or a rectangle defining an area on the earth’s surface and/or … l Inventory: get the set of distinct values associated with a single concept, for instance: Species. l Maps database models of collections to Darwin Core2 (suitable for exchange of specimen and observation data).
Global Biodiversity Information Facility DiGIR: Advantages l Provides a single point of access to one to many distributed information resources. l Resource: a collection of data objects that conform to a common schema. l Enables search & retrieval of structured data. l Makes location and technical characteristics of native resource transparent to the user. l Not the only available software ( BioCASE/ABCD Schema is another candidate ) but stable enough to be launched.
Global Biodiversity Information Facility DiGIR Provider: How it Works Resource WEB Server- DiGIR S/W Server Resource Provider Metadata Resource Metadata HTTP XML Metadata message Search/Inventory message
Global Biodiversity Information Facility GBIF’s DiGIR Provider Package l Encompasses the DiGIR Provider software, Apache2 WEB server and PHP libraries. l Requires from the user only basic knowledge of the operating system. l Two available releases: ( ) l Linux (RedHat 7.3, 8, 9) l MS Windows (2000, XP) l Supported databases: l MySQL l PostgreSQL l MS SQL Server l MS Access (only the MS Windows package) l Offers automatic registration with GBIF UDDI Registry ( ) l Other features: l Caching (cleanup from the startup script) l Rotation of log files (WEB server, DiGIR provider)
Global Biodiversity Information Facility DiGIR Provider Installation l Completed in 4 steps: l Installation of the GBIF’s DiGIR Provider package. l Definition of provider’s metadata. (For a unique RecordIdentifier in the GBIF network: Use the format ParticipantCode:InstitutionCode:CollectionCode)ParticipantCode l Definition of resource(s). l Registration with GBIF UDDI registry.
Global Biodiversity Information Facility Becoming a GBIF Data provider in the Netherlands (1) l Determine which data sets you can provide in structured electronic form (like a database) and whether these data sets contain specimen data, observation data, species data or other biodiversity data. The data also needs to be maintained. l Determine which data may be available for public use. GBIF has decided to make all data in the network publicly available (this may change in the future). There will be no user restrictions like password protection for data, to avoid extra complexity. Data that should not be available for public usage should not be provided. For example: do not provide exact information about locations of endangered species that can be of use for hunters or illegal traders. l Define an IPR (Intellectual Property Rights) policy for each data set. l Information about the data sets (metadata) should be sent to NLBIF and will be kept in a central metadatabase. This information will also be available in the BioCASE metadata network.BioCASE metadata
Global Biodiversity Information Facility Becoming a GBIF Data provider in the Netherlands (2) Required metadata: (The minimum metadata needed is the required metadata for DiGIR and for the BioCASE NoDIT database.) l A name, addres, description and unique code for your organisation (see gbif website for codes already taken) l A name, description and unique code for each dataset in your organisation l The unique identifier to identify a specimen l A last modified date for the dataset l At least one contact name, address and phone number
Global Biodiversity Information Facility Becoming a GBIF Data provider in the Netherlands (3) l Check if you can make your data available in one of the following database formats: l MySQL l PostgreSQL l MS SQL Server l MS Access (only the MS Windows package) l Check if you have a computer with internet access available and l Linux (RedHat 7.3, 8, 9) or l MS Windows (2000, XP) If this is the case: Congratulations: you can maintain your own data node that uses the standard GBIF DiGIR provider software. In all other cases, please contact NLBIF. NLBIF can also provide data storage space for your datasets. With DiGIR you might also be able to use DB2, Interbase, Frontbase, Informix, Visual FoxPro, PostgreSQL, Sybase, other ODBC-compliant database. However, this is currently not supported by GBIF.
Global Biodiversity Information Facility NLBIF Assistance l the complete distribution of the Digir provider (including PHP, Apache webserver and automatic GBIF UDDI registration) provided by GBIF is recommended. The GBIF helpdesk or NLBIF (ETI) can help you with technical installation problems.complete distribution l To use your data source with DiGIR, you need to map the data fields you want to publish 1:1 to Darwin Core V2 Schema elements (the software does not contain translator functions) For this you probably need to create a view (if your database supports this) for some of the fields or a separate database with the needed fields. Contact NLBIF if you need assistance with conversions.Darwin Core V2 Schema l Because GBIF netwerk use caching, it will take a few hours before your data is visible in the netwerk. l In case you want a custom search interface on your dataset, please also contact NLBIF. NLBIF is developing several web modules for this purpose that will be used for collections like those from ZMA. l You may use BioCASE and ABCD instead of DiGIR, for instance if you want to provide data that does not fit in Darwin Core, but it is recommended to start with DiGIR provider.
Global Biodiversity Information Facility GBIF network growth l The global network started end of november 2003 l Currently there are already about 28 dataproviders worldwide connected with about 8.5 million specimen and observation records l The Netherlands are currently connected with 7 collections containing records l With your help this can be … million records next year?!!
Global Biodiversity Information Facility
Darwin Core2 Elements (1) l DateLastModified: ISO 8601 compliant stamp indicating the date and time in UTC(GMT) when the record was last modified. Example: the instant "November 5, 1994, 8:15:30 am, US Eastern Standard Time" would be represented as " T13:15:30Z" l InstitutionCode: A "standard" code identifier that identifies the institution to which the collection belongs. No global registry exists for assigning institutional codes. Use the code that is "standard" in your discipline. l CollectionCode: A unique alphanumeric value which identifies the collection within the institution. l CatalogNumber: A unique alphanumeric value which identifies an individual record within the collection. It is recommended that this value provides a key by which the actual specimen can be identified. If the specimen has several items such as various types of preparation, this value should identify the individual component of the specimen. l ScientificName: The full name of lowest level taxon the Catalogued Item can be identified as a member of; includes genus name, specific epithet, and subspecific epithet (zool.) or infraspecific rank abbreviation, and infraspecific epithet (bot.) Use name of suprageneric taxon (e.g., family name) if Catalogued Item cannot be identified to genus, species, or infraspecific taxon. l BasisOfRecord: An abbreviation indicating whether the record represents an observation (O), living organism (L), specimen (S), germplasm/seed (G), etc. l Kingdom: The kingdom to which the organism belongs l Phylum: The phylum (or division) to which the organism belongs l Class: The class name of the organism l Order: The order name of the organism l Family: The family name of the organism l Genus: The genus name of the organism l Species: The specific epithet of the organism l Subspecies: The sub-specific epithet of the organism l ScientificNameAuthor: The author of a scientific name. Author string as applied to the accepted name. Can be more than one author (concatenated string). Should be formatted according to the conventions of the applicable taxonomic discipline.
Global Biodiversity Information Facility Darwin Core2 Elements (2) l IdentifiedBy: The name(s) of the person(s) who applied the currently accepted Scientific Name to the Catalogued Item. l YearIdentified: The year portion of the date when the Collection Item was identified; as four digits [ ], e.g., 1906, l MonthIdentified: The month portion of the date when the Collection Item was identified; as two digits [01..12]. l DayIdentified: The day portion of the date when the Collection Item was identified; as two digits [01..31]. l TypeStatus: Indicates the kind of nomenclatural type that a specimen represents. In particular, the type status may not apply to the name listed in the scientific name, i.e. current identification. In rare cases, a single specimen may be the type of more than one name. l CollectorNumber: An identifying "number" (really a string) applied to specimens (in some disciplines) at the time of collection. Establishes a links different parts/preparations of a single specimen and between field notes and the specimen. l FieldNumber: A "number" (really a string) created at collection time to identify all material that resulted from a collecting event. l Collector: The name(s) of the collector(s) responsible for collection the specimen or taking the observation l YearCollected: The year (expressed as an integer) in which the specimen was collected. The full year should be expressed (e.g must be expressed as "1972" not "72"). l MonthCollected: The month of year the specimen was collected from the field. Possible values range from inclusive l DayCollected: The day of the month the specimen was collected from the field. Possible value ranges from inclusive l JulianDay: The ordinal day of the year; i.e., the number of days since January 1 of the same year. (January 1 is Julian Day 1.)
Global Biodiversity Information Facility Darwin Core2 Elements (3) l TimeOfDay: The time of day a specimen was collected expressed as decimal hours from midnight local time (e.g = mid day, 13.5 = 1:30pm l ContinentOcean: The continent or ocean from which a specimen was collected. l Country: The country or major political unit from which the specimen was collected. ISO values should be used. Full country names are currently in use. A future recommendation is to use ISO two letter codes or the full name when searching l StateProvince: The state, province or region (i.e. next political region smaller than Country) from which the specimen was collected. l County: The county (or shire, or next political region smaller than State/Province) from which the specimen was collected l Locality: The locality description (place name plus optionally a displacement from the place name) from which the specimen was collected. Where a displacement from a location is provided, it should be in un-projected units of measurement l Longitude: The longitude of the location from which the specimen was collected. This value should be expressed in decimal degrees with a datum such as WGS-84 l Latitude: The latitude of the location from which the specimen was collected. This value should be expressed in decimal degrees with a datum such as WGS-84 l CoordinatePrecision: An estimate of how tightly the collecting locality was specified; expressed as a distance, in meters, that corresponds to a radius around the latitude-longitude coordinates. Use NULL where precision is unknown, cannot be estimated, or is not applicable. l BoundingBox: This access point provides a mechanism for performing searches using a bounding box. A Bounding Box element is not typically present in the database, but rather is derived from the Latitude and Longitude columns by the data provider l MinimumElevation: The minimum distance in meters above (positive) or below sea level of the collecting locality. l MaximumElevation: The maximum distance in meters above (positive) or below sea level of the collecting locality.
Global Biodiversity Information Facility Darwin Core2 Elements (4) l MinimumDepth: The minimum distance in meters below the surface of the water at which the collection was made; all material collected was at least this deep. Positive below the surface, negative above (e.g. collecting above sea level in tidal areas). l MaximumDepth: The maximum distance in meters below the surface of the water at which the collection was made; all material collected was at most this deep. Positive below the surface, negative above (e.g. collecting above sea level in tidal areas). l Sex: The sex of a specimen. The domain should be a controlled set of terms (codes) based on community consensus. Proposed values: M=Male; F=Female; H=Hermaphrodite; I=Indeterminate (examined but could not be determined; U=Unknown (not examined); T=Transitional (between sexes; useful for sequential hermaphrodites) l PreparationType: The type of preparation (skin. slide, etc). Probably best to add this as a record element rather than access point. Should be a list of preparations for a single collection record. l IndividualCount: The number of individuals present in the lot or container. Not an estimate of abundance or density at the collecting locality. l PreviousCatalogNumber: The previous (fully qualified) catalogue number of the Catalogued Item if the item earlier identified by another Catalogue Number, either in the current catalogue or another Institution / catalogue. A fully qualified Catalogue Number is preceded by Institution Code and Collection Code, with a space separating the each subelement. Referencing a previous Catalogue Number does not imply that a record for the referenced item is or is not present in the corresponding catalogue, or even that the referenced catalogue still exists. This access point is intended to provide a way to retrieve this record by previously used identifier, which may used in the literature. In future versions of this schema this attribute should be set-valued. l RelationshipType: A named or coded valued that identifies the kind relationship between this Collection Item and the referenced Collection Item. Named values include: "parasite of", "epiphyte on", "progeny of", etc. In future versions of this schema this attribute should be set-valued. l RelatedCatalogItem: The fully qualified identifier of a related Catalogue Item (a reference to another specimen); Institution Code, Collection Code, and Catalogue Number of the related Catalogued Item, where a space separates the three subelements. l Notes: Free text notes attached to the specimen record.
Global Biodiversity Information Facility DiGIR & Darwin Core2: An Example $Revision: 1.10 $ :33: <content xmlns:darwin=' xmlns:xsd=' xmlns:xsi=' T225000Z bioshare.com pyy 4 Diarsia mendica T220000Z bioshare.com pyy 6 Lycia lapponaria T220000Z bioshare.com pyy 7 Plutella maculipennis false
Global Biodiversity Information Facility Management of Resources – A Training DB l Getting familiar with the training MS Access data base: l Biotella: One of many available observation and specimen datatabase tools l l ”Open source” Microsoft Access Basic application l Can export ABCD and DwC formats to GBIF Data Repository Tool (in upcoming version) l Can act as resource to DiGIR Provider l Training database populated with sample Lepidoptera data
Global Biodiversity Information Facility Biotella Observation Database Schema Main Tables
Global Biodiversity Information Facility Mapping the Database against Darwin Core2 l Alternatives l Mapping within database (faster queries with indexing, conversion of value domains, available in Biotella) l Mapping at DiGIR Provider (no database work needed) l Conversion of value domains l Big issue, let’s leave it as is for time being
Global Biodiversity Information Facility Registration with GBIF UDDI Registry l Universal Description Discovery & Integration is a special directory that provides methods for publishing and finding business & service information / specifications. l UDDI is based on existing standards, such as XML and SOAP. l Four primary data types: l businessEntity: represents business basic information e.g. contact information, categorization, descriptions, etc. l businessService: describes a service provided by the business l bindingTemplate: contains an optional description of the service, the URL of its access point, and a reference to one or more tModel l tModel: abstract description of a particular specification or behaviour to which the Web service conforms businessEntity tModel businessService bindingTemplate businessService bindingTemplate tModel
Global Biodiversity Information Facility Registration with GBIF UDDI Registry (2) l Several steps to make data useful in a UDDI registry: l Companies/organisations/standards bodies define tModels, relevant to an industry/business/science, and register them in UDDI ( DiGIR tModel). l Companies/organisations ( business entities) register descriptions of them ( Data Node) and define the services ( DiGIR provider) they offer. l UDDI taxonomies are used for describing business entities ( connection between GBIF Participant Node and Data Nodes). l Marketplaces, search engines, and business applications ( GBIF portal, GBIF Participant Nodes portals) query the registry to discover services of interest at other companies.
Global Biodiversity Information Facility Registration with GBIF UDDI Registry (3) l Automatic registration with GBIF UDDI registry. l Utilisation of the values of the elements defined as metadata of the provider (plus some extra information). l Business Entity l business name: {the of the institution} l description: {the location (URL) pointing to institution } l Business Service l service name: {the common of the provider} (your.server.name) l description: {the information of } l Binding Template l access point: l description: Access point of { } l Demonstration l Registration of trainees’ DiGIR Providers
Global Biodiversity Information Facility Exploration of GBIF UDDI Registry
Global Biodiversity Information Facility Exploration of GBIF UDDI Registry (2) l Find all business entities correspond to Data Nodes under a Participant Node: l Access the URL l Click on the Browse link under the Taxonomies subtree. l Click on the gbif:nodes link. l Click on the Sweden link in the Categories box. l Press the Find business button.
Global Biodiversity Information Facility Use of a Search Portal
Global Biodiversity Information Facility l Find all records of a database resource where the Darwin Core2 concept Genus contains the word Colias: l Access the URL and press the Build query button. l Click on one of the available resources in the Select data providers section. l Select Genus from the Select a concept selection list in the Select query conditions section. Select like from the Select a comparator selection list and type Colias in the adjacent text box. l Press the Submit query box. Use of a Search Portal (1)