What needs to be checked? Quality control procedures for OBIS data - background Aim: Help data providers & data managers Checking quality Checking.

Slides:



Advertisements
Similar presentations
Tomer Gueta, Avi Bar-Massada and Yohay Carmel Using GBIF data to test niche vs. neutrality theories at a continental scale, and the value of data cleaning.
Advertisements

The Discovery Corridor Concept and its Applicability January 13/14, 2004 workshop St. Andrews Biological Station, St. Andrews, N.B.
Ocean Biodiversity Information – 29/11-1/12/20041 European Register of Marine Species version 2.0 data management, current status and plans for the future.
FADA workshop, 5-7 December 2008 in Bruges (Belgium) World Register of Marine Species and Aphia IT platform Ward Appeltans
BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.
IWC Database Overview of technology and application 13 th July 2010.
Ocean Biogeographic Information System. ‘Mission’ OBIS publishes primary data on marine species locations online through –It.
Operational integration of biodiversity and physico-chemical data: experience at the BMDC Meerhaeghe A., De Cauwer K., Devolder M., Jans S., Scory S.
MEDIN Data Guidelines. Data Guidelines Documents with tables and Excel versions of tables which are organised on a thematic basis which consider the actual.
PESI Pan-European Species-directories Infrastructure European GBIF nodes Meeting — Paris, 4 April 2011 Walter Berendsohn (based on presentation by Yde.
VOCABULARIES A data management presentation. Data management best practices Inventory of resources/datasets – Database level or series of datasets/collections.
Controlled Vocabularies (Term Lists). Controlled Vocabs Literally - A list of terms to choose from Aim is to promote the use of common vocabularies so.
Indexing the Species Names of the World - for the World Frank Bisby (Species 2000), Michael Ruggiero (ITIS) Per de Place Bjørn (GBIF - ECAT)
Online Data Flanders Marine Data & Information Centre InnovOcean site SeadataNet Annual Meeting, Madrid 2009.
A taxonomic and biogeographic information system of marine species in the Southern North Sea developed by Flanders Marine Institute Ward Appeltans, Edward.
Configuration Management (CM)
Design central EMODnet portal Objectives, Technical Proposal and Consultation Process.
Drinking Water Infrastructure Needs Survey and Assessment 2007 Training.
Agenda item 3.1 (c) Presentation of the EASIN CGBN Co-ordination Group for Biodiversity and Nature 13 th meeting – 06/09/12.
Knowledge base for growth and innovation in ocean economy: assembly and dissemination of marine data for seabed mapping LOT NO: 5 – BIOLOGY Simon Claus.
Drinking Water Infrastructure Needs Survey and Assessment 2007 Website.
The Marine S ystème d’ I nformation sur la N ature et les P aysages CAML Workshop– Villefranche-sur-mer – 18 th May 2010.
Development of a Marine Biological Data Portal within the framework of EMODNet Simon Claus, Leen Vandepitte and Tjess Hernandez Flanders Marine Institute.
Workshop on Price Index Compilation Issues February 23-27, 2015 Data Collection Issues Gefinor Rotana Hotel, Beirut, Lebanon.
Geographic data validation. Index Basic concepts Why do we need validation? How to assess geographic data Initial checks Intermediate checks Advanced.
Hellenic Centre for Marine Research (HCMR) MedOBIS - Ocean Biogeographic Information System for the Eastern Mediterranean and Black Sea.
United Nations Oslo City Group on Energy Statistics OG7, Helsinki, Finland October 2012 ESCM Chapter 8: Data Quality and Meta Data 1.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
1 EMODNET pilot biological lot Francisco Hernandez, Simon Claus, Leen Vandepitte.
African Register of Marine Species AfReMas Leen Vandepitte On behalf of WoRMS data management team.
OBIS - A Valuable Resource for NW Atlantic Fisheries Science (Part 2) OBIS Canada M.Kennedy, B.Marshall, N.Campbell, W.Appeltans NAFO Scientific Council.
Quality control of biodiversity data: tools & techniques Leen Vandepitte On behalf of WoRMS, EurOBIS & LifeWatch data management teams.
Leen Vandepitte On behalf of WoRMS data management team Introduction to WoRMS, the World Register of Marine Species.
Development of a Marine Biological Data Portal within the framework of EMODNet Simon Claus, Leen Vandepitte & Tjess Hernandez Flanders Marine Institute.
COST Action and European GBIF Nodes Anne-Sophie Archambeau.
EMODnet Biology Kick-off Meeting – VLIZ, Oostende September 2013 EMODnet Biology Work Package 2 Mark Costello & Dan Lear
Metadata standards Leen Vandepitte On behalf of WoRMS data management team.
Sample-based data publication; reflections on semantics and logic 1(1) Hanna - GBIF Finland Lepidoptera collection of Hannu SaarenmaaPublicNo (but DwC.
Moving away from the fish-eye view Integrating Surveys for the Ecosystem Approach 29 May 2013, Ingeborg de Boois (WGISUR)
INTRODUCTION TO GENERATING SERVICES
INTAROS – Integrated Arctic Observation System
Actuaries Climate Index™
EASIN European Alien Species Information Network GBIF
Europe’s Environment Assessment of Assessments EE-AoA 2011
LifeWatch, costing and funding
The IPT user interface and data quality tools
Flanders Marine Institute (VLIZ)
RCN Development of an Online Database to Enhance the Conservation of SGCN Invertebrates in the Northeastern Region James W. Fetzner Jr. & John.
Elmer Topp-Jørgensen, Aarhus University, Denmark
Simon Claus Flanders Marine Institute (VLIZ)
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition How Darwin Core Archives have changed the landscape.
Introduction to WoRMS, the World Register of Marine Species.
? Geographic quality control LifeWatch: Show on map
Daphnis De Pooter on behalf of the WoRMS data management team
EC FP7 - Cooperation Theme 6: Environment (incl. climate change)
Comments on ASFA Input Helen Wibley, FAO 2016 ASFA Advisory Board Meeting – Hanoi, Viet Nam.
Actuaries Climate Index™
Applying GIS to Santa Cruz Island:
LifeWatch Cloud Computing Workshop
Design central EMODnet portal Objectives and Technical description Initial draft prepared by the Flanders Marine Institute.
Accessing EurOBIS data – 3 possible pathways
GBIF Strategic Plan Alberto González-Talaván
Inside a PMI Online Course
Monitoring and assessment of the marine environment under the European Marine Strategy Introduction The European Union is highly dependent on maritime.
7.b Marine alien species on EASIN
Metadata used throughout statistics production
My name is VL, I work at the EEA, on EA, and particularly on developing a platform of exchange which aims at facilitating the planning and development.
Simon Claus, Leen Vandepitte, Klaas Deneudt & Tjess Hernandez
Indicators reviewed for the SEBI2010
Presentation transcript:

What needs to be checked?

Quality control procedures for OBIS data - background Aim: Help data providers & data managers Checking quality Checking completeness Detect (possible) errors Assign quality flags to each available record Evaluation of fitness for purpose & use (NOT: good or bad) Filter out records with certain quality standard Example Abra alba at latitude 24,53 & longitude 67,94 in 1983 Record suitable for general distribution analysis (species occurrence) Record suitable for general temporal analysis (yearly trends) Record not suitable for seasonal analysis Record not suitable for abundance-related analyses (presence only) Slide over waarom we qc doen. Voorbeeld toont aan dat we een record niet als goed/slecht willen bestempelen, maar vooral willen aangeven waarvoor het wel of niet geschikt is. => Communication with provider can improve quality of the contributing data => users can take direct action based on the results

Quality control procedures for OBIS data - background Aim: Help data providers & data managers Checking quality Checking completeness Detect (possible) errors Assign quality flags to each available record Evaluation of fitness for purpose & use (NOT: good or bad) Filter out records with certain quality standard Approach: Automated process, within the database Allows creation of filters Allows feedback to data providers Online web services Freely available for use to everyone Allows direct feedback to user (result reports) => Communication with provider can improve quality of the contributing data users can take direct action based on the results Will not go into detail on all the checks, will be the second presentation Technical: 18 quality control steps, on individual record level 10 outlier checks, on dataset or species level Each QC step = yes (1)/no (0) question Creation of a bit-sequence (2(x-1)) => stored as an integer value for the QC => unique value for each possible combination

Technical: 18 quality control steps, on individual record level 10 outlier checks, on dataset or species level Each QC step = yes (1)/no (0) question Creation of a bit-sequence (2(x-1)) => stored as an integer value for the QC => unique value for each possible combination QC step Value Bit-seq. 1 2(1-1) = 1 2 2(2-1) = 2 3 = 0 4 2(4-1) = 8 5 TOTAL = 11 QC step Value Bit-seq. 1 2(1-1) = 1 2 2(2-1) = 2 3 2(3-1) = 4 4 2(4-1) = 8 5 2(5-1) = 16 TOTAL = 31 If a record is evaluated positive for a certain QC, then two to the power of the unique number of this QC minus one is added to the field QC in the table ‘Eurobis’. If for example a record is evaluated positive for QC number 3 then the value 2(3-1) is added to the field QC. This way, a bit sequence is created and stored as an integer value in the field QC. This makes it possible to store the results of different QC’s in one (integer) field.

LifeWatch: home for a multitude of web services Part of European Strategy Forum on Research Infrastructures (ESFRI) Distributed virtual laboratory: Biodiversity research Climatological & environmental impact studies Support development of ecosystem services Provide information for policy makers Biodiversity observatories, databases, web services and modelling tools Integration of existing systems, upgrades, new systems LifeWatch wants Standardization of species data Integration of distributed biodiversity data repositories & operating facilities LifeWatch needs Species information services LifeWatch was established as part of the European Strategy Forum on Research Infrastructures (ESFRI). LifeWatch is a distributed virtual laboratory and will be used for biodiversity research, for climatological and environmental impact studies, to support the development of ecosystem services and to provide information for policy makers in Europe. This large European research infrastructure will consist of several biodiversity observatories, databases, web services and modeling tools. It will be integrating the existing systems, upgrading them were possible and developing new systems where needed. Flemish contribution: Lifewatch.eu will build a distributed infrastructure for the study and monitoring of biodiversity in europe Pariticipation from many institutes in eu countries, and existing european networks (Marbef, LTER, PESI, GBIF, SPECIES2000,4D4live,….) VLIZ and INBO have several existing systems to offer and are already participating as partner (lifewatch preparatory project) , or as network (all of the above) The price is reasonable and the chances of success are high because it’s an upgrade of existing systems, and the expertise of both partners is proven. But the project is also innovative: integration to very large scale, novel web technologies, development of new biosenors, and unprecedented: such infrastrucure is not yet available The infrastructure will benefit many scientists because available online Marine observatory; taxonomic backbone Note: ‘central’ does not mean 1 server or 1 database but that it’s central in the workflow mechanisms. Similarly ‘integrating’ does not imply creating a new database.

LifeWatch offers compilation and combination of several web services These services = taxonomic backbone Taxonomy access services Taxonomic editing environment Species occurrence services Catalogue services LifeWatch infrastucture: Identify, analyze and design online data services, models and applications Make use of all LifeWatch data = interactive part of LifeWatch

LifeWatch web services Login / password required System keeps track of all your “jobs”

Taxonomic quality control Taxon match: World Register of Marine Species (WoRMS) Taxon match: LifeWatch taxon match: World Register of Marine Species Integrated Taxonomic Information System (ITIS) Catalogue of Life (CoL) International Plant Name Index (IPNI) Index Fungorum (IF) PalaeoBiology Database (Palaeo-DB) Pan-European Species Infrastructure (PESI)

Taxonomic QC step by step TAXON NAME X Match with WoRMS? yes Document LSID Check habitat (marine/non-marine) Check tax level (genus/species) no Match with other registers? yes no Contact the data provider for secondary check Go through matches again Is the taxon marine? 2004: MarBEF EU FP6 => European Register of Marine Species (ERMS) 2007: further development to World Register Ook nog: index fungorum, Interim Register of Marine & Non-Marine Genera (IRMNG) World Register of Marine Species Not just a name index, but expert-based taxonomic database > 200 taxonomic editors, Steering Committee & data management team Permanently hosted at VLIZ Web-based system, including web-services International standards, permanent LSID’s Taxon name available in WoRMS? Is taxon equal or more detailed than genus/species? Is taxon marine or not? Other taxonomic databases Catalogue of Life (CoL) International Plant Names Index (IPNI) Integrated Taxonomic Information System (ITIS) … yes Contact taxonomic editor: add taxon to WoRMS no Add taxon to annotated list

WoRMS Taxon Match Tool Freely available, no password/login required This tool uses the following components: TAXAMATCH fuzzy matching algorithm by Tony Rees PHP/MySql port of TAXAMATCH by Michael Giddens Scientific Names Parser by Dmitry Mozzherin

Prepare your own file (Plain text [TXT], Comma Separated [CSV] & Excel Sheet [XLS, XLSX] For convenience => colum “scientific_name” Upload onto website

WoRMS taxon match results: Exact match Phonetic match Near_1 match Near_2 match No match Check and verify everything that is not an exact match… Some examples: Phonetic: Fragilaria aurivillii => Fragilaria aurivilii Near_1: Chaetoceros seychellarum => Chaetoceros seychellarus Near_2: Gammarus finnmarchius => Gammarus finmarchicus Syllis armoricanus => Syllis armoricana

LifeWatch taxon match tool

If a taxon is not in WoRMS: Send email to info@marinespecies.org Currently available taxon services If a taxon is not in WoRMS: Send email to info@marinespecies.org Let us know if it is available in any of the other registers

Use this report as feedback to your provider / WoRMS

Taxonomic quality control – ambiguous matches Tax. QC Scientific name: Chondracanthus, unknown species Kingdom Plantae (Rhodophyta) Kingdom Animalia (Crustacea) Scientific name: Alebion Alebion Krøyer, 1863 => Animalia, Crustacea, parasitic copepods Alebion Gray, 1867 => Animalia, Porifera => Accepted as Iophon Gray, 1867 Alebion Krøyer, 1863 Alebion Gray, 1867 accepted as Iophon Gray, 1867

Taxonomic quality control – its importance illustrated… Tax. QC “… In total, 6,172 unique taxon names were submitted …. After a thorough QC, however, this number was reduced to 4,525, mostly due to spelling variations and synonymy.” “ … Such [taxonomic] quality control is highly needed, since a misspelled or obsolete name could be compared to the introduction of a rare species, with adverse effects on further (biodiversity) calculations…” Source: Vandepitte et al. (2010). Data integration for European marine biodiversity research: creating a database on benthos and plankton to study large-scale patterns and long-term changes. Hydrobiologia 644: 1-13

? Geographic quality control LifeWatch: Show on map LifeWatch: Marine Regions Gazetteer services Get lat-lon by MrgID Get lat-lon by name Get Gazetteer name by lat-lon Get lat-lon by accepted name

Geographic QC – the concept Communication with provider Before quality control After quality control 18°30’25’’N – 5°15’E 18.51 ; 5.25 54,23N – 16.5S 54.23 ; -16.5 WGS84 = World Geodetic System 1984; most used geographical reference system Decimal degrees => easy to work with

Coordinates are indispensable Coordinates = basis of a biogeographic information system When no coordinates are provided… Check with the data provider / the source When existing: complete the file & run QC When not existing: Derive from provided map Check Marine Regions to assign coordinates

Marine Regions = Standard, relational list of geographic names Coupled with information and maps of the geographic location Improve access and clarity of the different geographic, mainly marine names such as seas, sandbanks, ridges and bays http://www.marineregions.org

Fish species “A” present in Kenya Marine species on land? Link with adjacent sea area: EEZ Indicate precision!!!!

The importance of geographical QC Some examples “Monitoring in Belgian part of the North Sea” “Monitoring in Kongsfjorden area” “+” & “-” signs switched Latitude & longitude switched

Sightings and strandings of marine turtles around the coast of UK and Ireland Left: coordinates as received; right: corrected. Errors due to missing minus sign

What else to check…? Use common sense…

Dates OBIS data format check includes check on the date format: but… Year: “1972” vs “72” vs “972” Month: between 1-12 Day: between 1-31, check takes into account the given month but… Dataset from 1990, with a few records in 1909…

Units OBIS can capture: Are units defined? Counts Biomass Depth Counts: individuals per m², cm², liter, m³ Biomass: wet weight, dry weight, ash-free dry weight Depth: meter, centimeter “I collected 4 individuals of species X from location Z” => Sample size? 10 cm² - 50 cm² - 1 m² - …?

Significance: Needs thorough documenting Know what you are dealing with Comparison Convert to OBIS standards Depth: in meter, positive values Abundance: NULL versus 0 (absence); positive values

Quality control procedures for OBIS data - automated Remember - technical: 18 quality control steps, on individual record level 10 outlier checks, on dataset or species level Each QC step = yes (1)/no (0) question Creation of a bit-sequence (2(x-1)) => stored as an integer value for the QC => unique value for each possible combination Some steps not (yet) available through web services Will be identified once data is in OBIS Harvest report will give indication of possibly erroneous records

Geographic quality control – 3-dimensional check Latitude – longitude <> 0 Latitude – longitude between -90/+90 and -180/+180 Latitude – longitude within sea area (20 km buffer) Depth value possible Plot corresponding latitude-longitude on the General Bathymetric Chart of the Oceans (GEBCO) Compare GEBCO depth with actual sampling depth Take into account 100m margin Deze kun je desnoods weglaten. We doen ook een check op diepte: als de gegeven diepte te sterk afwijkt van de GEBCo diepte, wordt deze ‘geflagd’. Mensen moeten er ook logisch bij nadenken. Die vis is een bodem-vis, dus de kans dat je die inderdaad kort onder het oppervlak vind is eerder klein… Taxon Given depth (m) GEBCO depth (m) Difference (m) Desmoscolex 2080 510 + 1570 Halieutichthys aculeatus 110 1140 - 1030 Negative evaluation in QC Needs to be looked at… Pancake batfish => usually bottom-dwelling…

Quality control procedures for OBIS data - outlier analyses Only performed on OBIS database (=global coverage) Geographic outliers – dataset level Analysis on dataset level Possible location outlier(s) Methodology based on centroid calculations and assuming normal distribution => not applicable for strong asymetric datasets… Communication with provider on results Centroid No outlier Possible outlier Dataset: “ICES Biological Community” (DOME) Also identified as incorrect in record-level check of lat-lon (=land) Outliers, op dataset niveau. Geografisch: we kijken welke coordinaten statistisch afwijken van de norm binnen de dataset. Kan moeilijk zijn, voor datasets die een globale spreiding hebben. De outlier analyses kunnen best gecombineerd worden met andere qc stappen (cfr de rode cirkels). Dergelijke zaken worden gecommuniceerd met de provider, om tot de correcte oplossingen te komen. Not identified through record-level check of lat-lon (=sea), but seen as potential outlier through geographic outlier check Provider communication: Antarctic locations are incorrect (data error) Northern locations are correct (sampling bias) Vandepitte et al. (2015). Fishing for data and sorting the catch […]. Database. DOI: 10.1093/database/bau125

Verruca stroemia (Crustacea: Cirripedia) Environmental outliers – species level => Check for outliers within the available distribution records of a species => Geography, depth, sea surface salinity (SSS), sea surface temperature (SST) Geography Depth Centroid No outlier Possible outlier No outlier Possible outlier Environmental outliers. Hier wordt zowel geografie, diepte, SSS en SST in rekening gebracht. Deze kaart toont vooral dat verschillende analyses verschillende resultaten kunnen hebben, en dat de gebruiker zijn gezond verstand moet gebruiken bij het gebruik van deze data. QC flags zijn een hulpmiddel voor de gebruiker om te beslissen welke data hij wel/niet gebruikt in zijn analyses. De dubbelzinnige resultaten (oranje cirkels) werden voorgelegd aan expert. Middellandse zee data zijn fouten, noordelijke data zijn OK. Dubious results => additional verification: WoRMS, species distribution => not in Mediterranean, but yes in north Expert => Mediteranean records are due to erroneous identifications in the field => depth range of species varies between intertidal and 548 metres (available data only to 300 metres depth) Verruca stroemia (Crustacea: Cirripedia) Vandepitte et al. (2015) Dubious results => additional verification: World Register of Marine Species: literature & expert-based species distribution information Expert: ecological information

Questions? Analysing the content of the European Ocean Biogeographic Information System (EurOBIS): available data, limitations, prospects and a look at the future Hydrobiologia 667(1): 1-14 (2011) Vandepitte L., Hernandez F., Claus S., Vanhoorne B., De Hauwere N., Deneudt K., Appeltans W., Mees, J. Finding what you need in a sea of data: Assessing the data quality, completeness and fitness for use of data in marine biogeographic databases Database (2015), 1-14 (doi: 10.1093/database/bau125) Vandepitte L., Bosch S., Tyberghein L., Waumans F., Vanhoorne B., Hernandez F., De Clerck O. & Mees J.