INCOFISH WP3 - Campinas, April 2006 WEB Tools and Data Cleaning Alexandre Marino Centro de Referência em Informação Ambiental, CrIA
WEB Tools and Data Cleaning These tools were developed within the scope of the speciesLink project, so, in some cases, there is a complete dependency on the architecture, the local database, and the libraries that were developed by CRIA. Data Cleaning started as an idea that had not a very clear direction, it became a very particular system.
The speciesLink project is being funded by FAPESP (São Paulo state agency) from October, 2001 to October, 2005.
Col 1 Col 2 Col 3 Col 4 Col 5 program search interface Win2000 Brahms Linux MySQL Win98 Access Win98 biota FreeeBSD PostgreSQL ? ? ? ? ? Different data sources software and systems
Protocol and Content Schema DiGIR protocol (Distributed Generic Information Retrieval) Potential to be globally accepted DiGIR software (Java Portal & PHP Provider) Collaborative development DarwinCore v.2 Covers the basic content elements (taxonomic identification, location and date of collecting event)
speciesLink site Presentation Layer speciesLink site Presentation Layer DiGIR Portal (Java) DiGIR Portal (Java) Perl Slow or unstable connectivity Fast and stable connectivity Data SOAP client Collection Management System SQL Collection C Data Repository Data SOAP client Collection Management System SQL Collection B Data Repository Postgres PHP Provider SOAP Server SQL Mirror Server Data PHP Provider Collection Management System SQL Collection A System’s Architecture
~40 connected collections ~ on-line records March/2006 JBRJ speciesLink network
WEB Tools geoLoc spOutlier infoXY conversor speciesMapper data cleaning
About geoLoc to assist biological collections in geo-referencing their data the database includes approximately 110 thousand names of Brazilian localities, obtained from: Brazilian Institute of National Statistics and Geography (IBGE) GEOnet Names Server (GNS) speciesLink/Fapesp algorithm based on concepts in the Egaz program (Shattuck 1997) capable of calculating a coordinate for a distance and direction Tools
26 Noroeste-NW Campinas São Paulo
Tools About spOutlier to assist biological collections in identifying possible suspect points in existing records uses techniques modified from Chapman 1999 to detect outliers in latitude, longitude and altitude allows users to indicate their data set as either terrestrial or marine useful to biologists around the world who wish to identify possible errors in their data
1, , , 795 2, , , 805 3, , , 809 4, , , 815 5, , , 810 6, , , 790 7, , , 801 8, , , 700
marine
1, , , , aus, , , , , , id_teste, -45, -22 6, , , 71.37, eua, , , , , , , ,
Input/Output: -degrees, min, sec -decimal degrees -UTM DATUM: -WGS84 (World) -SAD69 (Brazil) -Córrego Alegre (SP) , , , d34'47"W, 52d3'47"N 34d19'23"E, 67d59'0"N 44d59'58"W, 21d59'58"S degrees, min, s
Plot georeferenced points on a map. Available layers: -World -South and Central America -Brazil -São Paulo State
Trachurus trachurus Pteroscion pele Gaidropsarus biscayensis
Using Data PostgreSQL spOutlier geoLoc SOAP Web service job1job2 Maps PostGIS Maps PostGIS
Tools About Data Cleaning Aim at helping curators in identifying possible errors and to standardize data Records are not modified The system just presents "suspect" records
Col 1Col 2Col 3Col n National collections Col 1Col 2 Internacional collections... Tables of Suspect Records chart.pm (Perl) Local Database dc_tax dc_geo PostgreSQL Detect Suspect Records Perl Web speciesLink Portal Java How Data Cleaning Works
Demonstration on-line
Thank you! Obrigado!