Not our data, but we use it in research Wietse Dol, LEI-WUR 6 October 2014.

Slides:



Advertisements
Similar presentations
AgMIP SSA Meeting Accra, Ghana 12 September, 2012 Importing and translating crop model data.
Advertisements

Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.
Computer Monitoring System for EE Faculty By Yaroslav Ross And Denis Zakrevsky Supervisor: Viktor Kulikov.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Background Data validation, a critical issue for the E.S.S.
ESCWA SDMX Workshop Session: SDMX and Data. Session Objectives At the end of this session you will: –Know the SDMX model of a data structure definition.
WP.5 - DDI-SDMX Integration
Database Design - Lecture 1
WP.5 - DDI-SDMX Integration E.S.S. cross-cutting project on Information Models and Standards Marco Pellegrino, Denis Grofils Eurostat METIS Work Session6-8.
SWE 316: Software Design and Architecture – Dr. Khalid Aljasser Objectives Lecture 11 : Frameworks SWE 316: Software Design and Architecture  To understand.
Metadata management and statistical business process at Statistics Estonia Work Session on Statistical Metadata (Geneva, Switzerland 8-10 May 2013) Kaja.
Overview of SDMX: Statistical Data and Metadata eXchange Technical and Content Standards for Statistical Data Ann McPhail, Division Chief Statistics Department,
Chapter 9 Database Management Discovering Computers Fundamental.
Restricted Daejeon, April An SDMX based unified data catalogue (UDC) MSIS – Meeting on the Management of Statistical Information Systems 1.
Development of metadata in the National Statistical Institute of Spain Work Session on Statistical Metadata Genève, 6-8 May-2013 Ana Isabel Sánchez-Luengo.
GSIM implementation in the Istat Metadata System: focus on structural metadata and on the joint use of GSIM and SDMX Mauro Scanu
OECD Short-Term Economic Statistics Working PartyJune Maintaining long time series through industry classification changes Richard McKenzie.
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
Implementation of quality indicators in the Finnish statistics production process Kari Djerf Statistics Finland Q2008, Rome Italy.
Chapter 14 Part II: Architectural Adaptation BY: AARON MCKAY.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
BAIGORRI Antonio – Eurostat, Unit B1: Quality; Classifications Q2010 EUROPEAN CONFERENCE ON QUALITY IN STATISTICS Terminology relating to the Implementation.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Eurostat Expression language (EL) in Eurostat SDMX - TWG Luxembourg, 5 Jun 2013 Adam Wroński.
Eurostat – Unit D5 Key indicators for European policies European Conference on Quality in Official Statistics, Q2010 Helsinki, 4-6 May 2010.
Cadastral Principles Grenville Barnes TCI Workshop 17 October 2007.
Data and information. Information and data By the end of this, you should be able to state the difference between DATE and INFORMAITON.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
United Nations Economic Commission for Europe Statistical Division The Importance of Databases in the Dissemination Process Steven Vale, UNECE.
Discussion, Q2010 Cynthia Clark National Agricultural Statistics Service.
Yarmouk University Department of Computer Information Systems CIS 499 Yarmouk University Department of Computer Information Systems CIS 499 Yarmouk University.
ITGS Databases.
Not our data, but we use it in research Wietse Dol, LEI-WUR 9 February 2015, Forum C214.
An approach for Framework Construction and Instantiation Using Pattern Languages Rosana Teresinha Vaccare Braga Paulo Cesar Masiero ICMC-USP: Institute.
Electronic labnotes Mari Wigham COMMIT/. Information WUR  Organising, sharing, finding and reusing data  Expertise in: ● Modelling data.
Database Management Systems (DBMS)
Eurostat SDMX and Global Standardisation Marco Pellegrino Eurostat, Statistical Office of the European Union Bangkok,
Web Technologies for Bioinformatics Ken Baclawski.
Eurostat 4. SDMX: Main objects for data exchange 1 Raynald Palmieri Eurostat Unit B5: “Central data and metadata services” SDMX Basics course, October.
SDMX and Metadata SDMX Basics Course 12 April 2013 Daniel Suranyi Eurostat B5 Management of statistical data and metadata.
Harmonisation of Seasonal Adjustment Methods in EU and OECD Countries Ronny Nilsson Statistics Directorate.
The Question Bank Graham Hughes & Julie Gibbs Department of Sociology University of Surrey Research Methods Festival, July 2008
Copyright 2010, The World Bank Group. All Rights Reserved. Managing processes Core business of the NSO Part 1 Strengthening Statistics Produced in Collaboration.
Syrian Agriculture Database The NAPC with the support of the FAO project GCP/SYR/006/ITA has produced the Syrian Agr. database The NAPC.
13-Jul-07 State of the art of the ISCO-08 implementation.
What is data? Wietse Dol, LEI-WUR 13 November 2012, 9.40 – 10.25, C435 Forumgebouw.
Eurostat Sharing data validation services Item 5.1 of the agenda.
Outline Announcements: –HW II due today! –HW III on web CVS.
Session 6: Data Flow, Data Management, and Data Quality.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
What is a Computer An electronic, digital device that stores and processes information. A machine that accepts input, processes it according to specified.
Saturday, 11 June 2016 Project FoodCASE Workshop Data Quality Research on Food Composition Database Systems © Department of Computer Science | ETH Zürich.
The Reproducible Research Advantage Why + how to make your research more reproducible Presentation for the Center for Open Science June 17, 2015 April.
Montenegrin FADN FAO project Szilárd Keszthelyi, PhD.
Information and Information Technology 1. Information and employment 2.
Todd Quinn – Business & Economics Librarian
Not our data, but we use it in research
SowiDataNet - A User-Driven Repository for Data Sharing and Centralizing Research Data from the Social and Economic Sciences in Germany Monika Linne, 30.
Exchanging Reference Metadata using SDMX
The Re3gistry software and the INSPIRE Registry
Generic Statistical Business Process Model (GSBPM)
ESSnet on SDMX phase II Laura Vignola
Goals and objectives of Work package 2 of the ESSnet on Consistency of concepts and applied methods of business and trade-related statistics Norbert Rainer,
Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia
Ag. No Transparency Consultation
Validation services developed in the ESS
Local Administrative Units
Work Session on Statistical Metadata (Geneva, Switzerland May 2013)
PRODCOM Working Group JMO M November 2012
Outline Announcements: Version control with CVS HW II due today!
Presentation transcript:

Not our data, but we use it in research Wietse Dol, LEI-WUR 6 October 2014

Wietse Dol  PhD Econometrics  10 years University of Groningen (Econometrics, sampling theory) 21 years LEI (many different departments)  Data and models, i.e. use/reuse and quality, trouble shooter + statistical methods + ICT + user interfacing  Not and IT specialist but a researcher (I build software because I use it myself)  Many model projects and user interfaces for models (not only LEI)  Since 2006: data, data quality ≡ MetaBase

LEI: Agricultural Economic Research Institute  Part of Wageningen University & Research center (WUR)  Part of the Social Science Group within the WUR  We are the research part of WUR/SSG (advice ministry of Economic Affairs) in The Hague  Consultancy (applied research): ministries, EU, local government, industry,…  Collecting data (Farm data: FADN), building models and agricultural content specialists

University vs. Research center  University: teaching, publications, new theory and technology  Research center: ● applied work/consultancy ● reusing things from the past (e.g. yearly publications) ● sharing knowledge (how to become a content specialist)/teaching for small groups ● working in groups (different disciplines) ● Working in (inter)national groups with many different disciplines Research centers have experience in data management.

Primary vs. Secondary research data Research data: collected, observed, or created, for the purpose of analysis to produce and validate original research results.  Primary data: you collect, targeted to answer/validate your questions.  Secondary data: not yours, e.g. from website. More and more need of secondary data (primary is expensive and takes a lot of time to collect). Quality of data Meta-information and Versioning is crucial

Production data Meta-information: Source, Version, Dimension, Definitions etc. without proper information you use the wrong data  is FR with or without DOM?  Is the production in tons or in Euros.  Does the year start 1-1 and ends 31-12?  What’s the definition of Tomato  Owner of the data/Version of the data/conditions usage… ProductCountryYearProduction TomatoNL WheatBE SugarFR

Lifecycle Model of data

Data  Use data  How to get the data, filter it and store it  Inspection and Quality checks on the data  How to make it available for others  What scientific actions are done on the data  Curate, preserve, versions, … Lifecycle Model Don’t do it alone, do it as a GROUP and communicate Everybody Not often Seldom

Types of databases according MetaBase  Statistical database  Scientific database  Meta-database

Statistical database: secondary data Databases provided by international organizations like EU, FAO, OECD, World bank are in general statistical databases: ● Good web interfaces for downloading data ● Data are stored as they are received ● Data are consistent in their own domain ● No aggregations are made when underlying data are missing ● Not much attention for data checking ● No versioning system (data changes

Scientific versus Statistical database  Problems with statistical database: ● Different definitions of territories and commodities ● Typing errors ● Missing data ● Break in series  Scientific database: ● Problems solved ● Transparency (original data sources and underlying assumptions are kept) ● Versioning of the data ● Essential for modeling and research

Structural design of a scientific database  Key words for structural design HarDFACTS project IPTS 2007 done by vTI/LEI ● Transparent ● Harmonised ● Complete ● Consistent Harmonised Database for Agricultural Commodity Time Series => The amount of effort/costs scares institutes but it is often a “hidden” costs.

Transparent  Original data from statistical database are stored  Complete and consistent data are stored  Original and completed data can be compared  Calculation procedures are stored and can be repeated (scripting language) Harmonised Definition used here is to bring together the different international databases in one framework and to link the data through a unique coding system (keywords are classifications and tree structures, super-classifications)

Complete Definition used in MetaBase is that an econometric procedures will be proposed to complete the new (time) series in the database (especially needed for models). Consistent Definition used here is that the inter relationship of the data in the database holds over classifications (time, territories and variables).

Versioning of your research Main reason for versioning: Reproducibility  Software you use changes: software versions  Data changes/is updated/corrected: data versions  You discover errors in your research process or you improve the procedure: model versions  Best advice: do not use a spreadsheet but a language with a scripting language (SQL, R, GAMS,…) and store data in a database (with a good data model). This documents how the original data was transformed into the data of your research  Store data and scripts in a version control system SVN: like Turtoise  Do it as a group and (re)use others results.

Versioning 2  Try to separate Model (script) from Data  Make generic scripts when possible (re-use)  Store Script and Data in separate SVN repositories  Add meta-information to data as well as your scripts  I.e. register versions of the software you use  Test if your data and code also runs on other computers Example: Outlier testing in MetaBase

Land under permanent crop in Spain by Eurostat

Versioning 3  Versioning looks time consuming, but when you make mistakes it is easy to go back to an old situation. It is also a first good step in sharing data etc. Works very well in groups.  Easy to see differences between versions.  Versioning makes it possible to reproduce research, also in 5 years time.  Frequency of versioning: some make a version every day. Practical advice: make a version when you have a publication.

MetaBase: data management for data

MetaBase 1. many different data sources (e.g. FAO, Eurostat) all in same user-interface (SDMX, NetCDF) 2. find data alternatives using Meta-Information 3. search data content (e.g. oilseed) 4. all content easily available in research software 5. recodings, aggregations and concordances are all implemented in GAMS 6. Statistical methods in GAMS and R 7. Versioning Eurostat (monthly), FAO (twice per year) 8. Example:

Always play with your data and communicate Wishes, problems, requests: