Dissemination Databases

Slides:



Advertisements
Similar presentations
We have developed CV easy management (CVem) a fast and effective fully automated software solution for effective and rapid management of all personnel.
Advertisements

Alternative Ways of Presenting Historical Census Data Luuk Schreven & Anouk de Rijk &
Serving up Statistics to an International Community IASSIST Conference Brian Buffett May 2003.
The Dutch Censuses of 1960, 1971 and 2001 Producing public use files in the IPUMS project Wijnand Advokaat Statistics Netherlands Division Social and Spatial.
Augmenting search using a semantic visual graph Edwin de Jonge Olav ten Bosch Statistics Netherlands.
Statistical Databases: A short review Slide: 1 Demonstration of the Prototype model for ECA Statistical Database Statistical Database Application March.
1 © Netskills Quality Internet Training, University of Newcastle Metadata Explained © Netskills, Quality Internet Training.
Terminology and Standards Dan Gillman US Bureau of Labor Statistics.
Met a-data Resources in Europe: within NSIs and from Dosis Projects Wilfried Grossmann Department of Statistics and Decision Support Systems University.
CountryData Technologies for Data Exchange SDMX Information Model: An Introduction.
Data and Social Research Chuck Humphrey Data Library Rutherford North Library.
StatLine 4 metadata implementation Edwin de Jonge Statistics Netherlands.
GSIM implementation in the Istat Metadata System: focus on structural metadata and on the joint use of GSIM and SDMX Mauro Scanu
Metadata Models in Survey Computing Some Results of MetaNet – WG 2 METIS 2004, Geneva W. Grossmann University of Vienna.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
Implementation Experiences METIS – April 2006 Russell Penlington & Lars Thygesen - OECD v 1.0.
United Nations Economic Commission for Europe Statistical Division The Importance of Databases in the Dissemination Process Steven Vale, UNECE.
Eurostat 4. SDMX: Main objects for data exchange 1 Raynald Palmieri Eurostat Unit B5: “Central data and metadata services” SDMX Basics course, October.
Data in context Chapter 1 of Data Basics. Frameworks Today, we will be presenting two frameworks for thinking about the content of data services. A.Statistics.
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
1 Enhancing data quality by using harmonised structural metadata within the European Statistical System A. Götzfried Head of Unit B6 Eurostat.
Page 1 Development of Metadata System at Croatian Bureau of Statistics Development of Metadata System at Croatian Bureau of Statistics Presented by Maja.
Dissemination Statline tool and organisation André de Boer.
METADATA MANAGEMENT AT ISTAT: CONCEPTUAL FOUNDATIONS AND TOOLS Istituto Nazionale di Statistica ITALY.
University of Colorado at Denver and Health Sciences Center Department of Preventive Medicine and Biometrics Contact:
© 2017 by McGraw-Hill Education. This proprietary material solely for authorized instructor use. Not authorized for sale or distribution in any manner.
Using CANSIM* Online Tools
Navigating Your Way Through the EFT, Nesstar and Beyond 20/20 (WDS)
Investment Intentions Survey 2016
ITEC 3220A Using and Designing Database Systems
Prepared by: Galya STATEVA, Chief expert
REPORTING SDG INDICATORS USING NATIONAL REPORTING PLATFORMS
Statistics Netherlands Division Social and Spatial Statistics
Web Engineering.
The Generic Statistical Information Model (GSIM) and the Sistema Unitario dei Metadati (SUM): state of application of the standard Cecilia Casagrande –
Integration of INSPIRE & SDMX data infrastructures for the 2021 Census
Reading and writing reports
Taxonomies, Lexicons and Organizing Knowledge
Interoperable data formats: SDMX
Dissemination Working Group
SDMX Information Model
Attributes and Values Describing Entities.
Census Hub in practice Working Group "European Statistical Data Support" Luxembourg, 29 April 2015.
Application of Dublin Core and XML/RDF standards in the KIKERES
2. An overview of SDMX (What is SDMX? Part I)
Working on coherence and consistency of an output database
Modernization of Statistical data processes
Documentation of statistics Metadata
Session 2: Metadata and Catalogues
Max Booleman Statistics Netherlands
SDMX Information Model: An Introduction
International Marketing and Output Database Conference 2005
Adult Education Survey : recommendations of the TF AES
ESS VIP ICT Project Task Force Meeting 5-6 March 2013.
C. Laevaert,B. Le Goff DWG 4th and 5th May-2006
A review of the 2011 census round in the EU, including the successful implementation of a detailed European legal base First meeting of the Technical Coordination.
Dissemination and use of aggregate data: structures and functionality
ECONOMIC CLASSIFICATIONS Advanced course Day 1 – third afternoon session Tools for assisting the use of classifications Zsófia Ercsey - KSH – Hungary.
Metadata on quality of statistical information
ESTP course on Statistical Metadata – Introductory course
Generic Statistical Information Model (GSIM)
Work Session on Statistical Metadata (Geneva, Switzerland May 2013)
Technical Coordination Group, Zagreb, Croatia, 26 January 2018
Petr Elias Czech Statistical Office
ECONOMIC CLASSIFICATIONS Advanced course Day 1 – third afternoon session Tools for assisting the use of classifications Zsófia Ercsey - KSH – Hungary.
Introduction to reference metadata and quality reporting
This tutorial was produced for Canadian Cancer Statistics 2019
Palestinian Central Bureau of Statistics
SDMX training Francesco Rizzo June 2018
GSIM overview Mauro Scanu ISTAT
Presentation transcript:

Dissemination Databases Meta data in Statistical Output Databases Edwin de Jonge, Statistics Netherlands

Contents Dissemination databases Output meta data Meta data issues What, why, how? Cube model Model, advantages / disadvantages Output meta data Purpose and types Meta data issues Editorial, linguistic, time-dependency, coordination

Publication on Web Internet primary output channel statistical offices: Web site(s), containing documents describing published statistics. Data files (e.g. Excel sheets) Document centric! A user views/downloads a document Metadata are “document” properties But increasingly they contain: online output databases Data centric! A user selects/views/downloads data Metadata are “data” properties

Dissemination Metadata Purpose: Descriptive, to explain meaning of the data But also typical for dissemination: findability How can data be found? Navigation Search engines need metadata… We will adress this issue later.

Dissemination database? Online database containing published statistical output data, ideally all data ever published A user can: Search and select from database and compose a table View, make a chart or download table Can contain large quantity of statistical numbers E.g. StatLine (Statistics Netherlands) contains over 500 million facts

Output data features Output Data = Data Yes, but: Output data is macro data Output data contains special data values: Statistically disclosed Not present Not possible Unknown

Output data features (2) Output Data can have status / versions Provisional Definitive Revised Ideally old versions are still available Currently no system supports that feature Output data is structured in a Cube

Cube Model Dissemination database is collection of cubes Cube = multidimensional table Some very similar cube models: OLAP, Sundgren, SDMX, others Cube characteristics: Describes features of a population Has dimensions. Contains facts (values)

From table to cube Example: Jan 1th 2009 the male population in Amsterdam was 371,858 This fact can be dissected into features, dimensions and facts

Cube: Inhabitants Sex 371,858 Male 2007 Period Region

OLAP Cube Model Developed for DataWareHouses Subject / Measure: Number of inhabitants (population) Dimensions of (Hyper)Cube: Sex (Total, Male, Female) Region (e.g. Amsterdam) Period (e.g. January 1st 2007) Total cube has: Subjects(31) x Sex(3) x Region (1250) x Period (50) = 5,8 million cells!

OLAP Cube Model (2) Cube has: Cube model also used in OLAP tools Measures / subjects: Aggregated quantative variable E.g. Average age, Number of inhabitants, Total import Dimensions: Classifying variables, subdividing population. E.g. Sex, NACE, Place of Birth Values in a dimension are classification items Male/Female, Amsterdam/London. Cube model also used in OLAP tools

Sundgren output model Developed by Statistics Sweden (Sundgren) Other formulation of cube model: α : population determining attributes E.g. Dutch residents β : aggregated variables (= measures), number of inhabitants γ : classifying variables (= dimensions) Region, sex τ : time variable Special role for time!

Cube model Pro Advantage: Top down view on variables of a population Dimensions make it possible to select subpopulations (drill down) Cube is large coherent dataset Easy container for publishing data

Cube Model Con Disadvantage: More dimensions means more empty crossings E.g. Inbound shipping: 7 dimensions -> 300 million combinations, with 1 million datacells (< 0.3%) Data in multiple cubes are not easily combined. For subject areas careful cube design is necessary. Art of Cubism (Willeboordse) Minimize number of cubes Minimize number of empty cells Create core and satellite cubes

Output database software Commercial Beyond2020 SuperWeb “Home Brew” PXWeb / PCAXis (semi-commercial) Sweden, Denmark, Norway and others OECDStat (OECD) StatLine (Netherlands) Genesis (Germany) Many others

Cube Metadata Cube has many metadata items: Variable names, descriptions Methodological description Footnotes Dimension names, descriptions Category names, descriptions How can we structure these?

Dissemination Metadata Remember? Purpose: Descriptive, to explain meaning of the data But also typical for dissemination: findability Types: Data related (detailed) Variable related (detailed) Publication metadata (dissemination!)

Data related metadata Data production metadata Can be: Description of source of data Description of methodology used to produce the data Description of trends/anomalies in dataset Status of data (provisional/revised/etc) Implemented as: Description for whole cube Footnote attached to single datacell of a cube Data related metadata is mainly descriptive

Variable related metadata Metadata of variables used in cube Name and description of variable Aggregation method used Unit of variable (1,000 euro, kg etc.) Name and description of classification Name and description of classification items (categories) Some databases support hierarchical relations within classifications: e.g. regional classifications. Variable related metadata is partly descriptive but names are also useful for findablity

Publication metadata Metadata of cube related to publishing. Typical for dissemination Many of these are dublin core (dc) metadata (or variants). Dublin Core: open standard for document meta data (on the web). Describe the cube as a document! Not only for human consumption but very often used by search engines. Publication metadata is mainly used for findability

Publication metadata (2) Title (dc) Author (dc) Created (dc) Modified (dc) Source (dc) Description (dc) Summary Published (dc) Spatial (dc) Spatial scope Temporal (dc) Reporting period Subject (dc) Frequency Language Subject Area Statistical theme

Dissemination meta data issues Meta data is never without difficulties: Editorial Linguistic Coordination Time dependency

Editorial problems Problems: Many cubes use jargon or ambiguous or difficult language Many cubes prepared by technically skilled people, which are less skilled in writing for general public But a cube is publication medium! Assign an editor to each cube Choose terms carefully / use clear language Don’t put category definitions in border of table, but use clear term. Put definition in footnote.

Linguistic problems Findability problem: many times user uses a synonym/hypernym to find data and finds nothing: Synonym: “job” vs “occupation”, “business” vs “enterprise” Hyper/hyponym: “vehicle” vs “car” vs “SUV” Search should deal with synonyms in a understandable way. Most cube systems support only one language Translation to a different language results in a copy of the database: synchronization problems

Coordination problems Within cube metadata can be clearly defined. Accros cubes within dissemination database is more difficult: Variables, classifications, dimensions, measures need to be managed centrally Cubes don’t own their variable related metadata anymore. They share this metadata with other cubes. Synonym problem: dependent on context a different term may/must be used for identical metadata. (real problem!) Many cubes contain a small variation on a standard classification (but not standard) Problem of homonyms: words written identically have different meaning.

Coordination problems (2) Within NSI / total website + dissemination database is also problematic Web site and dissemation db should share common glossary / search / metadata system ISTAT (Italy) is developing a system that addresses many of these problems Across organisations (Eurostat/ISI/NSI) even more difficult: Centralized metadata model does not work Maybe Federated metadata model: (combination of decentralized and centralized) Other option: Use semantic Web technology for sharing and publishing metadata.

Time dependency Definitions of variables may change. Allmost all cubes have a time dimension. If a measure changes A new measure is added to the cube If a dimension changes New categories are added to the cube Problem is that a changed dimension is now dependent on selection in time dimension! (regions for example) Many empty cells Currently no dissemination db addresses this issue

Questions?