Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dissemination Databases

Similar presentations


Presentation on theme: "Dissemination Databases"— Presentation transcript:

1 Dissemination Databases
Meta data in Statistical Output Databases Edwin de Jonge, Statistics Netherlands

2 Contents Dissemination databases Output meta data Meta data issues
What, why, how? Cube model Model, advantages / disadvantages Output meta data Purpose and types Meta data issues Editorial, linguistic, time-dependency, coordination

3 Publication on Web Internet primary output channel statistical offices: Web site(s), containing documents describing published statistics. Data files (e.g. Excel sheets) Document centric! A user views/downloads a document Metadata are “document” properties But increasingly they contain: online output databases Data centric! A user selects/views/downloads data Metadata are “data” properties

4 Dissemination Metadata
Purpose: Descriptive, to explain meaning of the data But also typical for dissemination: findability How can data be found? Navigation Search engines need metadata… We will adress this issue later.

5 Dissemination database?
Online database containing published statistical output data, ideally all data ever published A user can: Search and select from database and compose a table View, make a chart or download table Can contain large quantity of statistical numbers E.g. StatLine (Statistics Netherlands) contains over 500 million facts

6 Output data features Output Data = Data Yes, but:
Output data is macro data Output data contains special data values: Statistically disclosed Not present Not possible Unknown

7 Output data features (2)
Output Data can have status / versions Provisional Definitive Revised Ideally old versions are still available Currently no system supports that feature Output data is structured in a Cube

8 Cube Model Dissemination database is collection of cubes
Cube = multidimensional table Some very similar cube models: OLAP, Sundgren, SDMX, others Cube characteristics: Describes features of a population Has dimensions. Contains facts (values)

9 From table to cube Example:
Jan 1th 2009 the male population in Amsterdam was 371,858 This fact can be dissected into features, dimensions and facts

10 Cube: Inhabitants Sex 371,858 Male 2007 Period Region

11 OLAP Cube Model Developed for DataWareHouses Subject / Measure:
Number of inhabitants (population) Dimensions of (Hyper)Cube: Sex (Total, Male, Female) Region (e.g. Amsterdam) Period (e.g. January 1st 2007) Total cube has: Subjects(31) x Sex(3) x Region (1250) x Period (50) = 5,8 million cells!

12 OLAP Cube Model (2) Cube has: Cube model also used in OLAP tools
Measures / subjects: Aggregated quantative variable E.g. Average age, Number of inhabitants, Total import Dimensions: Classifying variables, subdividing population. E.g. Sex, NACE, Place of Birth Values in a dimension are classification items Male/Female, Amsterdam/London. Cube model also used in OLAP tools

13 Sundgren output model Developed by Statistics Sweden (Sundgren)
Other formulation of cube model: α : population determining attributes E.g. Dutch residents β : aggregated variables (= measures), number of inhabitants γ : classifying variables (= dimensions) Region, sex τ : time variable Special role for time!

14 Cube model Pro Advantage: Top down view on variables of a population
Dimensions make it possible to select subpopulations (drill down) Cube is large coherent dataset Easy container for publishing data

15 Cube Model Con Disadvantage:
More dimensions means more empty crossings E.g. Inbound shipping: 7 dimensions -> 300 million combinations, with 1 million datacells (< 0.3%) Data in multiple cubes are not easily combined. For subject areas careful cube design is necessary. Art of Cubism (Willeboordse) Minimize number of cubes Minimize number of empty cells Create core and satellite cubes

16 Output database software
Commercial Beyond2020 SuperWeb “Home Brew” PXWeb / PCAXis (semi-commercial) Sweden, Denmark, Norway and others OECDStat (OECD) StatLine (Netherlands) Genesis (Germany) Many others

17 Cube Metadata Cube has many metadata items:
Variable names, descriptions Methodological description Footnotes Dimension names, descriptions Category names, descriptions How can we structure these?

18 Dissemination Metadata
Remember? Purpose: Descriptive, to explain meaning of the data But also typical for dissemination: findability Types: Data related (detailed) Variable related (detailed) Publication metadata (dissemination!)

19 Data related metadata Data production metadata Can be:
Description of source of data Description of methodology used to produce the data Description of trends/anomalies in dataset Status of data (provisional/revised/etc) Implemented as: Description for whole cube Footnote attached to single datacell of a cube Data related metadata is mainly descriptive

20 Variable related metadata
Metadata of variables used in cube Name and description of variable Aggregation method used Unit of variable (1,000 euro, kg etc.) Name and description of classification Name and description of classification items (categories) Some databases support hierarchical relations within classifications: e.g. regional classifications. Variable related metadata is partly descriptive but names are also useful for findablity

21 Publication metadata Metadata of cube related to publishing.
Typical for dissemination Many of these are dublin core (dc) metadata (or variants). Dublin Core: open standard for document meta data (on the web). Describe the cube as a document! Not only for human consumption but very often used by search engines. Publication metadata is mainly used for findability

22 Publication metadata (2)
Title (dc) Author (dc) Created (dc) Modified (dc) Source (dc) Description (dc) Summary Published (dc) Spatial (dc) Spatial scope Temporal (dc) Reporting period Subject (dc) Frequency Language Subject Area Statistical theme

23 Dissemination meta data issues
Meta data is never without difficulties: Editorial Linguistic Coordination Time dependency

24 Editorial problems Problems:
Many cubes use jargon or ambiguous or difficult language Many cubes prepared by technically skilled people, which are less skilled in writing for general public But a cube is publication medium! Assign an editor to each cube Choose terms carefully / use clear language Don’t put category definitions in border of table, but use clear term. Put definition in footnote.

25 Linguistic problems Findability problem:
many times user uses a synonym/hypernym to find data and finds nothing: Synonym: “job” vs “occupation”, “business” vs “enterprise” Hyper/hyponym: “vehicle” vs “car” vs “SUV” Search should deal with synonyms in a understandable way. Most cube systems support only one language Translation to a different language results in a copy of the database: synchronization problems

26 Coordination problems
Within cube metadata can be clearly defined. Accros cubes within dissemination database is more difficult: Variables, classifications, dimensions, measures need to be managed centrally Cubes don’t own their variable related metadata anymore. They share this metadata with other cubes. Synonym problem: dependent on context a different term may/must be used for identical metadata. (real problem!) Many cubes contain a small variation on a standard classification (but not standard) Problem of homonyms: words written identically have different meaning.

27 Coordination problems (2)
Within NSI / total website + dissemination database is also problematic Web site and dissemation db should share common glossary / search / metadata system ISTAT (Italy) is developing a system that addresses many of these problems Across organisations (Eurostat/ISI/NSI) even more difficult: Centralized metadata model does not work Maybe Federated metadata model: (combination of decentralized and centralized) Other option: Use semantic Web technology for sharing and publishing metadata.

28 Time dependency Definitions of variables may change.
Allmost all cubes have a time dimension. If a measure changes A new measure is added to the cube If a dimension changes New categories are added to the cube Problem is that a changed dimension is now dependent on selection in time dimension! (regions for example) Many empty cells Currently no dissemination db addresses this issue

29 Questions?


Download ppt "Dissemination Databases"

Similar presentations


Ads by Google