Documentation of statistics Metadata
We and our users get lost without metadata Why metadata? I work in dissemination – Metadata sounds boring and / or as a job for librarians: Necessary to explain the origin and meaning of data Supports “findability” Navigation Search engines We and our users get lost without metadata
Metadata is everywhere in the Generic Statistical Business Process Model Source:UNECE Secretariat - April 2009
A never ending demand Annual Danish user surveys since 2001 Every year users have placed more / better documentation as their number 1 priority A number of improvements No effect what so ever Documentation is mainly about metadata
Why we can’t live without metadata
May (different )ways of looking at metadata Let’s focus on those relevant to dissemination Purpose: Descriptive – to explain meaning of data ‘Findability’ – Navigation & search engines Data related Variable related Publication related
Data related / reference metadata Description of source of data Methodology used to produce data Status of data (provisional / revised / etc.) Implemented as: Quality declarations Footnotes attached to cells / tables Based on:Edwin de Jonge (CBS)
Quality declarations – reference metadata Administrative info Contents Time Accuracy Comparability Accessibility www.dst.dk/declarations
Quality declarations – Reference metadata Source: http://epp.eurostat.ec.europa.eu/portal/page/portal/population/data/main_tables
Footnote attached to table
Variable related metadata Name and description of variable Aggregation method used Unit (1,000, euro, kg, etc.) Name and description of classification Name and description of classification items (categories) Variable related metadata is partly descriptive but names are also important for ‘findability’ Based on:Edwin de Jonge (CBS)
©Statistics Denmark©Statistics Denmark OECD example i ©Statistics Denmark©Statistics Denmark
©Statistics Denmark©Statistics Denmark Eurostat Example ©Statistics Denmark©Statistics Denmark
Metadata is readily available and useable in context of client's information need What is a projection? What is the difference between immigranta and descendants? Which countries are Western? ©Statistics Denmark
Presenting metadata –selective needs Ancestry click!
Metadata on variabel - civilstatus
Publication metadata Metadata related to publishing Release calendars Also for search engines Dublin Core Standard for document metadata on the Internet Hidden metadata information supporting search engines
Publication metadata –release calendars
Publication metadata –release calendars Contact information Links to metadata Other publications
Publication metadata –release calendars
Publication metadata Many publication metadata are Dublin Core (dc) related- and supports search engines: Title (dc) Spatial (dc) Author (dc) Temporal (dc) – reporting period Created (dc) Subject (dc) Modified (dc) Frequency Source (dc) Laguage Description (dc) Subject Area Summary (dc) Statistical theme Published (dc)
Dublin core supporting search
Terminology / linguistics Coherence Metadata challenges Terminology / linguistics Coherence Output databases / changes over time Audience / Target groups
Terminology –What are our users talking about? Statistical terms: CPI Employed Salary Income Household Family Layman terms: Inflation Working Income/Salary Family
Dissemination metadata issues - Linguistics ‘Findability’: Users uses synonyms /hyponym to find data and finds nothing Synonym: Job <> occupation, business vs enterprises Hyper/hyponym: vehicles <> car <> SUV Musical instrument" is a hypernym of "guitar" because musical instruments include guitars
Metadata should ensure coherence in contents same definitions, aggregations and classifications must be used across all subject areas and media should build on international recognized nomenclatures data sources must be technically coordinated =>Statisticians <> Dissemination
Inconsistent tables - Motorbike owner Car owner 18 – 25 A 26 – 45 B > 46 C 18 – 29 D 30 – 41 E 41+ F When creating tables / compiling statistics detailed attention should be given to the harmonization of variable values even across different subject areas. Otherwise you will end up being inconsistent both across time and across subject areas. In the example above we have to different variables Car owner and motorbike own distributed by age. Even if the data is coming from different surveys and that it is there for not possible to make cross tabulations of Motorbike and Car owner it is still much better dissemination to usage the same age groupings across all tables compiled by your organization. This is of cause in real life a nearly impossible task. 1/17/2019
Consistent structural metadata in the Danish model Centralized variables values unit ”by”/”and” time template for quality declaration Decentralized contents footnote contact person quality declaration Decentralized metadata is highly standardized through templates, guides and editorial overview
Time dependency – output databases Definitions of variables may change All most all cubes have a time dimension If a measure changes A new measure is added If a dimension changes New categories are added Change in dimension depends on selection in time dimension -> Many empty cells – (region)
Metadata play a role when the users Metadata …. for what? Metadata play a role when the users browse search select comprehend compare
Metadata – for whom? Staff Users, internal/external database administrators statisticians developers managers Users, internal/external news media international organisations researchers occasional users
Documentation – metadata principles* ensure customers are identified for all metadata processes make metadata 'active' to greatest extent possible - also to Google (*)single authoritative source - 'registration authority‘ reuse metadata metadata is readily available and useable in context of client's information need (*) www.statistics.gov.uk/events/q2006/downloads/W02_Penlington.pdf
And now back to work …. card sorting ©