StatLine 4 metadata implementation Edwin de Jonge Statistics Netherlands
What is StatLine? StatLine is online output database of Statistics Netherlands. –Primary output channel –Contains all published data –Current size: 1500 data cubes, 2 billion data cells, over 150 million facts –Contains much functionality, including very good search engine
StatLine in Bussiness Architecture StatLine in statistical process
What is StatLine 4? Redesign current StatLine 3 dissemination software: Reasons redesign: –Improve coherence –Changing publication policy –Handle time dependence –Archiving –Many new features
StatLine coherence Ideally: StatLine coherent & consistent Currently (StatLine 3): –1500 independent data cubes StatLine 4: –Data cubes share metadata: –centrally moderated, quality improvement –Data cubes share data: –Each fact stored once.
StatLine 4 metadata management Metadata management centralized: –What? Conceptual metadata: –Classifications –Variables –By whom? Two organization units: 1.Coordination: Maintaining structure and meaning of classifications 2.Dissemination: Textual editing and translations –Data producers own data, but not meta data. –Result: Every fact in StatLine 4 uses central classifications.
StatLine in Bussiness Architecture StatLine in statistical process
Classification status In StatLine 4 each classification has status: –(Inter)national standard –Coordinated – within Statistics Netherlands –Shared –Shared but not coordinated –Private –Can only be used by 1 data cube –Only during conversion This status is used for coordination purposes.
Cristal model: StatLine 4 uses Cristal model –Model for classifications and variables (Van Bracht et al.) –Focus on Conceptual and Value domain (ISO 11179) Model elements: –Category (value): –value of variable, creates subpopulation. e.g.: male (gender: male) –Can be part of other category (partial order) –Level: –set of disjoint categories –Equals “flat” classification
Cristal model (2): –Hierarchy: –Sequence of levels (total order) with contained categories –Every category in hierarchy has 1 parent in higher level –Equals “hierarchical” classification –Classification: –set of hierarchies with contained levels and categories –Equals: Family of hierarchical classifications.
Cristal model (3) –Classification versioning –Each metadata object has lifetime (begin and end date) –Each metadata object can have a predecessor and successor –Models versions of categories, levels and hierarchies.
Cristal model (4) Multilingual –All textual properties are multilingual –E.g. Mannelijk (dutch) -> Male –All metadata and tables can be shown in each defined language –All textual properties have popular versions –E.g. Consumer Price Index -> Inflation –All metadata and tables can be shown in “popular” or “expert” mode Object class: Is stored, but not coordinated (yet)
StatLine 4 conversion All content current StatLine must be converted –From 1500 independent cubes –To 1500 coordinated cubes Conversion means coordination! –Total coordination -> very long conversion –No coordination -> no added value Ergo: Partial classification coordination
Conversion strategy (1) Strategy: –Coordinate standardized metadata –Allow non standards for 2 year period –Phased conversion –Preparation, conversion, coordination
Conversion strategy (2) Preparation phase: until June 2006 –Collect and store standard classifications –E.g. Time, Region (50 versions), Age, Marital status, Sex, NACE –Including variations (disclosure control) –For each data cube –Check usage standard classifications –Non standard is marked “private” –Define StatLine 4 structure
Conversion strategy (3) Conversion phase: (June 2006) –Convert data cube –Add missing meta data to metadata server –Check conversion Coordination phase (November 2006) –After conversion: StatLine 4 contains coordinated and private metadata –In two years time all private metadata must be replaced with coordinated metadata
Benefits metadata StatLine 4 –Coordinated classifications and variables –Uniform naming and description –Standard/coordinated metadata can be downloaded –Better comparability of data –Better search results
Future improvements StatLine 4.1 –Centralize population (object class) management: –E.g.: person, enterprise –Model populations and subpopulations Statistical process –Centralize: –process metadata –quality metadata.