Two Paradigms for Official Statistics Production Boris Lorenc, Jakob Engdahl and Klas Blomqvist Statistics Sweden
Preliminaries The talk concerns data and knowledge about external world – not data and knowledge about producing statistics (but might have consequences for the latter) Inspired by the different discussions on ongoing developments and initiatives within (official) statistics May have certain relevance for editing Naturally, the views presented herein are those of the authors, not necessarily reflecting policies of Statistics Sweden
Preliminaries (cont’d) Transition from (many) Stovepipes to (few) Integrated System(s) Among intended goals i.better integration of administrative data and survey data, ii.better/faster response to new or changing user needs How an integrated system should look like so as to satisfy these requirements answer sought in the field of knowledge systems/cognitive systems
Agenda I.Preliminaries II.On some distinctions and results regarding knowledge/cognitive systems III.Consequences for representing data in Integrated systems for statistics production IV.Further considerations for statistics methodology, including some thoughts regarding editing
Knowledge/Cognitive Systems Computational symbolic first-order predicate logic other formal logic etc subsymbolic artificial neural networks (ANNs) etc Other (noncomputational) embodied cognition situated cognition socially distributed cognition etc Good for restricted domains with clear rules (e.g. chess), less good for open-world problems
Database developments Relational Model RDBMS (Relational Database Management System) implements first-order predicate logic database schema: theory in predicate calculus NoSQL schema-less (theory-less) examples Google‘s BigTable solutions underlying some functions on Amazon, Twitter, and Facebook Perhaps related: Semantic Web how to structure documents into a “web of data” “a web of data that can be processed directly and indirectly by machines” uses Resource Description Framework (rather than RDBMS)
Consequences Paradigm I: Stovepipe + RDBMS ‘manual’ management of a fairly restricted domain single-purpose use likely requires expert assistance to users in search and requirements specification Paradigm II: Integrated system + noSQL automatic building of world knowledge pertaining to the domain multi-purpose use likely empowers users to themselves explore available data and consider merits of requiring new data likely requires expert assistance to users in search and requirements specification likely empowers users to themselves explore available data and consider merits of requiring new data
Sampling theory considerations In the context of Paradigm II: use of weights what should they then reflect: inclusion probabilities (if known)? nonresponse information (including an assumed model)? auxiliary information pertaining to specific variables to be estimated? use of models memorylessness vs. Bayesian statistics
Editing Editing for a purpose vs. editing “without a purpose” adherence to general specifications (‘concept validity’) self-learning (unsupervised) tools from computer science/ANN model congruence (especially building automatic models using methods from the KDD (Knowledge Discovery and Data Mining) field more?
Conclusions The distinction likely not as clear-cut as presented here, however the trend discernible: transition from “manual” to automatic processing potential increased need to use models In building representations of “world knowledge”, in addition to RDBMS, pay attention to developments in NoSQL, Big Data, and similar Perhaps strengthen work on general-purpose data editing automated data editing model use... (as already advanced in several contributions to the workshop)
Thank you