Efficiency and generalization as drivers

Slides:

Advertisements

Similar presentations

Irwin/McGraw-Hill Copyright © 2000 The McGraw-Hill Companies. All Rights reserved Whitten Bentley DittmanSYSTEMS ANALYSIS AND DESIGN METHODS5th Edition.

Advertisements

Database Planning, Design, and Administration

Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.

Visibility Information Exchange Web System. Source Data Import Source Data Validation Database Rules Program Logic Storage RetrievalPresentation AnalysisInterpretation.

--What is a Database--1 What is a database What is a Database.

5 Creating the Physical Model. Designing the Physical Model Phase IV: Defining the physical model.

Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.

United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE

Environment Change Information Request Change Definition has subtype of Business Case based upon ConceptPopulation Gives context for Statistical Program.

Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.

Irwin/McGraw-Hill Copyright © 2000 The McGraw-Hill Companies. All Rights reserved Whitten Bentley DittmanSYSTEMS ANALYSIS AND DESIGN METHODS5th Edition.

Configuration Management and Server Administration Mohan Bang Endeca Server.

WP.5 - DDI-SDMX Integration E.S.S. cross-cutting project on Information Models and Standards Marco Pellegrino, Denis Grofils Eurostat METIS Work Session6-8.

Development of metadata in the National Statistical Institute of Spain Work Session on Statistical Metadata Genève, 6-8 May-2013 Ana Isabel Sánchez-Luengo.

DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.

Current and Future Applications of the Generic Statistical Business Process Model at Statistics Canada Laurie Reedman and Claude Julien May 5, 2010.

11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)

Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,

Metadata driven application for data processing – from local toward global solution Rudi Seljak Statistical Office of the Republic of Slovenia.

United Nations Economic Commission for Europe Statistical Division Mapping Data Production Processes to the GSBPM Steven Vale UNECE

11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)

Pilot Census in Poland Some Quality Aspects Geneva, 7-9 July 2010 Janusz Dygaszewicz Central Statistical Office POLAND.

GLOBEX INFOTEK Copyright © 2013 Dr. Emelda Ntinglet-DavisSYSTEMS ANALYSIS AND DESIGN METHODSINTRODUCTORY SESSION EFFECTIVE DATABASE DESIGN for BEGINNERS.

Open GSBPM compliant data processing system in Statistics Estonia (VAIS) 2011 MSIS Conference Maia Ennok Head of Data Warehouse Service Data Processing.

Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.

1 The EDIT System, Overview European Commission – Eurostat.

RECENT DEVELOPMENT OF SORS METADATA REPOSITORIES FOR FASTER AND MORE TRANSPARENT PRODUCTION PROCESS Work Session on Statistical Metadata 9-11 February.

EDIT – Eurostat’s editing tool

On Implementing CSPA Specifications for Editing and Imputation Services Donato Summa, Monica Scannapieco, Diego Zardetto, Istat, Italy Istituto Nazionale.

Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.

Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)

Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.

11 Copyright © 2009, Oracle. All rights reserved. Enhancing ETL Performance.

1 SQL SERVER 2005 Express CE-105 SPRING 2007 Engr. Faisal ur Rehman.

Databases (CS507) CHAPTER 2.

Methods for Data-Integration

Introducing SQL Server 2000 Reporting Services

Chapter (12) – Old Version

Supporting the use of administrative data in official statistics.

Chapter 1: Introduction

MongoDB Er. Shiva K. Shrestha ME Computer, NCIT

Parameter Sniffing in SQL Server Stored Procedures

Establishing an Automated Confidentiality Service in Stats NZ

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.

SDMX Information Model

Big Data - in Performance Engineering

Database Fundamentals

tRelational/DPS Overview

Database management concepts

Generic Statistical Business Process Model (GSBPM)

Teaching slides Chapter 8.

YTY − an integrated production system for business statistics

Lecture 1: Multi-tier Architecture Overview

Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia

MANAGING DATA RESOURCES

Scanning the environment: The global perspective on the integration of non-traditional data sources, administrative data and geospatial information Sub-regional.

Database Systems Instructor Name: Lecture-3.

Data validation in Statistical Office of the Republic of Serbia

SDMX in the S-DWH Layered Architecture

Education and Training Statistics Working Group – 2-3 June 2016

Chapter 3 Database Management

Mapping Data Production Processes to the GSBPM

SOA initiatives at Istat

Étienne Saint-Pierre, Statistics Canada

Work Session on Statistical Metadata (Geneva, Switzerland May 2013)

Practical Database Design and Tuning Objectives

INTRODUCTION A Database system is basically a computer based record keeping system. The collection of data, usually referred to as the database, contains.

CSPA Templates for sharing services

CSPA Templates for sharing services

Presentation transcript:

Efficiency and generalization as drivers for responsive and standard statistical processes. A rule-engine system applied to statistical validation Annalisa Cesaro 15 march 2017, Brussels

The EDI component and its application and technological architecture Requisiti del SIR AGENDA Deterministic data editing and imputation in GSBPM in modern statistics The EDI component and its application and technological architecture The efficiency strategy and future enhancements The EDI component in an integrated environment and future enhancements (for achieving platform neutrality)

Deterministic data editing and imputation in GSBPM in modern statistics 5 Marzo 2007 WHERE WE ARE MOVING… OFFER INCREMENT DATA SOURCE VARIABILITY Output Diss. Social Statistics Economic Statistics Social Statistics Economic Statistics Statistical operations infrastructure ICT STOVEPIPE– REDUNDANCY – NO SHARING Integrated Registers System Data integration for ENTITY STATISTICAL CHARACTERIZATION TRANSVERSALITY – SHARING ALL Data processing Data integration Register Inputs Data processing Data integration Data integration for ENTITY IDENTIFICATION Data Collection Data processing Data processing Services, applications and IT objects Surveys Data Collection Data Collection Admin. Reg. Admin. Reg. Big Data Surveys Admin. Reg. Statistical infrastructure Standards and methods Surveys

Deterministic data editing and imputation in GSBPM in modern statistics WHERE WE ARE MOVING… The same effort is a work in progress issue at European level, where sharing is being promoted effectively. Unfortunately, most cases of sharing have involved significant work to integrate components into different processing and technology environments This has brought to the development of the Common Statistical Production Architecture (CSPA) and its implementation 4 4

ONE RELEVANT APPLICATION Deterministic data editing and imputation in SBR production ONE RELEVANT APPLICATION The designed validation framework has been used for the Deterministic Statistical Data Editing (DSDE), within the yearly Statistical Business Register (SBR) In particular it is the 5.3 (Review, validate and edit) and 5.4 (Edit and Impute) phases of the GSBPM The SBR is yearly updated by integrating administrative and statistical sources, thus identifying the statistical units starting from legal units and estimating the main structural and identification variables for each integrated unit, applying a robust methodology Generally, in such validation phase it is possible to switch from a vertical database structure to an horizontal one, thus building up a single table which maintains all the needed information for each statistical unit. Such structure lets to define easily validation rules for each statistical unit and, in case there is a linking key which groups more units, lets to define easily validation rules inside disjoint groups. Hence, such DSDE validation process applies to microdata and looks at each record to try to identify potential problems, errors and discrepancies, such as outliers and miscoding. It is run iteratively. Data are flagged for automatic or manual inspection or editing. 5 5

The EDI component and its application and technological architecture WHAT IS NEEDED… The EDI component and its application and technological architecture The EDI component performs imputation and editing operations on a list of statistical entities, retrieving for each input statistical variables to be processed. Such list is called base. The EDI component is based on a set of deterministic rules, which may involve more different entities coupled by a coupling key (the rule will link different entities of the same base by matching the coupling key) The EDI component processes the rules with respect to a single staging relational table, which has a record for each entity and collects all the input data for each entity and registers the correction actions output of the DEI process. 6 6

TESTED SELECTION RULES APPROACH WHAT IS NEEDED… The EDI component and its application and technological architecture The EDI component needs: The list of @ID; The input selection strategy for the valorization of several variables for each @ID, by retrieving them in different sources (locale, remote, or «web serviced») The set of rule referred to a fixed base table structure The base table structure Standardized output for evaluating and downloading the executed editing and imputation process TESTED SELECTION RULES APPROACH 7 7

A STANDARD EDITING AND IMPUTATION PROCESS The EDI component and its application and technological architecture Integrated Sources in Terms of UNIQUE ID ENTITY ATTRIBUTE VALUE MODEL Admin Survey BigData Foreign Unique ID Coupling ID Input Variables (attributes) Output Variables (attributes) Base table sustaining editing and imputation processing ATTRIBUTES (properties) FOR ENTITY IDENTIFICATION ATTRIBUTES (properties) FOR ENTITY STATISTICAL CHARACTERIZATION RULE DEFINITION BASE TABLE TRUNCATION @ID LIST SELECTION INPUT RULES EDIT AND IMPUTATION RULES STANDARD REPORTS PRODUCTION 8 8

RULE DEFINITION IN A SQL-LIKE LANGUAGE The EDI component and its application and technological architecture A STANDARD EDITING AND IMPUTATION PROCESS RULE DEFINITION IN A SQL-LIKE LANGUAGE THE BASE TABLE STRUCTURE HAS TO BE GIVEN GUI Actions Services DAOs Entities Java based Web application 9 9

RULE BASED PROCESS WEB EXECUTION MONITORING The EDI component and its application and technological architecture A STANDARD EDITING AND IMPUTATION PROCESS RULE BASED PROCESS WEB EXECUTION MONITORING The rule based process has been implemented in ORACLE: it relies on a performant schema db, which thanks to an engineered partitioning and indexing stategy grants decoupling in rule based processes execution and downloading; Its application is enclosed in an ORACLE procedure with standard parameters, whose scheduling may be controlled by remote (web application – web service) 10 10

EDIT AND IMPUTATION RULES The EDI component and its application and technological architecture A STANDARD EDITING AND IMPUTATION PROCESS SELECTION INPUT RULES EDIT AND IMPUTATION RULES EFFICENCY IN EXECUTION REDUCE STEP AGGREGATION FOR REPORTING PURPOUSES ALL DATA 20000 20000 20000 20000 20000 AT THE END 20000 20000 20000 20000 20000 20000 20000 DATA CHUNKS SUBSET OF THE BASE TABLE ACTIVE DEDICATED SERVER PROCESSES MAPPING EACH ENTITY IN RELATION TO A GIVEN RULE REDUCING 11 11

STANDARD REPORTING FOR MONITORING AND DOWNLOADING The EDI component and its application and technological architecture A STANDARD EDITING AND IMPUTATION PROCESS STANDARD REPORTING FOR MONITORING AND DOWNLOADING

EDIT AND IMPUTATION RULES The efficiency strategy and future enhancements SELECTION INPUT RULES EDIT AND IMPUTATION RULES EFFICENCY IN EXECUTION 5 10 15 20 25 30 1 2 3 4 Parallel_P1 Parallel_P2 Parallel_P3 Parallel_NoP Parallel_alone NoParallel_launch3 Speed up w.r.t. best NoParallel 13% 53% 66% Scalability level 5 10 15 20 25 30 1 2 3 4 Parallel_P1 Parallel_P2 Parallel_P3 Parallel_NoP Parallel_alone NoParallel_launch1 NoParallel_launch2 NoParallel_launch3 13 13

EDIT AND IMPUTATION RULES The efficiency strategy and future enhancements SELECTION INPUT RULES EDIT AND IMPUTATION RULES EFFICENCY IN EXECUTION Evaluation of limit scenarios when DB resources become scarce. In heavy conditions it could be useful to scale out the number of instantiated dedicated server processes, granting equal conditions to all tasks in which a job is massively parallelized 14 14

The EDI component in an integrated environment and future enhancements (for achieving platform neutrality) WEB SERVICE FOR INPUT SELECTION EMBEDDED IN RULE-BASED PROCESSES 2) WEB SERVICE EXPOSING THE GENERALIZED ORACLE COMPONENT FOR INTEGRATING IT IN ANY DIFFERENT ARCHITECTURE THANKS TO XML TESTS HAVE BEEN CARRIED ON SMALL EXPERIMENTAL DATASET, BY USING XML WEBROWSET FOR EXCHANGING METADATA AND DATA (Any selection query may be exposed via web and consumed inside a rule-based process in the optimized Oracle Server environment) FUTURE EXPERIMENTAL PERFORMANCE TESTS (AS THOSE CARRIED ON FOR THE PARALLEL COMPONENT) INVOLVING: THE SELECTION VIA WEB OF CONSISTENT DATASETS; THE USAGE OF THE ENGINEERED PARALLEL ENGINE STILL IN CASE OF DATA SELECTION VIA WEB; AND EXPOSING AS WEB SERVICE THE WHOLE RULE-BASED COMPONENT TAKING CARE OF THE SECURITY ISSUE 20000 Selection Rule application Retrieving data by consuming SOAP web service RULE APPLICATION COMPONENT EXPOSED VIA WEB AS SOAP WEB SERVICE Input params EFFICENT EXECUTION SERVER (ORACLE) Output params 15 15

FOR EXCHANGING QUERY RESULTS AS METADATA AND DATA The EDI component in an integrated environment and future enhancements (for achieving platform neutrality) <webRowSet xmlns="http://java.sun.com/xml/ns/jdbc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://java.sun.com/xml/ns/jdbc http://java.sun.com/xml/ns/jdbc/webrowset.xsd"> <metadata> <column-count></column-count> <column-definition> <column-index></column-index> <column-display-size></column-display-size> <column-label></column-label> <column-name></column-name> <schema-name></schema-name> <column-precision></column-precision> <column-scale></column-scale> <column-type-name</column-type-name> </column-definition> ... </metadata> <data> <currentRow> <columnValue></columnValue> … </currentRow> </data> </webRowSet> SAMPLE STANDARD XML FOR EXCHANGING QUERY RESULTS AS METADATA AND DATA 16 16

THANK YOU FOR YOUR ATTENTION Annalisa Cesaro (cesaro@istat.it), Monica Consalvi, Francesca Alonzi THANK YOU FOR YOUR ATTENTION

EDIT AND IMPUTATION RULES STANDARD REPORTS PRODUCTION A STANDARD EDITING AND IMPUTATION PROCESS The DEI component and its application and technological architecture RULE DEFINITION BASE TABLE TRUNCATION @ID LIST SELECTION INPUT RULES EDIT AND IMPUTATION RULES STANDARD REPORTS PRODUCTION