Efficiency and generalization as drivers

Efficiency and generalization as drivers
for responsive and standard statistical processes. A rule-engine system applied to statistical validation Annalisa Cesaro 15 march 2017, Brussels

The EDI component and its application and technological architecture
Requisiti del SIR AGENDA Deterministic data editing and imputation in GSBPM in modern statistics The EDI component and its application and technological architecture The efficiency strategy and future enhancements The EDI component in an integrated environment and future enhancements (for achieving platform neutrality)

Deterministic data editing and imputation in GSBPM in modern statistics
5 Marzo 2007 WHERE WE ARE MOVING… OFFER INCREMENT DATA SOURCE VARIABILITY Output Diss. Social Statistics Economic Statistics Social Statistics Economic Statistics Statistical operations infrastructure ICT STOVEPIPE– REDUNDANCY – NO SHARING Integrated Registers System Data integration for ENTITY STATISTICAL CHARACTERIZATION TRANSVERSALITY – SHARING ALL Data processing Data integration Register Inputs Data processing Data integration Data integration for ENTITY IDENTIFICATION Data Collection Data processing Data processing Services, applications and IT objects Surveys Data Collection Data Collection Admin. Reg. Admin. Reg. Big Data Surveys Admin. Reg. Statistical infrastructure Standards and methods Surveys

Deterministic data editing and imputation in GSBPM in modern statistics
WHERE WE ARE MOVING… The same effort is a work in progress issue at European level, where sharing is being promoted effectively. Unfortunately, most cases of sharing have involved significant work to integrate components into different processing and technology environments This has brought to the development of the Common Statistical Production Architecture (CSPA) and its implementation 4 4

ONE RELEVANT APPLICATION
Deterministic data editing and imputation in SBR production ONE RELEVANT APPLICATION The designed validation framework has been used for the Deterministic Statistical Data Editing (DSDE), within the yearly Statistical Business Register (SBR) In particular it is the 5.3 (Review, validate and edit) and 5.4 (Edit and Impute) phases of the GSBPM The SBR is yearly updated by integrating administrative and statistical sources, thus identifying the statistical units starting from legal units and estimating the main structural and identification variables for each integrated unit, applying a robust methodology Generally, in such validation phase it is possible to switch from a vertical database structure to an horizontal one, thus building up a single table which maintains all the needed information for each statistical unit. Such structure lets to define easily validation rules for each statistical unit and, in case there is a linking key which groups more units, lets to define easily validation rules inside disjoint groups. Hence, such DSDE validation process applies to microdata and looks at each record to try to identify potential problems, errors and discrepancies, such as outliers and miscoding. It is run iteratively. Data are flagged for automatic or manual inspection or editing. 5 5

The EDI component and its application and technological architecture
WHAT IS NEEDED… The EDI component and its application and technological architecture The EDI component performs imputation and editing operations on a list of statistical entities, retrieving for each input statistical variables to be processed. Such list is called base. The EDI component is based on a set of deterministic rules, which may involve more different entities coupled by a coupling key (the rule will link different entities of the same base by matching the coupling key) The EDI component processes the rules with respect to a single staging relational table, which has a record for each entity and collects all the input data for each entity and registers the correction actions output of the DEI process. 6 6

TESTED SELECTION RULES APPROACH
WHAT IS NEEDED… The EDI component and its application and technological architecture The EDI component needs: The list The input selection strategy for the valorization of several variables for by retrieving them in different sources (locale, remote, or «web serviced») The set of rule referred to a fixed base table structure The base table structure Standardized output for evaluating and downloading the executed editing and imputation process TESTED SELECTION RULES APPROACH 7 7

A STANDARD EDITING AND IMPUTATION PROCESS
The EDI component and its application and technological architecture Integrated Sources in Terms of UNIQUE ID ENTITY ATTRIBUTE VALUE MODEL Admin Survey BigData Foreign Unique ID Coupling ID Input Variables (attributes) Output Variables (attributes) Base table sustaining editing and imputation processing ATTRIBUTES (properties) FOR ENTITY IDENTIFICATION ATTRIBUTES (properties) FOR ENTITY STATISTICAL CHARACTERIZATION RULE DEFINITION BASE TABLE TRUNCATION @ID LIST SELECTION INPUT RULES EDIT AND IMPUTATION RULES STANDARD REPORTS PRODUCTION 8 8

RULE DEFINITION IN A SQL-LIKE LANGUAGE
The EDI component and its application and technological architecture A STANDARD EDITING AND IMPUTATION PROCESS RULE DEFINITION IN A SQL-LIKE LANGUAGE THE BASE TABLE STRUCTURE HAS TO BE GIVEN GUI Actions Services DAOs Entities Java based Web application 9 9

RULE BASED PROCESS WEB EXECUTION MONITORING
The EDI component and its application and technological architecture A STANDARD EDITING AND IMPUTATION PROCESS RULE BASED PROCESS WEB EXECUTION MONITORING The rule based process has been implemented in ORACLE: it relies on a performant schema db, which thanks to an engineered partitioning and indexing stategy grants decoupling in rule based processes execution and downloading; Its application is enclosed in an ORACLE procedure with standard parameters, whose scheduling may be controlled by remote (web application – web service) 10 10

EDIT AND IMPUTATION RULES
The EDI component and its application and technological architecture A STANDARD EDITING AND IMPUTATION PROCESS SELECTION INPUT RULES EDIT AND IMPUTATION RULES EFFICENCY IN EXECUTION REDUCE STEP AGGREGATION FOR REPORTING PURPOUSES ALL DATA 20000 20000 20000 20000 20000 AT THE END 20000 20000 20000 20000 20000 20000 20000 DATA CHUNKS SUBSET OF THE BASE TABLE ACTIVE DEDICATED SERVER PROCESSES MAPPING EACH ENTITY IN RELATION TO A GIVEN RULE REDUCING 11 11

STANDARD REPORTING FOR MONITORING AND DOWNLOADING
The EDI component and its application and technological architecture A STANDARD EDITING AND IMPUTATION PROCESS STANDARD REPORTING FOR MONITORING AND DOWNLOADING

The efficiency strategy and future enhancements SELECTION INPUT RULES EDIT AND IMPUTATION RULES EFFICENCY IN EXECUTION 5 10 15 20 25 30 1 2 3 4 Parallel_P1 Parallel_P2 Parallel_P3 Parallel_NoP Parallel_alone NoParallel_launch3 Speed up w.r.t. best NoParallel 13% 53% 66% Scalability level 5 10 15 20 25 30 1 2 3 4 Parallel_P1 Parallel_P2 Parallel_P3 Parallel_NoP Parallel_alone NoParallel_launch1 NoParallel_launch2 NoParallel_launch3 13 13

The efficiency strategy and future enhancements SELECTION INPUT RULES EDIT AND IMPUTATION RULES EFFICENCY IN EXECUTION Evaluation of limit scenarios when DB resources become scarce. In heavy conditions it could be useful to scale out the number of instantiated dedicated server processes, granting equal conditions to all tasks in which a job is massively parallelized 14 14

The EDI component in an integrated environment and future enhancements (for achieving platform neutrality) WEB SERVICE FOR INPUT SELECTION EMBEDDED IN RULE-BASED PROCESSES 2) WEB SERVICE EXPOSING THE GENERALIZED ORACLE COMPONENT FOR INTEGRATING IT IN ANY DIFFERENT ARCHITECTURE THANKS TO XML TESTS HAVE BEEN CARRIED ON SMALL EXPERIMENTAL DATASET, BY USING XML WEBROWSET FOR EXCHANGING METADATA AND DATA (Any selection query may be exposed via web and consumed inside a rule-based process in the optimized Oracle Server environment) FUTURE EXPERIMENTAL PERFORMANCE TESTS (AS THOSE CARRIED ON FOR THE PARALLEL COMPONENT) INVOLVING: THE SELECTION VIA WEB OF CONSISTENT DATASETS; THE USAGE OF THE ENGINEERED PARALLEL ENGINE STILL IN CASE OF DATA SELECTION VIA WEB; AND EXPOSING AS WEB SERVICE THE WHOLE RULE-BASED COMPONENT TAKING CARE OF THE SECURITY ISSUE 20000 Selection Rule application Retrieving data by consuming SOAP web service RULE APPLICATION COMPONENT EXPOSED VIA WEB AS SOAP WEB SERVICE Input params EFFICENT EXECUTION SERVER (ORACLE) Output params 15 15

FOR EXCHANGING QUERY RESULTS AS METADATA AND DATA
The EDI component in an integrated environment and future enhancements (for achieving platform neutrality) <webRowSet xmlns=" xmlns:xsi=" xsi:schemaLocation=" <metadata> <column-count></column-count> <column-definition> <column-index></column-index> <column-display-size></column-display-size> <column-label></column-label> <column-name></column-name> <schema-name></schema-name> <column-precision></column-precision> <column-scale></column-scale> <column-type-name</column-type-name> </column-definition> ... </metadata> <data> <currentRow> <columnValue></columnValue> … </currentRow> </data> </webRowSet> SAMPLE STANDARD XML FOR EXCHANGING QUERY RESULTS AS METADATA AND DATA 16 16

THANK YOU FOR YOUR ATTENTION
Annalisa Cesaro Monica Consalvi, Francesca Alonzi THANK YOU FOR YOUR ATTENTION

EDIT AND IMPUTATION RULES STANDARD REPORTS PRODUCTION
A STANDARD EDITING AND IMPUTATION PROCESS The DEI component and its application and technological architecture RULE DEFINITION BASE TABLE TRUNCATION @ID LIST SELECTION INPUT RULES EDIT AND IMPUTATION RULES STANDARD REPORTS PRODUCTION

Efficiency and generalization as drivers

Similar presentations

Presentation on theme: "Efficiency and generalization as drivers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficiency and generalization as drivers

Similar presentations

Presentation on theme: "Efficiency and generalization as drivers"— Presentation transcript:

Similar presentations

About project

Feedback