Presentation is loading. Please wait.

Presentation is loading. Please wait.

S-DWH Approach to statistical data management: The practice case of the SBS production process in ISTAT Francesco Altarocca, ISTAT Diego Bellisai, ISTAT.

Similar presentations


Presentation on theme: "S-DWH Approach to statistical data management: The practice case of the SBS production process in ISTAT Francesco Altarocca, ISTAT Diego Bellisai, ISTAT."— Presentation transcript:

1 S-DWH Approach to statistical data management: The practice case of the SBS production process in ISTAT Francesco Altarocca, ISTAT Diego Bellisai, ISTAT Antonio Laureti Palma, ISTAT

2 Summary What is the SBS-Frame SBS-Frame production process A Statistical Data Warehouse (S-DWH) for supporting the production workflow INSIDE: INtegrated StatIstical Datawarehouse Environment software INSIDE to support the SBS-Frame workflow Features of INSIDE

3 The Frame allows ISTAT to obtain by sum the main economic aggregates required by the Eurostat SBS (Structural Business Statistics) Regulation The Frame allows ISTAT to overcome the limitations of the estimation domains of the sample surveys; the possibility to have accurate estimates on a relevant number of sub-populations A detailed and multidimensional mapping of the enterprises is possible It represents the new base for the National Accounts SEC 2010 estimates; It will be integrated at the micro level with foreign trade statistics (geographic features of export, foreign controlled enterprises) and labour statistics (employees characteristics, wage levels) SBS-Frame in the contest of business statistics

4 SBS-Frame variables The SBS Frame is an archive of the main annual economic variables on all the active Italian enterprises (around 4.4 million units) SBS Frame base variables are: from the SBR 1- Enterprise ID number 2- Economic activity 3- Number of employees new variables derived for the SBS-Frame 4- Turnover 5- Labour Cost 6- Total purchases of goods and services 7- Value-added at factor cost 8- Gross operating margin

5 IDSourceDescriptionSupplierunits#vars FS Financial Statements annual profit and loss statements of limited liability companies Chambers of Commerce 750K~300 SS Sector Studies survey SMEs with Turnover in [30K-7.5M] euros Italian Revenue Agency 3.5M~60 UNTax returns form unified model of tax declarations by legal form, containing economic information for different legal forms Italian Revenue Agency 4.4M~60 IRAP Regional Tax on Productive Activities form Model of declaration for Regional Tax on Productive Activities payment Italian Revenue Agency 4.4M~70 SME Small Medium Ent. Survey sample survey on enterprises with less than 100 employees ISTAT100K~200 RACLI Labour Cost by Enterprise Reg. Register of Labour Cost by EnterpriseISTAT1.5M~20 SBRBusiness Register Italian official Business Register of Active Enterprises ISTAT4.4M~20 The Frame’s Sources

6 integrated micro-data view RACLI (  33% of SMEs) UnitsID Ateco NEm TBRNEm PC WS WH SCY 1 1 Y 2 1.....… Y k 1 Y 1 2 Y 2 2 k 2 Y 1 3 Y 2 3 k 3 Y 1 S Y 2 S……...… Y p S 1 2Survey............................ N (4.3 mil) SBR Not covered (  4%) FS (  16% of SMEs) SS (  80% of SMEs) UNICO (  97% of SMEs) The Frame’s Sources

7 data harmonization final dataset FS consistency CHK duplicate CHK discard bad data SS UN IRAP labour cost correction FS SS UN IRAP standardized dataset FS SS UN IRAP integration base FRAME imputation, estimation key vars Imputation missing units FRAME key variables FRAME variables of detail estimation, imputation other variables SBR RACLI The SBS-Frame process

8 data harmonization final dataset FS consistency CHK duplicate CHK discard bad data SS UN IRAP labour cost correction FS SS Uni co IRA P standardized dataset FS SS UN IRAP integration base FRAME imputation, estimation key vars Imputation missing units FRAME key variables FRAME variables of detail estimation, imputation other variables SBR RACLI

9 SBS-Frame process features annual activity complex workflow iterative activities many actors’ interactions different actor skills variability of the sources distributed computing tracking methodological choices replicability of results documenting processes storing distributed knowledge for safety in the production process

10 The SBS-Frame production process is executed once a year Due to source variability, it is necessary to make adjustments to the process yearly The adjustment phase takes two months At the end of the adjustment phase the process is executed only once SBS-Frame process features

11 scientific workflows  In the lT literature the SBS-Frame production process can be managed by a scientific workflow management system based on a data-centric approach A scientific workflow is a specialized workflow designed to compose and execute a series of complex computational or data manipulation steps. Basic requirements are: to construct simple what-if scenarios to support iterative hypothesis testing activities to share knowledge to track methodological choices

12 data-centric approach scientific workflows SETUP reuse, composition EXECUTION ANALYSIS evaluation, validation data warehouse DISSEMINATION

13 Statistical Data Warehouse (S-DWH) To support a data-centric approach we use a Statistical- Data Warehouse (S-DWH) as a single central data store Basic requirements for the S-DWH are: an easy-to-use environment for access complex data control of information visibility support of multiple-purpose statistical information in a specific statistical domain a metadata-driven model a single integrated system

14 Layered S-DWH From an architectural point of view, we can identify four conceptual layers in the S-DWH: access layer, for the final presentation, dissemination and delivery of the information sought; interpretation and data analysis layer enables data analysis or evaluation for statistical design; integration layer is where all operational activities are carried out; in this layer data are integrated and transformed in order to increase performance and usability of the upper layer; source layer is the level where data sources are stored; internal data (from surveys or step elaboration) or external data (from administrative provisions).

15 IRAP Unico SME survey data linking preliminary treatment SS FS integrated micro- data SBS-Frame process in S-DWH variables mapping SBR

16 IRAP Unico SME survey data linking preliminary treatment SS FS integrated micro- data SBS-Frame process in S-DWH variables mapping SBR dett VAR Frame dett VAR Frame missing imputation FRAME SBS FRAME SBS treatment of inconsistent data SBS views SBS views Key VAR Frame Key VAR Frame base Frame base Frame

17 IRAP Unico SME survey data linking preliminary treatment SS FS integrated micro- data SBS-Frame process in S-DWH variables mapping SBR missing imputation FRAME SBS FRAME SBS views SBS views sourceintegration interpretation access treatment of inconsistent data dett VAR Frame dett VAR Frame Key VAR Frame Key VAR Frame base Frame base Frame

18 IRAP Unico SME survey data linking preliminary treatment SS FS integrated micro- data SBS-Frame process in S-DWH variables mapping SBR missing imputation FRAME SBS FRAME SBS views SBS views sourceintegration interpretation access treatment of inconsistent data data centric approach workflow dett VAR Frame dett VAR Frame Key VAR Frame Key VAR Frame base Frame base Frame

19 The implementation of INSIDE To support the SBS-Frame production, the INSIDE (INtegrated StatIstical Datawarehouse Environment) software application has been implemented INSIDE is the S-DWH used to support the SBS-Frame workflow. INSIDE is: specialized in structural business statistics metadata driven based on 4 layers

20 INSIDE basic architecture

21 RoleDescription Source Integration Interpretation Access source mapperis a source expert responsible for mapping of economic variables data analystperforms statistical analysis and is in charge of all or part of the statistical production process data administratorresponsible for managing the data flows, user authorization and system maintenance INSIDE: user roles

22 IRAP Unico SME survey data linking preliminary treatment SS FS integrated micro- data SBS-Frame process in S-DWH variables mapping SBR missing imputation FRAME SBS FRAME SBS views SBS views sourceintegration interpretation access treatment of inconsistent data dett VAR Frame dett VAR Frame Key VAR Frame Key VAR Frame base Frame base Frame

23 data mapping is the process of creating data element mappings between two distinct data models the mapper is a source expert of economic variables, responsible for the coherent mapping between source variables and internal S-DWH variables. this operational activity is carried out in order to overcome the lack of control in source provisions IRAP … … SME survey variables mapping SS FS internal dictionary SBS- Frame source integration INSIDE: user: mapper

24 allows data analysts to access data regardless of their knowledge of the data-source layout works via a web interface directly on the source layer through a common metadata layer maps, automatically or manually, the source variables to the internal dictionary can search for keywords in the documentation attached to the source supply INSIDE: user: mapper IRAP … … SME survey variables mapping SS FS internal dictionary SBS- Frame source integration

25 IRAP Unico SME survey data linking preliminary treatment SS FS integrated micro- data SBS-Frame process in S-DWH variables mapping SBR missing imputation FRAME SBS FRAME SBS views SBS views sourceintegration interpretation access treatment of inconsistent data dett VAR Frame dett VAR Frame Key VAR Frame Key VAR Frame base Frame base Frame

26 FRAME SBS FRAME SBS View SBS View SBS View NA View NA SASWF access interpretation data analysts make the statistical evaluations. They work using an access- layer web interface to reach the interpretation layer data. The environment is optimized for the analysis of large amounts of data. This allows: basic analysis data mining testing by hypotheses simulation model processes procedure setup data analyst INSIDE: user: analyst

27 extracts microdata from a list of selected data sources previews the extraction has access permission creates a view in a private area shares the created views can access their views through standard statistical software (SAS, R, Stata,..) or through database querying language (SQL, PL/SQL) creates views as the starting points for further elaboration steps in the S-DWH life cycle INSIDE: user: analyst FRAME SBS FRAME SBS View SBS View SBS View NA View NA SASWF access interpretation data analyst

28 Data administrator: responsible for the data flow management, loads data sources, metadata and microdata acceptance operations manages data in all layers optimizes the data warehouse provisions provision layout provision layout dictionary facts view layout docs dimensions facts INSIDE: user: data administrator

29 data model: metadata logical schemes source integration interpretation provisions layouts dictionary provision view layouts docs dimensions access facts timing monitoring user

30 data model: data logical schemes integrationinterpretationaccess data hubfact tables views/marts SBR UNICO FS EMP CLASS GEO ATECO SS JUR. FORM FS DIM SS DIM DICTIONARY source tables provisions SBR surveys derived source PROVISION SBS

31 Key metadata concepts: PROVISION, set of information provided from a source (administrative or survey) relatively to a certain period SUBPROVISIONS, microdata, record layout, metadata, classifications, documentation, which represent the detail subset of the provision METASOURCE, record layout, elementary variables contained in a source DOCUMENTATION: set of documents (general information, tax instructions, etc.) related to a certain provision DICTIONARY items of the S-DWH, classified in measures, dimensions, unique identifiers and other non-identifying variables data model: metadata logical scheme

32 Key metadata concepts: DW_META_SOURCE, key associative table, contains the association between the microdata table fields and the DW dictionary items DW_META_ATTRIB, dictionary of the DW concepts DW_META_SOURCE_PCT, table functional to the mapping operation. It contains the calculated distances between the microdata table fields and the DW dictionary items DW_DIMENSION, description and the physical references to the dimensions used in the DW DW_FACT description and the physical references to the facts contained in the DW) data model: metadata logical scheme

33 INSIDE software application: user modules MAPPING VIEWER

34 Lista di selezione tipologie e periodo fornitura INSIDE software application: mapper welcome

35 35 provision list and time period INSIDE software application: mapper

36 mapping view results sources’ variables button for automatic mapping INSIDE software application: mapper S-DWH dictionary

37 automatic mapping results probabilistic matching, percentage of association manual matching INSIDE software application: mapper

38 38 Dettaglio metaattributo scelto probabilistic matching: details of possible matching and percentage of association for the first five results INSIDE software application: mapper

39 INSIDE architecture: two user modules MAPPING VIEWER VIDEO

40 INSIDE architecture: two user modules MAPPING VIEWER

41 The viewer is based on two sub-modules: View Builder: module for query building with the possibility of a data preview; View Manager: module for the management of views created INSIDE software application: viewer

42 facts list INSIDE software application: viewer: view builder building area: select area building area: where area

43 detail of variable INSIDE software application: viewer: view builder

44 search for variables through tags INSIDE software application: viewer: view builder

45 where conditions select condition fact links INSIDE software application: viewer: view builder

46 view preview INSIDE software application: viewer: view builder view name filter of yearly supply

47 query detail of the view recalled INSIDE software application: viewer: view manager view manager

48 query detail of the view INSIDE software application: viewer: view manager

49 49 view data preview INSIDE software application: viewer: view manager

50 view features: query structure, number of records, execution time INSIDE software application: viewer: view manager

51 INSIDE architecture: two user modules MAPPING VIDEO VIEWER

52 2-tier system INSIDE data analyst: desktop application environment VIDEO INSIDE architecture: two user modules PUBLISING & SHARING SERVICES CONTENT MANAGEMENT

53 2-tier system INSIDE data analyst: workflow environment INSIDE architecture: two user modules CONTENT MANAGEMENT PUBLISHING & SHARING SERVICES WORKFLOW SERVICE

54 INSIDE Tecnologies/FW TechnologyDescription Red Hat EL Server 5.11Web server operating system PHP 5.3Application development language APACHE 2.0Web Server software Yii 1.1MVC application framework BootstrapHTML,CSS and JavaScript framework (included in Yii) jQueryUIUser interface framework JavaScript library PdfJS viewerPDF viewer library HighchartsInteractive charting library in JavaScript DataTablesjQuery plug-in for advanced interaction controls to HTML table Red Hat EL Server 5.8Database server operating system Oracle 11gRelational database management system

55 thanks for your attention


Download ppt "S-DWH Approach to statistical data management: The practice case of the SBS production process in ISTAT Francesco Altarocca, ISTAT Diego Bellisai, ISTAT."

Similar presentations


Ads by Google