Download presentation
Presentation is loading. Please wait.
Published byPeter Freeman Modified over 8 years ago
1
S-DWH Approach to statistical data management: The practice case of the SBS production process in ISTAT Francesco Altarocca, ISTAT Diego Bellisai, ISTAT Antonio Laureti Palma, ISTAT
2
Summary What is the SBS-Frame SBS-Frame production process A Statistical Data Warehouse (S-DWH) for supporting the production workflow INSIDE: INtegrated StatIstical Datawarehouse Environment software INSIDE to support the SBS-Frame workflow Features of INSIDE
3
The Frame allows ISTAT to obtain by sum the main economic aggregates required by the Eurostat SBS (Structural Business Statistics) Regulation The Frame allows ISTAT to overcome the limitations of the estimation domains of the sample surveys; the possibility to have accurate estimates on a relevant number of sub-populations A detailed and multidimensional mapping of the enterprises is possible It represents the new base for the National Accounts SEC 2010 estimates; It will be integrated at the micro level with foreign trade statistics (geographic features of export, foreign controlled enterprises) and labour statistics (employees characteristics, wage levels) SBS-Frame in the contest of business statistics
4
SBS-Frame variables The SBS Frame is an archive of the main annual economic variables on all the active Italian enterprises (around 4.4 million units) SBS Frame base variables are: from the SBR 1- Enterprise ID number 2- Economic activity 3- Number of employees new variables derived for the SBS-Frame 4- Turnover 5- Labour Cost 6- Total purchases of goods and services 7- Value-added at factor cost 8- Gross operating margin
5
IDSourceDescriptionSupplierunits#vars FS Financial Statements annual profit and loss statements of limited liability companies Chambers of Commerce 750K~300 SS Sector Studies survey SMEs with Turnover in [30K-7.5M] euros Italian Revenue Agency 3.5M~60 UNTax returns form unified model of tax declarations by legal form, containing economic information for different legal forms Italian Revenue Agency 4.4M~60 IRAP Regional Tax on Productive Activities form Model of declaration for Regional Tax on Productive Activities payment Italian Revenue Agency 4.4M~70 SME Small Medium Ent. Survey sample survey on enterprises with less than 100 employees ISTAT100K~200 RACLI Labour Cost by Enterprise Reg. Register of Labour Cost by EnterpriseISTAT1.5M~20 SBRBusiness Register Italian official Business Register of Active Enterprises ISTAT4.4M~20 The Frame’s Sources
6
integrated micro-data view RACLI ( 33% of SMEs) UnitsID Ateco NEm TBRNEm PC WS WH SCY 1 1 Y 2 1.....… Y k 1 Y 1 2 Y 2 2 k 2 Y 1 3 Y 2 3 k 3 Y 1 S Y 2 S……...… Y p S 1 2Survey............................ N (4.3 mil) SBR Not covered ( 4%) FS ( 16% of SMEs) SS ( 80% of SMEs) UNICO ( 97% of SMEs) The Frame’s Sources
7
data harmonization final dataset FS consistency CHK duplicate CHK discard bad data SS UN IRAP labour cost correction FS SS UN IRAP standardized dataset FS SS UN IRAP integration base FRAME imputation, estimation key vars Imputation missing units FRAME key variables FRAME variables of detail estimation, imputation other variables SBR RACLI The SBS-Frame process
8
data harmonization final dataset FS consistency CHK duplicate CHK discard bad data SS UN IRAP labour cost correction FS SS Uni co IRA P standardized dataset FS SS UN IRAP integration base FRAME imputation, estimation key vars Imputation missing units FRAME key variables FRAME variables of detail estimation, imputation other variables SBR RACLI
9
SBS-Frame process features annual activity complex workflow iterative activities many actors’ interactions different actor skills variability of the sources distributed computing tracking methodological choices replicability of results documenting processes storing distributed knowledge for safety in the production process
10
The SBS-Frame production process is executed once a year Due to source variability, it is necessary to make adjustments to the process yearly The adjustment phase takes two months At the end of the adjustment phase the process is executed only once SBS-Frame process features
11
scientific workflows In the lT literature the SBS-Frame production process can be managed by a scientific workflow management system based on a data-centric approach A scientific workflow is a specialized workflow designed to compose and execute a series of complex computational or data manipulation steps. Basic requirements are: to construct simple what-if scenarios to support iterative hypothesis testing activities to share knowledge to track methodological choices
12
data-centric approach scientific workflows SETUP reuse, composition EXECUTION ANALYSIS evaluation, validation data warehouse DISSEMINATION
13
Statistical Data Warehouse (S-DWH) To support a data-centric approach we use a Statistical- Data Warehouse (S-DWH) as a single central data store Basic requirements for the S-DWH are: an easy-to-use environment for access complex data control of information visibility support of multiple-purpose statistical information in a specific statistical domain a metadata-driven model a single integrated system
14
Layered S-DWH From an architectural point of view, we can identify four conceptual layers in the S-DWH: access layer, for the final presentation, dissemination and delivery of the information sought; interpretation and data analysis layer enables data analysis or evaluation for statistical design; integration layer is where all operational activities are carried out; in this layer data are integrated and transformed in order to increase performance and usability of the upper layer; source layer is the level where data sources are stored; internal data (from surveys or step elaboration) or external data (from administrative provisions).
15
IRAP Unico SME survey data linking preliminary treatment SS FS integrated micro- data SBS-Frame process in S-DWH variables mapping SBR
16
IRAP Unico SME survey data linking preliminary treatment SS FS integrated micro- data SBS-Frame process in S-DWH variables mapping SBR dett VAR Frame dett VAR Frame missing imputation FRAME SBS FRAME SBS treatment of inconsistent data SBS views SBS views Key VAR Frame Key VAR Frame base Frame base Frame
17
IRAP Unico SME survey data linking preliminary treatment SS FS integrated micro- data SBS-Frame process in S-DWH variables mapping SBR missing imputation FRAME SBS FRAME SBS views SBS views sourceintegration interpretation access treatment of inconsistent data dett VAR Frame dett VAR Frame Key VAR Frame Key VAR Frame base Frame base Frame
18
IRAP Unico SME survey data linking preliminary treatment SS FS integrated micro- data SBS-Frame process in S-DWH variables mapping SBR missing imputation FRAME SBS FRAME SBS views SBS views sourceintegration interpretation access treatment of inconsistent data data centric approach workflow dett VAR Frame dett VAR Frame Key VAR Frame Key VAR Frame base Frame base Frame
19
The implementation of INSIDE To support the SBS-Frame production, the INSIDE (INtegrated StatIstical Datawarehouse Environment) software application has been implemented INSIDE is the S-DWH used to support the SBS-Frame workflow. INSIDE is: specialized in structural business statistics metadata driven based on 4 layers
20
INSIDE basic architecture
21
RoleDescription Source Integration Interpretation Access source mapperis a source expert responsible for mapping of economic variables data analystperforms statistical analysis and is in charge of all or part of the statistical production process data administratorresponsible for managing the data flows, user authorization and system maintenance INSIDE: user roles
22
IRAP Unico SME survey data linking preliminary treatment SS FS integrated micro- data SBS-Frame process in S-DWH variables mapping SBR missing imputation FRAME SBS FRAME SBS views SBS views sourceintegration interpretation access treatment of inconsistent data dett VAR Frame dett VAR Frame Key VAR Frame Key VAR Frame base Frame base Frame
23
data mapping is the process of creating data element mappings between two distinct data models the mapper is a source expert of economic variables, responsible for the coherent mapping between source variables and internal S-DWH variables. this operational activity is carried out in order to overcome the lack of control in source provisions IRAP … … SME survey variables mapping SS FS internal dictionary SBS- Frame source integration INSIDE: user: mapper
24
allows data analysts to access data regardless of their knowledge of the data-source layout works via a web interface directly on the source layer through a common metadata layer maps, automatically or manually, the source variables to the internal dictionary can search for keywords in the documentation attached to the source supply INSIDE: user: mapper IRAP … … SME survey variables mapping SS FS internal dictionary SBS- Frame source integration
25
IRAP Unico SME survey data linking preliminary treatment SS FS integrated micro- data SBS-Frame process in S-DWH variables mapping SBR missing imputation FRAME SBS FRAME SBS views SBS views sourceintegration interpretation access treatment of inconsistent data dett VAR Frame dett VAR Frame Key VAR Frame Key VAR Frame base Frame base Frame
26
FRAME SBS FRAME SBS View SBS View SBS View NA View NA SASWF access interpretation data analysts make the statistical evaluations. They work using an access- layer web interface to reach the interpretation layer data. The environment is optimized for the analysis of large amounts of data. This allows: basic analysis data mining testing by hypotheses simulation model processes procedure setup data analyst INSIDE: user: analyst
27
extracts microdata from a list of selected data sources previews the extraction has access permission creates a view in a private area shares the created views can access their views through standard statistical software (SAS, R, Stata,..) or through database querying language (SQL, PL/SQL) creates views as the starting points for further elaboration steps in the S-DWH life cycle INSIDE: user: analyst FRAME SBS FRAME SBS View SBS View SBS View NA View NA SASWF access interpretation data analyst
28
Data administrator: responsible for the data flow management, loads data sources, metadata and microdata acceptance operations manages data in all layers optimizes the data warehouse provisions provision layout provision layout dictionary facts view layout docs dimensions facts INSIDE: user: data administrator
29
data model: metadata logical schemes source integration interpretation provisions layouts dictionary provision view layouts docs dimensions access facts timing monitoring user
30
data model: data logical schemes integrationinterpretationaccess data hubfact tables views/marts SBR UNICO FS EMP CLASS GEO ATECO SS JUR. FORM FS DIM SS DIM DICTIONARY source tables provisions SBR surveys derived source PROVISION SBS
31
Key metadata concepts: PROVISION, set of information provided from a source (administrative or survey) relatively to a certain period SUBPROVISIONS, microdata, record layout, metadata, classifications, documentation, which represent the detail subset of the provision METASOURCE, record layout, elementary variables contained in a source DOCUMENTATION: set of documents (general information, tax instructions, etc.) related to a certain provision DICTIONARY items of the S-DWH, classified in measures, dimensions, unique identifiers and other non-identifying variables data model: metadata logical scheme
32
Key metadata concepts: DW_META_SOURCE, key associative table, contains the association between the microdata table fields and the DW dictionary items DW_META_ATTRIB, dictionary of the DW concepts DW_META_SOURCE_PCT, table functional to the mapping operation. It contains the calculated distances between the microdata table fields and the DW dictionary items DW_DIMENSION, description and the physical references to the dimensions used in the DW DW_FACT description and the physical references to the facts contained in the DW) data model: metadata logical scheme
33
INSIDE software application: user modules MAPPING VIEWER
34
Lista di selezione tipologie e periodo fornitura INSIDE software application: mapper welcome
35
35 provision list and time period INSIDE software application: mapper
36
mapping view results sources’ variables button for automatic mapping INSIDE software application: mapper S-DWH dictionary
37
automatic mapping results probabilistic matching, percentage of association manual matching INSIDE software application: mapper
38
38 Dettaglio metaattributo scelto probabilistic matching: details of possible matching and percentage of association for the first five results INSIDE software application: mapper
39
INSIDE architecture: two user modules MAPPING VIEWER VIDEO
40
INSIDE architecture: two user modules MAPPING VIEWER
41
The viewer is based on two sub-modules: View Builder: module for query building with the possibility of a data preview; View Manager: module for the management of views created INSIDE software application: viewer
42
facts list INSIDE software application: viewer: view builder building area: select area building area: where area
43
detail of variable INSIDE software application: viewer: view builder
44
search for variables through tags INSIDE software application: viewer: view builder
45
where conditions select condition fact links INSIDE software application: viewer: view builder
46
view preview INSIDE software application: viewer: view builder view name filter of yearly supply
47
query detail of the view recalled INSIDE software application: viewer: view manager view manager
48
query detail of the view INSIDE software application: viewer: view manager
49
49 view data preview INSIDE software application: viewer: view manager
50
view features: query structure, number of records, execution time INSIDE software application: viewer: view manager
51
INSIDE architecture: two user modules MAPPING VIDEO VIEWER
52
2-tier system INSIDE data analyst: desktop application environment VIDEO INSIDE architecture: two user modules PUBLISING & SHARING SERVICES CONTENT MANAGEMENT
53
2-tier system INSIDE data analyst: workflow environment INSIDE architecture: two user modules CONTENT MANAGEMENT PUBLISHING & SHARING SERVICES WORKFLOW SERVICE
54
INSIDE Tecnologies/FW TechnologyDescription Red Hat EL Server 5.11Web server operating system PHP 5.3Application development language APACHE 2.0Web Server software Yii 1.1MVC application framework BootstrapHTML,CSS and JavaScript framework (included in Yii) jQueryUIUser interface framework JavaScript library PdfJS viewerPDF viewer library HighchartsInteractive charting library in JavaScript DataTablesjQuery plug-in for advanced interaction controls to HTML table Red Hat EL Server 5.8Database server operating system Oracle 11gRelational database management system
55
thanks for your attention
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.