“Mapping the GSBPM on a SDW architecture” National Institute of Statistics – Italy “Mapping the GSBPM on a SDW architecture” Antonio Laureti Palma IT - Structural Business Statistics Unit Workshop ESS NET ON MICRO DATA LINKING AND DATA WAREHOUSING IN STATISTICAL PRODUCTION 22 & 23 september 2011
Overview The aim of this study is to define and contextualize a statistical data warehouse in order to define a framework to assist the development and definition of “data warehousing and data linking”. The data warehousing architecture presented can be considered as an IT-conclusion of the activities of the first year of the ESSnet. While, the modelling approach proposed it would indicate the roadmap for the future IT representation on the context. It will be described by: Data Warehousing as a Single Coherent Statistical production System Statistical Data Warehousing an Architecture schema Modeling the Business Domain - Designer’s view of the GSBPM on DWA schema Modeling the Data/Metadata Domain Conclusion
The Data Warehouse IT definition: In computing, a data warehouse is a database used for reporting. …the concept of data warehousing dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse" (from Wikipedia). ...as Bill Inmon says - “the data warehouse is at the center of the corporate information factory, which provides a logical framework for decision support environments and business management capabilities”. ...in essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to delivering business intelligence.
Data Warehousing for Enterprise DW centrality in an enterprise is obtained trough a IT infrastructure transversal to all the operational systems. The data from operational systems are Extracted Transformed and Loaded (ETL) into the DW and then they are available for the DSS and MIS. MARKETING ETL DATA WAREHOUSE DSS Decision Support System RESOURCES ETL PRODUCTION ETL MIS Management Information System DISTRIBUTION ETL SALES ETL ENTERPRISE PRODUCTION LINE
Data Warehousing for Statistics In a NSI, if the DW is mainly used for improving production efficiency, like for an enterprise, it is transversal to the statistical production line: REGULATIONS ETL DATA WAREHOUSE DSS Decision Support System RESOURCES ETL SURVEYS ETL ADMIN DATA ETL MIS Management Information System ELABORATION ETL OUTPUT ETL STATISTICAL PRODUCTION LINE
Data Warehousing for Statistics In a NSI, if the DW is used for “improving the production efficiency” (DSS-MIS) and for “creating the statistical product” (SD), then the DW is part of the production line. …in this case, the DW could be considered as a single logical repository, the center of the information factory, of all information generated from the NSI: REGULATIONS DATA WAREHOUSE SD Statistical Dissemination ETL RESOURCES ETL DDS STATISTICAL PRODUCTION LINE SURVEYS ETL MIS ADMIN DATA ETL
From the survey, two issues arise: Single coherent system (questions 6 to 13) 15 counties declare they do not have a single coherent system, even if 11 out of them are planning to change it... this situation will probably largely change in the next five years... Current output requirements are not integrated into data systems for 10 countries and the situation will probably change for half of them... Those who have a single coherent system do not want to change it, metadata and data-input are totally integrated in the data system as well as admin data. Motivation to start DW (question 14) The main motivations are linked to the ways to (re)use data, the improvement of the efficiency and the process integration in business statistics production... Adjunct motivations are integrating the project in the organization processing model, reducing the burden (cost and time) on survey responders and increasing consistency and quality.
Disadvantages of a stove-pipe-like production In a stove-pipe production system every single production line corresponds to a specific domain of statistics, together with the corresponding production system. For each domain, the whole production process from survey design to dissemination, takes place independently of other domains, and each has its own data suppliers and user groups: Structural Business Statistics Short Term business Statistics Information Society elaboration statistical output Science Technology Innovation data integration SBS SBS …. STS STS survey data administrative data IS IS STI STI I/O I/O Business Register
Data Warehousing as a Single Coherent System In a NSI, a single coherent Data Warehousing System (DWSys) is finalized to improve the production efficiency and to create the statistical products, in a full integrated way. From this view, the DWSys becomes the “effective” Information System of the full statistical production line. Then, the DWSys should be used to refer to the interaction between: People, Business Processes, Data and Technology. The Statistical Data Warehouse (SDW) then can be seen as a central statistical data store, regardless of the data’s source, for managing all available data of interest, improving the NSI’s ability to: (re)use data to create new data/new outputs; perform reporting; execute analysis; produce the necessary information.
DWSys Architectural description A DWSys Architecture (DWA) for statistics is a rigorous description of the structure of the NSI production, which comprises DWSys components (business entities or sub-process), the externally visible properties of those components, and the relationships (e.g. the behavior) between them. The DWA should be a framework for a NSI which defines how to organize the DWSys: provide the mechanisms for communicating information about the relationships that are important in the architecture provide the discipline to gather and organize the data and construct the views in a way that helps ensure integrity, accuracy and completeness support the application of method and use of tools
Layers of the enterprise architecture In the context of the creation of enterprise architecture it is common, to recognize four types of architecture, each corresponding to its particular architectural domain.
DWA – Business Domain To provide a DWA as detailed as possible, in the context of statistics production, we could articulate the business domain in four functional layers: data source layer, integration layer, interpretation and data analysis layer, access layer. Each layer has its data domain structure: operational data, for data warehousing meta data, the description data of the SDW, usually used to manage, describe and monitor the information systems.
DWA layered business architecture SOURCE INTEGRATION INTERPRETATION & DATA ANALYSIS ACCESS REGULATIONS STAGING AREA PRIMARY DATA DATA MART RESOURCES DISSEMINATION STATISTICAL SURVEYS 1 DATA MART SURVEYS n BUSINESS REGISTER DSS ADMIN DATA 1 DATA MART ADMIN DATA n MIS META DATA MANAGEMENT
DWA - functional Layer Source Database Layer: This level is responsible for, physically or virtually, storing the data from internal (surveys) or external (archives) sources for statistical purpose. Typical data sources, in the context of business statistics, are data from : specific surveys, like STS, ICT, CIS, SBS, Customs Agency, Revenue Agency, Chambers of Commerce, National Social Security Institute.
DWA - functional Layer Integration layer: It is used for all integration and reconciliation activities of data sources. Into this layer we have the set of applications that perform the main ETL, which manages: inconsistent coding for the same object, the consistency is obtained by coding defined by the data warehouse; adjustment of the different units of measurement and inconsistent formats; alignment of inconsistent labels, same object named differently. Usually the data are identified according to the definition contained in the metadata of the system. incomplete or incorrect data; in this case operation may require human intervention to resolve issues not predictable a priori. data linking, in which different sources enable the creation of extended, or new, units of analysis.
DWA - functional Layer Interpretation and data analysis layer: The basic functions performed at this level are advanced analysis and interpretation of data-elaborations, both based on statistical algorithms. Here “statistical expert users” operate to produce strategic value information, working with the maximum granularity data. Only a reduced number of users are allowed to access the data, in order to prevent lack of servers performance. This strategy of “process of information delivery”, where the demand for new statistical information does not involve the construction of new statistical production lines, but rather the creation of other data marts. Results of these activities are unplanned aggregate data for the next access layer or to develop software rules for next iteration, through data marts, regarded as subsets of the DW, usually oriented to a specific business line or team.
DWA - functional Layer Access Layer: It is the layer for the final presentation of the information sought, addressed to a wide typology of users, not necessarily expert on business statistics, or informatics instruments. They are: Specialized Business Intelligence tools: in this extensive category, in terms of solutions on the market, we find tools to build queries, navigational tools (OLAP viewer) including Web browsers; - Graphics and publishing tools: the Business Intelligence tools are able to generate graphs and tables for its users, this solution consists essentially in just a couple of steps to avoid inefficiency. Office Automation tools: this is a reassuring solution for users who come for the first time to the data warehouse context, as they are not forced to learn new complex instruments. The problem is that this solution while adequate with regard to productivity and efficiency, is very restrictive in the use of the data warehouse, since these instruments, have significant architectural and functional limitations;
DWA – Modeling the Business Domain The designer's view of business is also known as the analytical view and there are various standards for modeling this view. One mostly commonly used modeling standard is the Generic Statistical Business Process Model (GSBPM). The GSBPM definition by UNECE is (vers.4): “The original intention was for the GSBPM to provide a basis for statistical organizations to agree on standard terminology to aid their discussions on developing statistical metadata systems and processes. The GSBPM should therefore be seen as a flexible tool to describe and define the set of business processes needed to produce official statistics”. So, in order to define a general and comprehensive architecture for statistical production, it may be useful to identify and locate the different phases of a generic statistic production process on the different DWA’s functional levels.
Generic Statistic Business Production Model
DWA - Mapping the GSBPM on DWA The analysis of sub-processes locations on a SDW architecture is graphically represented in the next slides, with: SDW functional layers on the horizontal axis and the nine GSBPM phases on the vertical axis. Each element inside the graph is a sub-process, we will consider from the 4td to the 7td GSBPM phases. That is only an example of Model Processing. Each case must be validated and discussed on the different operational context this is just a basis for setting and starting the modelling work for the next two year of the ess-net. In the context, each sub-process must be regarded from either a: methodological, planning, technological, operational, point of view. Blank sub-processes are related to methodological, or planning, metadata definitions, meanwhile brown sub-processes are related to operational, or technological, function for data elaboration.
Designer's view - Mapping the GSBPM on DWA Sub-Process of the GSBPM allocated on the functional layers of the DWA. Interpretation and analysis Layer Source Layer Integration Layer Access Layer 7 Disseminate 7.1-update output systems 7.2-produce dissemination 7.3-manage release of dissemination products 7.4-promote dissemination 7.5-manage user support 6 Analyze 6.1-prepare draft output 6.4-apply disclosure control 6.3-scrutinize and explain 6.5-finalize outputs 6.2-validate outputs
Designer's view - Mapping the GSBPM on DWA Sub-Process of the GSBPM allocated on the functional layers of the DWA. Interpretation and analysis Layer Source Layer Integration Layer Access Layer 5 Process 5.1-integrate data 5.2-classify & code 5.3-review, validate & edit 5.4-impute 5.6-calculate weights 5.7-calculate aggregate 5.5-derive new variables and statistical units 5.8-finalize data files 4 Collect 4.4-finalize collection 4.1-select sample 4.2-set up collection 4.3-run collection
Designer's view – Modeling the Data Domain Graphic scheme of layered architecture with a focus on “statistical data”:
SDA – Modeling the Meta Data Domain Our purpose is to refer to an IT infrastructure of SDW, so we should consider only structured metadata articulated as: Structural Metadata (SM), they are used for description, identification and retrieval of statistical and quality information. Moreover they could link the various different components of the SDW; Process Metadata (PM), they are used to store the data usage and maintenance of process administration, as well as the proper information for automatic execution of work flows or management systems. Both of them can be Active, when they enables operational use, manual or automated, for one or more processes, or Passive in all other uses.
Designer's view - Modeling the Meta Data Domain Graphic scheme of layered architecture with a focus on “meta data”:
Conclusion We have contextualized the statistical production in a Data Warehousing Architecture. So, we have introduced a general Enterprise Architecture vision for a SDW production system. We have showed as the GSBPM representation can be used for modelling the business domain of the SDW layered architecture, for a complete operational view for the deploy of statistical production cases. Finally, we have showed the corresponding four level data-domain of the architecture for a Statistical Data Warehouse.