S-DWH Architecture (Recap):

Slides:



Advertisements
Similar presentations
Enterprise Architecture Framework in Statistics Poland
Advertisements

“Mapping the GSBPM on a SDW architecture”
C6 Databases.
Database Management3-1 L3 Database Management Santa R. Susarapu Ph.D. Student Virginia Commonwealth University.
Accessing Organizational Information—Data Warehouse
Chapter 3 Database Management
Components and Architecture CS 543 – Data Warehousing.
Business Intelligence System September 2013 BI.
Designing a Data Warehouse
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
LEVERAGING THE ENTERPRISE INFORMATION ENVIRONMENT Louise Edmonds Senior Manager Information Management ACT Health.
by Ha Do Statistical Standard Methodology and ITC Department
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Data Warehouse & Data Mining
Understanding Data Warehousing
WP.5 - DDI-SDMX Integration E.S.S. cross-cutting project on Information Models and Standards Marco Pellegrino, Denis Grofils Eurostat METIS Work Session6-8.
Case Studies: Statistics Canada (WP 11) Alice Born Statistics UNECE Workshop on Statistical Metadata.
Data Warehousing at STC MSIS 2007 Geneva, May 8-10, 2007 Karen Doherty Director General Informatics Branch Statistics Canada.
M ETADATA OF NATIONAL STATISTICAL OFFICES B ELARUS, R USSIA AND K AZAKHSTAN Miroslava Brchanova, Moscow, October, 2014.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
The Adoption of METIS GSBPM in Statistics Denmark.
Current and Future Applications of the Generic Statistical Business Process Model at Statistics Canada Laurie Reedman and Claude Julien May 5, 2010.
2 Copyright © Oracle Corporation, All rights reserved. Defining Data Warehouse Concepts and Terminology.
Explaining the statistical data warehouse (S-DWH)
1 Reviewing Data Warehouse Basics. Lessons 1.Reviewing Data Warehouse Basics 2.Defining the Business and Logical Models 3.Creating the Dimensional Model.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Toward Generic Systems Shifra Haar - Central Bureau of Statistics-Israel.
United Nations Economic Commission for Europe Statistical Division Mapping Data Production Processes to the GSBPM Steven Vale UNECE
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
By N.Gopinath AP/CSE. There are 5 categories of Decision support tools, They are; 1. Reporting 2. Managed Query 3. Executive Information Systems 4. OLAP.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
United Nations Economic Commission for Europe Statistical Division High-Level Group Achievements and Plans Steven Vale UNECE
Copyright 2010, The World Bank Group. All Rights Reserved. Principles, criteria and methods Part 2 Quality management Produced in Collaboration between.
Pilot Census in Poland Some Quality Aspects Geneva, 7-9 July 2010 Janusz Dygaszewicz Central Statistical Office POLAND.
Developing and applying business process models in practice Statistics Norway Jenny Linnerud and Anne Gro Hustoft.
Open GSBPM compliant data processing system in Statistics Estonia (VAIS) 2011 MSIS Conference Maia Ennok Head of Data Warehouse Service Data Processing.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
United Nations Oslo City Group on Energy Statistics OG7, Helsinki, Finland October 2012 ESCM Chapter 8: Data Quality and Meta Data 1.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
ESS-net DWH ESSnet on microdata linking and data warehousing in statistical production Harry Goossens – Statistics Netherlands Head Data Service Centre.
Foundations of Information Systems in Business. System ® System  A system is an interrelated set of business procedures used within one business unit.
RECENT DEVELOPMENT OF SORS METADATA REPOSITORIES FOR FASTER AND MORE TRANSPARENT PRODUCTION PROCESS Work Session on Statistical Metadata 9-11 February.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
ESS-net DWH ESSnet on microdata linking and data warehousing in statistical production.
5.8 Finalise data files 5.6 Calculate weights Price index for legal services Quality Management / Metadata Management Specify Needs Design Build CollectProcessAnalyse.
Harry Goossens Centre of Competence on Data Warehousing.
Describe a layered S-DWH Technology Architecture Information Systems Architecture Business Architecture.
The business process models and quality issues at the Hungarian Central Statistical Office (HCSO) Mr. Csaba Ábry, HCSO, Methodological Department Geneva,
Introduction to Quality Management Frameworks Eurostat, Luxembourg, January 2016 Process quality Dr Johanna Laiho-Kauranne.
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
1 Management Information Systems M Agung Ali Fikri, SE. MM.
Statistical process model Workshop in Ukraine October 2015 Karin Blix Quality coordinator
Managing Data Resources File Organization and databases for business information systems.
Introduction to Statistics Estonia Study visit of the State Statistical Service of Ukraine on Dissemination of Statistical Information and related themes.
Metadata models to support the statistical cycle: IMDB
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Generic Statistical Business Process Model (GSBPM)
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia
Metadata in the modernization of statistical production at Statistics Canada Carmen Greenough June 2, 2014.
“The role of S-DWH in the ESS 2020 modernization process”
SDMX in the S-DWH Layered Architecture
Data Warehouse.
Data Warehousing Concepts
Chapter 3 Database Management
Mapping Data Production Processes to the GSBPM
Metadata used throughout statistics production
Presentation transcript:

S-DWH Architecture (Recap): Summary: A Data Warehouse as statistical production system. S-DWH Architectural Domains S- DWH layered architecture National Institute of Statistics – Italy Antonio Laureti Palma - IT Business Statistics 4th ESS-Net Workshop on “Micro data linking and warehousing in statistical production “Tallinn, Statistics Estonia March 20th and 21st, 2013 1 1

Data Warehouse In an enterprise a data warehouse (DWH) is an organized information collection which stores current and historical data used for creating reports for management. Reports are generally produced from already structured information or continual data mining carried out by experts. Data mining is realized by using advanced statistical methods, correlating “primary information ” from different production departments. The delivery of reports is carried out using “secondary information” stored in specialized Data Marts. The discovery of useful information aids business strategies which should increase efficiency in the production process.

S-DWH overview architecture Staging Area Data Integration Data Warehouse Data Mart DATA MINING PRODUCTION RESOURCES ETL secondary information secondary information DISTRIBUTION operational information primary information SALES primary information REPORTS EXTERNAL OPERATIONAL DECISION feedback

Statistical Data Warehouse In an Enterprise the information collected in a DWH derives from different production departments. For producing Primary Information, specific Extraction Transformation and Loading (ETL) procedures are needed. In a S-DWH (Statistical-DWA) the sources are direct data capturing or administrative archives and the ETL procedures must manage the effective statistical elaboration. A S-DWH is a coherent information collection of current and historical data of different statistical topics and domains. A high level of coordination is necessary both within different topics and within different operational phase activities, and between the topics and activities.

S-DWH Architectural domains To define and enable the evolution of a S-DWH requires a definition of a framework where key principles and models can be created, communicated and improved. A S-DWH framework comprises: business domain to align strategic objectives and tactical demands through a common understanding of the organization. Information architecture domain, to describe data base organization and management of data and metadata information. technology domain, i.e. the combined set of software, hardware and networks to develop and support IT services. 5

S-DWH business domain In the business domain, key business processes that should be considered are: Adapting or Redesigning of statistical regulation (timing and output); Managing changes from stove-pipe approaches to a coherent S-DWH; Redesigning operational management ( methodology and processes); Data management (security, custodianship and ownership, data and metadata); Quality management (assessment and control); Software and IT infrastructure management. 6

S-DWH information domain To organize a S-DWH information architecture, we group functionalities in four distinct layers: access layer is designed for a wide typology of users or informatics instruments for the final presentation of the information sought; interpretation and data analysis layer is designed for interactive non-structured human activities; integration layer is designed for the ETL functions, which should be realized automatically or semi-automatically; source layer is designed for storing and managing internal (surveys) or external (archives) raw data sources. 7 7

S-DWH information domain overview Source Layer Integration Layer Interpretation and Analysis Layer Access Layer Staging Area Data Integration Data Warehouse Data Mart CATI OPERATIONAL DECISION DATA MINING CAWI REPORTS CAPI ETL primary information secondary information … operational information primary information secondary information ADMIN

S-DWH information domain ESS-Net S-DWH functional definition: A statistical data warehouse is a central statistical data store, regardless of the data’s source, for managing all available data of interest, improving the NSI to: - (re)use data to create new data/new outputs; - perform reporting; - execute analysis; - produce the necessary information.

S-DWH information domain produce the necessary information perform reporting produce the necessary information new outputs ACCESS LAYER INTERPRETATION AND ANALYSIS LAYER execute analysis re-use data to create new data INTEGRATION LAYER produce the necessary information SOURCES LAYER

S-DWH technology domain In the technology domain the layered architecture reflects a conceptual organization in which we will consider the first two levels as pure statistical operational infrastructures, functional for acquiring , storing, editing and validating data, and the last two layers as the effective data warehouse, i.e. levels in which data are accessible for data analysis. These reflect two different IT environments, an operational where we support semi-automatic computer interaction systems and an analytical, the warehouse, where we maximize human free interaction. STATISTICAL DATA WAREHOUSE DATA WAREHOUSE ACCESS LAYER INTERPRETATION AND ANALYSIS LAYER OPERATIONAL DATA INTEGRATION LAYER SOURCES LAYER

S-DWH technology domain In the Technology domain each layer must support different process typologies: ACCESS LAYER MOLAP (Multidimensional Online Analytical Processing) uses specific analytical tools on a multidimensional data model. INTERPRETATION AND ANALYSIS LAYER ROLAP, Relational Online Analytical Processing, uses specific analytical tools on a relational dimensional data model which is easy to understand and does not require pre-compuation and storage of the information. INTEGRATION LAYER OLTP (OnLine Transaction Processing) is the typical operational activity for data editing. SOURCE LAYER Data mapping involves combining data residing in different sources and providing users with a unified view of these data.

Statistical Data Warehouse layered architecture Access Layer The Access Layer is the layer for the final presentation, dissemination and delivery of the information sought. This layer is designed for a wide typology of users and computer instruments. In this layer the data organization must support automatic dissemination systems and BI-Tools (Business Intelligence). In both cases, statistical information is structured in data models at micro and macro data levels.

DWA - Access Layer Specialized functionalities: SDMX interface (Statistical Data and Metadata eXchange) supports cross-border service interoperability for public administrations or organizations. Specialized BI-Tools is an extensive category, in terms of solutions on the market, of products for query building, data-navigating, web browsing, graphics and publishing. Office Automation tools: this is a reassuring solution for users who come for the first time to the data warehouse context, as they are not forced to learn new complex instruments. The problem is that this solution, while adequate with regard to productivity and efficiency, is very restrictive in the use of the data warehouse, since these instruments, have significant architectural and functional limitations;

DWA - Access Layer From the GSBPM we consider follow sub-processes:   7- Disseminate 7.1-”update output systems” , including re-formatting data and metadata into specific output databases. 7.2-”produce dissemination” is a sort of integration process between table, text and graphs. 7.3-”manage release of dissemination products” ensures that all elements in place for releasing. 7.4-  ” promote dissemination ”, it includes wikis, blog, customer relationship management tools. 7.5- ”manage user support” ensures that customer queries are recorded and provided within agreed deadlines.

Statistical Data Warehouse layered architecture Interpretation and data analysis layer The Interpretation and data analysis layer is specifically for statisticians and enables data mining. This is the effective data warehouse, and must support all kinds of statistical analysis on micro and macro data. Data evaluation in this layer supports the design of any new production processes or data re-use. The results expected of the human activities in this layer should then be statistical “services” useful for other phases of the elaboration process, from the sampling, to the set-up of instruments used in the Integration Layer until the generation of new possible statistical outputs. Activities on the Interpretation layer improve the S-DWH capabilities.

DWA - Interpretation and data analysis layer From the GSBPM we consider follow sub-processes : 1- Specify Needs: 1.5 - check data availability 2- Design: 2.1-design outputs 2.2-design variable descriptions 2.4-design frame and sample 2.5-design statistical methodology 2.6-design production systems 4- Collect: 4.1-select sample 5- Process 5.1-integrate data; 5.5-derive new variables and units; 5.6-calculate weights; 5.7-calculate aggregate; 6- Analyze 6.1-prepare draft output; 6.2-validate outputs; 6.3-scrutinize and explain; 6.4-apply disclosure control; 6.5-finalize outputs 7- Disseminate 7.1-update output systems, 9- Evaluate 9.1- gather evaluation inputs 9.2- conduct evaluation

Statistical Data Warehouse layered architecture Integration layer The integration layer is where all operational activities needed for all statistical elaboration processes are carried out. This means operations carried out automatically or manually by operators to produce statistical information in a common IT infrastructure. This are recurring (cyclic) activities involved in the running of the whole or any part of a statistical production process. Statistical elaboration should be organized in operational work flows for checking, cleaning, linking and harmonizing data-information in a common persistent area (Data Vault) where information is grouped by subject.

DWA - Integration layer Specialized functions: It is used for all integration and reconciliation activities of data sources. Into this layer we have the set of applications that perform the main ETL, which manages: inconsistent coding for the same object, the consistency is obtained by coding defined by the data warehouse; adjustment of the different units of measurement and inconsistent formats; alignment of inconsistent labels, same object named differently. Usually the data are identified according to the definition contained in the metadata of the system. incomplete or incorrect data; in this case operation may require human intervention to resolve issues not predictable a priori.   data linking, in which different sources enable the creation of extended, or new, units of analysis.

DWA - Integration layer From the GSBPM we consider follow sub-processes : 5- Process 5.1-integrate data; 5.2-classify & code; 5.3-review, validate & edit; 5.4-impute; 5.5-derive new variables and statistical units: 5.6-calculate weights; 5.7-calculate aggregate; 5.8-finalize data files 6- Analyze 6.1-prepare draft output; the presence of this sub-process in this layer is strictly related to regular production process, in which the measures estimated are regularly produced, as should in the STS

Statistical Data Warehouse layered architecture The Integration and Interpretation Layers are reciprocally functional to each other. Integration layer, always prepares the elaborated information for Interpretation layer: from raw data, just uploaded into the S-DWH and not yet included in a production process, to micro/macro statistical data at any elaboration step of any production processes. Otherwise in the interpretation layer it must be possible to easily access and analyze this micro/macro elaborated data of the production processes in any state of elaboration. This because methodologists should correct possible operational elaboration mistakes before, during and after any statistical production line, or design new elaboration processes for new surveys. In this way the new concept or strategy can generate a feedback toward Integration layer which is able to correct, or increase the quality, of the regular production lines.

Statistical Data Warehouse layered architecture Example of reciprocally functional between The Integration and Interpretation Layers At Estat level, official statistical output is defined by the Commission Regulation and it can be modified annually in order to reflect changes in society or the economy. The process of defining a new statistical output is realized by statistical experts in European task forces and working groups. Statistical experts should investigate using available primary data, at National or European levels, to support any new statistical proposal, which should involve a re-use of existing data or the design of a new process.

Statistical Data Warehouse layered architecture Example of reciprocally functional between The Integration and Interpretation Layers European task forces 1° new proposal STATISTICAL DATA WAREHOUSE 2° new regulation ACCESS LAYER INTERPRETATION AND ANALYSIS LAYER 3° new design strategy INTEGRATION LAYER 4° new output SOURCES LAYER

Statistical Data Warehouse layered architecture Case: produce the necessary information ACCESS LAYER 7 DISSEMINATE INTERPRETATION LAYER 2 DESIGN 9 EVALUATE 6 ANALYSIS INTEGRATION LAYER 3 BUILD 5 PROCESS SOURCE LAYER 4 COLLECT

Statistical Data Warehouse layered architecture Case: re-use data to create new data ACCESS LAYER 7 DISSEMINATE INTERPRETATION LAYER 5 PROCESS 2 DESIGN 9 EVALUATE 6 ANALYSIS INTEGRATION LAYER 5 PROCESS SOURCE LAYER 4 COLLECT

Statistical Data Warehouse layered architecture Source Layer The Source Layer is the level in which we locate all the activities related to storing and managing internal or external data sources. Typically, internal data are from direct data capturing carried out by CAWI, CAPI or CATI; while external data are from administrative archives, for example from Customs Agencies, Revenue Agencies, Chambers of Commerce, National Social Security Institutes.

DWA - Source layer Specialized functions: This level is responsible for, physically or virtually, storing the data from internal (surveys) or external (archives) sources for statistical purpose. Typical data sources, in the context of business statistics, are data from : specific direct data capturing CAWI, CATI, CAPI; archive from Customs Agency; archive from Revenue Agency; archive from Chambers of Commerce; archive from National Social Security Institute.

DWA - Source layer From the GSBPM we consider follow sub-processes : 4- Collect: 4.2-set up collection, ensures that the processes and technology are ready to collect data; 4.3-run collection, is where the collection is implemented, with different collection instruments being used to collect the data; 4.4-finalize collection, includes loading the collected data into a suitable electronic environment for further processing of the next layers.

Glossary: Data mining: an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. Data Warehouse is a central repository of data which is created by integrating data from one or more disparate sources. Data Mart is a subset of the data warehouse that is usually oriented to a specific business line or IT-Tolls. OLAP, OnLine Analytical Processing, is an approach to answering multi-dimensional analytical queries swiftly. Data Vault is a database modeling method that is designed to provide long-term historical storage of data coming in from multiple operational systems. Primary Information, is the original information which has been coded and cleaned in a common coherent environment. Secondary Information, is derived from primary information and could involve aggregate data. Business intelligence (BI) is a set of theories, methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information. BI-Tools are a set of IT-instruments used to analyze structured data and disseminates information with a topical focus. Graphics and publishing tools are able to generate graphs and tables for its users, this solution consists essentially in just a couple of steps to avoid inefficiency.

Glossary: CAPI, Computer-Assisted Personal Interviewing is an interviewing technique in which the respondent or interviewer uses a computer to answer the questions. CATI, Computer-Assisted Telephone Interviewing is a telephone surveying technique in which the interviewer follows a script provided by a software application. CAWI, Computer-assisted web interviewing is an Internet surveying technique in which the interviewer follows a script provided in a website. Graphics and publishing tools are able to generate graphs and tables for its users, this solution consists essentially in just a couple of steps to avoid inefficiency. Interoperability is the ability of two or more systems or components to exchange information and to use the information that has been exchanged. Operational system is refered to a system that is used to process the day-to-day transactions of an organization.

Modeling the Business Architecture In statistics, a possible standard definition of the production process is the Generic Statistical Business Process Model (GSBPM), by the 9 phases: 1 Specify Needs, 2 Design, 3 Build, 4 Collect, 5 Process, 6 Analyze, 7 Disseminate, 8 Archive, 9 Evaluate. Each phase is articulated by several sub statistical processes; which, according to process modeling theory, each sub-process should have a number of clearly identified attributes (input, output, owner, purpose, guide, enablers, feedback,..) 31 31

Generic Statistical Business Process Model (GSBPM) 32