Dátové sklady Pokročilé dátové technológie Genči.

Slides:



Advertisements
Similar presentations
1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.
Advertisements

Chapter 13 The Data Warehouse
ITEC 423 Data Warehousing and Data Mining Lecture 3.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Management Information Systems, Sixth Edition
Data Warehouse Architecture Sakthi Angappamudali Data Architect, The Oregon State University, Corvallis 16 th May, 2005.
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
Chapter 9 DATA WAREHOUSING Transparencies © Pearson Education Limited 1995, 2005.
Distributed DBMSs A distributed database is a single logical database that is physically distributed to computers on a network. Homogeneous DDBMS has the.
Components and Architecture CS 543 – Data Warehousing.
DATA WAREHOUSING.
Data Warehouse success depends on metadata
Chapter 13 The Data Warehouse
1 © Prentice Hall, 2002 Chapter 11: Data Warehousing.
Data Warehouse Components
Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1.
Designing a Data Warehouse
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Architecture and Infrastructure Module 2 G.Anuradha.
Leaving a Metadata Trail Chapter 14. Defining Warehouse Metadata Data about warehouse data and processing Vital to the warehouse Used by everyone Metadata.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Dátové sklady Pokročilé dátové technológie Genči.
By N.Gopinath AP/CSE. Why a Data Warehouse Application – Business Perspectives  There are several reasons why organizations consider Data Warehousing.
PHASE 3: SYSTEMS DESIGN Chapter 7 Data Design.
Basic Concepts of Datawarehousing An Overview Prasanth Gurram.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
L/O/G/O Metadata Business Intelligence Erwin Moeyaert.
Understanding Data Warehousing
1 Brett Hanes 30 March 2007 Data Warehousing & Business Intelligence 30 March 2007 Brett Hanes.
Database Systems – Data Warehousing
Data Warehouse Concepts Transparencies
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
AN OVERVIEW OF DATA WAREHOUSING
I Information Systems Technology Ross Malaga 4 "Part I Understanding Information Systems Technology" Copyright © 2005 Prentice Hall, Inc. 4-1 DATABASE.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
1 Reviewing Data Warehouse Basics. Lessons 1.Reviewing Data Warehouse Basics 2.Defining the Business and Logical Models 3.Creating the Dimensional Model.
Data Warehouse Fundamentals
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
CISB594 – Business Intelligence
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Sachin Goel (68) Manav Mudgal (69) Piyush Samsukha (76) Rachit Singhal (82) Richa Somvanshi (85) Sahar ( )
CHAPTER 7: ARCHITECTURAL COMPONENTS. CHAPTER OBJECTIVES  Understand data warehouse architecture  Examine how the architectural framework supports the.
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
By N.Gopinath AP/CSE. There are 5 categories of Decision support tools, They are; 1. Reporting 2. Managed Query 3. Executive Information Systems 4. OLAP.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
7 Strategies for Extracting, Transforming, and Loading.
By N.Gopinath AP/CSE.  The data warehouse architecture is based on a relational database management system server that functions as the central repository.
1 Database Systems, 8 th Edition 1 Chapter 13 Business Intelligence and Data Warehouses Objectives In this chapter, you will learn: –How business intelligence.
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Managing Data Resources File Organization and databases for business information systems.
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
An Introduction to Data Warehousing
THE ARCHITECTURAL COMPONENTS
Data Warehouse.
Data Warehousing Concepts
Presentation transcript:

Dátové sklady Pokročilé dátové technológie Genči

2 Obsah Literatúra Pojem INFORMÁCIA Motivácia pre DWH Bližší pohľad na DWH Štruktúra DWH Metadata Komponenty DWH Nástroje (Tools)

3 Literatúra [1] Lacko L.: Datové sklady, analýza OLAP a dolování dát s příklady …. Computer Press. Brno [2] Paulraj Ponniah: Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals John Wiley & Sons, Inc. ISBNs: (Hardback); (Electronic)

4

5 Literatúra (pokr.) [3] Ralph Kimball, Margy Ross: The Data Warehouse Toolkit. Second Edition Wiley Computer Publishing. [4] W. H. Inmon: Building the Data Warehouse Third Edition John Wiley & Sons, Inc.

6

7 Literatúra (pokr.) [5] Inmon W., Strauss D., Neushloss G.: DW 2.0: THE ARCHITECTURE FOR THE NEXT GENERATION OF DATA WAREHOUSING, Paperback, 400 pages, ISBN-13: , MORGAN KAUFFMAN

8

9 Pojem INFORMÁCIA Podľa firemnej literatúry ORACLE sa údaje stávajú informáciami, ak –máme údaje; –vieme, že máme údaje; –vieme, kde máme tieto údaje; –máme k nim prístup; –zdroju údajov môžeme dôverovať.

10 Hierarchia informačných úrovní Údaje Informácie Znalosti Múdrosť

11 Motivácia pre DWH Exekutíva potrebuje informácie (napr.) kvôli rozhodnutiu: –kde postaviť ďalší sklad; –ktorú produktovú líniu rozvíjať; –ktorý tržný segment by mal byť posilnený t.j. potrebuje realizovať strategické rozhodnutia a pre ne potrebuje strategickú informáciu

12 Strategická informácia Nemôžu ju poskytnúť OLTP systémy Neslúži pre denno-denné riadenie spoločnosti Dôležitá pre zdravý vývoj a prežitie spoločnosti Kritické rozhodnutia závisia od správnej (korektnej, patričnej) strategickej informácie

13 Požadované vlastnosti strategickej informácie

14 „Vstup“ dát

15 „Výstup“ informácií

16 Vyplývajúce protirečenia Organizácie majú veľké množstvo dát ale IT zdroje a systémy nie sú schopné efektívnym spôsobom toto množstvo dát premeniť na strategickú informáciu

17 Informačná kríza Nie kvôli nedostatku dát, ale preto, že dáta nie sú použiteľné pre strategické rozhodovanie Dôvody: –Údaje sú v spoločnostiach rozložené naprieč mnohými typmi nekompatibilných štruktúr a systémov –Údaje sú v spoločnostiach uložené v rôznych nezlúčiteľných systémoch, viacerých platformách a rozmanitých štruktúrach

18 These operational systems (order processing, inventory control, claims processing, outpatient billing,...) are not designed or intended to provide strategic information. If we need the ability to provide strategic information, we must get the information from altogether different types of systems. Only specially designed decision support systems or informational systems can provide strategic information.

19 Rozdiely

20 Processing Requirements in the New Environment Most of the processing in the new environment for strategic information will have to be analytical. There are four levels of analytical processing requirements: 1.Running of simple queries and reports against current and historical data 2.Ability to perform “what if ” analysis in many different ways 3.Ability to query, step back, analyze, and then continue the process to any desired length 4.Spot historical trends and apply them for future results

21 Data warehousing concept Take all the data you already have in the organization, clean and transform it, and then provide useful strategic information.

22 Data warehousing concept One of the most important approaches to the integration of data sources is based on a data warehouse architecture. In this architecture, data coming from multiple external data sources (EDSs) are extracted, filtered, merged, and stored in a central repository, called a data warehouse (DW). Data are also enriched by historical and summary information. From a technological point of view, a data warehouse is a huge database from several hundred GB to several dozens of TB. Thanks to this architecture, users operate on a local, homogeneous, and centralized data repository that reduces access time to data. Moreover, a data warehouse is independent of EDSs that may be temporarily unavailable. However, a data warehouse has to be kept up to date with respect to the content of EDSs, by being periodically refreshed.

Bližší pohľad na DWH

24 Functional definition of the data warehouse The data warehouse is an informational environment that: –Provides an integrated and total view of the enterprise –Makes the enterprise’s current and historical information easily available for decision making –Makes decision-support transactions possible without hindering operational systems –Renders the organization’s information consistent –Presents a flexible and interactive source of strategic information

25 DWH – zmes technológií

26 Bill Inmon’s definition Bill Inmon, considered to be the father of Data Warehousing provides the following definition: – “A Data Warehouse is a subject oriented, integrated, nonvolatile, and time variant collection of data in support of management’s decisions.”

27 The data in the data warehouse is –Separate –Available –Integrated –Time stamped –Subject oriented –Nonvolatile –Accessible

28 Subject-oriented

29 Integrated Data

30 Integrated Data (2) Before the data from various disparate sources can be usefully stored in a data warehouse, you have to: –remove the inconsistencies; –standardize the various data elements; –make sure of the meanings of data names in each source application.

31 Integrated Data (3) Before moving the data into the data warehouse, you have to go through a process of transformation, consolidation, and integration of the source data. Here are some of the items that would need standardization: –Naming conventions –Codes –Data attributes –Measurements

32 Nonvolatile Data

33 Time-Variant Data For an operational system, the stored data contains the current values. The data in the data warehouse is meant for analysis and decision making. A data warehouse, because of the very nature of its purpose, has to contain historical data, not just current values. Data is stored as snapshots over past and current periods. Every data structure in the data warehouse contains the time element.

34 Time-Variant Data (2) The time-variant nature of the data in a data warehouse –Allows for analysis of the past –Relates information to the present –Enables forecasts for the future

35 Data Granularity

36 DATA WAREHOUSES AND DATA MARTS

OVERVIEW OF THE COMPONENTS

38 Štruktúra DWH

39 Source data component Production systems Internal data (spreadsheets) Archived data (tapes) External data (stocks, interest rates, …)

40 Data Staging Component Data Extraction. Data Transformation. Data Loading.

41 Data Movement to the data Warehouse

42 Information Delivery Component

METADATA IN THE DATA WAREHOUSE

44 WHY METADATA IS IMPORTANT Users to compose and run the query can have several important questions: –Are there any predefined queries I can look at? –What are the various elements of data in the warehouse? –Is there information about unit sales and unit costs by product? –How can I browse and see what is available? –From where did they get the data for the warehouse? From which source systems? –How did they merge the data from the telephone orders system and the mail orders system? –How old is the data in the warehouse? –When was the last time fresh data was brought in? –Are there any summaries by month and product?

45 Metadata in a data warehouse contains the answers to questions about the data in the data warehouse.

46 Different definitions for metadata Data about the data Table of contents for the data Catalog for the data Data warehouse atlas Data warehouse roadmap Data warehouse directory Glue that holds the data warehouse contents together Tongs to handle the data The nerve center

47 Metadata in OLTP In operational systems we do not really have any easy and flexible methods for knowing the nature of the contents of the database. There is no great need for user-friendly interfaces to the database contents. The data dictionary or catalog is meant for IT uses only.

48 Metadata in DWH Users need sophisticated methods for browsing and examining the contents of the data warehouse. Users need to know the meanings of the data items. Users have to prevent them from drawing wrong conclusions from their analysis through their ignorance about the exact meanings. Without adequate metadata support, users of the large data warehouses are totally handicapped.

49 Types of Metadata Metadata in a data warehouse fall into three major categories: –Operational Metadata –Extraction and Transformation Metadata –End-User Metadata

50 Operational Metadata Data for the data warehouse comes from several operational systems of the enterprise. These source systems contain different data structures. The data elements selected for the data warehouse have various field lengths and data types. In selecting data from the source systems for the data warehouse, you split records, combine parts of records from different source files, and deal with multiple coding schemes and field lengths. When you deliver information to the end-users, you must be able to tie that back to the original source data sets. Operational metadata contain all of this information about the operational data sources.

51 Extraction and Transformation Metadata Extraction and transformation metadata contain data about the extraction of data from the source systems, namely, the extraction frequencies, extraction methods, and business rules for the data extraction. Also, this category of metadata contains information about all the data transformations that take place in the data staging area.

52 End-User Metadata The end-user metadata is the navigational map of the data warehouse. It enables the end-users to find information from the data warehouse. The end-user metadata allows the end-users to use their own business terminology and look for information in those ways in which they normally think of the business.

53

54

55

56

57 THE PROJECT TEAM

THE ARCHITECTURAL COMPONENTS

59 ARCHITECTURAL FRAMEWORK Flow of data

60 ARCHITECTURAL FRAMEWORK Control

61 TECHNICAL ARCHITECTURE The technical architecture of a data warehouse is the complete set of functions and services provided within its components. The technical architecture also includes the procedures and rules that are required to perform the functions and provide the services. The technical architecture also encompasses the data stores needed for each component to provide the services.

62 Data Acquisition Data acquisition covers the entire process of extracting data from the data sources, moving all the extracted data to the staging area, and preparing the data for loading into the data warehouse repository. The two major architectural components are source data and data staging.

63 Data Acquisition (2)

64 List of Functions and Services Data Extraction Select data sources and determine the types of filters to be applied to individual sources Generate automatic extract files from operational systems using replication and other techniques Create intermediary files to store selected data to be merged later Transport extracted files from multiple platforms Provide automated job control services for creating extract files Reformat input from outside sources Reformat input from departmental data files, databases, and spreadsheets Generate common application code for data extraction Resolve inconsistencies for common data elements from multiple sources

65 List of Functions and Services (2) Data Transformation Map input data to data for data warehouse repository Clean data, deduplicate, and merge/purge Denormalize extracted data structures as required by the dimensional model of the data warehouse Convert data types Calculate and derive attribute values Check for referential integrity Aggregate data as needed Resolve missing values Consolidate and integrate data

66 List of Functions and Services (3) Data Staging Provide backup and recovery for staging area repositories Sort and merge files Create files as input to make changes to dimension tables If data staging storage is a relational database, create and populate database Preserve audit trail to relate each data item in the data warehouse to input source Resolve and create primary and foreign keys for load tables Consolidate datasets and create flat files for loading through DBMS utilities If staging area storage is a relational database, extract load files

67 Data Storage Data storage covers the process of loading the data from the staging area into the data warehouse repository. All functions for transforming and integrating the data are completed in the data staging area. The prepared data in the data warehouse is like the finished product that is ready to be stacked in an industrial warehouse.

68 Data Storage (2)

69 Data Storage (3) List of Functions and Services Load data for full refreshes of data warehouse tables Perform incremental loads at regular prescribed intervals Support loading into multiple tables at the detailed and summarized levels Optimize the loading process Provide automated job control services for loading the data warehouse Provide backup and recovery for the data warehouse database Provide security Monitor and fine-tune the database Periodically archive data from the database according to preset conditions

70 Information Delivery Information delivery spans a broad spectrum of many different methods of making information available to users. For users, the information delivery component is the data warehouse.

71 Information Delivery (2) The information delivery component makes it easy for the users to access the information either directly from the enterprise-wide data warehouse, from the dependent data marts, or from the set of conformed data marts. Most of the information access in a data warehouse is through online queries and interactive analysis sessions.

72 Information Delivery (3)

73 Information Delivery (4) Almost all modern data warehouses provide for online analytical processing (OLAP). The primary data warehouse feeds data to proprietary multidimensional databases (MDDBs) where summarized data is kept as multidimensional cubes of information. The users perform complex multidimensional analysis using the information cubes in the MDDBs.

74 Functions and Services Provide security to control information access Monitor user access to improve service and for future enhancements Allow users to browse data warehouse content Simplify access by hiding internal complexities of data storage from users Automatically reformat queries for optimal execution Enable queries to be aware of aggregate tables for faster results Govern queries and control runaway queries Provide self-service report generation for users, consisting of a variety of flexible options to create, schedule, and run reports

75 Functions and Services (2) Store result sets of queries and reports for future use Provide multiple levels of data granularity Provide event triggers to monitor data loading Make provision for the users to perform complex analysis through online analytical processing (OLAP) Enable data feeds to downstream, specialized decisions support systems such as EIS and data mining

Tools

77 COLLECTION OF TOOLS In a data warehouse environment developers use third-party tools for different phases of the development: –code-generators for preparing in-house software for data extraction –accessing information through third-party query tools –creating reports with report writers

78

Basic purposes and features of the type of tool

80 Typy nástrojov Data Modeling Data Extraction Data Transformation Data Loading Data Quality Queries and Reports Online Analytical Processing (OLAP) Alert Systems Middleware and Connectivity Data Warehouse Management

81 Data Modeling Enable developers to create and maintain data models for the source systems and the data warehouse target databases. If necessary, data models may be created for the staging area. Provide forward engineering capabilities to generate the database schema. Provide reverse engineering capabilities to generate the data model from the data dictionary entries of existing source databases. Provide dimensional modeling capabilities to data designers for creating STAR schemas

82 Data Extraction Two primary extraction methods are available: bulk extraction for full refreshes and change-based replication for incremental loads. Tool choices depend on the following factors: source system platforms and databases, and available built-in extraction and duplication facilities in the source systems.

83 Data Transformation Transform extracted data into appropriate formats and data structures. Provide default values as specified. Major features include field splitting, consolidation, standardization, and deduplication.

84 Data Loading Load transformed and consolidated data in the form of load images into the data warehouse repository. Some loaders generate primary keys for the tables being loaded. For load images available on the same RDBMS engine as the data warehouse, precoded procedures stored on the database itself may be used for loading.

85 Data Quality Assist in locating and correcting data errors. May be used on the data in the staging area or on the source systems directly. Help resolve data inconsistencies in load images.

86 Queries and Reports Allow users to produce canned, graphic- intensive, sophisticated reports. Help users to formulate and run queries. Two main classifications are report writers, report servers.

87 Online Analytical Processing (OLAP) Allow users to run complex dimensional queries. Enable users to generate canned queries. Two categories of online analytical processing are multidimensional online analytical processing (MOLAP) and relational online analytical processing (ROLAP). MOLAP works with proprietary multidimensional databases that receive data feeds from the main data warehouse. ROLAP provides online analytical processing capabilities from the relational database of the data warehouse itself.

88 Alert Systems Highlight and get user’s attention based on defined exceptions. Provide alerts from the data warehouse database to support strategic decisions. Three basic alert types are: –from individual source systems, –from integrated enterprise-wide data warehouses, –from individual data marts.

89 Middleware and Connectivity Transparent access to source systems in heterogeneous environments. Transparent access to databases of different types on multiple platforms. Tools are moderately expensive but prove to be invaluable for providing interoperability among the various data warehouse components.

90 Data Warehouse Management Assist data warehouse administrators in day-to-day management. Some tools focus on the load process and track load histories. Other tools track types and number of user queries.

DW 2.0

92

93 There are several substantial differences between the first generation of data warehouses and DW html

94 The lifecycle of data. As data ages, its characteristics change. As a consequence, the data in DW 2.0 is divided into different sectors based on the age of the data. In the first generation of data warehouses, there was no such distinction. Unstructured data is a valid part of the data warehouse. Unstructured data is , spreadsheets, documents and so forth. Some of the most valuable information in the corporation resides in unstructured data. The first generation of data warehouses did not recognize that there was valuable data in the unstructured environment and that the data belonged in the data warehouse.

95 The way unstructured data is treated. Unstructured data exists in several forms in DW actual snippets of text, edited words and phrases, and matching text. The most interesting of these forms of unstructured data in the DW 2.0 environment is easily the matching text. In the structured environment, matches are made positively and surely. Not so with unstructured data. In DW 2.0, when matches are made, either between unstructured data and unstructured data or between unstructured data and structured data, the match is said to be probabilistic. The match may or may not be valid, and a probability of an actual match can be calculated or estimated. The concept of a probabilistic match is hard to fathom (porozumiet) for the person that has only dealt with structured systems, but it represents the proper way to link structured and unstructured data.

96 The need for close incorporation of metadata into the data warehouse. Metadata is the glue that holds the data together over its different states. Amazingly, the first generation of data warehousing omitted metadata as part of the infrastructure. The need for different levels of metadata. Metadata is found in many places today - multidimensional technology, data warehouses, database management system catalogs, spreadsheets, documents and extract, transform and load. There is little or no coordination of metadata from one architectural construct to another; however, there is still a need for a global repository. These sets of needs are recognized and addressed architecturally in DW 2.0. The recognition of the need for integrity of data as data passes from online processing to integrated processing. Because data is constantly changing (or at least subject to change), there is only fleeting (chvilkovy, letmy) integrity of data at the online level.

97 One other important distinction with DW 2.0 is that because DW 2.0 is trademarked, it enjoys legal protection. There is a strict and clearly stated definition of the architecture for DW 2.0, and no one except the original authors and architects can change the specifications. There is integrity, then, in the definition of DW 2.0. This architecture is fully described on the Web site All access to the Web site and all noncommercial usage of the material on the Web site is free. All commercial usage of the material is strictly prohibited.

98 The advantages of the DW 2.0 architecture include the ability to: –Hold data at the lowest detail, –Hold data to infinity (or at least to your retirement), –Not cost huge amounts of money, –Have integrity of data and still have online high-performance transaction processing, –Link structured data and unstructured data, –Tightly couple metadata to the data warehouse environment, –Support different kinds of processing without sacrificing response time, and –Support changes of data over time.

99