11/20/ :11 AMData Mining 1 Data Mining – CSE 9033 Chapter – 1; Data Warehousing Dr. Goutam Sarker, B.E., M.E., Ph.D.(Engineering), Fellow: IE(I),

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

Supervisor : Prof . Abbdolahzadeh
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
OLAP Services Business Intelligence Solutions. Agenda Definition of OLAP Types of OLAP Definition of Cube Definition of DMR Differences between Cube and.
Data Warehousing M R BRAHMAM.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 29 Overview of Data Warehousing and OLAP.
13 Chapter 13 The Data Warehouse Hachim Haddouti.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS CHAPTER 3
Data Warehousing. On-Line Analytical Processing (OLAP) Tools The use of a set of graphical tools that provides users with multidimensional views of their.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
1 © Prentice Hall, 2002 Chapter 11: Data Warehousing.
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
DATA WAREHOUSE (Muscat, Oman).
Tanvi Madgavkar CSE 7330 FALL Ralph Kimball states that : A data warehouse is a copy of transaction data specifically structured for query and analysis.
An Overview of Data Warehousing and OLTP Technology Presenter: Parminder Jeet Kaur Discussion Lead: Kailang.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
Data Warehouse & Data Mining
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS CHAPTER 3
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
Data Warehousing.
BI Terminologies.
MIS2502: Data Analytics The Information Architecture of an Organization.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.
1 Topics about Data Warehouses What is a data warehouse? How does a data warehouse differ from a transaction processing database? What are the characteristics.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
Data Warehousing.
Advanced Database Concepts
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
Data Warehousing COMP3017 Advanced Databases Dr Nicholas Gibbins –
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Presented By: Pedel Oppong-Abebrese,Pedel Oppong-Abebrese Michael Boadi, William Osei, Nana Amoa OforiMichael BoadiWilliam OseiNana Amoa Ofori DATA WAREHOUSING.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Supervisor : Prof . Abbdolahzadeh
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Chapter 13 Business Intelligence and Data Warehouses
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Summarized from various resources Modern Database Management
Data Warehouse.
Chapter 13 – Data Warehousing
MANAGING DATA RESOURCES
Data Warehouse and OLAP
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
An Introduction to Data Warehousing
Data Warehousing: Data Models and OLAP operations
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009
Introduction of Week 9 Return assignment 5-2
Data Warehouse.
Chapter 17 Designing Databases
Data Warehousing Concepts
Data Warehouse and OLAP
Presentation transcript:

11/20/ :11 AMData Mining 1 Data Mining – CSE 9033 Chapter – 1; Data Warehousing Dr. Goutam Sarker, B.E., M.E., Ph.D.(Engineering), Fellow: IE(I), Fellow: IETE(I), Senior Member, IEEE, Associate Professor, CSE, NITD

11/20/ :11 AMData Mining 2 Data Warehousing A data warehouse is a subject oriented, integrated, time-varying, non-volatile collection of data in support of the management’s decision-making process

11/20/ :11 AMData Mining 3 Subject – oriented: A data warehouse is organized around major subjects such as customer, products, sales, etc. Data are organized according to subjects instead of application. For example an insurance company using a data warehouse would organize their data by customer, premium, and claim instead of by different products (auto, life, etc.)

11/20/ :11 AMData Mining 4 Non-volatile: A data warehouse is always a physically separate store of data, which is transformed from appropriate environment. The data are not updated or changed in any way once they enter the data warehouse.

11/20/ :11 AMData Mining 5 Time-varying: Data are stored in a data warehouse to provide a historical preservative. Integrated: A data warehouse is usually constructed by integrating multiple, heterogeneous sources such as a relational database, flat files etc.

11/20/ :11 AMData Mining 6 Multidimensional Data Model At the core of the design of the data warehouse lies a multidimensional view of the data model. In a multidimensional data model, there is a set of numeric measures that are the main theme or subject of analysis.

11/20/ :11 AMData Mining 7

11/20/ :11 AMData Mining 8 Multidimensional model views data in the form of a data cube (or more preciously, a hypercube). We shall in general use the term “data cube”. It has three dimensions, namely gender, profession and year. Each dimension can be divided into subdimensions.

Fig. shows a multidimensional view of information corresponding to some two dimensional statistical table. The data cube corresponding to the model has three dimensions namely gender, profession and year. Each dimension in its turn can be divided into sub dimensions. 11/20/ :11 AMData Mining 9

11/20/ :11 AMData Mining 10 Dimension Modeling Dimension modeling is a special technique for structuring data around business concepts. The following figure shows the dimension modeling for our example:

11/20/ :11 AMData Mining 11 Fig. of page 14

11/20/ :11 AMData Mining 12 OLAP Operations Data analysis tools are called OLAP (On-Line Analytical Processing). OLAP is mainly used to access the Live data Online and to Analyze it. The basic OLAP operations for multidimensional model is described below:

11/20/ :11 AMData Mining 13 Slicing and Dicing Slicing and Dicing are used for reducing the data cube by one or more dimension. The slice operation performs a selection on one dimension of the given cube, resulting in a sub cube. Fig. shows a slice operation where the sales data are selected from the central cube for the dimension time, using the criteria time = “Q2”

Dicing: This operation is selecting a smaller data cube and analyzing it from different perspectives. The dice operation defines a sub cube by performing a selection on two or more dimensions. Fig. shows a dice operation. 11/20/ :11 AMData Mining 14

11/20/ :11 AMData Mining 15

11/20/ :11 AMData Mining 16 Slice time = ‘Q2’ C [quarter, city, product] = C [city, product] Dice time= ‘Q1’ or ‘Q2’ and location = ‘Mumbai’ or ‘Pune’ C [quarter, city, product] = C [quarter’, city’, product’].

11/20/ :11 AMData Mining 17 Warehouse Schema Star Schema: A star schema is a modeling paradigm in which the data warehouse contains large, single, central Fact Table and a set of smaller Dimension Tables, one for each dimension. The Fact Table contains the detailed summary data. Its primary key has one key per dimension. Each dimension is a single, highly de normalized table. Every tuple in the Fact Table consists of the fact or subject of interest, and the dimensions that provide the fact.

We have a 1:N relationship between the Fact table and the Dimension Table. The advantages of star schema is that it is easy to understand, easy to define hierarchies, reduce the number of physical joins, requires low maintenance. 11/20/ :11 AMData Mining 18

Let us consider the “Employment” data warehouse. We have the Dimension Tables and one Fact Table. The star schema is shown in Fig. 11/20/ :11 AMData Mining 19

11/20/ :11 AMData Mining 20

11/20/ :11 AMData Mining 21 Snowflake Schema 1. The star schema consists of a single fact table and a single de normalized dimension table 2. To support attribute hierarchies, the dimension tables can be normalized to create snowflake schemas. 3. A snowflake schema consists of a single fact table and multiple dimension tables. Like the Star Schema, each tuple of the fact table consists of a (foreign) key pointing to each of the dimension tables that provide its multidimensional coordinates.

11/20/ :11 AMData Mining 22 Advantage of Snow Flake Schema 1. A normalized table is easier to maintain. 2. Normalizing also saves storage space, since an un normalized Dimension Table tends to be large and may contain redundant information.

11/20/ :11 AMData Mining 23 Disadvantage of Snow Flake Schema 1. The Snow Flake structure may be reducing the effectiveness of navigating across the tables due to large number of join operations.

11/20/ :11 AMData Mining 24

Fact Constellation or GALAXY SCHEMA A Fact Constellation or Galaxy Schema is a kind of schema where we have more than one Fact Table sharing among them same Dimension Table. 11/20/ :11 AMData Mining 25

Fig. Fact Constellation : Fact1 and Fact2 share the same Dimension Tables Dim2 and Dim3 11/20/ :11 AMData Mining 26

11/20/ :11 AMData Mining 27 Data Warehousing Architecture Architecture is Generally a 3 tier architecture : 1. Tier 1 is essentially the warehouse server 2. Tier 2 is the OLAP – engine for analytical processing 3. Tier 3 is a client containing reporting tools, visualization tools, data mining tools, querying tools, etc. There is also a backend process which is concerned with extracting data from multiple operational databases and from external sources; with cleaning, transforming and integrating this data for loading into the data warehouse server, and of course, with periodically refreshing the warehouse.

11/20/ :11 AMData Mining 28 1 st tire 2 nd tire 3 rd tire

Tire 1, Tire 2 and Tire 3 Tire1 contains the main data warehouse. It can follow one of the three models or any combination of these. Models: 1. Single enterprise warehouse. 2. Several departmental marts 3. Virtual Warehouse 11/20/ :11 AMData Mining 29

Tire 2 follows three different ways of designing the OLAP engine, namely: 1.ROLAP : Relational OLAP 2. MOLAP : Multidimensional OLAP 3. External SQL OLAP 11/20/ :11 AMData Mining 30

Data Marts Data Marts are partitions of the overall data warehouse. 11/20/ :11 AMData Mining 31

11/20/ :11 AMData Mining 32 Virtual Data Warehouse This model creates a virtual view of databases, allowing the creation of a virtual warehouse. In a virtual warehouse, we have a logical description of all the databases and their structures. The data resources can be either local or remote. In this type of data warehouse, the data is not moved from the sources. Instead the users are given the direct access to the data.

11/20/ :11 AMData Mining 33 Advantages 1. It is possible to access remote data sources. 2. Access of multiple data distribution of multiple data sources through only a single SQL statement – a single interface. 3. It appears as local source, and their application don’t even need to know the physical location of the data 4. A virtual database is easy, efficient and fast

Advantages contd.. 5. In this type of data warehouse, the data is not moved from the sources. Instead the users are given the direct access to the data. 11/20/ :11 AMData Mining 34

11/20/ :11 AMData Mining 35 Disadvantages 1. Since the queries must complete with the production of data transaction, its performance can be considerably degraded. 2. Since there is no metadata, no summary data or history, all the queries must be repeated, creating an additional burden on the system. 3. Over and above there is no clearing or refreshing process involved, causing the queries to become very complex.

11/20/ :11 AMData Mining 36 Metadata The relationship between metadata and data warehouse is same as the relationship between card catalogue and the traditional library. Thus metadata provides pointers to the data. In addition to this, metadata may contain data extraction, communication, modeling algorithms and data usage statistics.

11/20/ :11 AMData Mining 37 Types of Metadata 1. Build time Metadata: Whenever we design and develop a warehouse, the metadata that we generate can be termed as “build time metadata”. 2. Usage Metadata: When the warehouse is in production, usage metadata, which is derived from build-time metadata is an important tool for users and data administrators. This metadata is used differently from build time metadata.

11/20/ :11 AMData Mining 38 OLAP Engine The main function of the OLAP engine is to present the user a multidimensional view of the data warehouse and provide tools for OLAP operations. There are three options of the OLAP engine as shown in next slide

11/20/ :11 AMData Mining 39 Three Options for OLAP engine : 1. Specialized SQL Server: This model assumes that the warehouse organizes data in a relational structure and the engine provides and SQL like environment for OLAP tools. 2. Relational OLAP (ROLAP): In ROLAP approach the data does not need to be stored multi dimensionally to be viewed multi dimensionally. 3. Multidimensional OLAP (MOLAP): The third option is to have a special purpose Multidimensional Data Model for the data warehouse, with a Multidimensional OLAP (MOLAP) server for analysis.

ROLAP vs. MOLAP Advantages of MOLAP: 1. Relational tables are unnatural for multi dimensional data. 2. Multidimensional arrays provide efficiency in storage and operations. 11/20/ :11 AMData Mining 40

Advantages contd.. 3. There is a mismatch between multidimensional operations and SQL. 4. For ROLAP to achieve efficiency, it has to perform outside current relational systems, which is the same as what MOLAP does. 11/20/ :11 AMData Mining 41

Advantages of ROLAP 1. ROLAP integrates naturally with existing technology and standards. 2. MOLAP does not support ad hoc queries efficiently, because it is optimized for multidimensional operations. 3. Since data has to be downloaded into MOLAP systems, updating is difficult. 11/20/ :11 AMData Mining 42

Advantages contd.. 4. The efficiency of ROLAP can be achieved by using techniques such as encoding and compression. 5. ROLAP can readily take advantage of parallel relational technology. 11/20/ :11 AMData Mining 43

11/20/ :11 AMData Mining 44 Data Extraction Data extraction is the process of extracting data for the warehouse from different sources. The data may come from a variety of sources, namely: 1. Production data 2. Legacy data 3. Internal office system 4. External systems 5. Metadata

11/20/ :11 AMData Mining 45 Data Cleaning Data cleaning is essential for construction of quality data warehouse. The data cleaning techniques are: 1. Using transformation rules. 2. Using domain specific knowledge. 3. Performing parsing and fuzzy matching 4. Auditing, i.e. discovering facts that flag unusual patterns.

11/20/ :11 AMData Mining 46 Loading A loading system should allow system administrators to monitor the status, cancel, suspend, resume loading or change the loading rate, and restart loading after failures There are 3 different data loading techniques: 1. Batch Loading 2. Sequential Loading 3. Incremental Loading

11/20/ :11 AMData Mining 47 Refresh Function When the data source is updated, we need to update the warehouse. This process is called the refresh function. Determining how frequently to refresh is an important issue. One extreme is to refresh on every update. This is very expensive, however, and is normally only necessary when OLAP queries need the most current data.

11/20/ :11 AMData Mining 48 End of Data Warehousing (Chapter 1)