Download presentation
Presentation is loading. Please wait.
Published byMaria Harvey Modified over 8 years ago
1
11/20/2016 12:11 AMData Mining 1 Data Mining – CSE 9033 Chapter – 1; Data Warehousing Dr. Goutam Sarker, B.E., M.E., Ph.D.(Engineering), Fellow: IE(I), Fellow: IETE(I), Senior Member, IEEE, Associate Professor, CSE, NITD
2
11/20/2016 12:11 AMData Mining 2 Data Warehousing A data warehouse is a subject oriented, integrated, time-varying, non-volatile collection of data in support of the management’s decision-making process
3
11/20/2016 12:11 AMData Mining 3 Subject – oriented: A data warehouse is organized around major subjects such as customer, products, sales, etc. Data are organized according to subjects instead of application. For example an insurance company using a data warehouse would organize their data by customer, premium, and claim instead of by different products (auto, life, etc.)
4
11/20/2016 12:11 AMData Mining 4 Non-volatile: A data warehouse is always a physically separate store of data, which is transformed from appropriate environment. The data are not updated or changed in any way once they enter the data warehouse.
5
11/20/2016 12:11 AMData Mining 5 Time-varying: Data are stored in a data warehouse to provide a historical preservative. Integrated: A data warehouse is usually constructed by integrating multiple, heterogeneous sources such as a relational database, flat files etc.
6
11/20/2016 12:11 AMData Mining 6 Multidimensional Data Model At the core of the design of the data warehouse lies a multidimensional view of the data model. In a multidimensional data model, there is a set of numeric measures that are the main theme or subject of analysis.
7
11/20/2016 12:11 AMData Mining 7
8
11/20/2016 12:11 AMData Mining 8 Multidimensional model views data in the form of a data cube (or more preciously, a hypercube). We shall in general use the term “data cube”. It has three dimensions, namely gender, profession and year. Each dimension can be divided into subdimensions.
9
Fig. shows a multidimensional view of information corresponding to some two dimensional statistical table. The data cube corresponding to the model has three dimensions namely gender, profession and year. Each dimension in its turn can be divided into sub dimensions. 11/20/2016 12:11 AMData Mining 9
10
11/20/2016 12:11 AMData Mining 10 Dimension Modeling Dimension modeling is a special technique for structuring data around business concepts. The following figure shows the dimension modeling for our example:
11
11/20/2016 12:11 AMData Mining 11 Fig. of page 14
12
11/20/2016 12:11 AMData Mining 12 OLAP Operations Data analysis tools are called OLAP (On-Line Analytical Processing). OLAP is mainly used to access the Live data Online and to Analyze it. The basic OLAP operations for multidimensional model is described below:
13
11/20/2016 12:11 AMData Mining 13 Slicing and Dicing Slicing and Dicing are used for reducing the data cube by one or more dimension. The slice operation performs a selection on one dimension of the given cube, resulting in a sub cube. Fig. shows a slice operation where the sales data are selected from the central cube for the dimension time, using the criteria time = “Q2”
14
Dicing: This operation is selecting a smaller data cube and analyzing it from different perspectives. The dice operation defines a sub cube by performing a selection on two or more dimensions. Fig. shows a dice operation. 11/20/2016 12:11 AMData Mining 14
15
11/20/2016 12:11 AMData Mining 15
16
11/20/2016 12:11 AMData Mining 16 Slice time = ‘Q2’ C [quarter, city, product] = C [city, product] Dice time= ‘Q1’ or ‘Q2’ and location = ‘Mumbai’ or ‘Pune’ C [quarter, city, product] = C [quarter’, city’, product’].
17
11/20/2016 12:11 AMData Mining 17 Warehouse Schema Star Schema: A star schema is a modeling paradigm in which the data warehouse contains large, single, central Fact Table and a set of smaller Dimension Tables, one for each dimension. The Fact Table contains the detailed summary data. Its primary key has one key per dimension. Each dimension is a single, highly de normalized table. Every tuple in the Fact Table consists of the fact or subject of interest, and the dimensions that provide the fact.
18
We have a 1:N relationship between the Fact table and the Dimension Table. The advantages of star schema is that it is easy to understand, easy to define hierarchies, reduce the number of physical joins, requires low maintenance. 11/20/2016 12:11 AMData Mining 18
19
Let us consider the “Employment” data warehouse. We have the Dimension Tables and one Fact Table. The star schema is shown in Fig. 11/20/2016 12:11 AMData Mining 19
20
11/20/2016 12:11 AMData Mining 20
21
11/20/2016 12:11 AMData Mining 21 Snowflake Schema 1. The star schema consists of a single fact table and a single de normalized dimension table 2. To support attribute hierarchies, the dimension tables can be normalized to create snowflake schemas. 3. A snowflake schema consists of a single fact table and multiple dimension tables. Like the Star Schema, each tuple of the fact table consists of a (foreign) key pointing to each of the dimension tables that provide its multidimensional coordinates.
22
11/20/2016 12:11 AMData Mining 22 Advantage of Snow Flake Schema 1. A normalized table is easier to maintain. 2. Normalizing also saves storage space, since an un normalized Dimension Table tends to be large and may contain redundant information.
23
11/20/2016 12:11 AMData Mining 23 Disadvantage of Snow Flake Schema 1. The Snow Flake structure may be reducing the effectiveness of navigating across the tables due to large number of join operations.
24
11/20/2016 12:11 AMData Mining 24
25
Fact Constellation or GALAXY SCHEMA A Fact Constellation or Galaxy Schema is a kind of schema where we have more than one Fact Table sharing among them same Dimension Table. 11/20/2016 12:11 AMData Mining 25
26
Fig. Fact Constellation : Fact1 and Fact2 share the same Dimension Tables Dim2 and Dim3 11/20/2016 12:11 AMData Mining 26
27
11/20/2016 12:11 AMData Mining 27 Data Warehousing Architecture Architecture is Generally a 3 tier architecture : 1. Tier 1 is essentially the warehouse server 2. Tier 2 is the OLAP – engine for analytical processing 3. Tier 3 is a client containing reporting tools, visualization tools, data mining tools, querying tools, etc. There is also a backend process which is concerned with extracting data from multiple operational databases and from external sources; with cleaning, transforming and integrating this data for loading into the data warehouse server, and of course, with periodically refreshing the warehouse.
28
11/20/2016 12:11 AMData Mining 28 1 st tire 2 nd tire 3 rd tire
29
Tire 1, Tire 2 and Tire 3 Tire1 contains the main data warehouse. It can follow one of the three models or any combination of these. Models: 1. Single enterprise warehouse. 2. Several departmental marts 3. Virtual Warehouse 11/20/2016 12:11 AMData Mining 29
30
Tire 2 follows three different ways of designing the OLAP engine, namely: 1.ROLAP : Relational OLAP 2. MOLAP : Multidimensional OLAP 3. External SQL OLAP 11/20/2016 12:11 AMData Mining 30
31
Data Marts Data Marts are partitions of the overall data warehouse. 11/20/2016 12:11 AMData Mining 31
32
11/20/2016 12:11 AMData Mining 32 Virtual Data Warehouse This model creates a virtual view of databases, allowing the creation of a virtual warehouse. In a virtual warehouse, we have a logical description of all the databases and their structures. The data resources can be either local or remote. In this type of data warehouse, the data is not moved from the sources. Instead the users are given the direct access to the data.
33
11/20/2016 12:11 AMData Mining 33 Advantages 1. It is possible to access remote data sources. 2. Access of multiple data distribution of multiple data sources through only a single SQL statement – a single interface. 3. It appears as local source, and their application don’t even need to know the physical location of the data 4. A virtual database is easy, efficient and fast
34
Advantages contd.. 5. In this type of data warehouse, the data is not moved from the sources. Instead the users are given the direct access to the data. 11/20/2016 12:11 AMData Mining 34
35
11/20/2016 12:11 AMData Mining 35 Disadvantages 1. Since the queries must complete with the production of data transaction, its performance can be considerably degraded. 2. Since there is no metadata, no summary data or history, all the queries must be repeated, creating an additional burden on the system. 3. Over and above there is no clearing or refreshing process involved, causing the queries to become very complex.
36
11/20/2016 12:11 AMData Mining 36 Metadata The relationship between metadata and data warehouse is same as the relationship between card catalogue and the traditional library. Thus metadata provides pointers to the data. In addition to this, metadata may contain data extraction, communication, modeling algorithms and data usage statistics.
37
11/20/2016 12:11 AMData Mining 37 Types of Metadata 1. Build time Metadata: Whenever we design and develop a warehouse, the metadata that we generate can be termed as “build time metadata”. 2. Usage Metadata: When the warehouse is in production, usage metadata, which is derived from build-time metadata is an important tool for users and data administrators. This metadata is used differently from build time metadata.
38
11/20/2016 12:11 AMData Mining 38 OLAP Engine The main function of the OLAP engine is to present the user a multidimensional view of the data warehouse and provide tools for OLAP operations. There are three options of the OLAP engine as shown in next slide
39
11/20/2016 12:11 AMData Mining 39 Three Options for OLAP engine : 1. Specialized SQL Server: This model assumes that the warehouse organizes data in a relational structure and the engine provides and SQL like environment for OLAP tools. 2. Relational OLAP (ROLAP): In ROLAP approach the data does not need to be stored multi dimensionally to be viewed multi dimensionally. 3. Multidimensional OLAP (MOLAP): The third option is to have a special purpose Multidimensional Data Model for the data warehouse, with a Multidimensional OLAP (MOLAP) server for analysis.
40
ROLAP vs. MOLAP Advantages of MOLAP: 1. Relational tables are unnatural for multi dimensional data. 2. Multidimensional arrays provide efficiency in storage and operations. 11/20/2016 12:11 AMData Mining 40
41
Advantages contd.. 3. There is a mismatch between multidimensional operations and SQL. 4. For ROLAP to achieve efficiency, it has to perform outside current relational systems, which is the same as what MOLAP does. 11/20/2016 12:11 AMData Mining 41
42
Advantages of ROLAP 1. ROLAP integrates naturally with existing technology and standards. 2. MOLAP does not support ad hoc queries efficiently, because it is optimized for multidimensional operations. 3. Since data has to be downloaded into MOLAP systems, updating is difficult. 11/20/2016 12:11 AMData Mining 42
43
Advantages contd.. 4. The efficiency of ROLAP can be achieved by using techniques such as encoding and compression. 5. ROLAP can readily take advantage of parallel relational technology. 11/20/2016 12:11 AMData Mining 43
44
11/20/2016 12:11 AMData Mining 44 Data Extraction Data extraction is the process of extracting data for the warehouse from different sources. The data may come from a variety of sources, namely: 1. Production data 2. Legacy data 3. Internal office system 4. External systems 5. Metadata
45
11/20/2016 12:11 AMData Mining 45 Data Cleaning Data cleaning is essential for construction of quality data warehouse. The data cleaning techniques are: 1. Using transformation rules. 2. Using domain specific knowledge. 3. Performing parsing and fuzzy matching 4. Auditing, i.e. discovering facts that flag unusual patterns.
46
11/20/2016 12:11 AMData Mining 46 Loading A loading system should allow system administrators to monitor the status, cancel, suspend, resume loading or change the loading rate, and restart loading after failures There are 3 different data loading techniques: 1. Batch Loading 2. Sequential Loading 3. Incremental Loading
47
11/20/2016 12:11 AMData Mining 47 Refresh Function When the data source is updated, we need to update the warehouse. This process is called the refresh function. Determining how frequently to refresh is an important issue. One extreme is to refresh on every update. This is very expensive, however, and is normally only necessary when OLAP queries need the most current data.
48
11/20/2016 12:11 AMData Mining 48 End of Data Warehousing (Chapter 1)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.