Data Warehousing Denis Manley Enterprise Systems FT228/3
Data Warehousing A data warehouse is a repository for all an organisations data. It is designed to present data in very user-friendly formats to end-users. Why is Data Warehousing necessary? think of a typical company that uses computers for order processing, invoicing, etc. Data is gathered for every transaction that occurs. Different departments probably use different systems, each designed for a specific task.
Data Warehousing (cont’d) There will be lots of different database Tables, in different formats, on different machines. There will be tie-ups between data items, but those tie-ups may not be obvious outside the applications that manipulate the tables. Also, the amount of data is constantly increasing. Some estimates state that the size of a typical data processing system doubles every two years. Both standard, and ad-hoc, reports must be catered for, and they must run in a reasonable time-frame.
Data Warehousing (cont’d) Some solutions to the issues listed can be implemented in the applications that manage specific data. E.G. if you want to ask “How much does this customer owe?” then the original package is probably the one to use. But if you want to ask “Was this ad campaign more successful than that one?”, you require data from more disparate sources, and one application may not provide all of it.
Data Warehousing (cont’d) The new alternative is a Data Warehouse: a D.W. is a way of organising data from a wide range of formats, possibly from different locations, so it’s all available in a single location. Once this stage is complete, the collection of data is usually frequently replicated around multiple locations. This means users have a local copy of the data they need to inspect. This improves query run-times, and reduces communications overheads.
Data Warehousing (cont’d) The data is not only collected and joined, however. What the data means is described in a way that the end-users can easily understand, e.g. using the name “Customer Account Number” instead of “order>cacno” for a field. The data must be quickly accessible, so that users can find data quickly and easily.
Data Warehousing process Operational data in legacy systems e.g. OLTP apps. Data fusion: Assembles diverse data Data Cleansing: “fixing” data Meta Data: shows transformations of data,origins of data, etc. Data Migration: load data and meta-data into warehouse periodically Data Warehouse Decision-support analyst queries
Data Warehousing Architecture A data warewhouse is an architectural construct of an information system that provides users with current and historical decision support information that is hard to access or present in traditional operational data stores. It comprises a number of components illustrated in figure 1
Data Warehousing Architecture Figure 1 : A datawarehouse architecture
Data Warehousing Architecture Data Warehouse database This is a cornerstone of the data warehouse (item 2 in figure 1) Almost always implemented on a relational database management system However very large databases, ad hoc processing and the need for flexible user views e.g. aggregates and drill downs are calling for different technological approaches: Parallel relational database designs requiring the use of symmetric multiprocessors, Massively parallel processors and clusters
Data Warehousing Architecture Speeding up traditional RDBMS The use of Multidimensional databases (MDDBs) which are tightly coupled to on-line analytical processing
Data Warehousing Architecture Sourcing, acquisation, cleanup and transformation of data Implementing data warehouses involves extracting data from operational systems including legacy systems and putting it into a suitable format. The various tools are illustrated as item 1 in figure 1 These tools perform all the conversions, summarisations, key changes, structural changes, and condensations needed to transform disparate data into information can be used by decision support tools
Data Warehousing Architecture It produces programs and control statements required to move data into the data warehouse form multiple operational systems Maintains the metadata Remove unwanted data Converts to common data names and definitions Calculates summaries Establish defaults for missing data Keep track of source data definition changes
Data Warehousing Architecture The tools have to deal with: Database hetergeneity: DBMS can be very different in data models, data access language etc. Data hetergeneity: the difference in the way data is defined and used e.g. synonyms and different attributes for the same entity etc.
Data Warehousing Architecture Metadata (item 3 figure1): data about data that describes the datawarehouse Technical metadata: contains information for warehouse designers and administrators Information about data sources Transformation descriptions Rules used to perform data clean up Access authorisation, information delivery history, data acquisation history, data access etc Business metadata: information that gives users an understanding of the information stored in the data warehouse. Queries, reports images Data warehouse operational information e.g. data history and ownership
Meta Data Versioning In the operational environment, there tends to be a single instance of data and meta data at any one moment in time. In the data warehouse environment, there are multiple definitions of data and meta data over an historically long period of time. Therefore, versioning of data warehouse meta data is crucial to the success of the data warehouse vis-a-vis the end users’ ability to access and understand the data in the data warehouse.
Guidelines for Metadata Management Develop an Information Directory that integrates technical and business metadata. Keep the metadata current and accurate! Maintain the time variant history of the metadata! Provide meaningful descriptions and definitions (use business definitions, not technical definitions, where possible). The end users must be educated about what metadata is, how to access it, how to use it, etc.
Meta Data Answers Questions for Users of the Data Warehouse ?? How do I find the data I need? What is the original source of the data? How was this summarization created? What queries are available to access the data? How have business definitions and terms changed over time? How do product lines vary across organizations? What business assumptions have been made?
The Role of Meta Data in the Data Warehouse Architecture Meta Data enables data to become information, because with it you Know what data you have and You can trust it!
Data marts (item 4 figure 1) A data mart is a data store that is subsidary to a data warehouse of intergrated data. The data mart is directed at a partition of data (subject area) that is created for the use of a dedicated group of users and is sometimes termed a “subject warehouse” The data mart might be a set of denormalised, summarised or aggregated data that can be placed on the data warehouse database or more often placed on a separate physical store. Data marts can be “dependent data marts” when the data is sourced from the data warehouse. Independent data marts represent fragmented solutions to a range of business problems in the enterprise, however, such a concept should not be deployed as it doesn’t have the “data intergration” concept that’s associated with data warehouses.
Independent Data Marts Systems of Record Extract, Transform, Clean, Integrate, Summarize, etc....
Independent Data Marts Extract, Transform, Clean, Integrate, Summarize, etc.... Three Times!
Independent Data Marts $ Significant and expensive duplication of effort and data.
Independent Data Marts $ $ Maintenance of proliferating unarchitected marts expensive and cumbersome.
Unarchitected Data Marts md ? md There may be metadata for some marts, but what about consistency & history?
Contrast Architected Data Warehouse Data Access Data Warehouse Dept’l md Atomic Dept’l md Information Directory Dept’l md Systems of Record Dependent (Architected) Departmental “Marts” with the Appropriate Subset of Metadata
Independent Data Marts vs. The Real Thing Unarchitected Data Marts Data Warehouse Atomic Dept’l md Architected to meet organizational as well as departmental requirements Data and results consistent Redundancy is managed Detailed history available for drill-down Metadata is consistent! Easy to do, but... Are the extracts, transformations, integrations & loads consistent? Is the redundancy managed? What is the impact on the sources?
Independent Data marts However, such marts are not necessarly all bad. Often a valid solution to a pressing business problem: Extremely urgent user requirements The absence of a budget for a full data warehouse The decentralisation of business units
Data Warehousing Architecture Access Tools (item 5 figure 1) The principal purpose of the data warehouse is to provide information for strategic decision making. The main tools used to achieve this objective are: Data query and reporting tools Executive information system tools On-line analytical processing tools Data mining tools
Data Warehousing Architecture Query and reporting tools Reporting tools Production reporting tools e.g. generate operational reports Report writers: inexpensive desktop tools Managed Query tools Shield users from the complexities of SQL and database structures by inserting a metalayer between the users and the database
A Few Definitions OLAP Multidimensional Analysis On-Line Analytical Processing A set of functionality that attempts to facilitate multidimensional analysis Multidimensional Analysis The ability to manipulate information by a variety of relevant categories or “dimensions” to facilitate analysis and an understanding of that data Has also been called as “drill-down”, “drill-across” and “slice and dice”
A Few Definitions Hypercube Star Schema Snowflake Schema A means of visually representing multidimensional data Star Schema A means of aggregating data based on a set of known dimensions, attempting to store data multidimensionally in a two dimensional RDBMS Snowflake Schema An extension of the star schema by means of applying additional dimensions to the dimensions of a star schema in a relational environment
A Few Definitions Multidimensional Database OLAP Tools Also known as MDDB or MDDBS A class of proprietary, non-relational database management tools that store and manage data in a multidimensional manner, as opposed to the two dimensions associated with traditional relational database management systems OLAP Tools A set of software products that attempt to facilitate multidimensional analysis Can incorporate data acquisition, data access, data manipulation, or any combination thereof
A Few Definitions ROLAP Relational OLAP Using an RDBMS to implement an OLAP environment Typically involves a star schema to provide the multidimensional capabilities OLAP tool manipulates RDBMS star schema data
A Few Definitions MOLAP (1) MOLAP (2) Multidimensional OLAP Using an MDDBS to store and access data MDDBS “directly” manages data multidimensionally Usually requires proprietary (non-SQL) access tools MOLAP (2) OLAP tool facilitates multidimensional capabilities without the need for a star schema Often utilizes a 3-Tier environment, where middle tier server preprocesses data from an RDBMS Some OLAP tools access an RDBMS directly and build “cubes” as a “fat” client
How Can OLAP Be Accomplished? Use the Data Warehouse as the architected foundation for the organization’s informational processing requirements Use the appropriate design techniques to ensure that the data required is at the appropriate degree of granularity at the atomic level, and the appropriate degree of summarization at the departmental level Use the appropriate tools to either access relational data in a multidimensional manner, or manage multidimensional data
Data mining A CSF for any business is its ability to effectively use information. This strategic use can occur from discovering hidden, undected and frequently valuable information about customers suppliers retailers etc. This information can be used to formulate effective business, marketing and sales strategies. A relatively new technology that can be used to achieve this strategic advantage is “data mining”
Data visualisation Data visualisation is a method of presenting the output of the previously mentioned methods in such a way that the problem or solution is clearly visible to domain experts and even casual observers. It goes way beyond simple bar and pie charts It is a collection of complex techniques that focus on determining the best way to display complex patterns on a 2-D computer monitor. Such techniques involve experimenting with various colours, shapes 3-d images, sound and vittual reality to help users really see and feel the problem and solution.
DW administration and management (Item 6 figure 1) Managing data warehouses involves: Security and priority management Monitoring updates from multiple sources Data quality checks Managing and updataing metadata Replicating, subsetting and distributing data Backup and recovery Data warehouse storage management e.g. capacity planning, hierarchical storage management, purging of aged data.
Information delivery system (Item 7 figure 1) The IDS is used to enable the process of subscribing for data warehouse information and having it delivered to one or more destinations of choice according to some user-specified scheduling algorithm IDS may be based on time of day or on completion of an external event. IDS can be achieved by Client/Server architecture and now by Interner/intranet and World Wide Web.
Top 10 “Don’t’s” of Data Warehousing
10. Pre-selecting Your Technical Environment This is a very common trap in which many organizations find themselves. It is traditional to select the hardware, software and other physical, technical components of a system as one of the earliest activities. However, a data warehouse is an unstable environment from a sizing perspective. How do you know the hardware/RDBMS/end user tool is appropriate for your data warehouse before conducting even the first round of analysis? If at all possible, wait to select your technical environment until after you have analyzed the business requirements for information, data, and potential systems of record.
9. Allowing Access Tool to Determine Data Architecture This is an extension of #10, but is important enough to list by itself. If you select an end user tool before developing your data architecture, it is very likely that that architecture will suffer at the hand of design requirements delivered by the tool. If you have to sacrifice design requirements in order to meet functional requirements of a tool, it is probably time to put that tool aside and select another one.
8. Unarchitected Data Marts OK. Data marts are good; they are an essential part of the data warehouse architecture. But to build only a data mart and to ignore the rest of the data warehouse (specifically the atomic level data and centralized meta data) will lead you down a path that will be more expensive and deliver less quality of data than the alternative. The alternative is to architect and build the data warehouse incrementally, iteratively. Include data marts as departmental instances of the architecture, and populate them from the atomic level data. This will ensure accuracy across the architecture, and reduce costs by eliminating unnecessary population of stand-alone data marts.
7. Boiling the Ocean It is more efficient to implement the data warehouse in small, achievable and palatable chunks than to try to implement it all at once. When I say “boil the ocean”, I mean trying to do too many things for too many people all at the same time. There is an old adage: “You can have everything; where would you put it all?” The same holds true for a data warehouse. If you try to design, develop and implement a data warehouse that is all-encompassing as your first iteration, how will the users be able to use all that you delivered? And in the mean time, while you’ve been trying to meet all of their needs, you have failed to meet any needs. And users won’t forget that for a long time.
6. “If you build it they will come” If you design, develop and implement an operational system, such as an order processing system, that new system is typically going to replace an existing system. In other words, the old system goes away and the users have no choice but to use the new one. Not so with the d/w. “If you build it…” implies an analysis that includes only bottom-up activities. It is crucial to the success of a data warehouse that a top-down analysis of user requirements for information be conducted. After that, users must be tutored, mentored and otherwise have their hands held as part of the implementation of the data warehouse. Existence does not guarantee utilization and, therefore, value.
5. Putting ROI before RFI (Requirements for Information) It is very difficult to quantify the intangible benefits that a data warehouse can provide to an organization. How can you put a price on increased customer loyalty. Somewhere, sometime, someone has probably made this calculation. In most cases, however, the determination of how beneficial the data warehouse will be is based on criteria that was developed for operational systems. Just as you cannot use operational data to meet your strategic informational requirements, it is difficult to calculate the return on investment (ROI) of a data warehouse. In terms of benefits to the organization, it is more appropriate to concentrate on how well the data warehouse addresses the target users’ requirements for information.
4. No Committed User Involvement Write this down: The success of any data warehouse is directly proportional to the amount of end user participation! A data warehouse cannot be successful without active participate on the part of the target users. Period. If you do not have user participation, you will find yourself in a situation where you will build it and hope that they will come. If there is no serious user participation in a data warehouse project, you have to seriously question whether or not the organization truly needs a data warehouse.
3. No Dedicated DBA In many situations the lack of a dedicated database administrator (DBA) has prevented a data warehouse project to be complete 1) on time, or 2) successfully. “Borrowing” a DBA from the operational “pool” will only result in questions about the nature of the data warehouse data models and database design. It’s too flat, not normalized properly, too much redundancy, and other criticisms are well suited for an operational system’s database design, but not a data warehouse. Considering that “data” is the first word in “data warehouse”, be sure you have a dedicated database administration resource committed to this important project.
2. No Meta Data Meta data is like documentation and training: Everyone knows it is necessary, but it usually gets dropped somewhere along the route to implementation. For the data warehouse, meta data is more important than just your typical documentation. Remember, in order to turn data into information you have to have the data, know that you have it, be able to access it, and trust it. Meta data is the means by which the users will be able to understand and trust the data. A time-variant record of where data came from, what happened to it along the way, where it is in the data warehouse, and what vehicles exist to access it will spell the difference between success and frustration.
1. Analysis Paralysis The inability to proceed past a sticking question. Wanting to “boil the ocean” and model/design everything before proceeding with development. Having to resolve political issues surrounding a “standard” or “common” definition. All of these things (and more!) will result in analysis paralysis. The 80/20 rule is very applicable to the development of a data warehouse. Execute 20% effort to get 80% of the total outcome, then move on to the next set of challenges and opportunities. Many data warehouse failures started when the development team stopped. Get your hands around an idea, understand what the users’ requirements for information are, and build something that produces something that can be evaluated. Don’t just stand there…do something!
Parallel Data Management A topic that’s closely linked to Data Warehousing is that of Parallel Data Management. The argument goes: if your main problem is that your queries run too slowly, use more than one machine at a time to make them run faster (Parallel Processing). Oracle uses this strategy in its warehousing products. There are two types of parallel processing - Symmetric Multiprocessing (SMP), and Massively Parallel Processing (MPP)
Parallel Data Management (cont’d) SMP - means the O.S. runs and schedules tasks on more than one processor without distinction. in other words, all processors are treated equally in an effort to get the list of jobs done. MPP - more varied in its design, but essentially consists of multiple processors, each running their own program. the problem with MPP is to harness all these processors to solve a single problem.
Parallel Data Management (cont’d) Regardless of the architecture used, there are still alternatives regarding the use of the parallel processing capabilities. In normal Transaction processing, each transaction runs on a separate processor, because transactions are small units of work that run in a reasonable time-span. However, the type of analysis carried out in data warehouse applications isn’t like that. Typically you want to run a query that looks at all the data in a set of tables. The problem is splitting that into chunks that can be assigned to the multiple processors.
Parallel Data Management (cont’d) There are two possible solutions to this problem: Static and Dynamic Partitioning. In Static Partitioning you break up the data into a number of sections. Each section is placed on a different processor with its own data storage and memory. The query is then run on each of the processors, and the results combined at the end to give the entire picture. This is like joining a queue in a supermarket. You stay with it until you reach the check-out.
Parallel Data Management (cont’d) The main problem with Static Partitioning is that you can’t tell how much processing the various sections need. If most of the relevant data is processed by one processor you could end up waiting almost as long as if you didn’t use parallel processing at all. In Dynamic Partitioning the data is stored in one place, and the data server takes care of splitting the query into multiple tasks, which are allocated to processors as they become available. This is like the single queue in a bank. As a counter position becomes free the person at the head of the queue takes that position
Parallel Data Management (cont’d) With Dynamic Partitioning the performance improvement can be dramatic, but the partitioning is out of the users hands.
2 Tiered Architecture Tier 1 Tier 2 Enterprise Server Client Shared Global Data Storage Shared Application Logic This slide defines the fundamentals of a two-tiered architecture. End User Functionality Data Display Personal Data Storage Personal Application Logic This slide defines the fundamentals of a two-tiered architecture
2 Tiered Architecture in the DW Environment Enterprise Server Tier 2 Client Atomic Data Acquisition Organizational Level Data Secondary Data Acquisition Departmental Level Data This slide depicts where the various components of the data warehouse architecture would fall in a two-tiered technical architecture. Data Access Processing Individual Level Data Data Manipulation This slide depicts where the various components of the data warehouse architecture would fall in a two-tiered technical architecture.
2 Tiered Architecture in the DW Environment Enterprise Server Tier 2 Client Dept’l Atomic Ind’l Information Directory This is a graphical representation of the two-tiered data warehouse architecture. The atomic level, metadata and departmental levels would reside on the Enterprise Server, while any individual level data would be stored on the client. In this scenario the client is a “thin” one, with little processing beyond query and simple data manipulation. This is a graphical representation of the two-tiered data warehouse architecture. The atomic level, metadata and departmental levels would reside on the Enterprise Server, while any individual level data would be stored on the client. In this scenario the client is a “thin” one, with little processing beyond query and simple data manipulation.
2 Tiered Architecture in the DW Environment Enterprise Server Tier 2 “Fat” Client Dept’l Atomic Ind’l Information Directory In this variation, a “fat” client at the second tier facilitates a considerable amount of data manipulation, freeing up capacity on the Enterprise Server for other, more global activities. In this variation, a “fat” client at the second tier facilitates a considerable amount of data manipulation, freeing up capacity on the Enterprise Server for other, more global activities.
2 Tiered Architecture in the DW Environment Enterprise Server Tier 2 “Thin” Client Dept’l Atomic Ind’l Information Directory Multidimensional Preprocessing In this variation the second tier is a “thin” client, and volumes of data are pre-processed on the Enterprise Server. This enables access to sophisticated analyses by more than one “thin” client. In this variation the second tier is a “thin” client, and volumes of data are pre-processed on the Enterprise Server. This enables access to sophisticated analyses by more than one “thin” client.
3 Tiered Architecture Tier 1 Tier 2 Tier 3 Departmental Server Enterprise Server Tier 2 Departmental Server Tier 3 Client Shared Global Data Storage Shared Application Logic In a three-tiered technical environment a “middle” tier has been added for shared data and application logic. Shared Local Data Shared Application Logic End User Functionality Personal Data Storage Personal Application Logic In a three-tiered technical environment a “middle” tier has been added for shared data and application logic.
3 Tiered Architecture in the DW Environment Enterprise Server Tier 2 Departmental Server Tier 3 Client Atomic Data Acquisition Organizational Level Data In the data warehouse, this middle tier can take on the responsibility for departmental level (data mart) processing. There is little or no impact on the third tier, while the first tier is relieved of some mid-level functionality. Secondary Data Acquisition Departmental Level Data Data Access Processing Individual Data Data Manipulation In the data warehouse, this middle tier can take on the responsibility for departmental level (data mart) processing. There is little or no impact on the third tier, while the first tier is relieved of some mid-level functionality.
3 Tiered Architecture in the DW Environment In this graphical representation of the three-tiered technical architecture, the departmental data has been moved to the middle tier. In reality, departmental instances can still exist on the first tier. In cases where the middle tier is geographically distant from the first tier, it is advisable to provide a subset of local metadata for the users of the middle tier’s data. Tier 1 Enterprise Server Tier 2 Departmental Server Tier 3 Client Dept’l Atomic Ind’l Local m/d Information Directory In this graphical representation of the three-tiered technical architecture, the departmental data has been moved to the middle tier. In reality, departmental instances can still exist on the first tier. In cases where the middle tier is geographically distant from the first tier, it is advisable to provide a subset of local metadata for the users of the middle tier’s data.
3 Tiered Architecture in the DW Environment This scenario also supports a “fat” client at the third tier, where the individual level data is maintained and manipulated. Tier 1 Enterprise Server Tier 2 Departmental Server Tier 3 “Fat” Client Dept’l Atomic Ind’l Local m/d Information Directory This scenario also supports a “fat” client at the third tier, where the individual level data is maintained and manipulated.
3 Tiered OLAP Arch. in the DW Environment One of the biggest benefits of a three-tiered technical architecture for a data warehouse is the ability to facilitate tremendous amounts of pre-processing of data on the middle tier. Tier 1 Enterprise Server Tier 2 Departmental Server Tier 3 “Thin” Client Atomic Data Acquisition Organizational Level Data One of the biggest benefits of a three-tiered technical architecture for a data warehouse is the ability to facilitate tremendous amounts of pre-processing of data on the middle tier. Secondary Data Acquisition Departmental Level Data Shared Data Access Processing Data Access Individual Data Data Manipulation
3 Tiered OLAP Arch. in the DW Environment In this case, the middle tier pre-processes data for use by many “thin” clients, rather than requiring the individual levels on the third tier to repetitively processes the same data with complex algorithms. Tier 1 Enterprise Server Tier 2 Departmental Server Tier 3 “Thin” Client Dept’l Atomic Multidimensional Preprocessing Information Directory Local m/d Ind’l In this case, the middle tier pre-processes data for use by many “thin” clients, rather than requiring the individual levels on the third tier to repetitively processes the same data with complex algorithms.
The Atomic Schema Customer Cust Purchases Product Ref Cust Averages Customer ID Status Date Cust Addr State Cust ZIP Code Customer Type Customer Status ... Customer ID Activity Date Product Code Product Name Sales Rep ID Qty Purchased Total Dollars Promotion Flag Cust Purchases Product Code ProdRef Eff. Date ProdRef End Date Product Name Unit Price Product Category Product Type Product Sub Type Product Ref Cust Averages Customer ID Cust Average Date Cust Avg. End Date Cust Avg. Rev. Cust Longevity Atomic level data structured to support a wide variety of informational requirements across the organization As a result, atomic data too normalized to be easily accessed or understood by most end users Data consistently needs to be aggregated into the same categories (dimensions) Multidimensional processing capabilities provide users with tremendous flexibility for most of their analysis requirements Store ID Store Name Store Location Distribution Channel Outlet Reference Sales Rep ID Sales Person Name Store ID Sales Rep Ref
The Star Schema Fact Table Dimension Table 1 Dimension Table 3 Dimension Key 1 Fact Table Dimension Key 3 Dimension Key 1 Dimension Key 2 Dimension Key 3 Dimension Key 4 Description 1 Aggregatn Lvl 1.1 Aggregatn Lvl 1.2 Aggregatn Lvl 1.n Description 3 Aggregatn Lvl 3.1 Aggregatn Lvl 3.2 Aggregatn Lvl 3.n Fact 1 Fact 2 Fact 3 Fact 4 . Fact n Dimension Table 2 Dimension Table 4 Dimension Key 2 Dimension Key 4 Description 2 Aggregatn Lvl 2.1 Aggregatn Lvl 2.2 Aggregatn Lvl 2.n Description 4 Aggregatn Lvl 4.1 Aggregatn Lvl 4.2 Aggregatn Lvl 4.n
Dimension Table Dimension Table 1 Dimension Key 1 Description 1 Aggregatn Lvl 1.1 Aggregatn Lvl 1.2 Aggregatn Lvl 1.n Describes the data that has been organized in the Fact Table Key should either be the most detailed aggregation level necessary (e.g. country vs. county), if possible, or... Surrogate keys may be necessary, but will decrease the natural value of the key Manageable number of aggregation levels
Fact Table Quantifies the data that has been described by the Dimension Tables Key made up of unique combination of values of dimension keys ALWAYS contains date or date dimension Fact values should be additive Aggregations of quantities or amounts from atomic level No percentages or ratios May be non-additive, time-variant data Dimension Key 1 Dimension Key 2 Dimension Key 3 Dimension Key 4 Fact 1 Fact 2 Fact 3 Fact 4 . Fact n Fact Table
For Example: Purchases 1 Customer Location Selling Responsibility Cust ZIP Code City State/Province Country Customer Location Selling Responsibility Sales Rep ID Sales Rep Name Store ID Store Name Store Location Sales Channel Purchases 1 Days of Activity Unit Price Total Quantity Total Dollars Returned Qty Returned Dollars Promotion Qty Sales Rep ID Product Code Cust ZIP Code Customer Type Week Ending Date Customer Type Cust Type Desc Product Product Code Product Name Prod. Category Product Type Prod Sub Type Week Ending Date Month Quarter Year Date Information
Star Schema Query Select E.Month, B.Customer_Type, C.Product_Type, D.Store_Location, sum(A.Total_Quantity) From Purchases_1 A, Customer_Type B, Product C, Selling_Responsibility D, Date_Information E Where B.Customer_Type = A.Customer_Type and C.Product_Code = A.Product_Code and D.Sales_Rep_ID = A.Sales_Rep_ID and E.Week_Ending_Date = A.Week_Ending_Date and E.Year = “1996” and C.Product_Category = “V” Group by E.Month, B.Customer_Type, C.Product_Type, D.Store_Location;
Caution!: Overly Complex Dimension Number of aggregation levels within the dimension becomes unmanageable Logically or functionally incorrect combination of aggregation levels within a dimension
Answer #1: Split Dimension Split overly complex dimension into several logical, more manageable dimensions, based on business function of aggregation levels
Answer #2: The Snowflake Identify hierarchies of aggregation levels and “dimensionalize” the primary dimension Secondary dimensions descriptive of the primary dimension
Answer: Distinct Time Period Fact Tables Weekly Date D1 D2 D3 D4 Monthly Date D1 D2 D3 D4 Create separate fact tables to account for different time periods Date still part of each fact table key Same dimension tables used by both fact tables Improves overall performance (loading and accessing) for each time period Will not increase amount of managed redundancy Different time periods (weekly, monthly, accounting period, billing cycle) required for different analysis purposes.