Data Models for Warehouse Session-12/13 Data Management for Decision Support.

Data Models for Warehouse Session-12/13 Data Management for Decision Support

Data Models 4Data Models u relations u stars & snowflakes u cubes 4Operators u slice & dice u roll-up, drill down u pivoting u other

Data Models 4Star schemas are database schemas that exploit the structure of data for decision support query u Queries in DSS tend to l Examine a set of factual transactions- POS, Customer events l Facts are analyzed in variety of ways - POS transaction by week, or store u For example a retail store l POS is at the center l Product information - SKU, hierarchy of ( section dept, BU) l Time information - day, week, month, year l Stores - Store-id, hierarchy (regions, city, locality) l Suppliers- Sup-id, location, discounts

Data Models Sales Transactions ProductsTime Suppliers Stores Information is split between two classes- Factual information and Reference information

FACT DATA 4Fact data records the information on factual event that occurred in the business- POS, Phone calls, Banking transactions 4Typically 70% of Warehouse data is Fact data 4Important to identify and define structure right in the first place as restructuring is an expensive process 4Detail content of FACT is derived from the business requirement 4Recorded Facts do not change as they are events of past

Dimension Data 4Information that is used for analyzing the elemental data, for example, product hierarchy, time periods, customers, stores 4It is the reference data used for analysis of Facts 4Organizing the information in separate reference tables offers better query performance 4It differs from Fact data as it changes over time, due to changes in business, reorganization 4It should be structured to permit rapid changes

FACT and Dimensions 4Millions to billions of rows 4Multiple foreign keys 4Numeric 4Does not change 4Tens to millions of rows 4One primary key 4Textual decription 4Frequently modifies

Decision Support Queries 4Examples 4Average number of sales of Haldiram per store over last month (various types within the brand) 4Projected sales of Deepavali gift packs against the actual 4The top 20% customers (spending) over last quarter 4The customers with average balance in excess of Rs. 25000 for past one year 4==> Each of these queries is based on Factual data

Decision Support Queries 4Examples POS Transaction Membership card Transaction Account transactions Sales of Haldiram Customer Spend Account Balance Quantity Sold Product Store Date, Time Revenue Realized Customer-Id Store Transaction Value Date and Time Customer AC number type of transaction amount

Star Schema 4The star schema is a data-modeling technique used to map multidimensional decision support into a relational database. 4Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structure of the operational database. 4Four Components: u Facts u Dimensions u Attributes u Attribute hierarchies

A Simple Star Schema

Star Schema 4Facts u Facts are numeric measurements (values) that represent a specific business aspect or activity. u The fact table contains facts that are linked through their dimensions. u Facts can be computed or derived at run-time (metrics). 4Dimensions u Dimensions are qualifying characteristics that provide additional perspectives to a given fact. u Dimensions are stored in dimension tables.

Identifying Facts and Dimensions Elemental Transaction Determine Key Dimensions Check if Fact is a dimension Check if dimensions is a Fact

Identification: Step 1 4Examine the enterprise model and identify the transaction that or of interest- driven by business requirement analysis 4These will be transaction that describes events fundamental to the business e.g., #calls for Telecom, account transactions in banking 4For each potential Fact ask a question- Is this information operated upon by business process? Daily sales versus POS, even if system reports daily sales POS may be the FACT 4The limit of current recording should not influence Warehouse design

Identification: Step 1 Sector and Business Retail Sales Shrinkage Retail Banking Customer profiling Profitability Insurance Product Profitability Telecom Call Analysis Customer Analysis Fact Table POS Transaction Stock movement and position Customer events Account transactions Claims and receipts Call events Customer events(install, disconnect, payment)

Identification: Step 2 4Look at the logical model to find the entities associated with entities in the fact table. List out all such logically associate entities. 4These are candidate References, the task is to find key dimension entities that may not be directly associated. 4 For example, retail banking account transaction are candidate fact table. The account transaction is candidate reference. But, the customer I indirectly related to transaction. Although, a better choice. 4Analyze account transaction by account? 4Analyze how customers use our services? 4You store both relationships but customer becomes a dimension

Identification: Step3 FACT is not actually a denormalized dimension table 4Consider the following: u house-details u Cable-laid u Sales-persons visit u connected to the service u promotional material sent u subscription cancelled u … 4Home-details - candidate fact 4Operational events 4Report on number of connections quarter-to-date 4Time-lag between laying and subscrition

Identification: Step 4 Dimension is not a FACT 4Lot depends on DSS requirements- u Customer can be FACT or Dimension u Promotions can be fact or dimensions 4Ask questions using other dimensions- Using how many other dimensions, Can I view this entity. u Can I view promotion by Time? u Can I view promotions by product? u Can I view promotion by store? u Can I vie promotions by suppliers? 4If answer to these question is yes, then it is a FACT

Star Schema 4Attributes u Each dimension table contains attributes. Attributes are often used to search, filter, or classify facts. u Dimensions provide descriptive characteristics about the facts through their attributes. Possible Attributes For Sales Dimensions

Three Dimensional View Of Sales

Slice And Dice View Of Sales

Star Schema 4Attribute Hierarchies u Attributes within dimensions can be ordered in a well-defined attribute hierarchy. u The attribute hierarchy provides a top-down data organization that is used for two main purposes: l Aggregation l Drill-down/roll-up data analysis

A Location Attribute Hierarchy

Attribute Hierarchies In Multidimensional Analysis

Star Schema 4Star Schema Representation u Facts and dimensions are normally represented by physical tables in the data warehouse database. u The fact table is related to each dimension table in a many-to-one (M:1) relationship. u Fact and dimension tables are related by foreign keys and are subject to the primary/foreign key constraints.

Star Schema For Sales

Orders Star Schema

The Multi-Dimensional Model “Sales by product line over the past six months” “Sales by store between 1990 and 1995” Prod CodeTime Code Store Code SalesQty Store Info Product Info Time Info... Numerical Measures Key columns joining fact table to dimension tables Fact table for measures Dimension tables

Dimensional Modeling 4Dimensions are organized into hierarchies u E.g., Time dimension: days  weeks  quarters u E.g., Product dimension: product  product line  brand 4Dimensions have attributes

Dimension Hierarchies Store DimensionProduct Dimension District Region Total Brand Manufacturer Total StoresProducts

ROLAP: Dimensional Modeling Using Relational DBMS 4Special schema design: star, snowflake 4Special indexes: bitmap, multi-table join 4Special tuning: maximize query throughput 4Proven technology (relational model, DBMS), tend to outperform specialized MDDB especially on large data sets 4Products u IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

MOLAP: Dimensional Modeling Using the Multi Dimensional Model 4MDDB: a special-purpose data model 4Facts stored in multi-dimensional arrays 4Dimensions used to index array 4Sometimes on top of relational DB 4Products u Pilot, Arbor Essbase, Gentia

Star Schema (in RDBMS)

Star Schema Example

Star Schema with Sample Data

The “Classic” Star Schema  A single fact table, with detail and summary data  Fact table primary key has only one key column per dimension  Each key is generated  Each dimension is a single table, highly denormalized Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata Drawbacks: Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem

The “Classic” Star Schema The biggest drawback: dimension tables must carry a level indicator for every record and every query must use it. In the example below, without the level constraint, keys for all stores in the NORTH region, including aggregates for region and district will be pulled from the fact table, resulting in error. Example: Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Fact_Table A where A.STORE_KEY in (select STORE_KEY from Store_Dimension B where region = “North” and Level = 2) and etc... Level is needed whenever aggregates are stored with detail facts.

The “Level” Problem 4Level is a problem because because it causes potential for error. If the query builder, human or program, forgets about it, perfectly reasonable looking WRONG answers can occur. 4One alternative: the FACT CONSTELLATION model...

The “Fact Constellation” Schema Dollars Units Price District Fact Table District_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price Region Fact Table Region_ID PRODUCT_KEY PERIOD_KEY

The “Fact Constellation” Schema In the Fact Constellations, aggregate tables are created separately from the detail, therefor it is impossible to pick up, for example, Store detail when querying the District Fact Table. Major Advantage: No need for the “Level” indicator in the dimension tables, since no aggregated data is stored with lower-level detail Disadvantage: Dimension tables are still very large in some cases, which can slow performance; front-end must be able to detect existence of aggregate facts, which requires more extensive metadata

Another Alternative to “Level” 4Fact Constellation is a good alternative to the Star, but when dimensions have very high cardinality, the sub- selects in the dimension tables can be a source of delay. 4An alternative is to normalize the dimension tables by attribute level, with each smaller dimension table pointing to an appropriate aggregated fact table, the “Snowflake Schema”...

The “Snowflake” Schema STORE KEY Store Dimension Store Description City State District ID District Desc. Region_ID Region Desc. Regional Mgr. District_ID District Desc. Region_ID Region Desc. Regional Mgr. STORE KEY PRODUCT KEY PERIOD KEY Dollars Units Price Store Fact Table Dollars Units Price District Fact Table District_ID PRODUCT_KEY PERIOD_KEY Dollars Units Price RegionFact Table Region_ID PRODUCT_KEY PERIOD_KEY

The “Snowflake” Schema 4No LEVEL in dimension tables 4Dimension tables are normalized by decomposing at the attribute level 4Each dimension table has one key for each level of the dimensionís hierarchy 4The lowest level key joins the dimension table to both the fact table and the lower level attribute table How does it work? The best way is for the query to be built by understanding which summary levels exist, and finding the proper snowflaked attribute tables, constraining there for keys, then selecting from the fact table.

The “Snowflake” Schema 4Additional features: The original Store Dimension table, completely de-normalized, is kept intact, since certain queries can benefit by its all-encompassing content. 4In practice, start with a Star Schema and create the “snowflakes” with queries. This eliminates the need to create separate extracts for each table, and referential integrity is inherited from the dimension table. Advantage: Best performance when queries involve aggregation Disadvantage: Complicated maintenance and metadata, explosion in the number of tables in the database

Advantages of ROLAP Dimensional Modeling 4Define complex, multi-dimensional data with simple model 4Reduces the number of joins a query has to process 4Allows the data warehouse to evolve with rel. low maintenance 4HOWEVER! Star schema and relational DBMS are not the magic solution u Query optimization is still problematic

Aggregates  Add up amounts for day 1  In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81

Aggregates  Add up amounts by day  In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

Another Example  Add up amounts by day, product  In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId drill-down rollup

Aggregates 4Operators: sum, count, max, min, median, ave 4“Having” clause 4Using dimension hierarchy u average by region (within store) u maximum by month (within date)

ROLAP vs. MOLAP 4ROLAP: Relational On-Line Analytical Processing 4MOLAP: Multi-Dimensional On-Line Analytical Processing

The MOLAP Cube Fact table view: Multi-dimensional cube: dimensions = 2

3-D Cube dimensions = 3 Multi-dimensional cube:Fact table view: day 2 day 1

Example Store Product Time M T W Th F S S Juice Milk Coke Cream Soap Bread NY SF LA 10 34 56 32 12 56 56 units of bread sold in LA on M Dimensions: Time, Product, Store Attributes: Product (upc, price, …) Store … … Hierarchies: Product  Brand  … Day  Week  Quarter Store  Region  Country roll-up to week roll-up to brand roll-up to region

Cube Aggregation: Roll-up day 2 day 1 129... drill-down rollup Example: computing sums

Cube Operators for Roll-up day 2 day 1 129... sale(s1,*,*) sale(*,*,*) sale(s2,p2,*)

Extended Cube day 2 day 1 * sale(*,p2,*)

Aggregation Using Hierarchies store region country (store s1 in Region A; stores s2, s3 in Region B) day 2 day 1

Slicing day 2 day 1 TIME = day 1

Slicing & Pivoting

Summary of Operations 4Aggregation (roll-up) u aggregate (summarize) data to the next higher dimension element u e.g., total sales by city, year  total sales by region, year 4Navigation to detailed data (drill-down) 4Selection (slice) defines a subcube u e.g., sales where city =‘Gainesville’ and date = ‘1/15/90’ 4Calculation and ranking u e.g., top 3% of cities by average income 4Visualization operations (e.g., Pivot) 4Time functions u e.g., time average

Query & Analysis Tools 4Query Building 4Report Writers (comparisons, growth, graphs,…) 4Spreadsheet Systems 4Web Interfaces 4Data Mining

Data Models for Warehouse Session-12/13 Data Management for Decision Support.

Similar presentations

Presentation on theme: "Data Models for Warehouse Session-12/13 Data Management for Decision Support."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Models for Warehouse Session-12/13 Data Management for Decision Support.

Similar presentations

Presentation on theme: "Data Models for Warehouse Session-12/13 Data Management for Decision Support."— Presentation transcript:

Similar presentations

About project

Feedback