Designing the data warehouse / data marts Methodologies and Techniques
Basic principles
Life cycle of the DW Operational Databases Warehouse Database First time load Refresh Refresh Refresh Purge or Archive
Oracle Warehouse Components Relationaltools Applications/ Web Any Data Any Access Any Source Externaldata Operationaldata OLAPtools Text, image Oracle Medi` Relational / Multidimensional Spatial Audio, video Web
Oracle Intelligence Tools IS develops user’s Views Oracle Reports Current Business users Oracle Discoverer Tactical Analysts Oracle Express Strategic
Oracle Data Mart Suite Ware- housing Engines Data Modeling Oracle Data Mart Designer Data Management Oracle Enterprise Manager Data Extraction Oracle Data Mart Builder Data Access & Analysis Discoverer & Oracle Reports OLTP Engines OLTP Databases Data Mart Database Oracle8 SQL*PLUS
“Big Bang” Approach: Advantages and Disadvantages Advantages: –warehouse built as part of major project (eg: BPR) –Having a “big picture” of the data warehouse before starting the data warehousing project Disadvantages: –Involves a high risk, takes a longer time –Runs the risk of needing to change requirements –Costly and harder to get support for from users
Incremental Approach to Warehouse Development Multiple iterations Shorter implementations Validation of each phase Strategy Definition Analysis Design Build Production
Benefits of an Incremental Approach Delivers a strategic data warehouse solution through incremental development efforts Provides extensible, scalable architecture Quickly provides business benefits and ensures a much earlier return of investment Allows a data warehouse to be built based on a subject or application area at a time Allows the construction of an integrated data mart environment
Data Mart A subset of a data warehouse that supports the requirements of a particular department or business function. Characteristics include: –Do not normally contain detailed operational data unlike data warehouses. –May contain certain levels of aggregation
Marketing Sales Finance Human Resources Dependent Data Mart DataWarehouse Data Marts External Data Flat Files Operational Systems Marketing Sales Finance
Independent Data Mart Sales or Marketing External Data Flat Files Operational Systems
Reasons for Creating a Data Mart To give users more flexible access to the data they need to analyse most often. To provide data in a form that matches the collective view of a group of users To improve end-user response time. Potential users of a data mart are clearly defined and can be targeted for support
Reasons for Creating a Data Mart To provide appropriately structured data as dictated by the requirements of the end-user access tools. Building a data mart is simpler compared with establishing a corporate data warehouse. The cost of implementing data marts is far less than that required to establish a data warehouse.
Data Marts Issues Data mart functionality Data mart size Data mart load performance Users access to data in multiple data marts Data mart Internet / Intranet access Data mart administration Data mart installation
Example of DW tool OLAP Rotate and drill down to successive levels of detail. Create and examine calculated data interactively on large volumes of data. Determine comparative or relative differences. Perform exception and trend analysis. Perform advanced analytical functions for example forecasting, modeling, and regression analysis
Original OLAP Rules 1. Multidimensional conceptual view 2. Transparency 3. Accessibility 4. Consistent reporting performance 5. Client-server architecture
Original OLAP Rules 6. Multiuser support 7. Unrestricted cross-dimensional operations 8. Intuitive data manipulation 9. Flexible reporting 10. Unlimited dimensions and aggregation levels
Relational Database Model FMMFFMMF Anderson Green Lee Ramos Attribute 1 Name Attribute 2 Age Attribute 3 Gender Row 1 Row 2 Row 3 Row 4 The table above illustrates the employee relation. Attribute 4 Emp No.
Multidimensional Database Model The data is found at the intersection of dimensions. Store GL_Line Time FINANCE Store Product Time SALES Customer
Two dimensions
Three dimensions
Specialised Multidimensional tool Benefits: –Quick access to very large volumes of data –Extensive and comprehensive libraries of complex functions analysis Strong modeling and forecasting capabilities –Can access multidimensional and relational database structures –Caters for calculated fields Disadvantages: –Difficulty of changing model –Lack of support for very large volumes of data –May require significant processing power
MOLAP Server The application layer stores data in a multidimensional structure The presentation layer provides the multidimensional view MOLAP Engine DSS client Application layer Warehouse Efficient storage and processing Complexity hidden from the user Analysis using preaggregated summaries and precalculated measures
ROLAP Server The warehouse stores atomic data. The application layer generates SQL for the three- dimensional view. The presentation layer provides the multidimensional view. ROLAP engine DSS client Application layer Warehouse server Multiple SQL
MOLAP ExpressServerExpressuserWarehouse Query Data MDDB Periodicload
ROLAP ExpressServer Expressuser Warehouse Datacache Livefetch Cache Query Data Also Hybrid (HOLAP)
Choosing a Reporting Architecture Business needs Potential for growth interface enterprise architecture Network architecture Speed of access Openness MOLAP ROLAP Simple Complex QueryPerformance Good OK Analysis
Data Acquisition Identify, extract, transform, and transport source data Consider internal and external data Perform gap analysis between source data and target database objects Plan move of data between sources and target Define first-time load and refresh strategy Define tool requirements Build, test, and execute data acquisition modules
Modeling Warehouses differ from operational structures:Warehouses differ from operational structures: –Analytical requirements –Subject orientation Data must map to subject oriented information:Data must map to subject oriented information: –Identify business subjects –Define relationships between subjects –Name the attributes of each subject Modeling is iterativeModeling is iterative Modeling tools are availableModeling tools are available
1.Defining the business model 2.Creating the dimensional model 3.Modeling summaries 4.Creating the physical model Physical model 1 2, 3 4 Select a business process Modeling the Data Warehouse
Identifying Business Rules Product Type Monitor Status PC15 inchNew Server17 inchRebuilt 19 inchCustom None Location Geographic proximity miles miles > 5 miles Store Store > District > Region Time Month > Quarter > Year
Creating the Dimensional Model Identify fact tables –Translate business measures into fact tables –Analyze source system information for additional measures –Identify base and derived measures –Document additivity of measures Identify dimension tables Link fact tables to the dimension tables Create views for users
Dimension Tables Dimension tables have the following characteristics: Contain textual information that represents the attributes of the business Contain relatively static data Are joined to a fact table through a foreign key reference ProductChannel Facts (units, price) Customer Time
Fact Tables Fact tables have the following characteristics: Contain numeric measures (metrics) of the business May contain summarized (aggregated) data May contain date-stamped data Are typically additive Have key value that is typically a concatenated key composed of the primary keys of the dimensions Joined to dimension tables through foreign keys that reference primary keys in the dimension tables
Dimensional Model (Star Schema) ProductChannel Facts (units, price) Customer Time Dimension tables Fact table
Star Schema Model Central fact table Radiating dimensions Denormalized model Store Table Store_id District_id... Item Table Item_id Item_desc... Time Table Day_id Month_id Period_id Year_id Product Table Product_id Product_desc … Sales Fact Table Product_id Store_id Item_id Day_id Sales_dollars Sales_units...
Star Schema Model Easy for users to understand Fast response to queries Simple metadata Supported by many front end tools Less robust to change Slower to build Does not support history
Snowflake Schema Model Time Table Week_id Period_id Year_id Dept Table Dept_id Dept_desc Mgr_id Mgr Table Dept_id Mgr_id Mgr_name Product Table Product_id Product_desc Item Table Item_id Item_desc Dept_id Sales Fact Table Item_id Store_id Sales_dollars Sales_units Store Table Store_id Store_desc District_id District Table District_id District_desc
Snowflake Schema Model Direct use by some tools More flexible to change Provides for speedier data loading May become large and unmanageable Degrades query performance More complex metadata
Using Summary Data Provides fast access to precomputed data Reduces use of I/O, CPU, and memory Is distilled from source systems and precalculated summaries Usually exists in summary fact tables Phase 3: Modeling summaries
Designing Summary Tables UnitsSales(€)Store Product A Total Product B Total Product C Total Average Maximum Total Percentage
Summary Tables Example SALES FACTS SalesRegionMonth 10,000NorthJan 99 12,000SouthFeb 99 11,000North Jan 99 15,000WestMar 99 18,000South Feb 99 20,000North Jan 99 10,000EastJan 99 2,000WestMar 99 SALES BY MONTH/REGION MonthRegionTot_Sales$ Jan 99North41,000 Jan 99East10,000 Feb 99South40,000 Mar 99West17,000 SALES BY MONTH MonthTot_Sales Jan 9951,000 Feb 9940,000 Mar 9917,000
Summary Management in Oracle8i Product Region Time Sales summary City Sales State Summary usage Summary advisor Space requirements Summary recommendations
The Time Dimension How and where should it be stored? Time dimension Sales fact Time is critical to the data warehouse. A consistent representation of time is required for extensibility.