Download presentation
Presentation is loading. Please wait.
Published byBenedict Robbins Modified over 6 years ago
1
Data Warehouse Design Enrico Franconi CS 636
2
Implementing a Warehouse
Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing, ... Managing: Metadata, Design, ... CS 336
3
Monitoring Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … How to get data out? Replication tool Dump file Create report ODBC or third-party “wrappers” CS 336
4
Monitoring Techniques
Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring CS 336
5
Monitoring Issues Frequency Data transformation Standards (e.g., ODBC)
periodic: daily, weekly, … triggered: on “big” change, lots of changes, ... Data transformation convert data to uniform format remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways CS 336
6
Wrapper Converts data and queries from one data model to another Queries Data Model B Data Model A Data Extends query capabilities for sources with limited capabilities Wrapper Source Queries CS 336
7
Wrapper Generation Solution 1: Hard code for each source
Solution 2: Automatic wrapper generation Wrapper Generator Definition Wrapper CS 336
8
Integration Data Cleaning Data Loading Derived Data CS 336 Client
Warehouse Source Query & Analysis Integration Metadata CS 336
9
Data Integration Receive data (changes) from multiple wrappers/monitors and integrate into warehouse Rule-based Actions Resolve inconsistencies Eliminate duplicates Integrate into warehouse (may not be empty) Summarize data Fetch more data from sources (wh updates) etc. CS 336
10
Data Cleaning Find (& remove) duplicate tuples
e.g., Jane Doe vs. Jane Q. Doe Detect inconsistent, wrong data Attribute values that don’t match Patch missing, unreadable data Insert default values Notify sources of errors found CS 336
11
Data Cleaning Migration (e.g., yen to dollars)
Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe) CS 336
12
Loading Data in the Warehouse
Incremental vs. refresh Off-line vs. on-line Frequency of loading At night, 1x a week/month, continuously Parallel/Partitioned load CS 336
13
Warehouse Maintenance
Warehouse data materialized view Initial loading View maintenance Derived Warehouse Data indexes aggregates materialized views CS 336
14
Materialized Views Define new warehouse relations using SQL expressions does not exist at any source CS 336
15
Differs from Conventional View Maintenance...
Warehouses may be highly aggregated and summarized Warehouse views may be over history of base data Process large batch updates Schema may evolve CS 336
16
Differs from Conventional View Maintenance...
Base data doesn’t participate in view maintenance Simply reports changes Loosely coupled Absence of locking, global transactions May not be queriable CS 336
17
Warehouse Maintenance Anomalies
Materialized view maintenance in loosely coupled, non-transactional environment Simple example Data Warehouse Sold (item,clerk,age) Sold = Sale Emp Integrator Sales Comp. Sale(item,clerk) Emp(clerk,age) CS 336
18
Warehouse Maintenance Anomalies
Data Warehouse Sold (item,clerk,age) Integrator Sales Comp. Sale(item,clerk) Emp(clerk,age) 1. Insert into Emp(Mary,25), notify integrator 2. Insert into Sale (Computer,Mary), notify integrator 3. (1) integrator adds Sale (Mary,25) 4. (2) integrator adds (Computer,Mary) Emp 5. View incorrect (duplicate tuple) CS 336
19
Maintenance Anomaly - Solutions
Incremental update algorithms (ECA, Strobe, etc.) Research issues: Self-maintainable views What views are self-maintainable Store auxiliary views so original + auxiliary views are self-maintainable CS 336
20
Self-Maintainability: Examples
Sold(item,clerk,age) = Sale(item,clerk) Emp(clerk,age) Inserts into Emp If Emp.clerk is key and Sale.clerk is foreign key (with ref. int.) then no effect Inserts into Sale Maintain auxiliary view: Emp-clerk,age(Sold) Deletes from Emp Delete from Sold based on clerk CS 336
21
Self-Maintainability: Examples
Deletes from Sale Delete from Sold based on {item,clerk} Unless age at time of sale is relevant Auxiliary views for self-maintainability Must themselves be self-maintainable One solution: all source data But want minimal set CS 336
22
Partial Self-Maintainability
Avoid (but don’t prohibit) going to sources Sold=Sale(item,clerk) Emp(clerk,age) Inserts into Sale Check if clerk already in Sold, go to source if not Or replicate all clerks over age 30 Or ... CS 336
23
Warehouse Specification (ideally)
View Definitions Warehouse Configuration Module Warehouse Integration rules Change Detection Requirements Integrator Metadata Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor ... CS 336
24
Processing ROLAP servers vs. MOLAP servers Index Structures
What to Materialize? Algorithms Client Warehouse Source Query & Analysis Integration Metadata CS 336
25
ROLAP Server Relational OLAP Server tools ROLAP utilities server
Special indices, tuning; Schema is “denormalized” relational DBMS CS 336
26
MOLAP Server Multi-Dimensional OLAP Server M.D. tools utilities
Product City Date milk soda eggs soap A B Sales M.D. tools utilities multi-dimensional server could also sit on relational DBMS CS 336
27
Index Structures (sketch)
Traditional Access Methods B-trees, hash tables, R-trees, grids, … Popular in Warehouses inverted lists bit map indexes join indexes text indexes CS 336
28
What to Materialize? Store in warehouse results useful for common queries Example: total sales day 2 . . . day 1 129 materialize CS 336
29
Materialization Factors
Type/frequency of queries Query response time Storage cost Update cost CS 336
30
Cube Aggregates Lattice
129 all city product date city, product city, date product, date use greedy algorithm to decide what to materialize day 2 day 1 city, product, date CS 336
31
Dimension Hierarchies
all state city CS 336
32
Dimension Hierarchies
all city product date city, product city, date product, date state city, product, date state, date state, product state, product, date not all arcs shown... CS 336
33
Interesting Hierarchy
all years weeks quarters conceptual dimension table months days CS 336
34
Managing Metadata Warehouse Design Tools CS 336 Client
Source Query & Analysis Integration Metadata CS 336
35
Metadata Administrative definition of sources, tools, ...
schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control, ... CS 336
36
Metadata Business Operational business terms & definition
data ownership, charging Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails CS 336
37
Design Summary What data is needed? Where does it come from?
How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index? CS 336
38
Tools Development Planning & Analysis Warehouse Management
design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management performance monitoring, usage patterns, exception reporting System & Network Management measure traffic (sources, warehouse, clients) Workflow Management “reliable scripts” for cleaning & analyzing data CS 336
39
Current State of Industry
Extraction and integration done off-line Usually in large, time-consuming, batches Everything copied at warehouse Not selective about what is stored Query benefit vs storage & update cost Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything CS 336
40
State of Commercial Practice ...
Connectivity to sources Apertus Information Builders Informix Enterprise Gateway Oracle Open Connect CA-Ingres gateway MS ODBC Platinum InfoHub Data extract, clean, transform, refresh CA-Ingres Replicator ETI-Extract IBM Data Joiner, Data Propagator Prism Warehouse manager SAS Access Sybase Replication Server Trinzic InfoPump CS 336
41
… State of Commercial Practice ...
Multidimensional Database Engines Arbor Essbase Oracle RIR Express Comshare Commader SAS System Warehouse Data Servers CA-Ingres Oracle 8 RedBrick Sybase IQ Informix Dynamic Server IBM DB2 ROLAP Servers HP Intelligent Warehouse Informix Metacube MicroStrategy DSS Server Information Advantage Asxys CS 336
42
… State of Commercial Practice
Multidimensional Analysis Kenan Systems Acumate Microsoft Excel Arbor Essbase Analysis server Cognos PowerPlay IQ Software IQ/Vision Lotus 123 SAS OLAP++ Business Objects Query/Reporting Environments IBM DataGuide SAS Access CA Visual Express Platinum Forest&Trees Informix ViewPoint Lots and lots of consulting!! CS 336
43
Future Directions Better performance Larger warehouses Easier to use
What are companies & research labs working on? CS 336
44
Research (1) Incremental Maintenance Data Consistency Data Expiration
Recovery Data Quality Error Handling (Back Flush) CS 336
45
Research (2) Rapid Monitor Construction Temporal Warehouses
Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data Conceptual Modelling CS 336
46
Conclusions Massive amounts of data and complexity of queries will push limits of current warehouses Need better systems: easier to use provide quality information CS 336
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.