Download presentation
Presentation is loading. Please wait.
Published byAmberly Hicks Modified over 9 years ago
1
Data Warehouse Design Enrico Franconi CS 636
2
CS 3362 Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing,... Managing: Metadata, Design,...
3
CS 3363 Monitoring Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … How to get data out? Replication tool Dump file Create report ODBC or third-party “wrappers”
4
CS 3364 Monitoring Techniques Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring
5
CS 3365 Monitoring Issues Frequency periodic: daily, weekly, … triggered: on “big” change, lots of changes,... Data transformation convert data to uniform format remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways
6
CS 3366 Wrapper Converts data and queries from one data model to another Extends query capabilities for sources with limited capabilities Data Model B Data Model A Queries Data Queries Source Wrapper
7
CS 3367 Wrapper Generation Solution 1: Hard code for each source Solution 2: Automatic wrapper generation Wrapper Generator Definition
8
CS 3368 Integration Data Cleaning Data Loading Derived Data Client Warehouse Source Query & Analysis Integration Metadata
9
CS 3369 Data Integration Receive data (changes) from multiple wrappers/monitors and integrate into warehouse Rule-based Actions Resolve inconsistencies Eliminate duplicates Integrate into warehouse (may not be empty) Summarize data Fetch more data from sources (wh updates) etc.
10
CS 33610 Data Cleaning Find (& remove) duplicate tuples e.g., Jane Doe vs. Jane Q. Doe Detect inconsistent, wrong data Attribute values that don’t match Patch missing, unreadable data Insert default values Notify sources of errors found
11
CS 33611 Data Cleaning Migration (e.g., yen to dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)
12
CS 33612 Loading Data in the Warehouse Incremental vs. refresh Off-line vs. on-line Frequency of loading At night, 1x a week/month, continuously Parallel/Partitioned load
13
CS 33613 Warehouse Maintenance Warehouse data materialized view Initial loading View maintenance Derived Warehouse Data indexes aggregates materialized views View maintenance
14
CS 33614 Materialized Views Define new warehouse relations using SQL expressions does not exist at any source
15
CS 33615 Differs from Conventional View Maintenance... Warehouses may be highly aggregated and summarized Warehouse views may be over history of base data Process large batch updates Schema may evolve
16
CS 33616 Differs from Conventional View Maintenance... Base data doesn’t participate in view maintenance Simply reports changes Loosely coupled Absence of locking, global transactions May not be queriable
17
CS 33617 Warehouse Maintenance Anomalies Materialized view maintenance in loosely coupled, non-transactional environment Simple example SalesComp. Integrator Data Warehouse Sale(item,clerk)Emp(clerk,age) Sold (item,clerk,age) Sold = Sale Emp
18
CS 33618 Warehouse Maintenance Anomalies 1. Insert into Emp(Mary,25), notify integrator 2. Insert into Sale (Computer,Mary), notify integrator 3. (1) integrator adds Sale (Mary,25) 4. (2) integrator adds (Computer,Mary) Emp 5. View incorrect (duplicate tuple) SalesComp. Integrator Data Warehouse Sale(item,clerk)Emp(clerk,age) Sold (item,clerk,age)
19
CS 33619 Maintenance Anomaly - Solutions Incremental update algorithms (ECA, Strobe, etc.) Research issues: Self-maintainable views What views are self-maintainable Store auxiliary views so original + auxiliary views are self-maintainable
20
CS 33620 Self-Maintainability: Examples Sold(item,clerk,age) = Sale(item,clerk) Emp(clerk,age) Inserts into Emp If Emp.clerk is key and Sale.clerk is foreign key (with ref. int.) then no effect Inserts into Sale Maintain auxiliary view: Emp- clerk,age (Sold) Deletes from Emp Delete from Sold based on clerk
21
CS 33621 Self-Maintainability: Examples Deletes from Sale Delete from Sold based on {item,clerk} Unless age at time of sale is relevant Auxiliary views for self-maintainability Must themselves be self-maintainable One solution: all source data But want minimal set
22
CS 33622 Partial Self-Maintainability Avoid (but don’t prohibit) going to sources Sold=Sale(item,clerk) Emp(clerk,age) Inserts into Sale Check if clerk already in Sold, go to source if not Or replicate all clerks over age 30 Or...
23
CS 33623 Warehouse Specification (ideally) Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor Integrator Warehouse... Metadata Warehouse Configuration Module View Definitions Integration rules Change Detection Requirements
24
CS 33624 Processing ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms Client Warehouse Source Query & Analysis Integration Metadata
25
CS 33625 ROLAP Server Relational OLAP Server relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”
26
CS 33626 MOLAP Server Multi-Dimensional OLAP Server multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date 1 2 3 4 milk soda eggs soap A B Sales
27
CS 33627 Index Structures (sketch) Traditional Access Methods B-trees, hash tables, R-trees, grids, … Popular in Warehouses inverted lists bit map indexes join indexes text indexes
28
CS 33628 What to Materialize? Store in warehouse results useful for common queries Example: day 2 day 1 129... total sales materialize
29
CS 33629 Materialization Factors Type/frequency of queries Query response time Storage cost Update cost
30
CS 33630 Cube Aggregates Lattice city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day 1 129 use greedy algorithm to decide what to materialize
31
CS 33631 Dimension Hierarchies all state city
32
CS 33632 Dimension Hierarchies city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...
33
CS 33633 Interesting Hierarchy all years quarters months days weeks conceptual dimension table
34
CS 33634 Managing Metadata Warehouse Design Tools Client Warehouse Source Query & Analysis Integration Metadata
35
CS 33635 Metadata Administrative definition of sources, tools,... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control,...
36
CS 33636 Metadata Business business terms & definition data ownership, charging Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails
37
CS 33637 Design Summary What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index?
38
CS 33638 Tools Development design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management performance monitoring, usage patterns, exception reporting System & Network Management measure traffic (sources, warehouse, clients) Workflow Management “reliable scripts” for cleaning & analyzing data
39
CS 33639 Current State of Industry Extraction and integration done off-line Usually in large, time-consuming, batches Everything copied at warehouse Not selective about what is stored Query benefit vs storage & update cost Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything
40
CS 33640 State of Commercial Practice... Connectivity to sources Apertus Information Builders Informix Enterprise Gateway Oracle Open Connect CA-Ingres gateway MS ODBC Platinum InfoHub Data extract, clean, transform, refresh CA-Ingres Replicator ETI-Extract IBM Data Joiner, Data Propagator Prism Warehouse manager SAS Access Sybase Replication Server Trinzic InfoPump
41
CS 33641 … State of Commercial Practice... Multidimensional Database Engines Arbor Essbase Oracle RIR Express Comshare Commader SAS System Warehouse Data Servers CA-Ingres Oracle 8 RedBrick Sybase IQ Informix Dynamic Server IBM DB2 ROLAP Servers HP Intelligent Warehouse Informix Metacube MicroStrategy DSS Server Information Advantage Asxys
42
CS 33642 … State of Commercial Practice Query/Reporting Environments IBM DataGuide SAS Access CA Visual Express Platinum Forest&Trees Informix ViewPoint Multidimensional Analysis Kenan Systems Acumate Microsoft Excel Arbor Essbase Analysis server Cognos PowerPlay IQ Software IQ/Vision Lotus 123 SAS OLAP++ Business Objects Lots and lots of consulting!!
43
CS 33643 Future Directions Better performance Larger warehouses Easier to use What are companies & research labs working on?
44
CS 33644 Research (1) Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush)
45
CS 33645 Research (2) Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data Conceptual Modelling
46
CS 33646 Conclusions Massive amounts of data and complexity of queries will push limits of current warehouses Need better systems: easier to use provide quality information
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.