Data Warehouse Design Enrico Franconi CS 636.

Slides:



Advertisements
Similar presentations
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Advertisements

C6 Databases.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Data Warehouse Design Enrico Franconi CS 636. CS 3362 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Data Warehousing M R BRAHMAM.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #15.
Data Warehousing Overview
Lecture 1: Data Warehousing Based on the slides by Jeffrey D. Ullman and Hector Garcia-Molina at Stanford University 1.
Data Warehousing and OLAP
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Components and Architecture CS 543 – Data Warehousing.
10/30/2001Database Management -- R. Larson Data Warehousing University of California, Berkeley School of Information Management and Systems SIMS 257: Database.
Introduction to Data Warehousing Enrico Franconi CS 636.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Joachim Hammer 1 Data Warehousing Overview, Terminology, and Research Issues Joachim Hammer.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
Introduction to OLAP / Microsoft Analysis Services
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
COLD FUSION Deepak Sethi. What is it…. Cold fusion is a complete web application server mainly used for developing e-business applications. It allows.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Data Warehousing and OLAP. Warehousing ► Growing industry: $8 billion in 1998 ► Range from desktop to huge:  Walmart: 900-CPU, 2,700 disk, 23TB Teradata.
Data Management for Decision Support Session-3 Prof. Bharat Bhasker.
Ayyat IT Group Murad Faridi Roll NO#2492 Muhammad Waqas Roll NO#2803 Salman Raza Roll NO#2473 Junaid Pervaiz Roll NO#2468 Instructor :- “ Madam Sana Saeed”
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
Foundations of Business Intelligence: Databases and Information Management.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
7 Strategies for Extracting, Transforming, and Loading.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
MIS 451 Building Business Intelligence Systems Data Staging.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch , Ch. 22.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
11/20/ :11 AMData Mining 1 Data Mining – CSE 9033 Chapter – 1; Data Warehousing Dr. Goutam Sarker, B.E., M.E., Ph.D.(Engineering), Fellow: IE(I),
Advanced Database Systems: DBS CB, 2nd Edition
Plan for Populating a DW
Data Warehousing Overview CS245 Notes 12
Data warehouse.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data warehouse and OLAP
Databases and Information Management
Chapter 13 The Data Warehouse
Three tier Architecture of Data Warehousing
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Instructor: Dan Hebert
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Data Warehousing and OLAP
Introduction to Data Warehousing
Data Warehousing: Data Models and OLAP operations
Data Warehouse.
Chapter 17 Designing Databases
Data Warehousing Concepts
Analysis Services Analysis Services vs. the Data Warehouse vs. OLTP DB
Best Practices in Higher Education Student Data Warehousing Forum
Presentation transcript:

Data Warehouse Design Enrico Franconi CS 636

Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing, ... Managing: Metadata, Design, ... CS 336

Monitoring Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … How to get data out? Replication tool Dump file Create report ODBC or third-party “wrappers” CS 336

Monitoring Techniques Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring CS 336

Monitoring Issues Frequency Data transformation Standards (e.g., ODBC) periodic: daily, weekly, … triggered: on “big” change, lots of changes, ... Data transformation convert data to uniform format remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways CS 336

Wrapper Converts data and queries from one data model to another Queries Data Model B Data Model A Data Extends query capabilities for sources with limited capabilities Wrapper Source Queries CS 336

Wrapper Generation Solution 1: Hard code for each source Solution 2: Automatic wrapper generation Wrapper Generator Definition Wrapper CS 336

Integration Data Cleaning Data Loading Derived Data CS 336 Client Warehouse Source Query & Analysis Integration Metadata CS 336

Data Integration Receive data (changes) from multiple wrappers/monitors and integrate into warehouse Rule-based Actions Resolve inconsistencies Eliminate duplicates Integrate into warehouse (may not be empty) Summarize data Fetch more data from sources (wh updates) etc. CS 336

Data Cleaning Find (& remove) duplicate tuples e.g., Jane Doe vs. Jane Q. Doe Detect inconsistent, wrong data Attribute values that don’t match Patch missing, unreadable data Insert default values Notify sources of errors found CS 336

Data Cleaning Migration (e.g., yen to dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe) CS 336

Loading Data in the Warehouse Incremental vs. refresh Off-line vs. on-line Frequency of loading At night, 1x a week/month, continuously Parallel/Partitioned load CS 336

Warehouse Maintenance Warehouse data  materialized view Initial loading View maintenance Derived Warehouse Data indexes aggregates materialized views CS 336

Materialized Views Define new warehouse relations using SQL expressions does not exist at any source CS 336

Differs from Conventional View Maintenance... Warehouses may be highly aggregated and summarized Warehouse views may be over history of base data Process large batch updates Schema may evolve CS 336

Differs from Conventional View Maintenance... Base data doesn’t participate in view maintenance Simply reports changes Loosely coupled Absence of locking, global transactions May not be queriable CS 336

Warehouse Maintenance Anomalies Materialized view maintenance in loosely coupled, non-transactional environment Simple example Data Warehouse Sold (item,clerk,age) Sold = Sale Emp Integrator Sales Comp. Sale(item,clerk) Emp(clerk,age) CS 336

Warehouse Maintenance Anomalies Data Warehouse Sold (item,clerk,age) Integrator Sales Comp. Sale(item,clerk) Emp(clerk,age) 1. Insert into Emp(Mary,25), notify integrator 2. Insert into Sale (Computer,Mary), notify integrator 3. (1)  integrator adds Sale (Mary,25) 4. (2)  integrator adds (Computer,Mary) Emp 5. View incorrect (duplicate tuple) CS 336

Maintenance Anomaly - Solutions Incremental update algorithms (ECA, Strobe, etc.) Research issues: Self-maintainable views What views are self-maintainable Store auxiliary views so original + auxiliary views are self-maintainable CS 336

Self-Maintainability: Examples Sold(item,clerk,age) = Sale(item,clerk) Emp(clerk,age) Inserts into Emp If Emp.clerk is key and Sale.clerk is foreign key (with ref. int.) then no effect Inserts into Sale Maintain auxiliary view: Emp-clerk,age(Sold) Deletes from Emp Delete from Sold based on clerk CS 336

Self-Maintainability: Examples Deletes from Sale Delete from Sold based on {item,clerk} Unless age at time of sale is relevant Auxiliary views for self-maintainability Must themselves be self-maintainable One solution: all source data But want minimal set CS 336

Partial Self-Maintainability Avoid (but don’t prohibit) going to sources Sold=Sale(item,clerk) Emp(clerk,age) Inserts into Sale Check if clerk already in Sold, go to source if not Or replicate all clerks over age 30 Or ... CS 336

Warehouse Specification (ideally) View Definitions Warehouse Configuration Module Warehouse Integration rules Change Detection Requirements Integrator Metadata Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor ... CS 336

Processing ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms Client Warehouse Source Query & Analysis Integration Metadata CS 336

ROLAP Server Relational OLAP Server tools ROLAP utilities server Special indices, tuning; Schema is “denormalized” relational DBMS CS 336

MOLAP Server Multi-Dimensional OLAP Server M.D. tools utilities Product City Date 1 2 3 4 milk soda eggs soap A B Sales M.D. tools utilities multi-dimensional server could also sit on relational DBMS CS 336

Index Structures (sketch) Traditional Access Methods B-trees, hash tables, R-trees, grids, … Popular in Warehouses inverted lists bit map indexes join indexes text indexes CS 336

What to Materialize? Store in warehouse results useful for common queries Example: total sales day 2 . . . day 1 129 materialize CS 336

Materialization Factors Type/frequency of queries Query response time Storage cost Update cost CS 336

Cube Aggregates Lattice 129 all city product date city, product city, date product, date use greedy algorithm to decide what to materialize day 2 day 1 city, product, date CS 336

Dimension Hierarchies all state city CS 336

Dimension Hierarchies all city product date city, product city, date product, date state city, product, date state, date state, product state, product, date not all arcs shown... CS 336

Interesting Hierarchy all years weeks quarters conceptual dimension table months days CS 336

Managing Metadata Warehouse Design Tools CS 336 Client Source Query & Analysis Integration Metadata CS 336

Metadata Administrative definition of sources, tools, ... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control, ... CS 336

Metadata Business Operational business terms & definition data ownership, charging Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails CS 336

Design Summary What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index? CS 336

Tools Development Planning & Analysis Warehouse Management design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management performance monitoring, usage patterns, exception reporting System & Network Management measure traffic (sources, warehouse, clients) Workflow Management “reliable scripts” for cleaning & analyzing data CS 336

Current State of Industry Extraction and integration done off-line Usually in large, time-consuming, batches Everything copied at warehouse Not selective about what is stored Query benefit vs storage & update cost Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything CS 336

State of Commercial Practice ... Connectivity to sources Apertus Information Builders Informix Enterprise Gateway Oracle Open Connect CA-Ingres gateway MS ODBC Platinum InfoHub Data extract, clean, transform, refresh CA-Ingres Replicator ETI-Extract IBM Data Joiner, Data Propagator Prism Warehouse manager SAS Access Sybase Replication Server Trinzic InfoPump CS 336

… State of Commercial Practice ... Multidimensional Database Engines Arbor Essbase Oracle RIR Express Comshare Commader SAS System Warehouse Data Servers CA-Ingres Oracle 8 RedBrick Sybase IQ Informix Dynamic Server IBM DB2 ROLAP Servers HP Intelligent Warehouse Informix Metacube MicroStrategy DSS Server Information Advantage Asxys CS 336

… State of Commercial Practice Multidimensional Analysis Kenan Systems Acumate Microsoft Excel Arbor Essbase Analysis server Cognos PowerPlay IQ Software IQ/Vision Lotus 123 SAS OLAP++ Business Objects Query/Reporting Environments IBM DataGuide SAS Access CA Visual Express Platinum Forest&Trees Informix ViewPoint Lots and lots of consulting!! CS 336

Future Directions Better performance Larger warehouses Easier to use What are companies & research labs working on? CS 336

Research (1) Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush) CS 336

Research (2) Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data Conceptual Modelling CS 336

Conclusions Massive amounts of data and complexity of queries will push limits of current warehouses Need better systems: easier to use provide quality information CS 336