Data Warehousing and OLAP. Warehousing ► Growing industry: $8 billion in 1998 ► Range from desktop to huge:  Walmart: 900-CPU, 2,700 disk, 23TB Teradata.

Slides:



Advertisements
Similar presentations
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Advertisements

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Data Warehouse Design Enrico Franconi CS 636. CS 3362 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Introduction to Data Warehousing CPS Notes 6.
Data Warehousing M R BRAHMAM.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #15.
Data Warehousing Overview
Lecture 1: Data Warehousing Based on the slides by Jeffrey D. Ullman and Hector Garcia-Molina at Stanford University 1.
1 1 Data Warehousing Decision-Support Systems  Data Analysis  OLAP  Extended aggregation features in SQL –Windowing and ranking  Implementation Techniques.
Data Warehousing and OLAP
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
1 Lecture 10: More OLAP - Dimensional modeling
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
Introduction to Data Warehousing Enrico Franconi CS 636.
Data Warehousing. On-Line Analytical Processing (OLAP) Tools The use of a set of graphical tools that provides users with multidimensional views of their.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
DATA WAREHOUSE (Muscat, Oman).
CS346: Advanced Databases
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
8/20/ Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously. Defined in many different ways, but.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Joachim Hammer 1 Data Warehousing Overview, Terminology, and Research Issues Joachim Hammer.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 5 th Edition, Aug 26, 2005 Buzzword List OLTP – OnLine Transaction Processing (normalized,
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
Data Warehousing Xintao Wu. Can You Easily Answer These Questions? What are Personnel Services costs across all departments for all funding sources? What.
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
Data Warehousing.
October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
13 1 Chapter 13 The Data Warehouse Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
Data Warehousing.
Advanced Database Concepts
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Chapter 111 Chapter 11 Information Integration Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ.
Data Warehousing COMP3017 Advanced Databases Dr Nicholas Gibbins –
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch , Ch. 22.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
11/20/ :11 AMData Mining 1 Data Mining – CSE 9033 Chapter – 1; Data Warehousing Dr. Goutam Sarker, B.E., M.E., Ph.D.(Engineering), Fellow: IE(I),
Advanced Database Systems: DBS CB, 2nd Edition
Data Warehousing Overview CS245 Notes 12
Data warehouse.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Data Warehouse.
Data Warehouse Design Enrico Franconi CS 636.
Data Warehouse and OLAP
Data Warehousing Overview CS245 Notes 11
Data Warehousing and OLAP
Introduction to Data Warehousing
Data Warehouse and OLAP
Presentation transcript:

Data Warehousing and OLAP

Warehousing ► Growing industry: $8 billion in 1998 ► Range from desktop to huge:  Walmart: 900-CPU, 2,700 disk, 23TB Teradata system ► Lots of buzzwords, hype  slice & dice, rollup, MOLAP, pivot,...

Outline ► What is a data warehouse? ► Why a warehouse? ► Models & operations ► Implementing a warehouse ► Future directions

What is a Warehouse? ► Collection of diverse data  subject oriented  aimed at executive, decision maker  often a copy of operational data  with value-added data (e.g., summaries, history)  integrated  time-varying  non-volatile more

What is a Warehouse? ► Collection of tools  gathering data  cleansing, integrating,...  querying, reporting, analysis  data mining  monitoring, administering warehouse

Warehouse Architecture Client Warehouse Source Query & Analysis Integration Metadata

Why a Warehouse? ► Two Approaches:  Query-Driven (Lazy)  Warehouse (Eager) Source ?

Query-Driven Approach Client Wrapper Mediator Source

Advantages of Warehousing ► High query performance ► Queries not visible outside warehouse ► Local processing at sources unaffected ► Can operate when sources unavailable ► Can query data not stored in a DBMS ► Extra information at warehouse  Modify, summarize (store aggregates)  Add historical information

Advantages of Query-Driven ► No need to copy data  less storage  no need to purchase data ► More up-to-date data ► Query needs can be unknown ► Only query interface needed at sources ► May be less draining on sources

OLTP vs. OLAP ► OLTP: On Line Transaction Processing  Describes processing at operational sites ► OLAP: On Line Analytical Processing  Describes processing at warehouse

OLTP vs. OLAP ► Mostly updates ► Many small transactions ► Mb-Tb of data ► Raw data ► Clerical users ► Up-to-date data ► Consistency, recoverability critical ► Mostly reads ► Queries long, complex ► Gb-Tb of data ► Summarized, consolidated data ► Decision-makers, analysts as users OLTP OLAP

Data Marts ► Smaller warehouses ► Spans part of organization  e.g., marketing (customers, products, sales) ► Do not require enterprise-wide consensus  but long term integration problems?

Warehouse Models & Operators ► Data Models  relations  stars & snowflakes  cubes ► Operators  slice & dice  roll-up, drill down  pivoting  other

Star

Star Schema

Terms ► Fact table ► Dimension tables ► Measures

Dimension Hierarchies store sType cityregion  snowflake schema  constellations

Cube Fact table view: Multi-dimensional cube: dimensions = 2

3-D Cube day 2 day 1 dimensions = 3 Multi-dimensional cube:Fact table view:

ROLAP vs. MOLAP ► ROLAP: Relational On-Line Analytical Processing ► MOLAP: Multi-Dimensional On-Line Analytical Processing

Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81

Aggregates Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId drill-down rollup

Aggregates ► Operators: sum, count, max, min, median, ave ► “Having” clause ► Using dimension hierarchy  average by region (within store)  maximum by month (within date)

Cube Aggregation day 2 day drill-down rollup Example: computing sums

Cube Operators day 2 day sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)

Extended Cube day 2 day 1 * sale(*,p2,*)

Aggregation Using Hierarchies day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B)

Pivoting day 2 day 1 Multi-dimensional cube: Fact table view:

Implementing a Warehouse ► Monitoring: Sending data from sources ► Integrating: Loading, cleansing,... ► Processing: Query processing, indexing,... ► Managing: Metadata, Design,...

Monitoring ► Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … ► Incremental vs. Refresh new

Monitoring Techniques ► Periodic snapshots ► Database triggers ► Log shipping ► Data shipping (replication service) ► Transaction shipping ► Polling (queries to source) ► Screen scraping ► Application level monitoring  Advantages & Disadvantages!!

Monitoring Issues ► Frequency  periodic: daily, weekly, …  triggered: on “big” change, lots of changes,... ► Data transformation  convert data to uniform format  remove & add fields (e.g., add date to get history) ► Standards (e.g., ODBC) ► Gateways

Integration ► Data Cleaning ► Data Loading ► Derived Data Client Warehouse Source Query & Analysis Integration Metadata

Data Cleaning ► Migration (e.g., yen  dollars) ► Scrubbing: use domain-specific knowledge (e.g., social security numbers) ► Fusion (e.g., mail list, customer merging) ► Auditing: discover rules & relationships (like data mining) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)

Loading Data ► Incremental vs. refresh ► Off-line vs. on-line ► Frequency of loading  At night, 1x a week/month, continuously ► Parallel/Partitioned load

Derived Data ► Derived Warehouse Data  indexes  aggregates  materialized views (next slide) ► When to update derived data? ► Incremental vs. refresh

Materialized Views ► Define new warehouse relations using SQL expressions does not exist at any source

Processing ► ROLAP servers vs. MOLAP servers ► Index Structures ► What to Materialize? ► Algorithms Client Warehouse Source Query & Analysis Integration Metadata

ROLAP Server ► Relational OLAP Server relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”

MOLAP Server ► Multi-Dimensional OLAP Server multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date milk soda eggs soap A B Sales

Index Structures ► Traditional Access Methods  B-trees, hash tables, R-trees, grids, … ► Popular in Warehouses  inverted lists  bit map indexes  join indexes  text indexes

Inverted Lists... age index inverted lists data records

Using Inverted Lists ► Query:  Get people with age = 20 and name = “fred” ► List for age = 20: r4, r18, r34, r35 ► List for name = “fred”: r18, r52 ► Answer is intersection: r18

Bit Maps... age index bit maps data records

Using Bit Maps ► Query:  Get people with age = 20 and name = “fred” ► List for age = 20: ► List for name = “fred”: ► Answer is intersection: l Good if domain cardinality small l Bit vectors can be compressed

Join “Combine” SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT

Join Indexes join index

What to Materialize? ► Store in warehouse results useful for common queries ► Example: day 2 day total sales materialize

Materialization Factors ► Type/frequency of queries ► Query response time ► Storage cost ► Update cost

Cube Aggregates Lattice city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day use greedy algorithm to decide what to materialize

Dimension Hierarchies all state city

Dimension Hierarchies city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...

Interesting Hierarchy all years quarters months days weeks conceptual dimension table

Design ► What data is needed? ► Where does it come from? ► How to clean data? ► How to represent in warehouse (schema)? ► What to summarize? ► What to materialize? ► What to index?

Tools ► Development  design & edit: schemas, views, scripts, rules, queries, reports ► Planning & Analysis  what-if scenarios (schema changes, refresh rates), capacity planning ► Warehouse Management  performance monitoring, usage patterns, exception reporting ► System & Network Management  measure traffic (sources, warehouse, clients) ► Workflow Management  “reliable scripts” for cleaning & analyzing data

Current State of Industry ► Extraction and integration done off-line  Usually in large, time-consuming, batches ► Everything copied at warehouse  Not selective about what is stored  Query benefit vs storage & update cost ► Query optimization aimed at OLTP  High throughput instead of fast response  Process whole query before displaying anything