Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.

Slides:



Advertisements
Similar presentations
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Advertisements

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
April 30, Data Warehousing and OLAP Technology: An Overview  What is a data warehouse?  Data warehouse architecture  From data warehousing to.
Data Warehouse Design Enrico Franconi CS 636. CS 3362 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Introduction to Data Warehousing CPS Notes 6.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #15.
Data Warehousing Overview
Lecture 1: Data Warehousing Based on the slides by Jeffrey D. Ullman and Hector Garcia-Molina at Stanford University 1.
Data Warehousing and OLAP
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
1 Lecture 10: More OLAP - Dimensional modeling
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 29 Overview of Data Warehousing and OLAP.
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
Introduction to Data Warehousing Enrico Franconi CS 636.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
CS346: Advanced Databases
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
Dr. Bernard Chen Ph.D. University of Central Arkansas
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Joachim Hammer 1 Data Warehousing Overview, Terminology, and Research Issues Joachim Hammer.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 5 th Edition, Aug 26, 2005 Buzzword List OLTP – OnLine Transaction Processing (normalized,
Database Systems – Data Warehousing
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
Data Warehousing.
October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.
Data Warehousing and OLAP. Warehousing ► Growing industry: $8 billion in 1998 ► Range from desktop to huge:  Walmart: 900-CPU, 2,700 disk, 23TB Teradata.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Data Warehousing.
Advanced Database Concepts
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Data Warehousing COMP3017 Advanced Databases Dr Nicholas Gibbins –
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch , Ch. 22.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
11/20/ :11 AMData Mining 1 Data Mining – CSE 9033 Chapter – 1; Data Warehousing Dr. Goutam Sarker, B.E., M.E., Ph.D.(Engineering), Fellow: IE(I),
Advanced Database Systems: DBS CB, 2nd Edition
Data Warehousing Overview CS245 Notes 12
Data warehouse.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Data Warehouse.
Data Warehouse Design Enrico Franconi CS 636.
Data Warehouse and OLAP
Data Warehousing Overview CS245 Notes 11
Data Warehousing and OLAP
Data Warehousing Concepts
Data Warehouse and OLAP
Presentation transcript:

Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11

CS 245Notes11 2 Warehousing l Growing industry: $8 billion in 1998 l Range from desktop to huge: u Walmart: 900-CPU, 2,700 disk, 23TB Teradata system l Lots of buzzwords, hype u slice & dice, rollup, MOLAP, pivot,...

CS 245Notes11 3 Outline l What is a data warehouse? l Why a warehouse? l Models & operations l Implementing a warehouse l Future directions

CS 245Notes11 4 What is a Warehouse? l Collection of diverse data u subject oriented u aimed at executive, decision maker u often a copy of operational data u with value-added data (e.g., summaries, history) u integrated u time-varying u non-volatile more

CS 245Notes11 5 What is a Warehouse? l Collection of tools u gathering data u cleansing, integrating,... u querying, reporting, analysis u data mining u monitoring, administering warehouse

CS 245Notes11 6 Warehouse Architecture Client Warehouse Source Query & Analysis Integration Metadata

CS 245Notes11 7 Motivating Examples l Forecasting l Comparing performance of units l Monitoring, detecting fraud l Visualization

CS 245Notes11 8 Why a Warehouse? l Two Approaches: u Query-Driven (Lazy) u Warehouse (Eager) Source ?

CS 245Notes11 9 Query-Driven Approach Client Wrapper Mediator Source

CS 245Notes11 10 Advantages of Warehousing l High query performance l Queries not visible outside warehouse l Local processing at sources unaffected l Can operate when sources unavailable l Can query data not stored in a DBMS l Extra information at warehouse u Modify, summarize (store aggregates) u Add historical information

CS 245Notes11 11 Advantages of Query-Driven l No need to copy data u less storage u no need to purchase data l More up-to-date data l Query needs can be unknown l Only query interface needed at sources l May be less draining on sources

CS 245Notes11 12 OLTP vs. OLAP l OLTP: On Line Transaction Processing u Describes processing at operational sites l OLAP: On Line Analytical Processing u Describes processing at warehouse

CS 245Notes11 13 OLTP vs. OLAP l Mostly updates l Many small transactions l Mb-Tb of data l Raw data l Clerical users l Up-to-date data l Consistency, recoverability critical l Mostly reads l Queries long, complex l Gb-Tb of data l Summarized, consolidated data l Decision-makers, analysts as users OLTP OLAP

CS 245Notes11 14 Data Marts l Smaller warehouses l Spans part of organization u e.g., marketing (customers, products, sales) l Do not require enterprise-wide consensus u but long term integration problems?

CS 245Notes11 15 Warehouse Models & Operators l Data Models u relations u stars & snowflakes u cubes l Operators u slice & dice u roll-up, drill down u pivoting u other

CS 245Notes11 16 Star

CS 245Notes11 17 Star Schema

CS 245Notes11 18 Terms l Fact table l Dimension tables l Measures

CS 245Notes11 19 Dimension Hierarchies store sType cityregion  snowflake schema  constellations

CS 245Notes11 20 Cube Fact table view: Multi-dimensional cube: dimensions = 2

CS 245Notes D Cube day 2 day 1 dimensions = 3 Multi-dimensional cube:Fact table view:

CS 245Notes11 22 ROLAP vs. MOLAP l ROLAP: Relational On-Line Analytical Processing l MOLAP: Multi-Dimensional On-Line Analytical Processing

CS 245Notes11 23 Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81

CS 245Notes11 24 Aggregates Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

CS 245Notes11 25 Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId drill-down rollup

CS 245Notes11 26 Aggregates l Operators: sum, count, max, min, median, ave l “Having” clause l Using dimension hierarchy u average by region (within store) u maximum by month (within date)

CS 245Notes11 27 Cube Aggregation day 2 day drill-down rollup Example: computing sums

CS 245Notes11 28 Cube Operators day 2 day sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)

CS 245Notes11 29 Extended Cube day 2 day 1 * sale(*,p2,*)

CS 245Notes11 30 Aggregation Using Hierarchies day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B)

CS 245Notes11 31 Pivoting day 2 day 1 Multi-dimensional cube: Fact table view:

CS 245Notes11 32 Query & Analysis Tools l Query Building l Report Writers (comparisons, growth, graphs,…) l Spreadsheet Systems l Web Interfaces l Data Mining

CS 245Notes11 33 Other Operations l Time functions u e.g., time average l Computed Attributes u e.g., commission = sales * rate l Text Queries u e.g., find documents with words X AND B u e.g., rank documents by frequency of words X, Y, Z

CS 245Notes11 34 Data Mining l Decision Trees l Clustering l Association Rules

CS 245Notes11 35 Decision Trees Example: Conducted survey to see what customers were interested in new model car Want to select customers for advertising campaign training set

CS 245Notes11 36 One Possibility age<30 city=sfcar=van likely unlikely YY Y N N N

CS 245Notes11 37 Another Possibility car=taurus city=sfage<45 likely unlikely YY Y N N N

CS 245Notes11 38 Issues l Decision tree cannot be “too deep” è would not have statistically significant amounts of data for lower decisions l Need to select tree that most reliably predicts outcomes

CS 245Notes11 39 Clustering age income education

CS 245Notes11 40 Another Example: Text l Each document is a vector u e.g., contains words 1,4,5,... l Clusters contain “similar” documents l Useful for understanding, searching documents international news sports business

CS 245Notes11 41 Issues l Given desired number of clusters? l Finding “best” clusters l Are clusters semantically meaningful? u e.g., “yuppies’’ cluster? l Using clusters for disk storage

CS 245Notes11 42 Association Rule Mining transaction id customer id products bought sales records: Trend: Products p5, p8 often bough together Trend: Customer 12 likes product p9 market-basket data

CS 245Notes11 43 Association Rule l Rule: {p 1, p 3, p 8 } l Support: number of baskets where these products appear l High-support set: support  threshold s l Problem: find all high support sets

CS 245Notes11 44 Finding High-Support Pairs l Baskets(basket, item) l SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item = s; WHY?

CS 245Notes11 45 Example check if count  s

CS 245Notes11 46 Issues l Performance for size 2 rules big even bigger! l Performance for size k rules

CS 245Notes11 47 Implementing a Warehouse l Monitoring: Sending data from sources l Integrating: Loading, cleansing,... l Processing: Query processing, indexing,... l Managing: Metadata, Design,...

CS 245Notes11 48 Monitoring l Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … l Incremental vs. Refresh new

CS 245Notes11 49 Monitoring Techniques l Periodic snapshots l Database triggers l Log shipping l Data shipping (replication service) l Transaction shipping l Polling (queries to source) l Screen scraping l Application level monitoring  Advantages & Disadvantages!!

CS 245Notes11 50 Monitoring Issues l Frequency u periodic: daily, weekly, … u triggered: on “big” change, lots of changes,... l Data transformation u convert data to uniform format u remove & add fields (e.g., add date to get history) l Standards (e.g., ODBC) l Gateways

CS 245Notes11 51 Integration l Data Cleaning l Data Loading l Derived Data Client Warehouse Source Query & Analysis Integration Metadata

CS 245Notes11 52 Data Cleaning Migration (e.g., yen  dollars) l Scrubbing: use domain-specific knowledge (e.g., social security numbers) l Fusion (e.g., mail list, customer merging) l Auditing: discover rules & relationships (like data mining) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)

CS 245Notes11 53 Loading Data l Incremental vs. refresh l Off-line vs. on-line l Frequency of loading u At night, 1x a week/month, continuously l Parallel/Partitioned load

CS 245Notes11 54 Derived Data l Derived Warehouse Data u indexes u aggregates u materialized views (next slide) l When to update derived data? l Incremental vs. refresh

CS 245Notes11 55 Materialized Views l Define new warehouse relations using SQL expressions does not exist at any source

CS 245Notes11 56 Processing l ROLAP servers vs. MOLAP servers l Index Structures l What to Materialize? l Algorithms Client Warehouse Source Query & Analysis Integration Metadata

CS 245Notes11 57 ROLAP Server l Relational OLAP Server relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”

CS 245Notes11 58 MOLAP Server l Multi-Dimensional OLAP Server multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date milk soda eggs soap A B Sales

CS 245Notes11 59 Index Structures l Traditional Access Methods u B-trees, hash tables, R-trees, grids, … l Popular in Warehouses u inverted lists u bit map indexes u join indexes u text indexes

CS 245Notes11 60 Inverted Lists... age index inverted lists data records

CS 245Notes11 61 Using Inverted Lists l Query: u Get people with age = 20 and name = “fred” l List for age = 20: r4, r18, r34, r35 l List for name = “fred”: r18, r52 l Answer is intersection: r18

CS 245Notes11 62 Bit Maps... age index bit maps data records

CS 245Notes11 63 Using Bit Maps l Query: u Get people with age = 20 and name = “fred” l List for age = 20: l List for name = “fred”: l Answer is intersection: l Good if domain cardinality small l Bit vectors can be compressed

CS 245Notes11 64 Join “Combine” SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT

CS 245Notes11 65 Join Indexes join index

CS 245Notes11 66 What to Materialize? l Store in warehouse results useful for common queries l Example: day 2 day total sales materialize

CS 245Notes11 67 Materialization Factors l Type/frequency of queries l Query response time l Storage cost l Update cost

CS 245Notes11 68 Cube Aggregates Lattice city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day use greedy algorithm to decide what to materialize

CS 245Notes11 69 Dimension Hierarchies all state city

CS 245Notes11 70 Dimension Hierarchies city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...

CS 245Notes11 71 Interesting Hierarchy all years quarters months days weeks conceptual dimension table

CS 245Notes11 72 Algorithms l Query Optimization l Parallel Processing l Data Mining

CS 245Notes11 73 Example: Association Rules l How do we perform rule mining efficiently? l Observation: If set X has support t, then each X subset must have at least support t l For 2-sets: u if we need support s for {i, j} u then each i, j must appear in at least s baskets

CS 245Notes11 74 Algorithm for 2-Sets (1) Find OK products u those appearing in s or more baskets (2) Find high-support pairs using only OK products

CS 245Notes11 75 Algorithm for 2-Sets l INSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s; l Perform mining on okBaskets SELECT I.item, J.item, COUNT(I.basket) FROM okBaskets I, okBaskets J WHERE I.basket = J.basket AND I.item = s;

CS 245Notes11 76 Counting Efficiently l One way: sort count & remove threshold = 3

CS 245Notes11 77 Counting Efficiently l Another way: remove scan & count keep counter array in memory threshold = 3

CS 245Notes11 78 Yet Another Way (1) scan & hash & count in-memory hash table threshold = 3 (2) scan & remove (4) remove (3) scan& count in-memory counters false positive

CS 245Notes11 79 Discussion l Hashing scheme: 2 (or 3) scans of data l Sorting scheme: requires a sort! l Hashing works well if few high-support pairs and many low-support ones item-pairs ranked by frequency frequency threshold iceberg queries

CS 245Notes11 80 Managing l Metadata l Warehouse Design l Tools Client Warehouse Source Query & Analysis Integration Metadata

CS 245Notes11 81 Metadata l Administrative u definition of sources, tools,... u schemas, dimension hierarchies, … u rules for extraction, cleaning, … u refresh, purging policies u user profiles, access control,...

CS 245Notes11 82 Metadata l Business u business terms & definition u data ownership, charging l Operational u data lineage u data currency (e.g., active, archived, purged) u use stats, error reports, audit trails

CS 245Notes11 83 Design l What data is needed? l Where does it come from? l How to clean data? l How to represent in warehouse (schema)? l What to summarize? l What to materialize? l What to index?

CS 245Notes11 84 Tools l Development u design & edit: schemas, views, scripts, rules, queries, reports l Planning & Analysis u what-if scenarios (schema changes, refresh rates), capacity planning l Warehouse Management u performance monitoring, usage patterns, exception reporting l System & Network Management u measure traffic (sources, warehouse, clients) l Workflow Management u “reliable scripts” for cleaning & analyzing data

CS 245Notes11 85 Current State of Industry l Extraction and integration done off-line u Usually in large, time-consuming, batches l Everything copied at warehouse u Not selective about what is stored u Query benefit vs storage & update cost l Query optimization aimed at OLTP u High throughput instead of fast response u Process whole query before displaying anything

CS 245Notes11 86 Future Directions l Better performance l Larger warehouses l Easier to use l What are companies & research labs working on?

CS 245Notes11 87 Research (1) l Incremental Maintenance l Data Consistency l Data Expiration l Recovery l Data Quality l Error Handling (Back Flush)

CS 245Notes11 88 Research (2) l Rapid Monitor Construction l Temporal Warehouses l Materialization & Index Selection l Data Fusion l Data Mining l Integration of Text & Relational Data

CS 245Notes11 89 Conclusions l Massive amounts of data and complexity of queries will push limits of current warehouses l Need better systems: u easier to use u provide quality information