1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch. 21.1-2, Ch. 22.

Slides:



Advertisements
Similar presentations
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Advertisements

Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Data Warehousing.
Data Warehousing Willem Visser RW334. Somebody is watching! Everybody seems to be recording your every move Loyalty cards Cookies – Facebook, Twitter,…
Data Warehouse Design Enrico Franconi CS 636. CS 3362 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Introduction to Data Warehousing CPS Notes 6.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #15.
Data Warehousing Overview
Lecture 1: Data Warehousing Based on the slides by Jeffrey D. Ullman and Hector Garcia-Molina at Stanford University 1.
Data Warehousing Xintao Wu. Evolution of Database Technology (See Fig. 1.1) 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational.
Data Warehousing and OLAP
Dr. M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining Data Warehousing COMP207: Data Mining.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
1 Lecture 10: More OLAP - Dimensional modeling
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
Introduction to Data Warehousing Enrico Franconi CS 636.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
1 Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously.  A decision support database that is maintained.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
1 Data Warehouses C hapter 2. 2 Chapter 2 Outline Chapter 2 Outline – Introduction –Data Warehouses –Data Warehouse in Organisation – OLTP vs. OLAP –Why.
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
Dr. Bernard Chen Ph.D. University of Central Arkansas
8/20/ Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously. Defined in many different ways, but.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
An overview of Data Warehousing and OLAP Technology
Data Warehousing.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 5 th Edition, Aug 26, 2005 Buzzword List OLTP – OnLine Transaction Processing (normalized,
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Data Warehousing Xintao Wu. Can You Easily Answer These Questions? What are Personnel Services costs across all departments for all funding sources? What.
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
Data Warehousing.
October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.
Data Warehousing and OLAP. Warehousing ► Growing industry: $8 billion in 1998 ► Range from desktop to huge:  Walmart: 900-CPU, 2,700 disk, 23TB Teradata.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
Data Mining Data Warehouses.
Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.
January 21, 2016Data Mining: Concepts and Techniques 1 Chapter 3: Data Warehousing and OLAP Technology: An Overview What is a data warehouse? A multi-dimensional.
Data Warehousing.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
11/20/ :11 AMData Mining 1 Data Mining – CSE 9033 Chapter – 1; Data Warehousing Dr. Goutam Sarker, B.E., M.E., Ph.D.(Engineering), Fellow: IE(I),
Advanced Database Systems: DBS CB, 2nd Edition
Data Warehousing Overview CS245 Notes 12
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data warehouse and OLAP
Chapter 13 The Data Warehouse
OLAP Concepts and Techniques
Data Warehouse.
Data Warehouse Design Enrico Franconi CS 636.
Data Warehouse and OLAP
Data Warehousing Overview CS245 Notes 11
Data Warehousing and OLAP
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009
Data Warehouse and OLAP
Presentation transcript:

1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch , Ch. 22

2 Outline What is a data warehouse? Why a data warehouse? Data Warehouse Models & operations Query & Analysis Tools Implementing a data warehouse Future directions

333 What is a Data Warehouse

4 Collection of diverse data  subject oriented  aimed at executive, decision maker  often a copy of operational data  with value-added data (e.g., summaries, history)  integrated  time-varying  non-volatile What is a Data Warehouse more Collection of tools gathering data cleansing, integrating,... querying, reporting, analysis data mining monitoring, administering warehouse

5 Data Warehouse Architecture What is a Data Warehouse Client Warehouse Source Query & Analysis Integration Metadata

6 Motivation Examples: Forecasting Comparing performance of units Monitoring, detecting fraud Visualization What is a Data Warehouse

777 Why a Data Warehouse

8 Two Approaches to Extract Information in the Enterprise: Query-Driven (Lazy) Warehouse (Eager) Why a Data Warehouse

9 Query Driven Approach (Federation): Why a Data Warehouse Client Wrapper Mediator Source

10 Warehouse Advantages Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse  Modify, summarize (store aggregates)  Add historical information Why a Data Warehouse

11 Query-driven Advantages No need to copy data  less storage  no need to purchase data More up-to-date data – real-time Query needs can be unknown Only query interface needed at sources May be less draining on sources Why a Data Warehouse

12 OLTP vs. OLAP OLTP: On Line Transaction Processing  Describes processing at operational sites OLAP: On Line Analytical Processing  Describes processing at warehouse What is a Data Warehouse

13 OLTP vs. OLAP What is a Data Warehouse Mostly updates Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical Mostly reads Queries long, complex GB-TB of data Summarized, consolidated data Decision-makers, analysts as users OLTP OLAP

14 Data Marts Smaller warehouses Spans part of organization  e.g., marketing (customers, products, sales) Do not require enterprise-wide consensus  but long term integration problems? What is a Data Warehouse

15 Data Warehouse Models & Operations

16 Data Models  relations  stars & snowflakes  cubes Operators  slice & dice  roll-up, drill down  pivoting  other Data Warehouse Models & Operations

17 Star Schema Data Warehouse Models & Operations

Star Schema 18 Data Warehouse Models & Operations Terms: Fact table: PK is the aggregate of the dimensions’ PK - 2 nd NF Dimension tables: “by” condition - one view for how the data will be analyzed – 3 rd NF Measures: aggregates – sales/day, sales/quarter, etc. Summary Tables: pre- summarized data to meet 90% of user queries. Use triggers/stored procedures to maintain summary tables current on insert/delete product prodId name price customer custId name address city store storeId city sale orderId date custId prodId storeId qty amt

19 Star Schema - Dimension Hierarchies Data Warehouse Models & Operations store sType cityregion

20 S nowflakes schema, a dimension table have the hierarchy broken into set of tables – more complicated and slower queries Data Warehouse Models & Operations time_key day day_of_the_week month quarter year time location_key street city_key location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key item branch_key branch_name branch_type branch supplier_key supplier_type supplier city_key city state_or_province country city

21 Fact Constellations schema, multiple fact tables that share the set of dimensional tables; viewed as a collection of stars Data Warehouse Models & Operations time_key day day_of_the_week month quarter year time location_key street city_key location Sales Fact Table t item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key item branch_key branch_name branch_type branch supplier_key supplier_type supplier city_key city state_or_province country city time_key

22 Data Warehouse Models & Operations product prodId name price customer custId name address city store storeId city sale orderId date custId prodId storeId qty amt Fact table view: Multi-dimensional cube: dimensions = 2 Star Schema

23 3D Cube Data Warehouse Models & Operations day 2 day 1 dimensions = 3 Multi-dimensional cube:Fact table view:

24 ROLAP vs. MOLAP  ROLAP: Relational On-Line Analytical Processing  MOLAP: Multi-Dimensional On-Line Analytical Processing Aggregates  Add up amounts for day 1  In SQL: SELECT sum(amt) FROM SALE WHERE date = 1; Data Warehouse Models & Operations 81

25 Aggregates (Contd.)  Add up amounts by day  In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date; Data Warehouse Models & Operations

26 Aggregates (Contd.)  Add up amounts by day, product  In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId; Data Warehouse Models & Operations drill-down rollup

27 Aggregates (Contd.)  Operators: sum, count, max, min, median, ave  “Having” clause  Using dimension hierarchy average by region (within store) maximum by month (within date) Data Warehouse Models & Operations

28 Cube Aggregations Data Warehouse Models & Operations day 2 day Example: computing sums drill-down rollup

29 Cube Operators Data Warehouse Models & Operations day 2 day sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)

30 Extended Cube Data Warehouse Models & Operations day 2 day 1 * sale(*,p2,*)

31 Aggregation Using Hierarchies Data Warehouse Models & Operations day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B)

32 Pivoting Data Warehouse Models & Operations day 2 day 1 Multi-dimensional cube: Fact table view:

33 Query & Analysis Tools

34 Query Building Report Writers (comparisons, growth, graphs,…) Spreadsheet Systems Web Interfaces Data Mining Query & Analysis Tools

35 Other Operations Time functions  e.g., time average Computed Attributes  e.g., commission = sales * rate Text Queries  e.g., find documents with words X AND B  e.g., rank documents by frequency of words X, Y, Z Query & Analysis Tools

36 Data Mining  Decision Trees  Clustering  Association Rules (i) Decision Tree:  Conducted survey to see what customers were interested in new model car  Want to select customers for advertising campaign Query & Analysis Tools training set

37 One possibility: Query & Analysis Tools age<30 city=sfcar=van likely unlikely YY Y N N N

38 Another Possibility: Query & Analysis Tools car=taurus city=sfage<45 likely unlikely YY Y N N N

39 Decision Tree Issues: Decision tree cannot be “too deep” would not have statistically significant amounts of data for lower decisions Need to select tree that most reliably predicts outcomes Query & Analysis Tools

40 (ii) Clustering: Query & Analysis Tools age income education

41 Another Example: Text Each document is a vector  e.g., contains words 1,4,5,... Clusters contain “similar” documents Useful for understanding, searching documents Query & Analysis Tools international news sports business

42 Clustering Issues: Given desired number of clusters? Finding “best” clusters Are clusters semantically meaningful?  e.g., “yuppies’’ cluster? Using clusters for disk storage Query & Analysis Tools

43 (iii) Association Rules Mining: Query & Analysis Tools customer id products bought sales records: Trend: Products p5, p8 often bought together Trend: Customer 12 likes product p9 market-basket data

44 Association Rule Rule: {p 1, p 3, p 8 } Support: number of baskets where these products appear High-support set: support  threshold s Problem: find all high support sets Query & Analysis Tools

45 Finding High-Support Pairs Baskets(basket, item) SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item = s; Query & Analysis Tools WHY?

46 Example: Query & Analysis Tools check if count  s

47 Association Rule Issues: Performance for size 2 rules Performance for size k rules Query & Analysis Tools big even bigger!

48 Implementing a Data Warehouse

49 Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing,... Managing: Metadata, Design,... Implementing a Data Warehouse

50 Monitoring: Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … Incremental vs. Refresh Implementing a Data Warehouse new

51 Monitoring Techniques: Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring Implementing a Data Warehouse  Advantages & Disadvantages!!

52 Monitoring Issues: Frequency  periodic: daily, weekly, …  triggered: on “big” change, lots of changes,... Data transformation  convert data to uniform format  remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways Implementing a Data Warehouse

53 Integration Data Cleaning Data Loading Derived Data Implementing a Data Warehouse Client Warehouse Source Query & Analysis Integration Metadata

54 Data Cleaning: Migration (e.g., yen  dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) Auditing: discover rules & relationships (like data mining) Implementing a Data Warehouse billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)

55 Loading Data: Incremental vs. refresh Off-line vs. on-line Frequency of loading  At night, 1x a week/month, continuously Parallel/Partitioned load Implementing a Data Warehouse

56 Derived Data: Derived Warehouse Data  indexes  aggregates  materialized views (next slide) When to update derived data? Incremental vs. refresh Implementing a Data Warehouse

57 Materialized Views: Define new warehouse relations using SQL expressions Implementing a Data Warehouse does not exist at any source

Processing ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms 58 Implementing a Data Warehouse Client Warehouse Source Query & Analysis Integration Metadata

ROLAP Server: Relational OLAP Server 59 Implementing a Data Warehouse relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”

MOLAP Server: Multi-Dimensional OLAP Server 60 Implementing a Data Warehouse multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date milk soda eggs soap A B Sales

Index Structures Traditional Access Methods  B-trees, hash tables, R-trees, grids, … Popular in Warehouses  inverted lists  bit map indexes  join indexes  text indexes 61 Implementing a Data Warehouse

Inverted Lists 62 Implementing a Data Warehouse... age index inverted lists data records

Using Inverted Lists: Query:  Get people with age = 20 and name = “fred” List for age = 20: r4, r18, r34, r35 List for name = “fred”: r18, r52 Answer is intersection of “list for name” & “list for age”: r18 63 Implementing a Data Warehouse

Bit Map Index: 64 Implementing a Data Warehouse... age index bit maps data records

Using Bit Map Index: Query:  Get people with age = 20 and name = “fred” List for age = 20: List for name = “fred”: Answer is intersection: Good if domain cardinality small Bit vectors can be compressed 65 Implementing a Data Warehouse

Join “Combine” SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT; 66 Implementing a Data Warehouse

Join Index 67 Implementing a Data Warehouse join index

What to Materialize Store in warehouse results useful for common queries Example: 68 Implementing a Data Warehouse day 2 day total sales materialize

Materialization Factors: Type/frequency of queries Query response time Storage cost Update cost 69 Implementing a Data Warehouse

Cube Aggregates Lattice: 70 Implementing a Data Warehouse city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day use greedy algorithm to decide what to materialize

Dimension Hierarchies: 71 Implementing a Data Warehouse all state city

Dimension Hierarchies: 72 Implementing a Data Warehouse city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...

Interesting Hierarchy: 73 Implementing a Data Warehouse all years quarters months days weeks conceptual dimension table

Algorithms: Query Optimization Parallel Processing Data Mining 74 Implementing a Data Warehouse

Managing: Metadata Warehouse Design Tools 75 Implementing a Data Warehouse Client Warehouse Source Query & Analysis Integration Metadata

Metadata: Administrative  definition of sources, tools,...  schemas, dimension hierarchies, …  rules for extraction, cleaning, …  refresh, purging policies  user profiles, access control, Implementing a Data Warehouse

Metadata (Contd.): Business  business terms & definition  data ownership, charging Operational  data lineage  data currency (e.g., active, archived, purged)  use stats, error reports, audit trails 77 Implementing a Data Warehouse

Design: What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index? 78 Implementing a Data Warehouse

Tools: Development  design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis  what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management  performance monitoring, usage patterns, exception reporting System & Network Management  measure traffic (sources, warehouse, clients) Workflow Management  “reliable scripts” for cleaning & analyzing data 79 Implementing a Data Warehouse

Current State of Industry: Extraction and integration done off-line  Usually in large, time-consuming, batches Everything copied at warehouse  Not selective about what is stored  Query benefit vs. storage & update cost Query optimization aimed at OLTP  High throughput instead of fast response  Process whole query before displaying anything 80 Implementing a Data Warehouse

81 Future Directions

Better performance Larger warehouses Easier to use What are companies & research labs working on? 82 Future Directions

Research: Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush) 83 Future Directions

Research: Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data 84 Future Directions

85 END