Advanced Database Systems: DBS CB, 2nd Edition

Slides:



Advertisements
Similar presentations
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Advertisements

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Introduction to Data Warehousing CPS Notes 6.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #15.
Data Warehousing Overview
Lecture 1: Data Warehousing Based on the slides by Jeffrey D. Ullman and Hector Garcia-Molina at Stanford University 1.
Data Warehousing Xintao Wu. Evolution of Database Technology (See Fig. 1.1) 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational.
Data Warehousing and OLAP
1 Lecture 10: More OLAP - Dimensional modeling
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
Introduction to Data Warehousing Enrico Franconi CS 636.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
1 Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously.  A decision support database that is maintained.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Dr. Bernard Chen Ph.D. University of Central Arkansas
8/20/ Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously. Defined in many different ways, but.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 5 th Edition, Aug 26, 2005 Buzzword List OLTP – OnLine Transaction Processing (normalized,
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Data Warehousing Xintao Wu. Can You Easily Answer These Questions? What are Personnel Services costs across all departments for all funding sources? What.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
Data Warehousing.
October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.
Data Warehousing and OLAP. Warehousing ► Growing industry: $8 billion in 1998 ► Range from desktop to huge:  Walmart: 900-CPU, 2,700 disk, 23TB Teradata.
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
Data Mining Data Warehouses.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch , Ch. 22.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
11/20/ :11 AMData Mining 1 Data Mining – CSE 9033 Chapter – 1; Data Warehousing Dr. Goutam Sarker, B.E., M.E., Ph.D.(Engineering), Fellow: IE(I),
Data Mining: Data Warehousing
Introduction to Data Warehousing
On-Line Application Processing
Information Management course
Data Warehousing Overview CS245 Notes 12
Data warehouse.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data warehouse and OLAP
A multi-dimensional data model
Chapter 13 The Data Warehouse
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 4 —
Information Management course
OLAP Concepts and Techniques
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Data Warehousing and OLAP Technology for Data Mining
Data Warehouse Design Enrico Franconi CS 636.
Chapter 2: Data Warehousing and OLAP Technology for Data Mining
Data Warehouse and OLAP
Overview of Data Warehousing and OLAP
Data Warehousing Overview CS245 Notes 11
Data Warehousing and OLAP
Introduction to Data Warehousing
Data Warehousing and Decision Support Chapter 25
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009
Data Mining: Concepts and Techniques
Data Warehousing Concepts
Data Warehouse and OLAP
Presented by: Tek Narayan Adhikari
Presentation transcript:

Advanced Database Systems: DBS CB, 2nd Edition Data Warehouse, OLAP, Data Mining Ch. 21.1-2, Ch. 22

Outline What is a data warehouse? Why a data warehouse? Data Warehouse Models & operations Query & Analysis Tools Implementing a data warehouse Future directions

What is a Data Warehouse 3 3 3

What is a Data Warehouse Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile Collection of tools gathering data cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse more

What is a Data Warehouse Data Warehouse Architecture Client Warehouse Source Query & Analysis Integration Metadata

What is a Data Warehouse Motivation Examples: Forecasting Comparing performance of units Monitoring, detecting fraud Visualization

Why a Data Warehouse 7 7 7

Why a Data Warehouse Two Approaches to Extract Information in the Enterprise: Query-Driven (Lazy) Warehouse (Eager)

Why a Data Warehouse Query Driven Approach (Query Federation): Client Wrapper Mediator Source

Why a Data Warehouse Warehouse Advantages Queries not visible outside warehouse Local processing at sources are unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse Modify, summarize (store aggregates) Add historical information

Why a Data Warehouse Query-driven (Federation) Advantages No need to copy data less storage no need to purchase data More up-to-date data – real-time Query needs can be unknown Only query interface needed at sources May be less draining on sources

What is a Data Warehouse OLTP vs. OLAP OLTP: On Line Transaction Processing Describes processing at operational sites OLAP: On Line Analytical Processing Describes processing at warehouse

What is a Data Warehouse OLTP vs. OLAP OLTP OLAP Mostly updates Many small transactions Mb-Gb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical Mostly reads Queries long, complex GB-TB of data Summarized, consolidated data Decision-makers, analysts as users

What is a Data Warehouse Data Marts Smaller warehouses Spans part of organization e.g., marketing (customers, products, sales) Do not require enterprise-wide consensus but long term integration problems?

Data Warehouse Models & Operations 15 15 15

Data Warehouse Models & Operations Data Models relations Star & Snowflakes & Fact Constallation/Galaxy cubes Operators slice & dice roll-up, drill down pivoting other

Data Warehouse Models & Operations Star Schema

Data Warehouse Models & Operations Star Schema Terms: Fact table: PK is the aggregate of the dimensions’ PK - 2nd NF Dimension tables: “by” condition - one view for how the data will be analyzed – 3rd NF Measures: aggregates – sales/day, sales/quarter, etc. Summary Tables: pre-summarized data to meet 90% of user queries. Use triggers/stored procedures to maintain summary tables current on insert/delete sale orderId date custId prodId storeId qty amt customer custId name address city product prodId name price store storeId city

Data Warehouse Models & Operations Star Schema - Dimension Hierarchies store sType city region

Data Warehouse Models & Operations Snowflakes schema, a dimension table have the hierarchy broken into set of tables – more complicated and slower queries time_key day day_of_the_week month quarter year time location_key street city_key location Sales Fact Table item_key branch_key units_sold dollars_sold avg_sales Measures item_name brand type supplier_key item branch_name branch_type branch supplier_type supplier city state_or_province country

Data Warehouse Models & Operations Fact Constellations schema, multiple fact tables that share the set of dimensional tables; viewed as a collection of stars time_key day day_of_the_week month quarter year time location_key street city province_or_state country location Sales Fact Table item_key branch_key units_sold dollars_sold avg_sales item_name brand type supplier_type item branch_name branch_type branch Shipping Fact Table shipper_key from_location to_location dollars_cost units_shipped shipper_name shipper_type shipper Measures

Data Warehouse Models & Operations product prodId name price customer custId address city store storeId sale orderId date qty amt Fact table view: Multi-dimensional cube: Star Schema dimensions = 2

Data Warehouse Models & Operations 3D Cube Fact table view: Multi-dimensional cube: day 2 day 1 dimensions = 3

Data Warehouse Models & Operations ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing MOLAP: Multi-Dimensional On-Line Analytical Processing Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1; 81

Data Warehouse Models & Operations Aggregates (Contd.) Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date;

Data Warehouse Models & Operations Aggregates (Contd.) Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId; drill-down rollup

Data Warehouse Models & Operations Aggregates (Contd.) Operators: sum, count, max, min, median, ave “Having” clause Using dimension hierarchy average by region (within store) maximum by month (within date)

Data Warehouse Models & Operations Cube Aggregations day 2 day 1 129 . . . Example: computing sums drill-down rollup

Data Warehouse Models & Operations Cube Operators day 2 day 1 129 . . . sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)

Data Warehouse Models & Operations Extended Cube day 2 day 1 * sale(*,p2,*)

Data Warehouse Models & Operations Aggregation Using Hierarchies day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B)

Data Warehouse Models & Operations Pivoting day 2 day 1 Multi-dimensional cube: Fact table view:

Query & Analysis Tools 33 33 33

Query & Analysis Tools Query Building Report Writers (comparisons, growth, graphs,…) Spreadsheet Systems Web Interfaces Data Mining

Query & Analysis Tools Other Operations Time functions e.g., time average Computed Attributes e.g., commission = sales * rate Text Queries e.g., find documents with words X AND B e.g., rank documents by frequency of words X, Y, Z

Query & Analysis Tools Data Mining (i) Decision Tree: Decision Trees Clustering Association Rules (i) Decision Tree: Conducted survey to see what customers were interested in new model car Want to select customers for advertising campaign training set

Query & Analysis Tools One possibility: age<30 Y N city=sf car=van likely unlikely likely unlikely

Query & Analysis Tools Another Possibility: car=taurus Y N city=sf age<45 Y Y N N likely unlikely likely unlikely

Query & Analysis Tools Decision Tree Issues: Decision tree cannot be “too deep” ☻ would not have statistically significant amounts of data for lower decisions Need to select tree that most reliably predicts outcomes

Query & Analysis Tools (ii) Clustering: age income education

Query & Analysis Tools Another Example: Text Each document is a vector e.g., <100110...> contains words 1,4,5,... Clusters contain “similar” documents Useful for understanding, searching documents international news sports business

Query & Analysis Tools Clustering Issues: Given desired number of clusters? Finding “best” clusters Are clusters semantically meaningful? e.g., “yuppies’’ cluster? Using clusters for disk storage

Query & Analysis Tools (iii) Association Rules Mining: products bought customer id sales records: market-basket data Trend: Products p5, p8 often bought together Trend: Customer 12 likes product p9

Query & Analysis Tools Association Rule Rule: {p1, p3, p8} Support: number of baskets where these products appear High-support set: support  threshold s Problem: find all high support sets

Query & Analysis Tools Finding High-Support Pairs Baskets(basket, item) SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s; WHY?

Query & Analysis Tools Example: check if count  s

Query & Analysis Tools Association Rule Issues: Performance for size 2 rules Performance for size k rules even bigger! big

Implementing a Data Warehouse 48 48 48

Implementing a Data Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing, ... Managing: Metadata, Design, ...

Implementing a Data Warehouse Monitoring: Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … Incremental vs. Refresh new

Implementing a Data Warehouse Monitoring Techniques: Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring è Advantages & Disadvantages!!

Implementing a Data Warehouse Monitoring Issues: Frequency periodic: daily, weekly, … triggered: on “big” change, lots of changes, ... Data transformation convert data to uniform format remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways

Implementing a Data Warehouse Integration Data Cleaning Data Loading Derived Data Client Warehouse Source Query & Analysis Integration Metadata

Implementing a Data Warehouse Data Cleaning: Migration (e.g., yen ð dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) Auditing: discover rules & relationships (like data mining) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)

Implementing a Data Warehouse Loading Data: Incremental vs. refresh Off-line vs. on-line Frequency of loading At night, 1x a week/month, continuously Parallel/Partitioned load

Implementing a Data Warehouse Derived Data: Derived Warehouse Data indexes aggregates materialized views (next slide) When to update derived data? Incremental vs. refresh

Implementing a Data Warehouse Materialized Views: Define new warehouse relations using SQL expressions does not exist at any source

Implementing a Data Warehouse Processing ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms Client Warehouse Source Query & Analysis Integration Metadata

Implementing a Data Warehouse ROLAP Server: Relational OLAP Server tools utilities ROLAP server Special indices, tuning; Schema is “denormalized” relational DBMS

Implementing a Data Warehouse MOLAP Server: Multi-Dimensional OLAP Server Product City Date 1 2 3 4 milk soda eggs soap A B Sales M.D. tools utilities multi-dimensional server could also sit on relational DBMS

Implementing a Data Warehouse Index Structures Traditional Access Methods B-trees, hash tables, R-trees, grids, … Popular in Warehouses inverted lists bit map indexes join indexes text indexes

Implementing a Data Warehouse Inverted Lists . . . age index inverted lists data records

Implementing a Data Warehouse Using Inverted Lists: Query: Get people with age = 20 and name = “fred” List for age = 20: r4, r18, r34, r35 List for name = “fred”: r18, r52 Answer is intersection of “list for name” & “list for age”: r18

Implementing a Data Warehouse Bit Map Index: . . . age index bit maps data records

Implementing a Data Warehouse Using Bit Map Index: Query: Get people with age = 20 and name = “fred” List for age = 20: 1101100000 List for name = “fred”: 0100000001 Answer is intersection: 010000000000 Good if domain cardinality small Bit vectors can be compressed

Implementing a Data Warehouse Join “Combine” SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT;

Implementing a Data Warehouse Join Index join index

Implementing a Data Warehouse What to Materialize Store in warehouse results useful for common queries Example: day 2 day 1 129 . . . total sales materialize

Implementing a Data Warehouse Materialization Factors: Type/frequency of queries Query response time Storage cost Update cost

Implementing a Data Warehouse Cube Aggregates Lattice: 129 all city product date city, product city, date product, date use greedy algorithm to decide what to materialize day 2 day 1 city, product, date

Implementing a Data Warehouse Dimension Hierarchies: all state city

Implementing a Data Warehouse Dimension Hierarchies: all product city date city, product city, date product, date state city, product, date state, date state, product state, product, date not all arcs shown...

Implementing a Data Warehouse Interesting Hierarchy: all years weeks quarters conceptual dimension table months days

Implementing a Data Warehouse Algorithms: Query Optimization Parallel Processing Data Mining

Implementing a Data Warehouse Managing: Metadata Warehouse Design Tools Client Warehouse Source Query & Analysis Integration Metadata

Implementing a Data Warehouse Metadata: Administrative definition of sources, tools, ... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control, ...

Implementing a Data Warehouse Metadata (Contd.): Business business terms & definition data ownership, charging Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails

Implementing a Data Warehouse Design: What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index?

Implementing a Data Warehouse Tools: Development design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management performance monitoring, usage patterns, exception reporting System & Network Management measure traffic (sources, warehouse, clients) Workflow Management “reliable scripts” for cleaning & analyzing data

Implementing a Data Warehouse Current State of Industry: Extraction and integration done off-line Usually in large, time-consuming, batches Everything copied at warehouse Not selective about what is stored Query benefit vs. storage & update cost Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything

Future Directions 81 81 81

Future Directions Better performance Larger warehouses Easier to use What are companies & research labs working on?

Future Directions Research: Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush)

Future Directions Research: Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data

END