On-Line Application Processing Warehousing Data Cubes (Data Mining) (slides borrowed from Stanford)

Slides:



Advertisements
Similar presentations
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Advertisements

Data Analysis. Overview Traditional database systems are tuned to many, small, simple queries. Some applications use fewer, more time-consuming, analytic.
Technical BI Project Lifecycle
Chapter 18: Data Analysis and Mining Kat Powell. Chapter 18: Data Analysis and Mining ➔ Decision Support Systems ➔ Data Analysis and OLAP ➔ Data Warehousing.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
OLAP Services Business Intelligence Solutions. Agenda Definition of OLAP Types of OLAP Definition of Cube Definition of DMR Differences between Cube and.
Data Warehousing M R BRAHMAM.
Jennifer Widom On-Line Analytical Processing (OLAP) Introduction.
2/10/05Salman Azhar: Database Systems1 On-Line Analytical Processing Salman Azhar Warehousing Data Cubes Data Mining These slides use some figures, definitions,
Decision Support and Data Warehouse. Decision supports Systems Components Data management function –Data warehouse Model management function –Analytical.
OLAP. Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming, analytic queries.
Lecture 1: Data Warehousing Based on the slides by Jeffrey D. Ullman and Hector Garcia-Molina at Stanford University 1.
SLIDE 1IS 257 – Fall 2011 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
1 On-Line Application Processing Warehousing Data Cubes Data Mining.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
Data Warehousing. On-Line Analytical Processing (OLAP) Tools The use of a set of graphical tools that provides users with multidimensional views of their.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
SLIDE 1IS 257 – Fall 2010 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
CS 345: Topics in Data Warehousing Tuesday, September 28, 2004.
On-Line Application Processing Warehousing Data Cubes Data Mining 1.
Online Analytical Processing (OLAP) Hweichao Lu CS157B-02 Spring 2007.
1 Basic concepts of On-Line Analytical processing DT211 /4.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
OnLine Analytical Processing (OLAP)
1 On-Line Application Processing Warehousing Data Cubes Data Mining.
1 Data Warehouses BUAD/American University Data Warehouses.
Data Warehousing.
CS 345: Topics in Data Warehousing Thursday, September 30, 2004.
BI Terminologies.
BUSINESS ANALYTICS AND DATA VISUALIZATION
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Ayyat IT Group Murad Faridi Roll NO#2492 Muhammad Waqas Roll NO#2803 Salman Raza Roll NO#2473 Junaid Pervaiz Roll NO#2468 Instructor :- “ Madam Sana Saeed”
Fox MIS Spring 2011 Data Warehouse Week 8 Introduction of Data Warehouse Multidimensional Analysis: OLAP.
1 On-Line Analytic Processing Warehousing Data Cubes.
ADVANCED TOPICS IN RELATIONAL DATABASES Spring 2011 Instructor: Hassan Khosravi.
Two-Tier DW Architecture. Three-Tier DW Architecture.
Data Warehousing.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
1 Introduction to Database Systems, CS420 SQL Views and Indexes.
I am Xinyuan Niu I am here because I love to give presentations. Data Warehousing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
Data Warehouses and OLAP 1.  Review Questions ◦ Question 1: OLAP ◦ Question 2: Data Warehouses ◦ Question 3: Various Terms and Definitions ◦ Question.
Databases 2 On-Line Application Processing: Warehousing, Data Cubes, Data Mining.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Analysis Decision Support Systems Data Analysis and OLAP Data Warehousing.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
On-Line Application Processing
Data warehouse.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data Warehouse.
On-Line Analytic Processing
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Chapter 5: Advanced SQL Database System concepts,6th Ed.
Data storage is growing Future Prediction through historical data
Data Warehouse.
On-Line Analytical Processing (OLAP)
Data Warehouse and OLAP
On-Line Application Processing
Data Warehousing Concepts
Online analytical processing (OLAP) is a category of software technology that enables analysts, managers, and executives to gain insight into data through.
Data Warehouse and OLAP
Presentation transcript:

On-Line Application Processing Warehousing Data Cubes (Data Mining) (slides borrowed from Stanford)

Overview Traditional database systems are tuned to many, small, simple queries. Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming, complex queries. Some new applications use fewer, more time-consuming, complex queries. New architectures have been developed to handle complex “analytic” queries efficiently. New architectures have been developed to handle complex “analytic” queries efficiently.

The Data Warehouse The most common form of data integration. The most common form of data integration. Copy sources into a single DB (warehouse) and try to keep it up-to-date. Copy sources into a single DB (warehouse) and try to keep it up-to-date. Usual method: periodic reconstruction of the warehouse, perhaps overnight. Usual method: periodic reconstruction of the warehouse, perhaps overnight. Frequently essential for analytic queries. Frequently essential for analytic queries.

OLTP Most database operations involve On- Line Transaction Processing (OTLP). Most database operations involve On- Line Transaction Processing (OTLP). Short, simple, frequent queries and/or modifications, each involving a small number of tuples. Short, simple, frequent queries and/or modifications, each involving a small number of tuples. Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets. Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets.

OLAP Of increasing importance are On-Line Application Processing (OLAP) queries. Of increasing importance are On-Line Application Processing (OLAP) queries. Few, but complex queries --- may run for hours. Few, but complex queries --- may run for hours. Queries do not depend on having an absolutely up-to-date database. Queries do not depend on having an absolutely up-to-date database.

OLAP Examples 1. Amazon analyzes purchases by its customers to come up with an individual screen with products of likely interest to the customer. 2. Analysts at Wal-Mart look for items with increasing sales in some region.

Common Architecture Databases at store branches handle OLTP. Databases at store branches handle OLTP. Local store databases copied to a central warehouse overnight. Local store databases copied to a central warehouse overnight. Analysts use the warehouse for OLAP. Analysts use the warehouse for OLAP.

Loading the Data Warehouse Source Systems Data Staging AreaData Warehouse (OLTP) Data is periodically extracted Data is cleansed and transformed Users query the data warehouse

Terminology: ETL ETL = Extraction, Transformation, & Load ETL = Extraction, Transformation, & Load Extraction: Get the data out of the source systems Extraction: Get the data out of the source systems Transformation: Convert the data into a useful format for analysis Transformation: Convert the data into a useful format for analysis Load: Get the data into the data warehouse (…and build indexes, materialized views, etc.) Load: Get the data into the data warehouse (…and build indexes, materialized views, etc.)

Data Integration is Hard Data warehouses combine data from multiple sources Data warehouses combine data from multiple sources Data must be translated into a consistent format Data must be translated into a consistent format Data integration represents ~80% of effort for a typical data warehouse project! Data integration represents ~80% of effort for a typical data warehouse project! Some reasons why it’s hard: Some reasons why it’s hard: Metadata is often poor or non-existent Metadata is often poor or non-existent Data quality is often bad Data quality is often bad Missing or default values Missing or default values Multiple spellings of the same thing (Cal vs. UC Berkeley vs. University of California) Multiple spellings of the same thing (Cal vs. UC Berkeley vs. University of California) Inconsistent semantics Inconsistent semantics What is an airline passenger? What is an airline passenger?

Federated Databases An alternative to data warehouses An alternative to data warehouses Data warehouse Data warehouse Create a copy of all the data Create a copy of all the data Execute queries against the copy Execute queries against the copy Federated database Federated database Pull data from source systems as needed to answer queries Pull data from source systems as needed to answer queries “lazy” vs. “eager” data integration “lazy” vs. “eager” data integration Data WarehouseFederated Database Query Answer Query Extraction Rewritten Queries Answer Source Systems Source Systems Warehouse Mediator

Star Schemas A star schema is a common organization for data at a warehouse. It consists of: A star schema is a common organization for data at a warehouse. It consists of: 1. Fact table : a very large accumulation of facts such as sales. w Often “insert-only.” 2. Dimension tables : smaller, generally static information about the entities involved in the facts.

Example: Star Schema Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged. Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged. The fact table is a relation: The fact table is a relation: Sales(bar, beer, drinker, day, time, price)

Example, Continued The dimension tables include information about the bar, beer, and drinker “dimensions”: The dimension tables include information about the bar, beer, and drinker “dimensions”: Bars(bar, addr, license) Beers(beer, manf) Drinkers(drinker, addr, phone)

Visualization – Star Schema Dimension Table (Beers)Dimension Table (etc.) Dimension Table (Drinkers)Dimension Table (Bars) Fact Table - Sales Dimension Attrs.Dependent Attrs.

Dimensions and Dependent Attributes Two classes of fact-table attributes: Two classes of fact-table attributes: 1. Dimension attributes : the key of a dimension table. 2. Dependent attributes : a value determined by the dimension attributes of the tuple.

Example: Dependent Attribute price is the dependent attribute of our example Sales relation. price is the dependent attribute of our example Sales relation. It is determined by the combination of dimension attributes: bar, beer, drinker, and the time (combination of day and time-of-day attributes). It is determined by the combination of dimension attributes: bar, beer, drinker, and the time (combination of day and time-of-day attributes).

Comparing Facts and Dimensions Narrow Narrow Big (many rows) Big (many rows) Numeric Numeric Growing over time Growing over time Wide Wide Small (few rows) Small (few rows) Descriptive Descriptive Static Static Facts Dimensions Facts contain numbers, dimensions contain labels

Cross Tabulation of sales by item-name and color The table above is an example of a cross-tabulation (cross-tab), also referred to as a pivot-table. The table above is an example of a cross-tabulation (cross-tab), also referred to as a pivot-table. A cross-tab is a table where A cross-tab is a table where values for one of the dimension attributes form the row headers, values for another dimension attribute form the column headers values for one of the dimension attributes form the row headers, values for another dimension attribute form the column headers Values in individual cells are (aggregates of) the values of the dimension attributes that specify the cell. Values in individual cells are (aggregates of) the values of the dimension attributes that specify the cell.

Marginals The data cube also includes aggregation (typically SUM) along the margins of the cube. The data cube also includes aggregation (typically SUM) along the margins of the cube. The marginals include aggregations over one dimension, two dimensions,… The marginals include aggregations over one dimension, two dimensions,…

Visualization - Data Cube w/ Aggregation price bar beer drinker SUM over all Drinkers

Example: Marginals Our 4-dimensional Sales cube includes the sum of price over each bar, each beer, each drinker, and each time unit (perhaps days). Our 4-dimensional Sales cube includes the sum of price over each bar, each beer, each drinker, and each time unit (perhaps days). It would also have the sum of price over all bar-beer pairs, all bar-drinker-day triples,… It would also have the sum of price over all bar-beer pairs, all bar-drinker-day triples,…

Structure of the Cube Think of each dimension as having an additional value *. Think of each dimension as having an additional value *. A point with one or more *’s in its coordinates aggregates over the dimensions with the *’s. A point with one or more *’s in its coordinates aggregates over the dimensions with the *’s. Example: Sales(“Joe’s Bar”, “Bud”, *, *) holds the sum over all drinkers and all time of the Bud consumed at Joe’s. Example: Sales(“Joe’s Bar”, “Bud”, *, *) holds the sum over all drinkers and all time of the Bud consumed at Joe’s.

Relational Representation Crosstabs can be represented as relations The value all is used to represent aggregates The SQL:1999 standard actually uses null values in place of all

Three-Dimensional Data Cube A data cube is a multidimensional generalization of a crosstab Cannot view a three-dimensional object in its entirety but crosstabs can be used as views on a data cube

Data Cube Axes of the cube represent attributes of the data records Axes of the cube represent attributes of the data records e.g. color, month, state e.g. color, month, state Called dimensions Called dimensions Cells hold aggregated measurements Cells hold aggregated measurements e.g. total $ sales, number of autos sold e.g. total $ sales, number of autos sold Called facts Called facts Real data cubes have >> 3 dimensions Real data cubes have >> 3 dimensions JulAugSep CA OR WA Red Blue Gray Auto Sales

Slicing and Dicing JulAugSep CA OR WA Red Blue Gray Red Blue Gray JulAugSep CA OR WA Blue JulAugSep CA OR WA Blue JulAugSep Total

Querying the Data Cube Cross-tabulation Cross-tabulation “Cross-tab” for short “Cross-tab” for short Report data grouped by 2 dimensions Report data grouped by 2 dimensions Aggregate across other dimensions Aggregate across other dimensions Include subtotals Include subtotals Operations on a cross-tab Operations on a cross-tab Roll up (further aggregation) Roll up (further aggregation) Drill down (less aggregation) Drill down (less aggregation) CAORWATotal Jul Aug Sep Total Number of Autos Sold

Roll Up and Drill Down CAORWATotal Jul Aug Sep Total Number of Autos SoldCAORWATotal CAORWATotalRed Blue Gray Total Roll up by Month Number of Autos Sold Drill down by Color

Full Data Cube with Subtotals Pre-computation of aggregates → fast answers to OLAP queries Pre-computation of aggregates → fast answers to OLAP queries Ideally, pre-compute all 2 n types of subtotals Ideally, pre-compute all 2 n types of subtotals Otherwise, perform aggregation as needed Otherwise, perform aggregation as needed Coarser-grained totals can be computed from finer-grained totals Coarser-grained totals can be computed from finer-grained totals But not the other way around But not the other way around

Data Cube Lattice Total State MonthColor State, Month State, Color Month, Color State, Month, Color Drill Down Roll Up

MOLAP vs. ROLAP MOLAP = Multidimensional OLAP MOLAP = Multidimensional OLAP Store data cube as multidimensional array Store data cube as multidimensional array (Usually) pre-compute all aggregates (Usually) pre-compute all aggregates Advantages: Advantages: Very efficient data access → fast answers Very efficient data access → fast answers Disadvantages: Disadvantages: Doesn’t scale to large numbers of dimensions Doesn’t scale to large numbers of dimensions Requires special-purpose data store Requires special-purpose data store

Sparsity Imagine a data warehouse for Safeway. Imagine a data warehouse for Safeway. Suppose dimensions are: Customer, Product, Store, Day Suppose dimensions are: Customer, Product, Store, Day If there are 100,000 customers, 10,000 products, 1,000 stores, and 1,000 days… If there are 100,000 customers, 10,000 products, 1,000 stores, and 1,000 days… …data cube has 1,000,000,000,000,000 cells! …data cube has 1,000,000,000,000,000 cells! Fortunately, most cells are empty. Fortunately, most cells are empty. A given store doesn’t sell every product on every day. A given store doesn’t sell every product on every day. A given customer has never visited most of the stores. A given customer has never visited most of the stores. A given customer has never purchased most products. A given customer has never purchased most products. Multi-dimensional arrays are not an efficient way to store sparse data. Multi-dimensional arrays are not an efficient way to store sparse data.

MOLAP vs. ROLAP ROLAP = Relational OLAP ROLAP = Relational OLAP Store data cube in relational database Store data cube in relational database Express queries in SQL Express queries in SQL Advantages: Advantages: Scales well to high dimensionality Scales well to high dimensionality Scales well to large data sets Scales well to large data sets Sparsity is not a problem Sparsity is not a problem Uses well-known, mature technology Uses well-known, mature technology Disadvantages: Disadvantages: Query performance is slower than MOLAP Query performance is slower than MOLAP Need to construct explicit indexes Need to construct explicit indexes

Creating a Cross-tab with SQL SELECT state, month, SUM(quantity) FROM sales GROUP BY state, month WHERE color = 'Red' Grouping Attributes Measurements Filters

What about the totals? SQL aggregation query with GROUP BY does not produce subtotals, totals SQL aggregation query with GROUP BY does not produce subtotals, totals Our cross-tab report is incomplete. Our cross-tab report is incomplete. CAORWATotal Jul453330? Aug503642? Sep383140? Total???? Number of Autos Sold StateMonthSUM CAJul45 CAAug50 CASep38 ORJul33 ORAug36 ORSep31 WAJul30 WAAug42 WASep40

One solution: a big UNION ALL SELECT state, month, SUM(quantity) FROM sales GROUP BY state, month WHERE color = 'Red‘ UNION ALL SELECT state, "ALL", SUM(quantity) FROM sales GROUP BY state WHERE color = 'Red' UNION ALL SELECT "ALL", month, SUM(quantity) FROM sales GROUP BY month WHERE color = 'Red‘ UNION ALL SELECT "ALL", "ALL", SUM(quantity) FROM sales WHERE color = 'Red' Original Query State Subtotals Month Subtotals Overall Total

A better solution “UNION ALL” solution gets cumbersome with more than 2 grouping attributes “UNION ALL” solution gets cumbersome with more than 2 grouping attributes n grouping attributes → 2 n parts in the union n grouping attributes → 2 n parts in the union OLAP extensions added to SQL 99 are more convenient OLAP extensions added to SQL 99 are more convenient CUBE, ROLLUP CUBE, ROLLUP SELECT state, month, SUM(quantity) FROM sales GROUP BY CUBE(state, month) WHERE color = 'Red'

Results of the CUBE query StateMonthSUM(quantity) CAJul45 CAAug50 CASep38 CANULL133 ORJul33 ORAug36 ORSep31 ORNULL100 WAJul30 WAAug42 WASep40 WANULL112 NULLJul108 NULLAug128 NULLSep109 NULLNULL345 Notice the use of NULL for totals Subtotals at all levels

ROLLUP vs. CUBE CUBE computes entire lattice CUBE computes entire lattice ROLLUP computes one path through lattice ROLLUP computes one path through lattice Order of GROUP BY list matters Order of GROUP BY list matters Groups by all prefixes of the GROUP BY list Groups by all prefixes of the GROUP BY list GROUP BY ROLLUP(A,B,C) A,B,C (A,B) subtotals (A) subtotals Total GROUP BY CUBE(A,B,C) A,B,C Subtotals for the following: (A,B), (A,C), (B,C), (A), (B), (C) Total

ROLLUP example Total State MonthColor State, Month State, Color Month, Color State, Month, Color SELECT color, month, state, SUM(quantity) FROM sales GROUP BY ROLLUP(color,month,state)