On-Line Application Processing Warehousing Data Cubes (Data Mining) (slides borrowed from Stanford)
Overview Traditional database systems are tuned to many, small, simple queries. Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming, complex queries. Some new applications use fewer, more time-consuming, complex queries. New architectures have been developed to handle complex “analytic” queries efficiently. New architectures have been developed to handle complex “analytic” queries efficiently.
The Data Warehouse The most common form of data integration. The most common form of data integration. Copy sources into a single DB (warehouse) and try to keep it up-to-date. Copy sources into a single DB (warehouse) and try to keep it up-to-date. Usual method: periodic reconstruction of the warehouse, perhaps overnight. Usual method: periodic reconstruction of the warehouse, perhaps overnight. Frequently essential for analytic queries. Frequently essential for analytic queries.
OLTP Most database operations involve On- Line Transaction Processing (OTLP). Most database operations involve On- Line Transaction Processing (OTLP). Short, simple, frequent queries and/or modifications, each involving a small number of tuples. Short, simple, frequent queries and/or modifications, each involving a small number of tuples. Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets. Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets.
OLAP Of increasing importance are On-Line Application Processing (OLAP) queries. Of increasing importance are On-Line Application Processing (OLAP) queries. Few, but complex queries --- may run for hours. Few, but complex queries --- may run for hours. Queries do not depend on having an absolutely up-to-date database. Queries do not depend on having an absolutely up-to-date database.
OLAP Examples 1. Amazon analyzes purchases by its customers to come up with an individual screen with products of likely interest to the customer. 2. Analysts at Wal-Mart look for items with increasing sales in some region.
Common Architecture Databases at store branches handle OLTP. Databases at store branches handle OLTP. Local store databases copied to a central warehouse overnight. Local store databases copied to a central warehouse overnight. Analysts use the warehouse for OLAP. Analysts use the warehouse for OLAP.
Loading the Data Warehouse Source Systems Data Staging AreaData Warehouse (OLTP) Data is periodically extracted Data is cleansed and transformed Users query the data warehouse
Terminology: ETL ETL = Extraction, Transformation, & Load ETL = Extraction, Transformation, & Load Extraction: Get the data out of the source systems Extraction: Get the data out of the source systems Transformation: Convert the data into a useful format for analysis Transformation: Convert the data into a useful format for analysis Load: Get the data into the data warehouse (…and build indexes, materialized views, etc.) Load: Get the data into the data warehouse (…and build indexes, materialized views, etc.)
Data Integration is Hard Data warehouses combine data from multiple sources Data warehouses combine data from multiple sources Data must be translated into a consistent format Data must be translated into a consistent format Data integration represents ~80% of effort for a typical data warehouse project! Data integration represents ~80% of effort for a typical data warehouse project! Some reasons why it’s hard: Some reasons why it’s hard: Metadata is often poor or non-existent Metadata is often poor or non-existent Data quality is often bad Data quality is often bad Missing or default values Missing or default values Multiple spellings of the same thing (Cal vs. UC Berkeley vs. University of California) Multiple spellings of the same thing (Cal vs. UC Berkeley vs. University of California) Inconsistent semantics Inconsistent semantics What is an airline passenger? What is an airline passenger?
Federated Databases An alternative to data warehouses An alternative to data warehouses Data warehouse Data warehouse Create a copy of all the data Create a copy of all the data Execute queries against the copy Execute queries against the copy Federated database Federated database Pull data from source systems as needed to answer queries Pull data from source systems as needed to answer queries “lazy” vs. “eager” data integration “lazy” vs. “eager” data integration Data WarehouseFederated Database Query Answer Query Extraction Rewritten Queries Answer Source Systems Source Systems Warehouse Mediator
Star Schemas A star schema is a common organization for data at a warehouse. It consists of: A star schema is a common organization for data at a warehouse. It consists of: 1. Fact table : a very large accumulation of facts such as sales. w Often “insert-only.” 2. Dimension tables : smaller, generally static information about the entities involved in the facts.
Example: Star Schema Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged. Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged. The fact table is a relation: The fact table is a relation: Sales(bar, beer, drinker, day, time, price)
Example, Continued The dimension tables include information about the bar, beer, and drinker “dimensions”: The dimension tables include information about the bar, beer, and drinker “dimensions”: Bars(bar, addr, license) Beers(beer, manf) Drinkers(drinker, addr, phone)
Visualization – Star Schema Dimension Table (Beers)Dimension Table (etc.) Dimension Table (Drinkers)Dimension Table (Bars) Fact Table - Sales Dimension Attrs.Dependent Attrs.
Dimensions and Dependent Attributes Two classes of fact-table attributes: Two classes of fact-table attributes: 1. Dimension attributes : the key of a dimension table. 2. Dependent attributes : a value determined by the dimension attributes of the tuple.
Example: Dependent Attribute price is the dependent attribute of our example Sales relation. price is the dependent attribute of our example Sales relation. It is determined by the combination of dimension attributes: bar, beer, drinker, and the time (combination of day and time-of-day attributes). It is determined by the combination of dimension attributes: bar, beer, drinker, and the time (combination of day and time-of-day attributes).
Comparing Facts and Dimensions Narrow Narrow Big (many rows) Big (many rows) Numeric Numeric Growing over time Growing over time Wide Wide Small (few rows) Small (few rows) Descriptive Descriptive Static Static Facts Dimensions Facts contain numbers, dimensions contain labels
Cross Tabulation of sales by item-name and color The table above is an example of a cross-tabulation (cross-tab), also referred to as a pivot-table. The table above is an example of a cross-tabulation (cross-tab), also referred to as a pivot-table. A cross-tab is a table where A cross-tab is a table where values for one of the dimension attributes form the row headers, values for another dimension attribute form the column headers values for one of the dimension attributes form the row headers, values for another dimension attribute form the column headers Values in individual cells are (aggregates of) the values of the dimension attributes that specify the cell. Values in individual cells are (aggregates of) the values of the dimension attributes that specify the cell.
Marginals The data cube also includes aggregation (typically SUM) along the margins of the cube. The data cube also includes aggregation (typically SUM) along the margins of the cube. The marginals include aggregations over one dimension, two dimensions,… The marginals include aggregations over one dimension, two dimensions,…
Visualization - Data Cube w/ Aggregation price bar beer drinker SUM over all Drinkers
Example: Marginals Our 4-dimensional Sales cube includes the sum of price over each bar, each beer, each drinker, and each time unit (perhaps days). Our 4-dimensional Sales cube includes the sum of price over each bar, each beer, each drinker, and each time unit (perhaps days). It would also have the sum of price over all bar-beer pairs, all bar-drinker-day triples,… It would also have the sum of price over all bar-beer pairs, all bar-drinker-day triples,…
Structure of the Cube Think of each dimension as having an additional value *. Think of each dimension as having an additional value *. A point with one or more *’s in its coordinates aggregates over the dimensions with the *’s. A point with one or more *’s in its coordinates aggregates over the dimensions with the *’s. Example: Sales(“Joe’s Bar”, “Bud”, *, *) holds the sum over all drinkers and all time of the Bud consumed at Joe’s. Example: Sales(“Joe’s Bar”, “Bud”, *, *) holds the sum over all drinkers and all time of the Bud consumed at Joe’s.
Relational Representation Crosstabs can be represented as relations The value all is used to represent aggregates The SQL:1999 standard actually uses null values in place of all
Three-Dimensional Data Cube A data cube is a multidimensional generalization of a crosstab Cannot view a three-dimensional object in its entirety but crosstabs can be used as views on a data cube
Data Cube Axes of the cube represent attributes of the data records Axes of the cube represent attributes of the data records e.g. color, month, state e.g. color, month, state Called dimensions Called dimensions Cells hold aggregated measurements Cells hold aggregated measurements e.g. total $ sales, number of autos sold e.g. total $ sales, number of autos sold Called facts Called facts Real data cubes have >> 3 dimensions Real data cubes have >> 3 dimensions JulAugSep CA OR WA Red Blue Gray Auto Sales
Slicing and Dicing JulAugSep CA OR WA Red Blue Gray Red Blue Gray JulAugSep CA OR WA Blue JulAugSep CA OR WA Blue JulAugSep Total
Querying the Data Cube Cross-tabulation Cross-tabulation “Cross-tab” for short “Cross-tab” for short Report data grouped by 2 dimensions Report data grouped by 2 dimensions Aggregate across other dimensions Aggregate across other dimensions Include subtotals Include subtotals Operations on a cross-tab Operations on a cross-tab Roll up (further aggregation) Roll up (further aggregation) Drill down (less aggregation) Drill down (less aggregation) CAORWATotal Jul Aug Sep Total Number of Autos Sold
Roll Up and Drill Down CAORWATotal Jul Aug Sep Total Number of Autos SoldCAORWATotal CAORWATotalRed Blue Gray Total Roll up by Month Number of Autos Sold Drill down by Color
Full Data Cube with Subtotals Pre-computation of aggregates → fast answers to OLAP queries Pre-computation of aggregates → fast answers to OLAP queries Ideally, pre-compute all 2 n types of subtotals Ideally, pre-compute all 2 n types of subtotals Otherwise, perform aggregation as needed Otherwise, perform aggregation as needed Coarser-grained totals can be computed from finer-grained totals Coarser-grained totals can be computed from finer-grained totals But not the other way around But not the other way around
Data Cube Lattice Total State MonthColor State, Month State, Color Month, Color State, Month, Color Drill Down Roll Up
MOLAP vs. ROLAP MOLAP = Multidimensional OLAP MOLAP = Multidimensional OLAP Store data cube as multidimensional array Store data cube as multidimensional array (Usually) pre-compute all aggregates (Usually) pre-compute all aggregates Advantages: Advantages: Very efficient data access → fast answers Very efficient data access → fast answers Disadvantages: Disadvantages: Doesn’t scale to large numbers of dimensions Doesn’t scale to large numbers of dimensions Requires special-purpose data store Requires special-purpose data store
Sparsity Imagine a data warehouse for Safeway. Imagine a data warehouse for Safeway. Suppose dimensions are: Customer, Product, Store, Day Suppose dimensions are: Customer, Product, Store, Day If there are 100,000 customers, 10,000 products, 1,000 stores, and 1,000 days… If there are 100,000 customers, 10,000 products, 1,000 stores, and 1,000 days… …data cube has 1,000,000,000,000,000 cells! …data cube has 1,000,000,000,000,000 cells! Fortunately, most cells are empty. Fortunately, most cells are empty. A given store doesn’t sell every product on every day. A given store doesn’t sell every product on every day. A given customer has never visited most of the stores. A given customer has never visited most of the stores. A given customer has never purchased most products. A given customer has never purchased most products. Multi-dimensional arrays are not an efficient way to store sparse data. Multi-dimensional arrays are not an efficient way to store sparse data.
MOLAP vs. ROLAP ROLAP = Relational OLAP ROLAP = Relational OLAP Store data cube in relational database Store data cube in relational database Express queries in SQL Express queries in SQL Advantages: Advantages: Scales well to high dimensionality Scales well to high dimensionality Scales well to large data sets Scales well to large data sets Sparsity is not a problem Sparsity is not a problem Uses well-known, mature technology Uses well-known, mature technology Disadvantages: Disadvantages: Query performance is slower than MOLAP Query performance is slower than MOLAP Need to construct explicit indexes Need to construct explicit indexes
Creating a Cross-tab with SQL SELECT state, month, SUM(quantity) FROM sales GROUP BY state, month WHERE color = 'Red' Grouping Attributes Measurements Filters
What about the totals? SQL aggregation query with GROUP BY does not produce subtotals, totals SQL aggregation query with GROUP BY does not produce subtotals, totals Our cross-tab report is incomplete. Our cross-tab report is incomplete. CAORWATotal Jul453330? Aug503642? Sep383140? Total???? Number of Autos Sold StateMonthSUM CAJul45 CAAug50 CASep38 ORJul33 ORAug36 ORSep31 WAJul30 WAAug42 WASep40
One solution: a big UNION ALL SELECT state, month, SUM(quantity) FROM sales GROUP BY state, month WHERE color = 'Red‘ UNION ALL SELECT state, "ALL", SUM(quantity) FROM sales GROUP BY state WHERE color = 'Red' UNION ALL SELECT "ALL", month, SUM(quantity) FROM sales GROUP BY month WHERE color = 'Red‘ UNION ALL SELECT "ALL", "ALL", SUM(quantity) FROM sales WHERE color = 'Red' Original Query State Subtotals Month Subtotals Overall Total
A better solution “UNION ALL” solution gets cumbersome with more than 2 grouping attributes “UNION ALL” solution gets cumbersome with more than 2 grouping attributes n grouping attributes → 2 n parts in the union n grouping attributes → 2 n parts in the union OLAP extensions added to SQL 99 are more convenient OLAP extensions added to SQL 99 are more convenient CUBE, ROLLUP CUBE, ROLLUP SELECT state, month, SUM(quantity) FROM sales GROUP BY CUBE(state, month) WHERE color = 'Red'
Results of the CUBE query StateMonthSUM(quantity) CAJul45 CAAug50 CASep38 CANULL133 ORJul33 ORAug36 ORSep31 ORNULL100 WAJul30 WAAug42 WASep40 WANULL112 NULLJul108 NULLAug128 NULLSep109 NULLNULL345 Notice the use of NULL for totals Subtotals at all levels
ROLLUP vs. CUBE CUBE computes entire lattice CUBE computes entire lattice ROLLUP computes one path through lattice ROLLUP computes one path through lattice Order of GROUP BY list matters Order of GROUP BY list matters Groups by all prefixes of the GROUP BY list Groups by all prefixes of the GROUP BY list GROUP BY ROLLUP(A,B,C) A,B,C (A,B) subtotals (A) subtotals Total GROUP BY CUBE(A,B,C) A,B,C Subtotals for the following: (A,B), (A,C), (B,C), (A), (B), (C) Total
ROLLUP example Total State MonthColor State, Month State, Color Month, Color State, Month, Color SELECT color, month, state, SUM(quantity) FROM sales GROUP BY ROLLUP(color,month,state)