Lecture 1: Data Warehousing Based on the slides by Jeffrey D. Ullman and Hector Garcia-Molina at Stanford University 1
2 Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time- consuming, complex queries. New architectures have been developed to handle complex “analytic” queries efficiently.
3 The Data Warehouse The most common form of data integration. – Copy sources into a single DB (warehouse) and try to keep it up-to-date. – Usual method: periodic reconstruction of the warehouse, perhaps overnight. – Frequently essential for analytic queries.
4 OLTP Most database operations involve On-Line Transaction Processing (OTLP). – Short, simple, frequent queries and/or modifications, each involving a small number of tuples. – Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets.
5 OLAP Of increasing importance are On-Line Application Processing (OLAP) queries. – Few, but complex queries --- may run for hours. – Queries do not depend on having an absolutely up-to-date database.
6 OLAP Examples 1.Amazon analyzes purchases by its customers to come up with an individual screen with products of likely interest to the customer. 2.Analysts at Wal-Mart look for items with increasing sales in some region.
Warehouse Architecture 7 Client Warehouse Source Query & Analysis Integration Metadata
Why a Warehouse? Two Approaches: – Query-Driven (Lazy) – Warehouse (Eager) 8 Source ?
9 Data Warehouse Databases at store branches handle OLTP. Local store databases copied to a central warehouse overnight. Analysts use the warehouse for OLAP.
Query-Driven Approach 10 Client Wrapper Mediator Source
Advantages of Warehousing High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse – Modify, summarize (store aggregates) – Add historical information 11
Advantages of Query-Driven No need to copy data – less storage – no need to purchase data More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources 12
OLTP vs. OLAP OLTP: On Line Transaction Processing – Describes processing at operational sites OLAP: On Line Analytical Processing – Describes processing at warehouse 13
OLTP vs. OLAP Mostly updates Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical Mostly reads Queries long, complex Gb-Tb of data Summarized, consolidated data Decision-makers, analysts as users 14 OLTP OLAP
15 Star Schemas A star schema is a common organization for data at a warehouse. It consists of: 1.Fact table : a very large accumulation of facts such as sales. wOften “insert-only.” 2.Dimension tables : smaller, generally static information about the entities involved in the facts.
16 Example: Star Schema Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged. The fact table is a relation: Sales(bar, beer, drinker, day, time, price)
17 Example, Continued The dimension tables include information about the bar, beer, and drinker “dimensions”: Bars(bar, addr, license) Beers(beer, manf) Drinkers(drinker, addr, phone)
18 Visualization – Star Schema Dimension Table (Beers)Dimension Table (etc.) Dimension Table (Drinkers)Dimension Table (Bars) Fact Table - Sales Dimension Attrs.Dependent Attrs.
19 Dimensions and Dependent Attributes Two classes of fact-table attributes: 1.Dimension attributes : the key of a dimension table. 2.Dependent attributes : a value determined by the dimension attributes of the tuple.
20 Example: Dependent Attribute price is the dependent attribute of our example Sales relation. It is determined by the combination of dimension attributes: bar, beer, drinker, and the time (combination of day and time-of-day attributes).
21 Approaches to Building Warehouses 1.ROLAP = “relational OLAP”: Tune a relational DBMS to support star schemas. 2.MOLAP = “multidimensional OLAP”: Use a specialized DBMS with a model such as the “data cube.”
22 ROLAP Techniques 1.Bitmap indexes : For each key value of a dimension table (e.g., each beer for relation Beers) create a bit-vector telling which tuples of the fact table have that value. 2.Materialized views : Store the answers to several useful queries (views) in the warehouse itself.
23 Typical OLAP Queries Often, OLAP queries begin with a “star join”: the natural join of the fact table with all or most of the dimension tables. Example: SELECT * FROM Sales, Bars, Beers, Drinkers WHERE Sales.bar = Bars.bar AND Sales.beer = Beers.beer AND Sales.drinker = Drinkers.drinker;
24 Typical OLAP Queries --- (2) The typical OLAP query will: 1.Start with a star join. 2.Select for interesting tuples, based on dimension data. 3.Group by one or more dimensions. 4.Aggregate certain attributes of the result.
25 Example: OLAP Query For each bar in Palo Alto, find the total sale of each beer manufactured by Anheuser- Busch. Filter: addr = “Palo Alto” and manf = “Anheuser-Busch”. Grouping: by bar and beer. Aggregation: Sum of price.
26 Example: In SQL SELECT bar, beer, SUM(price) FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’ GROUP BY bar, beer;
27 Using Materialized Views A direct execution of this query from Sales and the dimension tables could take too long. If we create a materialized view that contains enough information, we may be able to answer our query much faster.
28 Example: Materialized View Which views could help with our query? Key issues: 1.It must join Sales, Bars, and Beers, at least. 2.It must group by at least bar and beer. 3.It must not select out Palo-Alto bars or Anheuser- Busch beers. 4.It must not project out addr or manf.
29 Example --- Continued Here is a materialized view that could help: CREATE VIEW BABMS(bar, addr, beer, manf, sales) AS SELECT bar, addr, beer, manf, SUM(price) sales FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers GROUP BY bar, addr, beer, manf; Since bar -> addr and beer -> manf, there is no real grouping. We need addr and manf in the SELECT.
30 Example --- Concluded Here’s our query using the materialized view BABMS: SELECT bar, beer, sales FROM BABMS WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’;
31 MOLAP and Data Cubes Keys of dimension tables are the dimensions of a hypercube. – Example: for the Sales data, the four dimensions are bar, beer, drinker, and time. Dependent attributes (e.g., price) appear at the points of the cube.
32 Visualization - Data Cubes price bar beer drinker
33 Marginals The data cube also includes aggregation (typically SUM) along the margins of the cube. The marginals include aggregations over one dimension, two dimensions,…
34 Visualization - Data Cube w/ Aggregation price bar beer drinker SUM over all Drinkers
35 Example: Marginals Our 4-dimensional Sales cube includes the sum of price over each bar, each beer, each drinker, and each time unit (perhaps days). It would also have the sum of price over all bar-beer pairs, all bar-drinker-day triples,…
36 Structure of the Cube Think of each dimension as having an additional value *. A point with one or more *’s in its coordinates aggregates over the dimensions with the *’s. Example: Sales(“Joe’s Bar”, “Bud”, *, *) holds the sum over all drinkers and all time of the Bud consumed at Joe’s.
37 Drill-Down Drill-down = “de-aggregate” = break an aggregate into its constituents. Example: having determined that Joe’s Bar sells very few Anheuser-Busch beers, break down his sales by particular A.-B. beer.
38 Roll-Up Roll-up = aggregate along one or more dimensions. Example: given a table of how much Bud each drinker consumes at each bar, roll it up into a table giving total amount of Bud consumed for each drinker.
39 Roll Up and Drill Down JimBobMary Joe’s Bar Nut- House Blue Chalk $ of Anheuser-Busch by drinker/bar JimBobMary $ of A-B / drinker JimBobMary Bud M’lob Bud Light Roll up by Bar $ of A-B Beers / drinker Drill down by Beer
40 Materialized Data-Cube Views Data cubes invite materialized views that are aggregations in one or more dimensions. Dimensions may not be completely aggregated --- an option is to group by an attribute of the dimension table.
41 Example A materialized view for our Sales data cube might: 1.Aggregate by drinker completely. 2.Not aggregate at all by beer. 3.Aggregate by time according to the week. 4.Aggregate according to the city of the bar.
42 Example A materialized view for our Sales data cube might: 1.Aggregate by drinker completely. 2.Not aggregate at all by beer. 3.Aggregate by time according to the week. 4.Aggregate according to the city of the bar.
Warehouse Models & Operators Data Models – relations – stars & snowflakes – cubes Operators – slice & dice – roll-up, drill down – pivoting – other 43
Star 44
Star Schema 45
Terms Fact table Dimension tables Measures 46
Dimension Hierarchies 47 store sType cityregion snowflake schema constellations
Cube 48 Fact table view: Multi-dimensional cube: dimensions = 2
3-D Cube 49 day 2 day 1 dimensions = 3 Multi-dimensional cube:Fact table view:
ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing MOLAP: Multi-Dimensional On-Line Analytical Processing 50
Aggregates 51 Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81
Aggregates 52 Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
Another Example 53 Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId drill-down rollup
Aggregates Operators: sum, count, max, min, median, ave “Having” clause Using dimension hierarchy – average by region (within store) – maximum by month (within date) 54
Cube Aggregation 55 day 2 day drill-down rollup Example: computing sums
Cube Operators 56 day 2 day sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)
Extended Cube 57 day 2 day 1 * sale(*,p2,*)
Aggregation Using Hierarchies 58 day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B)
Pivoting 59 day 2 day 1 Multi-dimensional cube: Fact table view:
Query & Analysis Tools Query Building Report Writers (comparisons, growth, graphs,…) Spreadsheet Systems Web Interfaces Data Mining 60
Other Operations Time functions – e.g., time average Computed Attributes – e.g., commission = sales * rate Text Queries – e.g., find documents with words X AND B – e.g., rank documents by frequency of words X, Y, Z 61
Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing,... Managing: Metadata, Design,... 62
Monitoring Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … Incremental vs. Refresh 63 new
Monitoring Techniques Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring 64 Advantages & Disadvantages!!
Monitoring Issues Frequency – periodic: daily, weekly, … – triggered: on “big” change, lots of changes,... Data transformation – convert data to uniform format – remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways 65
Integration Data Cleaning Data Loading Derived Data 66 Client Warehouse Source Query & Analysis Integration Metadata
Data Cleaning Migration (e.g., yen dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) Auditing: discover rules & relationships (like data mining) 67 billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)
Loading Data Incremental vs. refresh Off-line vs. on-line Frequency of loading – At night, 1x a week/month, continuously Parallel/Partitioned load 68
Derived Data Derived Warehouse Data – indexes – aggregates – materialized views (next slide) When to update derived data? Incremental vs. refresh 69
Materialized Views Define new warehouse relations using SQL expressions 70 does not exist at any source
Processing ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms 71 Client Warehouse Source Query & Analysis Integration Metadata
ROLAP Server Relational OLAP Server 72 relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”
MOLAP Server Multi-Dimensional OLAP Server 73 multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date milk soda eggs soap A B Sales
Index Structures Traditional Access Methods – B-trees, hash tables, R-trees, grids, … Popular in Warehouses – inverted lists – bit map indexes – join indexes – text indexes 74
CS 245Notes12 75 Inverted Lists... age index inverted lists data records
CS 245Notes12 76 Using Inverted Lists Query: – Get people with age = 20 and name = “fred” List for age = 20: r4, r18, r34, r35 List for name = “fred”: r18, r52 Answer is intersection: r18
CS 245Notes12 77 Bit Maps... age index bit maps data records
CS 245Notes12 78 Using Bit Maps Query: – Get people with age = 20 and name = “fred” List for age = 20: List for name = “fred”: Answer is intersection: l Good if domain cardinality small l Bit vectors can be compressed
CS 245Notes12 79 Join “Combine” SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT
What to Materialize? Store in warehouse results useful for common queries Example: 80 day 2 day total sales materialize
Materialization Factors Type/frequency of queries Query response time Storage cost Update cost 81
Cube Aggregates Lattice 82 city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day use greedy algorithm to decide what to materialize
Dimension Hierarchies 83 all state city
Dimension Hierarchies 84 city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...
Interesting Hierarchy 85 all years quarters months days weeks conceptual dimension table
Managing Metadata Warehouse Design Tools 86 Client Warehouse Source Query & Analysis Integration Metadata
Administrative – definition of sources, tools,... – schemas, dimension hierarchies, … – rules for extraction, cleaning, … – refresh, purging policies – user profiles, access control,... 87
Metadata Business – business terms & definition – data ownership, charging Operational – data lineage – data currency (e.g., active, archived, purged) – use stats, error reports, audit trails 88
Design What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index? 89
Tools Development – design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis – what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management – performance monitoring, usage patterns, exception reporting System & Network Management – measure traffic (sources, warehouse, clients) Workflow Management – “reliable scripts” for cleaning & analyzing data 90
Current State of Industry Extraction and integration done off-line – Usually in large, time-consuming, batches Everything copied at warehouse – Not selective about what is stored – Query benefit vs storage & update cost Query optimization aimed at OLTP – High throughput instead of fast response – Process whole query before displaying anything 91
Future Directions Better performance Larger warehouses Easier to use What are companies & research labs working on? 92
Conclusions Massive amounts of data and complexity of queries will push limits of current warehouses Need better systems: – easier to use – provide quality information 93