Download presentation
Presentation is loading. Please wait.
Published byIliana Millner Modified over 10 years ago
1
12/18/20141 Lecture 10: Data Warehouses Introduction Operational vs. Warehouse Multidimensional Data Examples MOLAP vs ROLAP Dimensional Hierarchies OLAP Queries Demos Comparison with SQL Queries CUBE Operator Multidimensional Design Star/Snowflake Schemas Online Aggregation Implementation Issues Bitmap Index Constructing a Data Warehouse Views Materialized View Example Materialized View is an Index Issues in Materialized Views Maintaining Materialized Views
2
12/18/20142 Introduction In the late 80s and early 90s, companies began to use their DBMSs for complex, interactive, exploratory analysis of historical data. This was called Decision Support, and On- Line Analytic Processing (OLAP). DS slowed down the operation of the company, called On-Line Transaction Processing (OLTP). This led to the creation of Data Warehouses, separate from operational Databases.
3
12/18/20143 Operational vs Data Warehouse Requirements OLTP / Operational / ProductionDSS / Warehouse / DataMart Operate the business / ClerksDiagnose the business / Managers Short queries, small amts of dataopposite Queries change dataopposite Customer inquiry, Order Entry, etc.OLAP, Statistics, Visualization, Data Mining, etc. Legacy Applications, Heterogeneous databases Opposite Often DistributedOften Centralized (Warehouse) Current dataCurrent and Historical data
4
12/18/20144 Operational vs Data Warehouse Requirements, ctd OperationalWarehouse General E-R DiagramsMultidimensional data model common Locks necessaryNo Locks necessary Crash recovery requiredCrash recovery optional Smaller volume of dataHuge volume of data Need indexes designed to access small amounts of data Need indexes designed to access large volumes of data
5
12/18/20145 Data Warehousing Integrated data spanning long time periods, often augmented with summary information. Several terabytes to petabytes common. Interactive response times expected for complex queries; ad-hoc updates uncommon. Operational Data EXTRACT TRANSFORM LOAD REFRESH DATA WAREHOUSE Metadata Repository SUPPORTS OLAP DATA MINING
6
12/18/20146 Multidimensional Data In order to support OLAP, warehouse data is often structured multidimensionally, as measures and dimensions. Measure: Numeric attribute, e.g. sales amount Dimension: attribute categorizing the measure, e.g. product, store, date of sale. The fact table is a foreign key for each dimension, plus an attribute for each measure. There will also be a dimension table for each dimension. On the next page, the fact tables are red, the dimension tables are green.
7
12/18/20147 Examples of MultiDimensional Data Purchase(ProductID, StoreID, DateID, Amt) Product(ID, SKU, size, brand) Store(ID, Address, Sales District, Region, Manager) Date (ID, Week, Month, Holiday, Promotion) Claims(ProvID, MembID, Procedure, DateID, Cost) Providers(ID, Practice, Address, ZIP, City, State) Members(ID, Contract, Name, Address) Procedure (ID, Name, Type) Telecomm (CustID, SalesRepID, ServiceID, DateID) SalesRep(ID, Address, Sales District, Region, Manager) Service(ID, Name, Category)
8
12/18/20148 MOLAP vs ROLAP Multidimensional data can be stored physically in a (disk-resident, persistent) array; called MOLAP systems. Alternatively, can store as a relation; called ROLAP systems. The main relation, which relates dimensions to a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table. E.g., Products(pid, locid, timeid, amt) Fact tables are much larger than dimensional tables.
9
12/18/20149 25.2 Multidimensional Data Model Collection of numeric measures, which depend on a set of dimensions. E.g., measure Amt, dimensions Product (key: pid), Location (locid), and Time (timeid). 8 10 10 30 20 50 25 8 15 1 2 3 timeid pid 11 12 13 pid timeidlocid amt locid Slice locid=1 is shown:
10
12/18/201410 Dimension Hierarchies For each dimension, some of the attributes may be organized in a hierarchy: PRODUCTTIMELOCATION pname week city PID date ZIP year category quarter state
11
12/18/201411 25.3 OLAP Queries Influenced by SQL and by spreadsheets. A common operation is to aggregate a measure over one or more dimensions. Find total sales. Find total sales for each city, or for each state. Find top five products ranked by total sales. Roll-up: Aggregating at different levels of a dimension hierarchy. E.g., Given total sales by city, we can roll-up to get sales by state.
12
12/18/201412 OLAP Queries Drill-down: The inverse of roll-up. E.g., Given total sales by state, can drill-down to get total sales by city. E.g., Can also drill-down on different dimension to get total sales by product for each state. Pivoting: Aggregation on selected dimensions. E.g., Pivoting on State and Year yields this cross-tabulation : 63 81 144 38 107 145 75 35 110 OR CA Total 2007 2008 2009 176 223 339 Total Slicing and Dicing: Equality and range selections on one or more dimensions.
13
12/18/201413 Cognos Demo Now we watch a demo of Cognos (bought by IBM) Dimensions: Products…Margin ranges Measure: Order value (sales) First pivot from Product dimension to Margin Range Notice how quickly the cube changes Slice to Low Margin, pivot to Product and Company Region Drill Down to High Tech, IDES AG Now the guilty product is clear.
14
12/18/201414 Tableau Demo http://www.tableausoftware.com/products/tour2 http://www.tableausoftware.com/products/tour2 Note the many measures. Pivot on sales, date (drill down to month), region as color. Clear date, pivot on product and drill down on subcategory. Change region from color to rows Move profit into color Change bars to circles Pivot on dates (columns)
15
12/18/201415 Comparison with SQL Queries The cross-tabulation obtained by pivoting can also be computed using a collection of SQLqueries: SELECT T.year, L.state, SUM (S.amt) FROM Sales S, Times T, Locations L WHERE S.timeid=T.timeid AND S.locid=L.locid GROUP BY T.year, L.state SELECT T.year, SUM (S.amt) FROM Sales S, Times T WHERE S.timeid=T.timeid GROUP BY T.year SELECT L.state,SUM (S.amt) FROM Sales S, Location L WHERE S.locid=L.locid GROUP BY L.state
16
12/18/201416 The CUBE Operator Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions. CUBE pid, locid, timeid BY SUM Sales Equivalent to rolling up Sales on all eight subsets of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form: SELECT SUM (S.amt) FROM Sales S GROUP BY grouping-list Lots of work on optimizing the CUBE operator!
17
12/18/201417 Example Multidimensional Design This kind of schema is very common in OLAP applications It is called a star schema What is “wrong” with it? pricecategorypnamepidcountrystatecitylocid amtlocidtimeidpid holiday_flagweekdatetimeidmonthquarteryear (Fact table) SALES TIMES PRODUCTSLOCATIONS
18
12/18/201418 Star/Snowflake Schemas Why normalize? Space Redundancy, anomalies Why unnormalize? Performance Which is more important in D. Warehouses? If normalized, it is a snowflake schema
19
12/18/201419 Online Aggregation Consider an aggregate query, e.g., finding the average sales by state. Can we provide the user with some information before the exact average is computed for all states? Can show the current “running average” for each state as the computation proceeds. Even better, if we use statistical techniques and sample tuples to aggregate instead of simply scanning the aggregated table, we can provide bounds such as “the average for Oregon is 2000 102 with 95% probability”. Should also use nonblocking algorithms!
20
12/18/201420 25.6 Implementation Issues New indexing techniques: Bitmap indexes, Join indexes, array representations, compression, precomputation of aggregations, etc. E.g., Bitmap index: sex custid name sex rating rating Bit-vector: 1 bit for each possible value. M F
21
12/18/201421 Bitmap Indexes Work when an attribute has few values, e.g. gender or rating Advantage: Small enough to fit in memory Many queries can be answered by bit-vector ops, e.g. females with rating = 3.
22
12/18/201422 25.7 Constructing a D. Warehouse Extract Is the data in native format? Clean How many ways can you spell Mr.? Errors, missing information Transform Fix semantic mismatches. E.g. Last+first vs. Name Load Do it in parallel or else…. Refresh Both data and indexes
23
12/18/201423 25.8,9 Views and Decision Support In large databases, precomputation is necessary for decent response times Examples: brain, google Example: Precompute daily sums for the cube. What can be derived from those precomputations? These precomputed queries are called Materialized Views (SQL Server: Indexed views).
24
12/18/201424 Materialized View Example CREATE VIEW DailySum(date, sumamt) AS SELECT date, SUM(amt) FROM Times Join Sales USING(timeid) GROUP BY date SELECT week, SUM (amt) FROM Times Join Sales USING(timeid) Group By week SELECT week, SUM (sumamt) FROM Times Join DailySum USING (week) GROUP BY week Mat. View Query Modified Query
25
12/18/201425 Pros and Cons of Materialized Views Pro: Modified query is a join of two small tables; original query is a join with one huge table. Con: Materialized views take up space, need to be updated.
26
12/18/201426 A Materialized View is an Index Recall the definition of an index Data structure that provides fast access to data Table indexes were of the form {(value, pointer)}, perhaps at leaf level of a search structure. This is different. Needs to be maintained as underlying tables change. Ideally, we want incremental view maintenance algorithms.
27
12/18/201427 What views should we materialize? Remember the software that automatically chooses optimal index configurations? The same software will choose optimal materialized views, given a workload and available space.
28
12/18/201428 What about the optimizer? Given a query and a set of materialized views, can we use the materialized views to answer the query? This is tricky. Best reference is [348]
29
12/18/201429 Refreshing Materialized Views How often should we refresh the materialized view? Many enterprises refresh warehouse data only weekly/nightly, so can afford to completely rebuild their materialized views. Others want their warehouses to be current, so materialized views must be updated incrementally if possible. Let's look at some simple examples.
30
12/18/201430 25.10 Maintaining Materialized Views* Incremental view maintenance Defn: make changes in view that correspond to changes in the base tables Example: V = SELECT a FROM R How is V modified if r is inserted to R? How is V modified if r is deleted from R?
31
12/18/201431 Maintaining Materialized Views* Consider V = R ⋈ S How is V modified if r is inserted to R? How is V modified if r is deleted from R? Consider V = SELECT g,COUNT(*) FROM R GROUP BY g How is V modified if r is inserted to R? How is V modified if r is deleted from R For more general cases, see [348]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.