CS5226 2002 Data Warehouse & Performance Tuning Xiaofang Zhou School of Computing, NUS Office: S16-08-20 URL:

Slides:



Advertisements
Similar presentations
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Advertisements

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Data Warehouse Tuning. 7 - Datawarehouse2 Datawarehouse Tuning Aggregate (strategic) targeting: –Aggregates flow up from a wide selection of data, and.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Jennifer Widom On-Line Analytical Processing (OLAP) Introduction.
Data Warehouse IMS5024 – presented by Eder Tsang.
1 Database Tuning Principles, Experiments and Troubleshooting Techniques Dennis Shasha Philippe Bonnet
Introduction to Data Warehousing. From DBMS to Decision Support DBMSs widely used to maintain transactional data Attempts to use of these data for analysis,
1 Database Tuning Principles, Experiments and Troubleshooting Techniques Dennis Shasha Philippe Bonnet
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Tuning Relational Systems I. Schema design  Trade-offs among normalization, denormalization, clustering, aggregate materialization, vertical partitioning,
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 29 Overview of Data Warehousing and OLAP.
Data Warehousing. On-Line Analytical Processing (OLAP) Tools The use of a set of graphical tools that provides users with multidimensional views of their.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
AutoJoin: Providing Freedom from Specifying Joins Terrence Mason Lixin Wang
DATA WAREHOUSE (Muscat, Oman).
1 Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously.  A decision support database that is maintained.
CS346: Advanced Databases
SQL Server 2005 Performance Enhancements for Large Queries Joe Chang
SQL Server Parallel Data Warehouse: Supporting Large Scale Analytics José Blakeley, Software Architect Database Systems Group, Microsoft Corporation.
8/20/ Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously. Defined in many different ways, but.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
A Paradigm Shift in Database Optimization: From Indices to Aggregates Presented to: The Data Warehousing & Data Mining mini-track – AMCIS 2002 as Research-in-Progress.
Getting Started With Ingres VectorWise
Communicating with the Outside. Overview Package several SQL statements within one call to the database server Embedded procedural language (Transact.
OnLine Analytical Processing (OLAP)
1 Recovery Tuning Main techniques Put the log on a dedicated disk Delay writing updates to the database disks as long as possible Setting proper intervals.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
Data Warehousing.
October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.
CS Operating System & Database Performance Tuning Xiaofang Zhou School of Computing, NUS Office: S URL:
© 1999 FORWISS FORWISS MISTRAL Performance of TPC-D Benchmark and Datawarehouses Prof. R. Bayer, Ph.D. Dr. Volker Markl Dept. of Computer Science, Technical.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Schema Tuning. Outline Database design: Normalization –Problem of redundancy –Why? Functional dependency –How to solve? Decomposition –Objective of the.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
1 On-Line Analytic Processing Warehousing Data Cubes.
ADVANCED TOPICS IN RELATIONAL DATABASES Spring 2011 Instructor: Hassan Khosravi.
Data Mining Data Warehouses.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
CSE 5331/7331 F'071 CSE 5331/7331 Fall 2007 Dimensional Modeling Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University.
Data Warehousing.
Generalized Hash Teams for Join and Group-By Alfons Kemper Donald Kossmann Christian Wiesner Universität Passau Germany.
Advanced Database Concepts
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Data Warehouses and OLAP 1.  Review Questions ◦ Question 1: OLAP ◦ Question 2: Data Warehouses ◦ Question 3: Various Terms and Definitions ◦ Question.
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
Query Tuning. overview Use of Views CREATE VIEW techlocation AS SELECT ssnum, tech.dept, location FROM employee, tech WHERE employee.dept = tech.dept;
Presented By: Pedel Oppong-Abebrese,Pedel Oppong-Abebrese Michael Boadi, William Osei, Nana Amoa OforiMichael BoadiWilliam OseiNana Amoa Ofori DATA WAREHOUSING.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Data warehouse.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Value of Serializability
On-Line Analytic Processing
Data warehouse and OLAP
Data Warehouse.
On-Line Analytical Processing (OLAP)
Data Warehouse and OLAP
Data Warehousing: Data Models and OLAP operations
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009
Introduction of Week 9 Return assignment 5-2
DATA CUBES E0 261 Jayant Haritsa Computer Science and Automation
Data Warehouse and OLAP
Presentation transcript:

CS Data Warehouse & Performance Tuning Xiaofang Zhou School of Computing, NUS Office: S URL:

2 Outline Part 1: A review of data warehousing Part 2: Data warehouse tuning

3 OLAP vs OLTP OLAP : On-line analytical processing Very large amount of historical data Read only Long and complicated sessions For a few trained users OLTP : On-line transaction processing Small amount of current data Frequent updates, deletions and insertions A very large number of fixed, short-duration transactions For many ad hoc users

4 DW Characteristics A DW provides access to date for complex analysis, knowledge discovery and decision support Subject oriented (vs application oriented) The data is organized around subjects (such as Sales), rather than operations applications (such as order processing). Non-volatile Not usually subject to change Integrated Data is consistent Time variant Historical data is recorded (Inmon 1992)

5 Data Warehouse Types Enterprise-wide data warehouses Large projects with massive investment of time and resources Virtual data warehouses Provide views of operational DBs that are materialised for efficiency Data marts Targeted to a subset of the organisation Also called department-level data warehouse Low-risk, low-cost, but hard to evolve … a data warehouse = a collection of data marts?

6 Process of Data Warehousing Databases Cleaning Reformatting OLAP DSS Data Mining Back flushing Other data inputs Data Warehouse Data Metadata Updates/New Data

7 Data Cube Sales data with three dimensions: region, product and fiscal period Product Region Fiscal period Sales

8 OLAP Data Models ROLAP: relational OLAP Use of relational technology, suitably adapted and extended The data is stored using tables But the analysis operations are carried out efficiently using special data structures MOLAP: multidimensional OLAP A more radical approach Storing data directly in a multi-dimensional form HOLAP: hybrid OLAP Data defined as in ROLAP but stored as in MLOAP

9 Multidimensional model is usually not 3NF But it is as simple as a spreadsheet two-dimensional matrix And its query performance is much better then in the relational model Multidimensional or Relational Region Fiscal period Sales

10 Tables Types in MD Model A multidimensional model consists a fact table and a number of dimension tables Fact table - contains measured variables, and identifies them with pointers to dimension tables Dimension table - tuples of attributes of dimension ProdID RegID OrderTime Quantity ItemCost Sales Product(Prod No, Name, Price,…) Region(Area, Description) Time(Date, PeriodNum, QuaterNum, Year) dimension tables fact table

11 Multidimensional Schema Star schema One fact table One table for each dimension Typically not normalised Snowflake schema A variation on the star schema Dimensional tables are organised into a hierarchy Can be in 3NF

12 Star Schema Example Sales Product Region Time

13 Snowflake Schema Example Sales Product Region Time Class …if there are queries about classes of products, after normalisation, it becomes a snowflake schema…

14 Complex Snowflake Schema Fact table Fiscal qtrs FQ dates Sales revenue Product Prod. name P. line Dimension tables

15 Typical Functionality of DW Pivoting Rotate data cube to show a different orientation of axes Roll-up Move up concept hierarchy, grouping into larger units along a dimension Drill-down Disaggregate to a finer-grained view Slice and dice Perform project operations on the dimensions Other operations, such as sorting, selection

16 Data Warehousing Technology Aggregate targeting Aggregates flow up from a wide selection of data, and then Targeted decisions flow down Workload type Broad: aggregate queries over ranges of values E.g., find the total sales by region and quarter. Deep: queries that require precise individualized information E.g., which frequent flyers have been delayed several times in the last month? Dynamic (vs. Static): queries that require up-to-date information E.g.. which nodes have the highest traffic now? Main approaches Multidimensional arrays Special indexes – bitmaps and multidimensional indexes Materialised views Optimised foreign key joins Approximation by sampling

17 Tuning Knobs Indexes Materialised views Approximation

18 Bitmap Indexes Suitable for deep and broad, static, many query attributes One bitmap per attribute value Bitmap size = the number of tuples Bitmap value: a[i]=1  tuple i has the value Compact, and easy to intersect several bitmaps (for multi-attribute queries) B-tree for selective attributes, bitmaps for unselective attributes … multiple B-tree indexes = multidimensional indexes?

19 Bitmaps - Data Settings: lineitem ( L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX, L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT, L_SHIPMODE, L_COMMENT ); create bitmap index b_lin_2 on lineitem(l_returnflag); create bitmap index b_lin_3 on lineitem(l_linestatus); create bitmap index b_lin_4 on lineitem(l_linenumber); rows ; cold buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.

20 Bitmaps - Queries Queries: 1 attribute select count(*) from lineitem where l_returnflag = 'N'; 2 attributes select count(*) from lineitem where l_returnflag = 'N' and l_linenumber > 3; 3 attributes select count(*) from lineitem where l_returnflag = 'N' and l_linenumber > 3 and l_linestatus = 'F';

21 Bitmaps Order of magnitude improvement compared to scan. Bitmaps are best suited for multiple conditions on several attributes, each having a low selectivity. ANRANR l_returnflag O F l_linestatus

22 Multidimensional Indexes Suitable for deep or broad, static, few query attributes with range queries Examples: quadtrees and R-trees The space is conceptually organised in grids Variable sizes and hierarchical Quadtree: space-centric, regular, recursive decomposition R-tree: data-centric, hierarchical bounding rectangles The problem of “dimensional curse”

23 Multidimensional Indexes - Data Settings: create table spatial_facts ( a1 int, a2 int, a3 int, a4 int, a5 int, a6 int, a7 int, a8 int, a9 int, a10 int, geom_a3_a7 mdsys.sdo_geometry ); create index r_spatialfacts on spatial_facts(geom_a3_a7) indextype is mdsys.spatial_index; create bitmap index b2_spatialfacts on spatial_facts(a3,a7); rows ; cold buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb drives (10000RPM), Windows 2000.

24 Multidimensional Indexes - Queries Queries: Point Queries select count(*) from fact where a3 = and a7 = ; select count(*) from spatial_facts where SDO_RELATE(geom_a3_a7, MDSYS.SDO_GEOMETRY(2001, NULL, MDSYS.SDO_POINT_TYPE(694014,928878, NULL), NULL, NULL), 'mask=equal querytype=WINDOW') = 'TRUE'; Range Queries select count(*) from spatial_facts where SDO_RELATE(geom_a3_a7, mdsys.sdo_geometry(2003,NULL,NULL, mdsys.sdo_elem_info_array(1,1003,3),mdsys.sdo_ordinate_array (10,800000, , )), 'mask=inside querytype=WINDOW') = 'TRUE'; select count(*) from spatial_facts where a3 > 10 and a and a7 < ;

25 Multidimensional Indexes Oracle 8i on Windows 2000 Spatial Extension: 2-dimensional data Spatial functions used in the query R-tree does not perform well because of the overhead of spatial extension.

26 Materialised Views Suitable for broad, static applications Also dynamic but at a cost Frequent queries are materialised View maintenance cost Incremental maintenance Self-maintainable data warehouses

27 Materialised Views - Data Settings: orders( ordernum, itemnum, quantity, purchaser, vendor ); create clustered index i_order on orders(itemnum); store( vendor, name ); item(itemnum, price); create clustered index i_item on item(itemnum); orders, stores, items; Cold buffer Oracle 9i Pentium III (1 GHz, 256 Kb), 1Gb RAM, Adapter with 2 channels, 3x18Gb drives (10000RPM), Linux Debian 2.4.

28 Materialised Views - Data Settings: create materialized view vendorOutstanding build immediate refresh complete enable query rewrite as select orders.vendor, sum(orders.quantity*item.price) from orders,item where orders.itemnum = item.itemnum group by orders.vendor;

29 Materialised Views - Transactions Concurrent Transactions: Insertions insert into orders values ( ,7825,562,'xxxxxx6944','vendor4'); Queries select orders.vendor, sum(orders.quantity*item.price) from orders,item where orders.itemnum = item.itemnum group by orders.vendor; select * from vendorOutstanding;

30 Materialised Views Graph: Oracle9i on Linux Total sale by vendor is materialized Trade-off between query speed- up and view maintenance: The impact of incremental maintenance on performance is significant. Rebuild maintenance achieves a good throughput. A static data warehouse offers a good trade-off.

31 Materialised View Maintenance Problem when large number of views to maintain. The order in which views are maintained is important: A view can be computed from an existing view instead of being recomputed from the base relations (total per region can be computed from total per nation). Let the views and base tables be nodes v_i Let there be an edge from v_1 to v_2 if it possible to compute the view v_2 from v_1. Associate the cost of computing v_2 from v_1 to this edge. Compute all pairs shortest path where the start nodes are the set of base tables. The result is an acyclic graph A. Take a topological sort of A and let that be the order of view construction.

32 Result Approximation Suitable for broad, dynamic, large data Examples When computing the average, one can increase sample sizes until no significant change of the value Increased accuracy with time Approximated results within given error bounds

33 Approximations - Data Settings: TPC-H schema Approximations insert into approxlineitem select top 6000 * from lineitem where l_linenumber = 4; insert into approxorders select O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY, O_CLERK, O_SHIPPRIORITY, O_COMMENT from orders, approxlineitem where o_orderkey = l_orderkey;

34 Approximations - Queries insert into approxsupplier select distinct S_SUPPKEY, S_NAME, S_ADDRESS, S_NATIONKEY, S_PHONE, S_ACCTBAL, S_COMMENT from approxlineitem, supplier where s_suppkey = l_suppkey; insert into approxpart select distinct P_PARTKEY, P_NAME, P_MFGR, P_BRAND, P_TYPE, P_SIZE, P_CONTAINER, P_RETAILPRICE, P_COMMENT from approxlineitem, part where p_partkey = l_partkey; insert into approxpartsupp select distinct PS_PARTKEY, PS_SUPPKEY, PS_AVAILQTY, PS_SUPPLYCOST, PS_COMMENT from partsupp, approxpart, approxsupplier where ps_partkey = p_partkey and ps_suppkey = s_suppkey; insert into approxcustomer select distinct C_CUSTKEY, C_NAME, C_ADDRESS, C_NATIONKEY, C_PHONE, C_ACCTBAL, C_MKTSEGMENT, C_COMMENT from customer, approxorders where o_custkey = c_custkey; insert into approxregion select * from region; insert into approxnation select * from nation;

35 Approximations - More queries Queries: Single table query on lineitem select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order from lineitem where datediff(day, l_shipdate, ' ') <= '120' group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus;

36 Approximations - Still More Queries: 6-way join select n_name, avg(l_extendedprice * (1 - l_discount)) as revenue from customer, orders, lineitem, supplier, nation, region where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = 'AFRICA' and o_orderdate >= ' ' and datediff(year, o_orderdate,' ') < 1 group by n_name order by revenue desc;

37 Approximation Accuracy Good approximation for query Q1 on lineitem The aggregated values obtained on a query with a 6-way join are significantly different from the actual values - - for some applications may still be good enough.

38 Approximation Speedup Aqua approximation on the TPC-H schema 1% and 10% lineitem sample propagated. The query speed-up obtained with approximated relations is significant.

39 Summary In this module, we have covered: Basic concepts of data warehousing Data warehousing technologies How to optimise data warehouse performance