Presentation is loading. Please wait.

Presentation is loading. Please wait.

Decision Support, Data Warehousing, and OLAP Anindya Datta Director, iXL Center for E-Commerce Georgia Institute of Technology

Similar presentations


Presentation on theme: "Decision Support, Data Warehousing, and OLAP Anindya Datta Director, iXL Center for E-Commerce Georgia Institute of Technology"— Presentation transcript:

1 Decision Support, Data Warehousing, and OLAP Anindya Datta Director, iXL Center for E-Commerce Georgia Institute of Technology adatta@cc.gatech.edu

2 Outline Data Warehousing Architecture Terminology: OLTP vs. OLAP Technologies Products Research Issues References

3 Data Warehouse A decision support database that is maintained separately from the organization’s operational databases. A data warehouse is a  subject-oriented  integrated  time-varying  non-volatile

4 Data Warehouse collection of data that is used primarily in organizational decision making Decision support system

5 OLTP On-Line Transaction Processing OLTP  What we have been talking about in class  For day-to-day operations  Involves operational database  Queries to retrieve information from specific tuples

6 OLAP and Decision Support On-Line Analytical Processing OLAP is an element of decision support systems (DSS). Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions.  What were the sales volumes by region and product category for the last year?  Which orders should we fill to maximize revenues?  Will a 10% discount increase sales volume sufficiently?

7 Three-Tier Architecture Warehouse database server  Almost always a relational DBMS OLAP servers  Relational OLAP (ROLAP): extended relational DBMS that maps operations on multidimensional data to standard relational operations.  Multidimensional OLAP (MOLAP): special purpose server that directly implements multidimensional data and operations. Clients  Query and reporting tools.  Analysis tools  Data mining tools (e.g., trend analysis, prediction)

8 Data Warehouse vs. Data Marts Enterprise warehouse: collects all information about subjects for entire organization.  Requires extensive business modeling  May take years to design and build Data Marts: Departmental subsets that focus on selected subjects: Marketing data mart  Faster roll out, but complex integration in the long run. Virtual warehouse: views over operational dbs  Materialize some summary views for efficient query processing  Easier to build  Requisite excess capacity on operational db servers

9 Data Warehousing Architecture Monitoring & Administration Metadata Repository ExtractTransformLoadRefresh Data Marts External Sources Operational dbs Serve OLAP servers Analysis Query/ Reporting Data Mining

10 OLTP vs. OLAP Clerk, IT Professional Day to day operations Application-oriented (E- R based) Current, Isolated Detailed, Flat relational Structured, Repetitive Short, Simple transaction Read/write Index/hash on prim. Key Tens Thousands 100 MB-GB Trans. throughput Knowledge worker Decision support Subject-oriented (Star, snowflake) Historical, Consolidated Summarized, Multidimensional Ad hoc Complex query Read Mostly Lots of Scans Millions Hundreds 100GB-TB Query throughput, response User Function DB Design Data View Usage Unit of work Access Operations # Records accessed #Users Db size Metric OLTPOLAP

11 OLAP for Decision Support Goal of OLAP is to support ad-hoc querying for the business analyst Business analysts are familiar with spreadsheets Extend spreadsheet analysis model to work with warehouse data Multidimensional view of data is the foundation of OLAP

12 Relational DBMS as Warehouse Server - ROLAP Schema design Specialized scan, indexing and join techniques Handling of aggregate views (querying and materialization) Supporting query language extensions beyond SQL Complex query processing and optimization Data partitioning and parallelism

13 Warehouse Database Schema ER design techniques not appropriate Design should reflect multidimensional view  Star Schema  Snowflake Schema  Fact Constellation Schema

14 Star Schema A single fact (subject) table and a single table for each dimension Every fact points to one tuple in each of the dimensions and has additional attributes Does not capture hierarchies directly Generated keys are used for performance and maintenance reasons

15 Example of a Star Schema Order No Order Date Customer No Customer Name Customer Address City SalespersonIDSalespersonNameCityQuota OrderNOSalespersonIDCustomerNOProdNoDateKeyCityNameQuantity Total Price ProductNOProdNameProdDescrCategoryCategoryDescriptionUnitPrice DateKeyDate CityNameStateCountry Order Customer Salesperson City Date Product Fact Table

16 Types of tables Fact table  (OrderNO, SalespersonID, CustomerNo, ProdNo, Total Price) Dimension tables  Order (Order No, Order Date)  Date (DateKey, Date)

17

18

19

20

21 Dimensional SQL version: SELECT SUM(SA.NoOfItems), P.PCategory, V.Vendor, C.Qtr, GroupBy(C.Qtr) FROM Calendar C, Store S, Product P, Sales SA WHERE C.CalendarKey = SA.CalendarKey AND S.StoreKey = SA. StoreKey AND P.ProductKey = SA. ProductKey AND P.Category = ‘Pacifica’ and S.RegionName = ‘Midwest’ AND C.DayofWeek = ‘Saturday’ AND C.Year = 2010 HAVING C.Qtr in (1,2) Non-dimensional SQL version: SELECT SUM(SW.NoOfItems), C.CategoryName, V.VendorName, EXTRACTQUARTER(ST.Date), GroupBy(EXTRACTQUARTER(ST.Tdate)) FROM Region R, Store S, SalesTransaction ST, SoldWithin SW, Product P, Vendor V, Category C WHERE R.RegionID = S. RegionID AND S.StoreID = ST.StoreID AND ST.Tid = SW.Tid AND SW.ProductID = P.ProductID AND P. VendorID = V. VendorID AND P.CateoryID = C.CategoryID AND P.Category = ‘Pacifica’ and R.RegionName = ‘Midwest’ AND EXTRACTWEEKDAY(St.Date) = ‘Saturday’, AND EXTRACTYEAR(ST.Date) = 2010 HAVING QUARTER(ST.Date) in (1,2)

22 Snowflake Schema Represent dimensional hierarchy directly by normalizing the dimension tables Easy to maintain Saves storage, but is alleged that it reduces effectiveness of browsing (Kimball)

23 Example of a Snowflake Schema Order No Order Date Customer No Customer Name Customer Address City SalespersonIDSalespersonNameCityQuota OrderNOSalespersonIDCustomerNOProdNoDateKeyCityNameQuantity Total Price ProductNOProdNameProdDescrCategoryCategoryUnitPrice DateKeyDateMonth CityNameStateCountry Order Customer Salesperson City Date Product Fact Table CategoryNameCategoryDescr MonthYear Year StateNameCountry Category State Month Year

24 Multidimensional Data Model Database is a set of facts (points) in a multidimensional space A fact has a measure dimension  quantity that is analyzed, e.g., sale, budget A set of dimensions on which data is analyzed  e.g., store, product, date associated with a sale amount Each dimension has a set of attributes  e.g., owner city and county of store Attributes of a dimension may be related by partial order  Hierarchy: e.g., county > city > street  Lattice: e.g., year>month> date, year>week>date

25 Multidimensional Data Model 10 47 30 12 JuiceColaMilkCream NYCLASF Sales Volume as a function of time, city and product 3/1 3/2 3/3 3/4 Date 221830

26 Operations in Multidimensional Data Model – Roll-up Aggregation (roll-up)  summarization over aggregate hierarchy – higher levels of summarization:  e.g., total sales by city and rollup to total sales by state or take daily sales totals and rollup to monthly sales totals Subtracting group by columns

27 100 475 310 121 JuiceColaMilkCream NYALCA 3/1 3/2 3/3 3/4 Date Roll-UP

28 Drill Down Navigation to detailed data (drill-down)  Used to adjust the level of detail – getting more detail – finer data  e.g., total sales by region and drill down to total sales by city or top 3% of cities Adding group by columns

29 Drill Down 5 20 20 7 JuiceColaMilkCream NYCLASF 3/1 am 3/1 pm 3/2 am 3/2 pm Date 5 27 10 5

30 Operations in Multidimensional Data Model Selection (slice) defines a subcube  e.g., sales where city = SF and Product = Cream Visualization Operations (e.g., Pivot)  Rotating summary tables (next page)

31 Selection - Slice 12221830 Cream SF 3/13/23/33/4

32 A Visual Operation: Pivot (Rotate) 10 47 30 12 JuiceColaMilkCream NYCLASF 3/1 3/2 3/3 3/4 Date Month Region Product

33 Indexing Techniques Exploiting indexes to reduce scanning of data is of crucial importance Bitmap Indexes Join Indexes

34 Bit Map Index Base Table Rating Index Region Index Customers where Region = W Rating = L And

35 BitMap Indexes Comparison, join and aggregation operations are reduced to bit arithmetic with dramatic improvement in processing time Significant reduction in space and I/O (30:1) Adapted for higher cardinality domains as well. Compression (e.g., run-length encoding) exploit

36 Issues in Handling of Aggregate Views (summary info) Important component for ROLAP Servers Logic for Aggregation Navigation  make optimum use of materialized aggregates to answer a query Choice of aggregate views to materialize

37 SQL Extensions for Front End Tools Extended Family of Aggregate functions  rank (top 10) and N-Tile (“top 30%” of all products)  Median, mode….. Reporting Features  running total, cumulative totals Results of multiple group by:  total sales by month and total sales by product

38 Approaches to OLAP Servers Relational OLAP (ROLAP)  Relational and Specialized Relational DBMS to store and manage warehouse data Example: Oracle Warehouse Builder, MetaCube (Informix) Multidimensional OLAP  Array-based storage structures Direct access to array data structures Example: Essbase (Arbor then Hyperion then Oracle)

39 ROLAP Data stored in relational tables

40 MOLAP – data stored in multidimensional array The storage model is an n-dimensional array – Cube  (1,1,1) subject value stored for juice, 3/1, SF  1,1,1= 10 1,2,1=37 1,3,1=40 1,4,1=12 Array representation has good indexing properties but very poor storage utilization when data is sparse

41 ROLAP vs MOLAP MOLAP – fast response times – data ready to display, cheaper if must start from scratch ROLAP – increased flexibility in choice of queries, cheaper if already have relational DB

42 ETL Data Extraction  Data cleaning Data may have errors from multiple sources Data Transformation  Convert from legacy/host format to warehouse format Data Load  Sort, summarize, consolidate, compute views, check integrity, build indexes, partition

43 Data Cleaning/Transformation Techniques Extraction – decide which data Uses domain-specific knowledge to do scrubbing Discover facts that flag unusual patterns (auditing)  Some dealer has never received a single complaint Products: QDB, SBStar, WizRule Transformation Rules  Example: translate “gender” to “sex” Products: Warehouse Manger (Prism), Extract (ETI)

44 Data Load Issues:  huge volumes of data to be loaded  small time window (usually at night) when the warehouse can be taken off-line  When to build indexes and summary tables  allow system administrator to monitor status, cancel suspend, resume load, or change load rate  restart after failure with no loss of data integrity Techniques:  sort input records on clustering key and use sequential I/O; build indexes and derived tables  sequential loads still too long  use parallelism and incremental techniques

45 Data Refresh Issues:  when to refresh on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-the-minute stock quotes) periodically (e.g., every 24 hours, every week) or after “significant” events refresh policy set by administrator based on user needs and traffic Techniques:  Full extract from base tables read entire source table or database: expensive.  Incremental techniques (related to work on active DBS)

46 Issues to Consider Physical Design  design of summary tables, partitions, indexes Query processing  selecting appropriate summary tables  approximate answers  Runaway queries Warehouse Management  detecting runaway queries  resource management  incremental refresh techniques  computing summary tables during load

47 Research Issues Data warehouses in the cloud XML and data warehouses


Download ppt "Decision Support, Data Warehousing, and OLAP Anindya Datta Director, iXL Center for E-Commerce Georgia Institute of Technology"

Similar presentations


Ads by Google