Presentation is loading. Please wait.

Presentation is loading. Please wait.

Decision Support Systems1 From Transaction Processing to Support for Decision Making CIS 671.

Similar presentations


Presentation on theme: "Decision Support Systems1 From Transaction Processing to Support for Decision Making CIS 671."— Presentation transcript:

1 Decision Support Systems1 From Transaction Processing to Support for Decision Making CIS 671

2 Decision Support Systems2 Computerized Information Systems Used to “run the business”. OSU Examples –Personnel & Payroll (ARMS) –Course Offerings –Students, including course enrollments and grades (estimated $30M to replace) –Inventory Transaction Processing

3 Decision Support Systems3 1 st Generation DBMS Designed for Transaction Processing –Hierarchical – IBM – IMS –Network Management Information Systems –Added later –Mostly standard summary reports Produced on a regular basis

4 Decision Support Systems4 Relational DBMS Codd – particularly designed for “ad hoc” queries First uses for Transaction Processing Transaction Data now available on-line –Use it to help Decision Making –Ad Hoc

5 Decision Support Systems5 Decision Support Systems (DSS) Use comprehensive view of all aspects of business. –Different business units –Historical data –Summary information Classes of analysis tools: –Complex “traditional” SQL queries –Many “group-by” and “aggregation” queries (On Line Analytical Processing) –Exploratory data analysis - Data Mining

6 Decision Support Systems6 Data Warehousing Properties –Consolidated data from many sources –Spanning long time periods –Augmented with summary information Size: several gigabytes to terabytes

7 Decision Support Systems7 Data Warehouse Creation Integrate schemas from different groups –Semantic mismatches Different currencies Different names for same attributes Different structures for similar tables

8 Decision Support Systems8 Data Warehouse Creation, cont. Extract data from different operational databases and other external sources –Clean data - correct errors, fill in missing data –Transform data to match integrated schema –Load data into warehouse –Refresh data in a timely fashion –Purge very old data –Create metadata repository May be so large that it is in a separate database

9 Decision Support Systems9 Data Warehouse - Provide Variety of Analytical Tools –Complex “traditional” SQL queries –OLAP query engine –Data mining algorithm –Information visualization tools –Statistical packages –Report generators

10 Decision Support Systems10 Data Mart Departmental subset of a data warehouse Top-down approach –Derive from the organization’s data warehouse –May be too hard to do all at once Bottom-up approach –Initially create departmental data marts –Integrate data marts into organizational data warehouse –If not done carefully, may be hard to integrate

11 Decision Support Systems11 OLTP vs. Data Warehouse DBs (from Toby J. Teorey, Database Modeling & Design, Morgan Kaufmann, 1999, p. 212) OLTP Transaction oriented Thousands of users Small (MB to several GB) Current data Normalized data (many tables, few columns per table) Continuous update Simple to complex queries Data Warehouse Subject oriented Few users (  100) Large (hundreds of GB to several TB) Historical data Denormalized data (few tables, many columns per table) Batch updates Usually very complex queries

12 Decision Support Systems12 Complex “traditional” SQL queries Relational DBMS optimized for decision support –in contrast to a DBMS optimized for transaction processing Example: –Teradata machine from NCR

13 Decision Support Systems13 On Line Analytical Processing (OLAP) Multidimensional Databases (MDD)

14 Decision Support Systems14 Example from Finkelstein [Fink95]: Note that Branch, ProdID, Date  Sales, Returns Note the multidimensionality of the SALES_INFO table.

15 Decision Support Systems15 Dimension Hierarchies LOCATION Territory Region Branch TIME Year Quarter Week Month Date PRODUCT Category ProdID

16 Decision Support Systems16 Possible queries: 1. How did product Widget sell in the last month, and how does this figure compare with sales over the last five years? How about by branch, region and territory? 2. Did this product sell better in different regions, and are there any regional trends? 3. Were there more returns of Widgets over the last year? Were these returns caused by defects? Were they manufactured in any particular plants?

17 Decision Support Systems17 Additional Possible query: 4. Do commissions and pricing affect how sales persons sell the product? Do particular salespersons do a better job of selling the product? Note that a "multidimensional" spreadsheet would be useful. Codd called this type of problem On Line Analytical Processing (OLAP) in contrast to On Line Transaction Processing (TP).

18 Decision Support Systems18 Codd's rules for OLAP: [Codd93] 1. Multi-Dimensional Concept View The user should be able to see the data as being multidimensional insofar as it should be easy to 'pivot' or 'slice and dice’. (See later.) 2. Transparency The OLAP functionality should be provided behind the user's existing software without adversely affecting the functionality of the 'host'. 3. Accessibility OLAP should allow the user to access diverse data stores but see the data within a common 'schema' provided by the OLAP tool.

19 Decision Support Systems19 OLAP Rules, cont. 4. Consistent Reporting Performance There should not be significant degradation in performance with large numbers of dimensions or large quantities of data. 5. Client-Server Architecture Since much of the data is on mainframes, and the users work on PCs, the OLAP tool must be able to bring the two together! 6. Generic Dimensionality Data dimensions must all be treated equally. Functions available for one dimension must be available for others.

20 Decision Support Systems20 OLAP Rules, cont. 7. Dynamic Sparse Matrix Handling The OLAP tool should be able to work out for itself the most efficient way to store sparse matrix data. 8. Multi User Support This is self-evident. 9. Unrestricted Cross-Dimensional Operations e.g., individual office overheads are allocated according to total corporate overheads divided in proportion to individual office sales.

21 Decision Support Systems21 OLAP Rules, cont. 10. Intuitive Data Manipulation Navigation should be done by operations on individual cells rather than menus. 11. Flexible Reporting Row and column headings must be capable of more than one dimension each, and of displaying subsets of any dimension. 12. Unlimited Dimensions and Aggregation Levels At least 15 dimensions may be required, and within each there may be many hierarchical levels.

22 Decision Support Systems22 Example from Finkelstein [Fink95]: Note that Branch, ProdID, Date  Sales, Returns Note the multidimensionality of the SALES_INFO table.

23 Decision Support Systems23 “Pivoting” Cross Tabulation Sales by Date and Region

24 Decision Support Systems24 “Drill Down” (narrower category) Replace Region by Branch. “Rollup” (more general category) Replace Region by Territory.

25 Decision Support Systems25 OLAP Questions 1. Query language - how to say what's wanted. 2. Processing language - how to specify calculations: ratios, variances,.... 3. Data visualization - how to see the data. 4. Performance - time to process the query (5 second rule).

26 Decision Support Systems26 OLAP References [Codd93] E. F. Codd, S. B. Codd, and C.T. Salley, "Providing OLAP to User Analysts: An IT Mandate," Codd & Date Inc., 1993. [Fink95] Richard Finkelstein, "MDD: Database Reaches the Next Dimension," DATABASE Programming and Design, 8(4), April 1995.

27 Decision Support Systems27 Exploratory Data Analysis Data Mining Find interesting trends or patterns in large data sets. Statistics - Exploratory Data Analysis Artificial Intelligence - Knowledge Discovery and Machine Learning Much larger data sets

28 Decision Support Systems28 Mining for Association Rules Classic example Market basket analysis –Record each customer transaction at a grocery store. –Try and identify sets of items purchased together.

29 Decision Support Systems29 Association Rule: {coke}  {chips} People who buy coke usually buy chips. Measures for Association Rule {LHS}  {RHS} Support: % of transactions containing this set of items. (2/5=40%) Confidence: given all transactions containing LHS items, the % that also contain the RHS (2/3=67%) Want both to be “reasonably” large.

30 Decision Support Systems30 On-Line Analytical Processing (OLAP) Part II: CIS 671 Elmasri & Navathe §26.1

31 Decision Support Systems31 Multi-dimensional View of Data Fact Table (also called cubes) –Dimension attributes –Dependent attributes (functions of the dimension attributes) Dimension Tables, potentially one for each dimension

32 Decision Support Systems32 OLAP Operations Roll-up – increase the level of aggregation Drill-down - decrease the level of aggregation Slice-and-dice - selection and projection, i.e., reduce dimensionality of the data Pivot – re-orient the dimensional view

33 Decision Support Systems33 Implementation Approaches Relational OLAP (ROLAP) Servers –Data stored in a relational –system –SQL extended To allow easy OLAP query expression To provide efficient OLAP query execution. Multidimensional OLAP (MOLAP) –Systems directly store multidimensional data in special data structures –OLAP operations implemented directly on these data structures. Hybrid OLAP (HOLAP) –Combines ROLAP and MOLAP. –Detail records (largest volume) in relational database. –Aggregations in separate, but connected”, MOLAP store.

34 Decision Support Systems34 Example a Star Schema OrderNo OrderDate CustomerNo CustomerName CustomerAddress City SalespersonID SalespersonName City Quota OrderNo SalespersonID CustomerNo ProdNo DateKey CityName Quantity TotalPrice CityName State Region DateKey Date Month Year ProdNo ProdName ProdDescr Category CategoryDescr UnitPrice QOH Customer Order Salesperson Sales (Fact) table Product Date City

35 Decision Support Systems35 Snowflake Schema OrderNo OrderDate CustomerNo CustomerName CustomerAddress City SalespersonID SalespersonName City Quota OrderNo SalespersonID CustomerNo ProdNo DateKey CityName Quantity TotalPrice CityName State Region DateKey Date Month Year ProdNo ProdName ProdDescr Category UnitPrice QOH Customer Order Salesperson Sales (Fact) table Product Date City Month Year CategoryName CategoryDescr State Region Year State Category Region Month

36 Decision Support Systems36 Data Cubes Precompute all possible aggregations. Required extra storage is tolerable. Little penalty to keep aggregate up-to-date if data does not change. Normally some aggregation of raw data is done before it is entered into the data cube.

37 Decision Support Systems37 Data Cube with Orders Accumulated CustomerNo CustomerName CustomerAddress City SalespersonID SalespersonName City Quota SalespersonID CustomerNo ProdNo DateKey CityName Quantity TotalValue CityName State DateKey Date Month ProdNo ProdName ProdDescr Category UnitPrice QOH Customer Salesperson Sales table Product Date City Month Year CategoryName CategoryDescr State Region Year State Category Region Month Note that average for any aggregate can be calculated from TotalValue and Quantity.

38 Decision Support Systems38 Sample of Aggregates in the CUBE Sales (SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue) 22111002‘Columbus’3300 CUBE(Sales) (SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue) 22111002‘Columbus’3300 22*1002‘Columbus’62222 22**2‘Columbus’2533000 ***2‘Columbus’7590000 ****‘Columbus’200503444

39 Decision Support Systems39 How to answer query given the relation CUBE(Sales) Choose tuples in CUBE(Sales) with the following properties: 1.Query specifies value v for attribute a  tuple t has v in its component for a. 2.Query groups by attribute a  tuple t has any non-* value in its component for a. 3.Query has neither groups by attribute a nor specifies value for a  tuple t has * value in its component for a.

40 Decision Support Systems40 How to answer query given the relation CUBE(Sales) Cube(Sales) (SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue) 22111002‘Columbus’3300 22*1002‘Columbus’62222 22**2‘Columbus’2533000 ***2‘Columbus’7590000 ****‘Columbus’200503444 select CustomerNo, avg(Price) from Sales where SalespersonID = 22 Group by CustomerNo Cube(Sales) (SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue) 22c*** nv Result(c, v/n)

41 Decision Support Systems41 Cube Implementation by Materialized Views Dimensions may have hierarchies. –Product, Category –City, State, Region

42 Decision Support Systems42 Example: Materialized Views Cube(Sales) (SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue) insert into SalesV1 select SalespersonID, CustomerNo, Month, State sum(Quantity) as Quantity, sum(TotalValue) as TotalValue from Sales join City on Sales.CityName = City.CityName group by SalespersonID, CustomerNo, Month, State; insert into SalesV2 select SalespersonID, CustomerNo, Month, Region sum(Quantity) as Quantity, sum(TotalValue) as TotalValue from Sales join City on Sales.CityName = City.CityName group by SalespersonID, CustomerNo, Month, Region; City (CityName, State, Region)

43 Decision Support Systems43 Example: Query 1 select SalespersonID, sum(TotalValue) from Sales group by SalespersonID; select SalespersonID, sum(TotalValue) from SalesV1 group by SalespersonID; select SalespersonID, sum(TotalValue) from SalesV2 group by SalespersonID; Answer by or by

44 Decision Support Systems44 Example: Query 2 select SalespersonID, State, sum(TotalValue) from Sales group by SalespersonID, State; select SalespersonID, State, sum(TotalValue) from SalesV1 group by SalespersonID, State; Answer only by

45 Decision Support Systems45 Example: Query 3 select SalespersonID, State, date, sum(TotalValue) from Sales group by SalespersonID, State, Date; Cannot be answered by either SalesV1 or SalesV2. Thus must use Sales itself.

46 Decision Support Systems46 Lattice of Views All Years Quarters MonthsWeeks Days All City State Region

47 Decision Support Systems47 Lattice of Materialized Views and Queries Sales Q1 SalesV2SalesV1 Q3Q2

48 Decision Support Systems48 OLAP Example Garcia-Molina, Ullman & Widom, Database System Implementation, Prentice Hall, 2000 Automobile Sales Company: analyze sales of cars Sales(serialNo, date, dealer, price) Autos(serialNo, model, color) Dealers(name, city, state) Days(day, week, month, year) ( 5, 27, 7, 2000) Fact Table Dimension Tables Time Dimension Table, probably not stored

49 Decision Support Systems49 Assume a particular car model, say ‘Gobi’, is not selling as well as anticipated. How to analyze? Maybe it’s the color. Slice for ‘Gobi. Dice for color. select color, sum(price) from Sales natural join Autos where model = ‘Gobi’ group by color; Doesn’t show anything interesting.

50 Decision Support Systems50 Gobi analysis, continuing What about time? Drill down for month. select color, month, sum(price) from Sales natural join Autos join Days on date = day where model = ‘Gobi’ group by color, month; Suppose we discover red Gobis have not sold well recently.

51 Decision Support Systems51 Gobi analysis, continuing Are red Gobis selling poorly for all dealers or just some? Drill down for dealer. select dealer, month, sum(price) from Sales natural join Autos join Days on date = day where model = ‘Gobi’ and color = ‘red’ group by dealer, month; Discover there are too few sales to show anything interesting.

52 Decision Support Systems52 Gobi analysis, continuing Rollup time from month to year and slice for last two years. select dealer, year, sum(price) from Sales natural join Autos join Days on date = day where model = ‘Gobi’ and color = ‘red’ and (year = ‘1999’ or year = ‘2000’) group by dealer, year; Does show variation. Now understand the problem better.

53 Decision Support Systems53 Administration Lab assignments and HWs posted on the web. Clarifications/Questions? Please use appropriate online submit command Teams of 2 allowed but make contribution of each team member explicit especially in the lab assignment. Extra Credit assignment in lab. Bring questions to class on Thursday

54 Decision Support Systems54 (color codes, meaning  tuple representation (time in quarters, product,country,Tsales) time, product, country are dimension attributes, Tsales is total sales White squares (basic fact table) -  (q, p, c, sales) Green squares total annual sales grouped by product and country.  (*, p, c, Tsales) Dark Green squares total annual sales grouped by product  (*, p, *, Tsales) Orange squares total annual sales grouped by quarter and country.  (q, *, c, Tsales) Dark orange squares total annual sales grouped by quarter.  (q, *, *, Tsales) Grey total annual sales grouped by country.  (*, *, c, Tsales) Other pair (quarter and product) not shown (need to pivot).  (q, *, p, Tsales) Dark blue (all sales) (*, *, *, sales) Size of white cube = QXPXC, size of colored cube = (Q+1)X (P+1)X(C+1) Why? (*  think of it as another category along each dimension Size of colored cube with hierarchy  Even larger!

55 Decision Support Systems55

56 Decision Support Systems56 Aggregation causes Database Explosion in Large Multi-dimensional Applications as the Number of Dimensions Increases Based on Nigel Pendse, “Database Explosion”, www.olapreport.com/DatabaseExplosion.htm

57 Decision Support Systems57 Factors not causing data explosion Poor handling of data sparsity. –No more than factor of 4 vs. factors of 10s or 100s Type of database technology. –Although optimized storage technology will be significantly better. Lack of data compression. –Compression is helpful, but explosion still occurs. Software errors –Again, a different problem.

58 Decision Support Systems58 Multi-dimensional Database (MDB) can save significant space Keys, indexes & dimensional structures. –Not required or take far less space. Sparsity better suppressed. Data compressed. Example: –6-dimensional (including measures) banking cube –13 million row fact table –Relational fact table incl. indexes, but not aggregates: 5188 Mb –MOLAP cube including aggregations: 336 Mb –Well under 10% the space. –Much faster query processing.

59 Decision Support Systems59 (n+m+p) 2 (n+m) 2 Why is there a data explosion even without sparsity? Take two dimensional example n: data from original source. m: data aggregations precalculated. p: on-the-fly results, not stored. n2n2 n m p Simplifying to n=m=p 1n 2, 4n 2, 9n 2 In 3 dimensions this becomes 1n 3, 8n 3, 27n 3

60 Decision Support Systems60 When Data is Sparse it’s much worse. One-dimensional data. Simple hierarchy. Black - actual data, red - nulls. Detailed level: 8 of 25 or 32%. Aggregated levels: 5 of 6 or 83%. Growth factor: 1.625 (13 cells based on 8 input cells)

61 Decision Support Systems61 Aggregated data Two dimensions: The problem gets worse Detail data Potential input cells: 25*25=625 Potential aggregated cells: 6*6 + 6*25 +6*25 =336 More than 1 derived cell for every 2 possible input cells. In 6 dimensions, could have 2 or 3 derived cells per 1 input cell.

62 Decision Support Systems62 What about higher dimensions? One percent density, 6 of 625 input cells. Yields 29 computed cells. I.e., 35 total cells, only 6 input. Growth factor: 5.83. Growth factor per dimension: sqrt(5.83)=2.4. –Called compound growth factor (CGF). CGF is typically in the range 1.5 to 2.5. CGF increases as sparsity increases. With large dimensions, will often be more consolidation. –(Many thousands of products  more levels of groupings.) With CGF of 2.0, extra dimension with no increase in input data, will double size of fully computed database.

63 Decision Support Systems63 So what is the problem? Disk space increases. Can software handle this much data? Time to load and update database increases. –Could take days to load the database.

64 Decision Support Systems64 What to do? Avoid fully pre-calculating any multi- dimensional object with more than 5 sparse dimensions. Reduce sparsity of individual data objects: –Use good application design. What to pre-calculate? Data that is slow to calculate at run-time because it depends on many other cells or complex formulae. Data that is frequently viewed. Data that is the basis of many other calculations. Note: If too much is precalculated, performance may decrease because cache will not include as much useful data.


Download ppt "Decision Support Systems1 From Transaction Processing to Support for Decision Making CIS 671."

Similar presentations


Ads by Google