Jerry Post McGraw-Hill/Irwin Copyright © 2005 by The McGraw-Hill Companies, Inc. All rights reserved. Database Management Systems Chapter 8 Data Warehouses.

DATABASE 2 Sequential Storage and Indexes  We picture tables as simple rows and columns, but they cannot be stored this way.  It takes too many operations to find an item.  Insertions require reading and rewriting the entire table. IDLastNameFirstNameDateHired 1ReevesKeith1/29/98 2GibsonBill3/31/98 3ReasonerKaty2/17/98 4HopkinsAlan2/8/98 5JamesLeisha1/6/98 6EatonAnissa8/23/98 7FarrisDustin3/28/98 8CarpenterCarlos12/29/98 9O'ConnorJessica7/23/98 10ShieldsHoward7/13/98

DATABASE 3 Operations on Sequential Tables  Read entire table  Easy and fast  Sequential retrieval  Easy and fast for one order.  Random Read/Sequential  Very weak  Probability of any row = 1/N  Sequential retrieval  1,000,000 rows means 500,000 retrievals per lookup!  Delete  Easy  Insert/Modify  Very weak RowProb.# Reads A1/N1 B1/N2 C1/N3 D1/N4 E1/N5 …1/Ni

DATABASE 4 Insert into Sequential Table  Insert Inez:  Find insert location.  Copy top to new file.  At insert location, add row.  Copy rest of file. IDLastNameFirstNameDateHired 8CarpenterCarlos12/29/98 6EatonAnissa8/23/98 7FarrisDustin3/28/98 2GibsonBill3/31/98 4HopkinsAlan2/8/98 5JamesLeisha1/6/98 9O'ConnorJessica7/23/98 3ReasonerKaty2/17/98 1ReevesKeith1/29/98 10ShieldsHoward7/13/98 IDLastNameFirstNameDateHired 8CarpenterCarlos12/29/98 6EatonAnissa8/23/98 7FarrisDustin3/28/98 2GibsonBill3/31/98 5JamesLeisha1/6/98 9O'ConnorJessica7/23/98 3ReasonerKaty2/17/98 1ReevesKeith1/29/98 10ShieldsHoward7/13/98 11InezMaria1/15/99

DATABASE 5 Binary Search  Given a sorted list of names.  How do you find Jones.  Sequential search  Jones = 10 lookups  Average = 15/2 = 7.5 lookups  Min = 1, Max = 14  Binary search  Find midpoint (14 / 2) = 7  Jones > Goetz  Jones < Kalida  Jones > Inez  Jones = Jones (4 lookups)  Max = log 2 (N)  N = 1000Max = 10  N = 1,000,000Max = 20 Adams Brown Cadiz Dorfmann Eaton Farris 1Goetz Hanson 3Inez 4Jones 2Kalida Lomax Miranda Norman 14 entries

DATABASE 6 Pointers  When data is stored on drive (or RAM).  Operating System allocates space with a function call.  Provides location/address. Physical address Virtual address (VSAM)  Imaginary drive values mapped to physical locations. Relative address  Distance from start of file.  Other reference point. Data Address Key value Address / pointer Volume Track Cylinder/Sector Byte Offset Drive Head

DATABASE 7 Indexed Sequential Storage  Common uses  Large tables.  Need many sequential lists.  Some random search--with one or two key columns.  Mostly replaced by B+-Tree. IDLastNameFirstNameDateHired 1ReevesKeith1/29/98 2GibsonBill3/31/98 3ReasonerKaty2/17/98 4HopkinsAlan2/8/98 5JamesLeisha1/6/98 6EatonAnissa8/23/98 7FarrisDustin3/28/98 8CarpenterCarlos12/29/98 9O'ConnorJessica7/23/98 10ShieldsHoward7/13/98 IDPointer 1A11 2A22 3A32 4A42 5A47 6A58 7A63 8A67 9A78 10A83 A11 A22 A32 A42 A47 A58 A63 A67 A78 A83 Address LastNamePointer CarpenterA67 EatonA58 FarrisA63 GibsonA22 HopkinsA42 JamesA47 O'ConnorA78 ReasonerA32 ReevesA11 ShieldsA83 Indexed for ID and LastName

DATABASE 8 Linked List  Separate each element/key.  Pointers to next element.  Pointers to data.  Starting point. Carpenter B87 B29A67 Gibson B38 00A22 Eaton B29 B71A58 Farris B71 B38A63 7FarrisDustin3/28/98 A63 8CarpenterCarlos12/29/98 A67 6EatonAnissa8/23/98 A58 2GibsonBill3/31/98 A22

DATABASE 9 B-Tree  Store key values  Utilize binary search (or better).  Trees  Nodes  Root  Leaf (node with no children)  Levels / depth  Degree (maximum number of children per node) Hanson DorfmannKalida BrownFarriisInezMiranda AdamsCadizEatonGoetzJonesLomaxNorman ACBDEFGHIJKLMN Inez KeyData <>=

DATABASE 10 Index Options: Bitmaps and Statistics  Bitmap index  A compressed index designed for non-primary key columns. Bit-wise operations can be used to quickly match WHERE criteria.  Analyze statistics  By collecting statistics about the actual data within the index, the DBMS can optimize the search path. For example, if it knows that only a few rows match one of your search conditions in a table, it can apply that condition first, reducing the amount of work needed to join tables.

DATABASE 11 Problems with Indexes  Each index must be updated when rows are inserted, deleted or modified.  Changing one row of data in a table with many indexes can result in considerable time and resources to update all of the indexes.  Steps to improve performance  Index primary keys  Index common join columns (usually primary keys)  Index columns that are searched regularly  Use a performance analyzer

DATABASE 12 Data Warehouse OLTP Database 3NF tables Operations data Predefined reports Data warehouse Star configuration Daily data transfer Interactive data analysis Flat files

DATABASE 13 Data Warehouse Goals  Existing databases optimized for Online Transaction Processing (OLTP)  Online Analytical Processing (OLAP) requires fast retrievals, and only bulk writes.  Different goals require different storage, so build separate dta warehouse to use for queries.  Extraction, Transformation, Transportation (ETT)  Data analysis  Ad hoc queries  Statistical analysis  Data mining (specialized automated tools)

DATABASE 14 Extraction, Transformation, and Transportation (ETT) Data warehouse: All data must be consistent. Customers Convert Client to Customer Apply standard product numbers Convert currencies Fix region codes Transaction data from diverse systems.

DATABASE 15 OLTP v. OLAP

DATABASE 16 Multidimensional Cube Time Sale Date Customer Location Category Pet Store Item Sales Amount = Quantity*Sale Price

DATABASE 17 Sales Date: Time Hierarchy Year Quarter Month Week Day Levels Roll-up To get higher-level totals Drill-down To get lower-level details

DATABASE 18 Star Design Sales Quantity Amount=SalePrice*Quantity Fact Table Products Customer Location Sales Date Dimension Tables

DATABASE 19 Snowflake Design SaleID ItemID Quantity SalePrice Amount OLAPItems ItemID Description QuantityOnHand ListPrice Category Merchandise SaleID SaleDate EmployeeID CustomerID SalesTax Sale CustomerID Phone FirstName LastName Address ZipCode CityID Customer CityID ZipCode City State City Dimension tables can join to other dimension tables.

DATABASE 20 OLAP Computation Issues Compute Quantity*Price in base query, then add to get $23.00 If you use Calculated Measure in the Cube, it will add first and multiply second to get $45.00, which is wrong.

DATABASE 21 OLAP Data Browsing

DATABASE 22 Microsoft Pivot Table

DATABASE 23 OLAP in SQL 99 CategoryMonthAmount Bird1$135.00 Bird2$45.00 Bird3$202.50 Bird6$67.50 Bird7$90.00 Bird9$67.50 Cat1$396.00 Cat2$113.85 Cat3$443.70 Cat4$2.25 SELECT Category, Month(SaleDate) AS Month, Sum(Quantity*SalePrice) AS Amount FROM Sale INNER JOIN (Merchandise INNER JOIN SaleItem ON Merchandise.ItemID = SaleItem.ItemID) ON Sale.SaleID = SaleItem.SaleID GROUP BY Category, Month(SaleDate); GROUP BY two columns Gives you totals for each month within each category. You do not get super- aggregate totals for the category, or the month, or the overall total.

DATABASE 24 SQL ROLLUP SELECT Category, Month…, Sum … FROM … GROUP BY ROLLUP (Category, Month...) Bird1135.00 Bird245.00 … Bird(null)607.50 Cat1396.00 Cat2113.85 … Cat(null)1293.30 … (null)(null)8451.79 CategoryMonthAmount

DATABASE 25 Missing Values Cause Problems If there are missing values in the groups, it can be difficult to identify the super-aggregate rows. Bird1135.00 Bird245.00 … Bird(null)32.00 Bird(null)607.50 Cat1396.00 Cat2113.85 … Cat(null)1293.30 … (null)(null)8451.79 CategoryMonthAmount Super-aggregate Missing date

DATABASE 26 GROUPING Function SELECT Category, Month…, Sum …, GROUPING (Category) AS Gc, GROUPING (Month) AS Gm FROM … GROUP BY ROLLUP (Category, Month...) Bird1135.0000 Bird245.0000 … Bird(null)32.0000 Bird(null)607.5010 Cat1396.0000 Cat2113.8500 … Cat(null)1293.3010 … (null)(null)8451.7911 CategoryMonthAmountGcGm

DATABASE 27 CUBE Option Bird1135.0000 Bird245.0000 … Bird(null)32.0000 Bird(null)607.5010 Cat1396.0000 Cat2113.8500 … Cat(null)1293.3010 (null)11358.801 (null)21508.9401 (null)32362.6801 … (null)(null)8451.7911 CategoryMonthAmountGcGm SELECT Category, Month, Sum, GROUPING (Category) AS Gc, GROUPING (Month) AS Gm FROM … GROUP BY CUBE (Category, Month...)

DATABASE 28 GROUPING SETS: Hiding Details Bird(null)607.50 Cat(null)1293.30 … (null)11358.8 (null)21508.94 (null)32362.68 … (null)(null)8451.79 CategoryMonthAmount SELECT Category, Month, Sum FROM … GROUP BY GROUPING SETS (ROLLUP (Category), ROLLUP (Month), ( ) )

DATABASE 29 SQL OLAP Analytical Functions VAR_POPvariance VAR_SAMP STDDEV_POPstandard deviation STDEV_SAMP COVAR_POPcovariance COVAR_SAMP CORRcorrelation REGR_R2regression r-square REGR_SLOPEregression data (many) REGR_INTERCEPT

DATABASE 30 SQL RANK Functions SELECT Employee, SalesValue RANK() OVER (ORDER BY SalesValue DESC) AS rank DENSE_RANK() OVER (ORDER BY SalesValue DESC) AS dense FROM Sales ORDER BY SalesValue DESC, Employee; EmployeeSalesValuerankdense Jones18,00011 Smith16,00022 Black16,00022 White14,00043 DENSE_RANK does not skip numbers

DATABASE 31 SQL OLAP Windows SELECT Category, SaleMonth, MonthAmount, AVG(MonthAmount) OVER (PARTITION BY Category ORDER BY SaleMonth ASC ROWS 2 PRECEDING) AS MA FROM qryOLAPSQL99 ORDER BY SaleMonth ASC; CategorySaleMonthMonthAmountMA Bird2001011500.00 Bird2001021700.00 Bird2001032000.001600.00 Bird2001042500.001850.00 … Cat2001014000.00 Cat2001025000.00 Cat2001036000.004500.00 Cat2001047000.005500.00 …

DATABASE 32 Ranges: OVER SELECT SaleDate, Value SUM(Value) OVER (ORDER BY SaleDate) AS running_sum, SUM(Value) OVER (ORDER BY SaleDate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_sum2, SUM (Value) OVER (ORDER BY SaleDate RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS remaining_sum; FROM … Sum1 computes total from beginning through current row. Sum2 does the same thing, but more explicitly lists the rows. Sum3 computes total from current row through end of query.

DATABASE 33 LAG and LEAD Functions SELECT SaleDate, Value, LAG (Value 1,0) OVER (ORDER BY SaleDate) AS prior_day LEAD (Value 1, 0) OVER (ORDER BY SaleDate) AS next_day FROM … ORDER BY SaleDate LAG or LEAD: (Column, # rows, default) SaleDateValueprior_daynext_day 1/1/2003100001500 1/2/2003150010002000 1/3/2003200015002300 … 1/31/2003350032000 Prior is 0 from default value Not part of standard yet? But are in SQL Server and Oracle.

DATABASE 34 Data Mining  Goal: To discover unknown relationships in the data that can be used to make better decisions. Databases Reports Queries OLAP Data Mining Transactions and operations Specific ad hoc questions Aggregate, compare, drill down Unknown relationships

DATABASE 35 Exploratory Analysis  Data Mining usually works autonomously.  Supervised/directed  Unsupervised  Often called a bottom-up approach that scans the data to find relationships  Some statistical routines, but they are not sufficient  Statistics relies on averages  Sometimes the important data lies in more detailed pairs

DATABASE 36 Common Techniques  Classification/Prediction/Regression  Association Rules/Market Basket Analysis  Clustering  Data points  Hierarchies  Neural Networks  Deviation Detection  Sequential Analysis  Time series events  Websites  Textual Analysis  Spatial/Geographic Analysis

DATABASE 37 Classification Examples  Examples  Which borrowers/loans are most likely to be successful?  Which customers are most likely to want a new item?  Which companies are likely to file bankruptcy?  Which workers are likely to quit in the next six months?  Which startup companies are likely to succeed?  Which tax returns are fraudulent?

DATABASE 38 Classification Process  Clearly identify the outcome/dependent variable.  Identify potential variables that might affect the outcome.  Supervised (modeler chooses)  Unsupervised (system scans all/most)  Use sample data to test and validate the model.  System creates weights that link independent variables to outcome. IncomeMarriedCredit HistoryJob StabilitySuccess 50000YesGood Yes 25000YesBad No 75000NoGood No

DATABASE 39 Classification Techniques  Regression  Bayesian Networks  Decision Trees (hierarchical)  Neural Networks  Genetic Algorithms  Complications  Some methods require categorical data  Data size is still a problem

DATABASE 40 Association/Market Basket  Examples  What items are customers likely to buy together?  What Web pages are closely related?  Others?  Classic (early) example:  Analysis of convenience store data showed customers often buy diapers and beer together.  Importance: Consider putting the two together to increase cross- selling.

DATABASE 41 Association Details (two items)  Rule evaluation (A implies B)  Support for the rule is measured by the percentage of all transactions containing both items: P(A ∩ B)  Confidence of the rule is measured by the transactions with A that also contain B: P(B | A)  Lift is the potential gain attributed to the rule—the effect compared to other baskets without the effect. If it is greater than 1, the effect is positive: P(A ∩ B) / ( P(A) P(B) ) P(B|A)/P(B)  Example: Diapers implies Beer  Support: P(D ∩ B) =.6P(D) =.7P(B) =.5  Confidence: P(B|D) =.857= P(D ∩ B)/P(D)=.6/.7  Lift: P(B|D) / P(B) = 1.714=.857 /.5

DATABASE 42 Association Challenges  If an item is rarely purchased, any other item bought with it seems important. So combine items into categories.  Some relationships are obvious.  Burger and fries.  Some relationships are meaningless.  Hardware store found that toilet rings sell well only when a new store first opens. But what does it mean? ItemFreq. 1 “ nails2% 2” nails1% 3” nails1% 4” nails2% Lumber50% ItemFreq. Hardware15% Dim. Lumber20% Plywood15% Finish lumber15%

DATABASE 43 Cluster Analysis  Examples  Are there groups of customers? (If so, we can cross-sell.)  Do the locations for our stores have elements in common? (So we can search for similar clusters for new locations.)  Do our employees (by department?) have common characteristics? (So we can hire similar, or dissimilar, people.)  Problem: Many dimensions and large datasets Small intracluster distance Large intercluster distance

DATABASE 44 Geographic/Location  Examples  Customer location and sales comparisons  Factory sites and cost  Environmental effects  Challenge: Map data, multiple overlays

Jerry Post McGraw-Hill/Irwin Copyright © 2005 by The McGraw-Hill Companies, Inc. All rights reserved. Database Management Systems Chapter 8 Data Warehouses.

Similar presentations

Presentation on theme: "Jerry Post McGraw-Hill/Irwin Copyright © 2005 by The McGraw-Hill Companies, Inc. All rights reserved. Database Management Systems Chapter 8 Data Warehouses."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jerry Post McGraw-Hill/Irwin Copyright © 2005 by The McGraw-Hill Companies, Inc. All rights reserved. Database Management Systems Chapter 8 Data Warehouses.

Similar presentations

Presentation on theme: "Jerry Post McGraw-Hill/Irwin Copyright © 2005 by The McGraw-Hill Companies, Inc. All rights reserved. Database Management Systems Chapter 8 Data Warehouses."— Presentation transcript:

Similar presentations

About project

Feedback