Download presentation
Presentation is loading. Please wait.
1
Jerry Post McGraw-Hill/Irwin Copyright © 2005 by The McGraw-Hill Companies, Inc. All rights reserved. Database Management Systems Chapter 8 Data Warehouses and Data Mining
2
DATABASE 2 Sequential Storage and Indexes We picture tables as simple rows and columns, but they cannot be stored this way. It takes too many operations to find an item. Insertions require reading and rewriting the entire table. IDLastNameFirstNameDateHired 1ReevesKeith1/29/98 2GibsonBill3/31/98 3ReasonerKaty2/17/98 4HopkinsAlan2/8/98 5JamesLeisha1/6/98 6EatonAnissa8/23/98 7FarrisDustin3/28/98 8CarpenterCarlos12/29/98 9O'ConnorJessica7/23/98 10ShieldsHoward7/13/98
3
DATABASE 3 Operations on Sequential Tables Read entire table Easy and fast Sequential retrieval Easy and fast for one order. Random Read/Sequential Very weak Probability of any row = 1/N Sequential retrieval 1,000,000 rows means 500,000 retrievals per lookup! Delete Easy Insert/Modify Very weak RowProb.# Reads A1/N1 B1/N2 C1/N3 D1/N4 E1/N5 …1/Ni
4
DATABASE 4 Insert into Sequential Table Insert Inez: Find insert location. Copy top to new file. At insert location, add row. Copy rest of file. IDLastNameFirstNameDateHired 8CarpenterCarlos12/29/98 6EatonAnissa8/23/98 7FarrisDustin3/28/98 2GibsonBill3/31/98 4HopkinsAlan2/8/98 5JamesLeisha1/6/98 9O'ConnorJessica7/23/98 3ReasonerKaty2/17/98 1ReevesKeith1/29/98 10ShieldsHoward7/13/98 IDLastNameFirstNameDateHired 8CarpenterCarlos12/29/98 6EatonAnissa8/23/98 7FarrisDustin3/28/98 2GibsonBill3/31/98 5JamesLeisha1/6/98 9O'ConnorJessica7/23/98 3ReasonerKaty2/17/98 1ReevesKeith1/29/98 10ShieldsHoward7/13/98 11InezMaria1/15/99
5
DATABASE 5 Binary Search Given a sorted list of names. How do you find Jones. Sequential search Jones = 10 lookups Average = 15/2 = 7.5 lookups Min = 1, Max = 14 Binary search Find midpoint (14 / 2) = 7 Jones > Goetz Jones < Kalida Jones > Inez Jones = Jones (4 lookups) Max = log 2 (N) N = 1000Max = 10 N = 1,000,000Max = 20 Adams Brown Cadiz Dorfmann Eaton Farris 1Goetz Hanson 3Inez 4Jones 2Kalida Lomax Miranda Norman 14 entries
6
DATABASE 6 Pointers When data is stored on drive (or RAM). Operating System allocates space with a function call. Provides location/address. Physical address Virtual address (VSAM) Imaginary drive values mapped to physical locations. Relative address Distance from start of file. Other reference point. Data Address Key value Address / pointer Volume Track Cylinder/Sector Byte Offset Drive Head
7
DATABASE 7 Indexed Sequential Storage Common uses Large tables. Need many sequential lists. Some random search--with one or two key columns. Mostly replaced by B+-Tree. IDLastNameFirstNameDateHired 1ReevesKeith1/29/98 2GibsonBill3/31/98 3ReasonerKaty2/17/98 4HopkinsAlan2/8/98 5JamesLeisha1/6/98 6EatonAnissa8/23/98 7FarrisDustin3/28/98 8CarpenterCarlos12/29/98 9O'ConnorJessica7/23/98 10ShieldsHoward7/13/98 IDPointer 1A11 2A22 3A32 4A42 5A47 6A58 7A63 8A67 9A78 10A83 A11 A22 A32 A42 A47 A58 A63 A67 A78 A83 Address LastNamePointer CarpenterA67 EatonA58 FarrisA63 GibsonA22 HopkinsA42 JamesA47 O'ConnorA78 ReasonerA32 ReevesA11 ShieldsA83 Indexed for ID and LastName
8
DATABASE 8 Linked List Separate each element/key. Pointers to next element. Pointers to data. Starting point. Carpenter B87 B29A67 Gibson B38 00A22 Eaton B29 B71A58 Farris B71 B38A63 7FarrisDustin3/28/98 A63 8CarpenterCarlos12/29/98 A67 6EatonAnissa8/23/98 A58 2GibsonBill3/31/98 A22
9
DATABASE 9 B-Tree Store key values Utilize binary search (or better). Trees Nodes Root Leaf (node with no children) Levels / depth Degree (maximum number of children per node) Hanson DorfmannKalida BrownFarriisInezMiranda AdamsCadizEatonGoetzJonesLomaxNorman ACBDEFGHIJKLMN Inez KeyData <>=
10
DATABASE 10 Index Options: Bitmaps and Statistics Bitmap index A compressed index designed for non-primary key columns. Bit-wise operations can be used to quickly match WHERE criteria. Analyze statistics By collecting statistics about the actual data within the index, the DBMS can optimize the search path. For example, if it knows that only a few rows match one of your search conditions in a table, it can apply that condition first, reducing the amount of work needed to join tables.
11
DATABASE 11 Problems with Indexes Each index must be updated when rows are inserted, deleted or modified. Changing one row of data in a table with many indexes can result in considerable time and resources to update all of the indexes. Steps to improve performance Index primary keys Index common join columns (usually primary keys) Index columns that are searched regularly Use a performance analyzer
12
DATABASE 12 Data Warehouse OLTP Database 3NF tables Operations data Predefined reports Data warehouse Star configuration Daily data transfer Interactive data analysis Flat files
13
DATABASE 13 Data Warehouse Goals Existing databases optimized for Online Transaction Processing (OLTP) Online Analytical Processing (OLAP) requires fast retrievals, and only bulk writes. Different goals require different storage, so build separate dta warehouse to use for queries. Extraction, Transformation, Transportation (ETT) Data analysis Ad hoc queries Statistical analysis Data mining (specialized automated tools)
14
DATABASE 14 Extraction, Transformation, and Transportation (ETT) Data warehouse: All data must be consistent. Customers Convert Client to Customer Apply standard product numbers Convert currencies Fix region codes Transaction data from diverse systems.
15
DATABASE 15 OLTP v. OLAP
16
DATABASE 16 Multidimensional Cube Time Sale Date Customer Location Category Pet Store Item Sales Amount = Quantity*Sale Price
17
DATABASE 17 Sales Date: Time Hierarchy Year Quarter Month Week Day Levels Roll-up To get higher-level totals Drill-down To get lower-level details
18
DATABASE 18 Star Design Sales Quantity Amount=SalePrice*Quantity Fact Table Products Customer Location Sales Date Dimension Tables
19
DATABASE 19 Snowflake Design SaleID ItemID Quantity SalePrice Amount OLAPItems ItemID Description QuantityOnHand ListPrice Category Merchandise SaleID SaleDate EmployeeID CustomerID SalesTax Sale CustomerID Phone FirstName LastName Address ZipCode CityID Customer CityID ZipCode City State City Dimension tables can join to other dimension tables.
20
DATABASE 20 OLAP Computation Issues Compute Quantity*Price in base query, then add to get $23.00 If you use Calculated Measure in the Cube, it will add first and multiply second to get $45.00, which is wrong.
21
DATABASE 21 OLAP Data Browsing
22
DATABASE 22 Microsoft Pivot Table
23
DATABASE 23 OLAP in SQL 99 CategoryMonthAmount Bird1$135.00 Bird2$45.00 Bird3$202.50 Bird6$67.50 Bird7$90.00 Bird9$67.50 Cat1$396.00 Cat2$113.85 Cat3$443.70 Cat4$2.25 SELECT Category, Month(SaleDate) AS Month, Sum(Quantity*SalePrice) AS Amount FROM Sale INNER JOIN (Merchandise INNER JOIN SaleItem ON Merchandise.ItemID = SaleItem.ItemID) ON Sale.SaleID = SaleItem.SaleID GROUP BY Category, Month(SaleDate); GROUP BY two columns Gives you totals for each month within each category. You do not get super- aggregate totals for the category, or the month, or the overall total.
24
DATABASE 24 SQL ROLLUP SELECT Category, Month…, Sum … FROM … GROUP BY ROLLUP (Category, Month...) Bird1135.00 Bird245.00 … Bird(null)607.50 Cat1396.00 Cat2113.85 … Cat(null)1293.30 … (null)(null)8451.79 CategoryMonthAmount
25
DATABASE 25 Missing Values Cause Problems If there are missing values in the groups, it can be difficult to identify the super-aggregate rows. Bird1135.00 Bird245.00 … Bird(null)32.00 Bird(null)607.50 Cat1396.00 Cat2113.85 … Cat(null)1293.30 … (null)(null)8451.79 CategoryMonthAmount Super-aggregate Missing date
26
DATABASE 26 GROUPING Function SELECT Category, Month…, Sum …, GROUPING (Category) AS Gc, GROUPING (Month) AS Gm FROM … GROUP BY ROLLUP (Category, Month...) Bird1135.0000 Bird245.0000 … Bird(null)32.0000 Bird(null)607.5010 Cat1396.0000 Cat2113.8500 … Cat(null)1293.3010 … (null)(null)8451.7911 CategoryMonthAmountGcGm
27
DATABASE 27 CUBE Option Bird1135.0000 Bird245.0000 … Bird(null)32.0000 Bird(null)607.5010 Cat1396.0000 Cat2113.8500 … Cat(null)1293.3010 (null)11358.801 (null)21508.9401 (null)32362.6801 … (null)(null)8451.7911 CategoryMonthAmountGcGm SELECT Category, Month, Sum, GROUPING (Category) AS Gc, GROUPING (Month) AS Gm FROM … GROUP BY CUBE (Category, Month...)
28
DATABASE 28 GROUPING SETS: Hiding Details Bird(null)607.50 Cat(null)1293.30 … (null)11358.8 (null)21508.94 (null)32362.68 … (null)(null)8451.79 CategoryMonthAmount SELECT Category, Month, Sum FROM … GROUP BY GROUPING SETS (ROLLUP (Category), ROLLUP (Month), ( ) )
29
DATABASE 29 SQL OLAP Analytical Functions VAR_POPvariance VAR_SAMP STDDEV_POPstandard deviation STDEV_SAMP COVAR_POPcovariance COVAR_SAMP CORRcorrelation REGR_R2regression r-square REGR_SLOPEregression data (many) REGR_INTERCEPT
30
DATABASE 30 SQL RANK Functions SELECT Employee, SalesValue RANK() OVER (ORDER BY SalesValue DESC) AS rank DENSE_RANK() OVER (ORDER BY SalesValue DESC) AS dense FROM Sales ORDER BY SalesValue DESC, Employee; EmployeeSalesValuerankdense Jones18,00011 Smith16,00022 Black16,00022 White14,00043 DENSE_RANK does not skip numbers
31
DATABASE 31 SQL OLAP Windows SELECT Category, SaleMonth, MonthAmount, AVG(MonthAmount) OVER (PARTITION BY Category ORDER BY SaleMonth ASC ROWS 2 PRECEDING) AS MA FROM qryOLAPSQL99 ORDER BY SaleMonth ASC; CategorySaleMonthMonthAmountMA Bird2001011500.00 Bird2001021700.00 Bird2001032000.001600.00 Bird2001042500.001850.00 … Cat2001014000.00 Cat2001025000.00 Cat2001036000.004500.00 Cat2001047000.005500.00 …
32
DATABASE 32 Ranges: OVER SELECT SaleDate, Value SUM(Value) OVER (ORDER BY SaleDate) AS running_sum, SUM(Value) OVER (ORDER BY SaleDate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_sum2, SUM (Value) OVER (ORDER BY SaleDate RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS remaining_sum; FROM … Sum1 computes total from beginning through current row. Sum2 does the same thing, but more explicitly lists the rows. Sum3 computes total from current row through end of query.
33
DATABASE 33 LAG and LEAD Functions SELECT SaleDate, Value, LAG (Value 1,0) OVER (ORDER BY SaleDate) AS prior_day LEAD (Value 1, 0) OVER (ORDER BY SaleDate) AS next_day FROM … ORDER BY SaleDate LAG or LEAD: (Column, # rows, default) SaleDateValueprior_daynext_day 1/1/2003100001500 1/2/2003150010002000 1/3/2003200015002300 … 1/31/2003350032000 Prior is 0 from default value Not part of standard yet? But are in SQL Server and Oracle.
34
DATABASE 34 Data Mining Goal: To discover unknown relationships in the data that can be used to make better decisions. Databases Reports Queries OLAP Data Mining Transactions and operations Specific ad hoc questions Aggregate, compare, drill down Unknown relationships
35
DATABASE 35 Exploratory Analysis Data Mining usually works autonomously. Supervised/directed Unsupervised Often called a bottom-up approach that scans the data to find relationships Some statistical routines, but they are not sufficient Statistics relies on averages Sometimes the important data lies in more detailed pairs
36
DATABASE 36 Common Techniques Classification/Prediction/Regression Association Rules/Market Basket Analysis Clustering Data points Hierarchies Neural Networks Deviation Detection Sequential Analysis Time series events Websites Textual Analysis Spatial/Geographic Analysis
37
DATABASE 37 Classification Examples Examples Which borrowers/loans are most likely to be successful? Which customers are most likely to want a new item? Which companies are likely to file bankruptcy? Which workers are likely to quit in the next six months? Which startup companies are likely to succeed? Which tax returns are fraudulent?
38
DATABASE 38 Classification Process Clearly identify the outcome/dependent variable. Identify potential variables that might affect the outcome. Supervised (modeler chooses) Unsupervised (system scans all/most) Use sample data to test and validate the model. System creates weights that link independent variables to outcome. IncomeMarriedCredit HistoryJob StabilitySuccess 50000YesGood Yes 25000YesBad No 75000NoGood No
39
DATABASE 39 Classification Techniques Regression Bayesian Networks Decision Trees (hierarchical) Neural Networks Genetic Algorithms Complications Some methods require categorical data Data size is still a problem
40
DATABASE 40 Association/Market Basket Examples What items are customers likely to buy together? What Web pages are closely related? Others? Classic (early) example: Analysis of convenience store data showed customers often buy diapers and beer together. Importance: Consider putting the two together to increase cross- selling.
41
DATABASE 41 Association Details (two items) Rule evaluation (A implies B) Support for the rule is measured by the percentage of all transactions containing both items: P(A ∩ B) Confidence of the rule is measured by the transactions with A that also contain B: P(B | A) Lift is the potential gain attributed to the rule—the effect compared to other baskets without the effect. If it is greater than 1, the effect is positive: P(A ∩ B) / ( P(A) P(B) ) P(B|A)/P(B) Example: Diapers implies Beer Support: P(D ∩ B) =.6P(D) =.7P(B) =.5 Confidence: P(B|D) =.857= P(D ∩ B)/P(D)=.6/.7 Lift: P(B|D) / P(B) = 1.714=.857 /.5
42
DATABASE 42 Association Challenges If an item is rarely purchased, any other item bought with it seems important. So combine items into categories. Some relationships are obvious. Burger and fries. Some relationships are meaningless. Hardware store found that toilet rings sell well only when a new store first opens. But what does it mean? ItemFreq. 1 “ nails2% 2” nails1% 3” nails1% 4” nails2% Lumber50% ItemFreq. Hardware15% Dim. Lumber20% Plywood15% Finish lumber15%
43
DATABASE 43 Cluster Analysis Examples Are there groups of customers? (If so, we can cross-sell.) Do the locations for our stores have elements in common? (So we can search for similar clusters for new locations.) Do our employees (by department?) have common characteristics? (So we can hire similar, or dissimilar, people.) Problem: Many dimensions and large datasets Small intracluster distance Large intercluster distance
44
DATABASE 44 Geographic/Location Examples Customer location and sales comparisons Factory sites and cost Environmental effects Challenge: Map data, multiple overlays
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.