Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Warehousing CIS 4301 Lecture Notes 4/20/2006.

Similar presentations


Presentation on theme: "Data Warehousing CIS 4301 Lecture Notes 4/20/2006."— Presentation transcript:

1 Data Warehousing CIS 4301 Lecture Notes 4/20/2006

2 Emerging Database Technologies
Management of complex data Multimedia databases GIS Databases Genome Data Management Management of semistructured data XML Databases Sensor Networks - Management of Data Streams Approximate query processing Data Warehousing and OLAP Data Mining © COP Spring 2006

3 Warehousing Growing industry: $8 billion in 1998
Range from desktop to huge: Walmart: 900-CPU, 2,700 disk, 23TB Teradata system Lots of buzzwords, hype slice & dice, rollup, MOLAP, pivot, ... © COP Spring 2006

4 Outline What is a data warehouse? Why a warehouse? Models & operations
Implementing a warehouse Future directions © COP Spring 2006

5 What is a Warehouse? Collection of diverse data subject oriented
aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile more © COP Spring 2006

6 What is a Warehouse? Collection of tools gathering data
cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse © COP Spring 2006

7 Warehouse Architecture
Client Client Query & Analysis Warehouse Metadata Integration Source Source Source © COP Spring 2006

8 Motivating Examples Forecasting Comparing performance of units
Monitoring, detecting fraud Visualization © COP Spring 2006

9 Why a Warehouse? ? Two Approaches: Query-Driven (Lazy)
Warehouse (Eager) ? Source Source © COP Spring 2006

10 Query-Driven Approach
Client Wrapper Mediator Source © COP Spring 2006 7

11 Advantages of Warehousing
High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse Modify, summarize (store aggregates) Add historical information © COP Spring 2006

12 Advantages of Query-Driven
No need to copy data less storage no need to purchase data More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources © COP Spring 2006

13 OLTP vs. OLAP OLTP: On Line Transaction Processing
Describes processing at operational sites OLAP: On Line Analytical Processing Describes processing at warehouse © COP Spring 2006

14 OLTP vs. OLAP OLTP OLAP Mostly reads Mostly updates
Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical Mostly reads Queries long, complex Gb-Tb of data Summarized, consolidated data Decision-makers, analysts as users © COP Spring 2006

15 Data Marts Smaller warehouses Spans part of organization
e.g., marketing (customers, products, sales) Do not require enterprise-wide consensus but long term integration problems? © COP Spring 2006

16 Data Warehouse Reference Architecture
Query Manager / DSS Client Interfaces Data Mart Central Warehouse Operational Data Store ... Metadata Data Mart derived data WH Manager Extractor/ Monitor Extractor/ Monitor Extractor/ Monitor Source Source Source ... © COP Spring 2006

17 Warehouse Models & Operators
Data Models relations stars & snowflakes cubes Operators slice & dice roll-up, drill down pivoting other © COP Spring 2006

18 Multidimensional Data Model - Data Cube
Geography (cities) NY Measure: Sales SF LA Juice Milk Coke Cream Soap Bread 10 34 56 32 12 Three Dimensions: Time, Product, Geography Attributes (not shown in picture): Product (upc, price, …) Geography (lat, long, … Product (item) M T W Th F S S Time (day) 56 units of bread sold in LA on M © COP Spring 2006

19 Star © COP Spring 2006

20 Star Schema © COP Spring 2006

21 Terms Fact table Dimension tables Measures © COP Spring 2006

22 Dimension Hierarchies
sType store city region è snowflake schema è constellations © COP Spring 2006

23 Cube Fact table view: Multi-dimensional cube: dimensions = 2
© COP Spring 2006

24 3-D Cube Fact table view: Multi-dimensional cube: dimensions = 3 day 2
© COP Spring 2006

25 ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing
MOLAP: Multi-Dimensional On-Line Analytical Processing © COP Spring 2006

26 Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1 81 © COP Spring 2006

27 Aggregates Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date © COP Spring 2006

28 Another Example Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId rollup drill-down © COP Spring 2006

29 Aggregates Operators: sum, count, max, min, median, ave
“Having” clause Using dimension hierarchy average by region (within store) maximum by month (within date) © COP Spring 2006

30 Cube Aggregation Example: computing sums . . . 129 rollup drill-down
day 2 . . . day 1 129 drill-down rollup © COP Spring 2006

31 Cube Operators . . . sale(c1,*,*) 129 sale(c2,p2,*) sale(*,*,*) day 2
© COP Spring 2006

32 Extended Cube * day 2 sale(*,p2,*) day 1 © COP Spring 2006

33 Aggregation Using Hierarchies
day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B) © COP Spring 2006

34 Query & Analysis Tools Query Building
Report Writers (comparisons, growth, graphs,…) Spreadsheet Systems Web Interfaces Data Mining © COP Spring 2006

35 Other Operations Time functions Computed Attributes Text Queries
e.g., time average Computed Attributes e.g., commission = sales * rate Text Queries e.g., find documents with words X AND B e.g., rank documents by frequency of words X, Y, Z © COP Spring 2006


Download ppt "Data Warehousing CIS 4301 Lecture Notes 4/20/2006."

Similar presentations


Ads by Google