Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Data Warehousing CPS 196.03 Notes 6.

Similar presentations


Presentation on theme: "Introduction to Data Warehousing CPS 196.03 Notes 6."— Presentation transcript:

1 Introduction to Data Warehousing CPS 196.03 Notes 6

2 2 Warehousing l Growing industry: $30+ billion industry l Range from desktop to huge: u Walmart: 900-CPU, 2,700 disk, 23TB Teradata system (numbers from earlier part of this decade) l Lots of buzzwords, hype u slice & dice, rollup, MOLAP, pivot,...

3 3 Outline l What is a data warehouse? l Why a warehouse? l Models & operations l Implementing a warehouse

4 4 What is a Warehouse? l Collection of diverse data u subject oriented u aimed at executive, decision maker u often a copy of operational data u with value-added data (e.g., summaries, history) u integrated u time-varying u non-volatile more

5 5 What is a Warehouse? l Collection of tools u gathering data u cleansing, integrating,... u querying, reporting, analysis u data mining u monitoring, administering warehouse

6 6 Warehouse Architecture Client Warehouse Source Query & Analysis Integration Metadata

7 7 Motivating Examples l Forecasting l Comparing performance of units l Monitoring, detecting fraud l Visualization

8 8 Why a Warehouse? l Two Approaches: u Query-Driven (Lazy) u Warehouse (Eager) Source ?

9 9 Query-Driven Approach Client Wrapper Mediator Source

10 10 Advantages of Warehousing l High query performance l Queries not visible outside warehouse l Local processing at sources unaffected l Can operate when sources unavailable l Can query data not stored in a DBMS l Extra information at warehouse u Modify, summarize (store aggregates) u Add historical information

11 11 Advantages of Query-Driven l No need to copy data u less storage u no need to purchase data l More up-to-date data l Query needs can be unknown l Only query interface needed at sources l May be less draining on sources

12 12 OLTP vs. OLAP l OLTP: On Line Transaction Processing u Describes processing at operational sites l OLAP: On Line Analytical Processing u Describes processing at warehouse

13 13 OLTP vs. OLAP l Mostly updates l Many small transactions l Mb-Gb of data l Raw data l Clerical users l Up-to-date data l Consistency, recoverability critical l Mostly reads l Queries long, complex l Tb-Pb of data l Summarized, consolidated data l Decision-makers, analysts as users OLTP OLAP

14 14 Data Marts l Smaller warehouses l Spans part of organization u e.g., marketing (customers, products, sales) l Do not require enterprise-wide consensus u but long term integration problems?

15 15 Warehouse Models & Operators l Data Models u relations u stars & snowflakes u cubes l Operators u slice & dice u roll-up, drill down u pivoting u other

16 16 Warehouse Models l Modeling data warehouses: dimensions, measures u Star schema: A fact table in the middle connected to a set of dimension tables u Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake u Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

17 17 Star Measures

18 18 Star Schema

19 19 Another Example of Star Schema time_key day day_of_the_week month quarter year time location_key street city state_or_province country location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch

20 20 Terms l Fact table l Dimension tables l Measures

21 21 Dimension Hierarchies store sType cityregion  snowflake schema  constellations

22 22 Example of Snowflake Schema time_key day day_of_the_week month quarter year time location_key street city_key location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key item branch_key branch_name branch_type branch supplier_key supplier_type supplier city_key city state_or_province country city

23 23 Example of Fact Constellation time_key day day_of_the_week month quarter year time location_key street city province_or_state country location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type item branch_key branch_name branch_type branch Shipping Fact Table time_key item_key shipper_key from_location to_location dollars_cost units_shipped shipper_key shipper_name location_key shipper_type shipper

24 24 Cube Fact table view: Multi-dimensional cube: dimensions = 2 Recall counters in Apriori

25 25 3-D Cube day 2 day 1 dimensions = 3 Multi-dimensional cube:Fact table view:

26 26 Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81

27 27 Aggregates Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

28 28 Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId drill-down rollup

29 29 Aggregates l Operators: sum, count, max, min, median, ave l “Having” clause l Using dimension hierarchy u average by region (within store) u maximum by month (within date)

30 30 Types of Measures in Data Cubes l Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning è E.g., count(), sum(), min(), max() l Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function è E.g., avg(), min_N(), standard_deviation() l Holistic: if there is no constant bound on the storage size needed to describe a subaggregate. è E.g., median(), mode(), rank()

31 31 Cube Aggregation day 2 day 1 129... drill-down rollup Example: computing sums

32 32 Cube Operators day 2 day 1 129... sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)

33 33 Extended Cube day 2 day 1 * sale(*,p2,*)

34 34 Cube Aggregates Lattice city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day 1 129

35 35 Dimension Hierarchies all state city

36 36 Dimension Hierarchies city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...

37 37 Interesting Hierarchy all years quarters months days weeks conceptual dimension table

38 38 Aggregation Using Hierarchies day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B)

39 39 Multidimensional Data l Sales volume as a function of product, month, and region Product Region Month Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product City Month Week Office Day

40 40 Typical OLAP Operations Total annual sales of TV in U.S.A. Date Product Country sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum

41 41 Typical OLAP Operations l Roll up (drill-up): summarize data u by climbing up hierarchy or by dimension reduction l Drill down (roll down): reverse of roll-up u from higher level summary to lower level summary or detailed data, or introducing new dimensions l Slice and dice: project and select l Pivot (rotate): u reorient the cube, visualization, 3D to series of 2D planes l Other operations u drill across: involving (across) more than one fact table u drill through: through the bottom level of the cube to its back-end relational tables (using SQL)

42 42 Fig. 3.10 Typical OLAP Operations

43 43 Pivoting day 2 day 1 Multi-dimensional cube: Fact table view: Pivot turns unique values from one column into unique columns in the output


Download ppt "Introduction to Data Warehousing CPS 196.03 Notes 6."

Similar presentations


Ads by Google