Presentation is loading. Please wait.

Presentation is loading. Please wait.

What is Data Warehouse? Defined in many different ways.

Similar presentations


Presentation on theme: "What is Data Warehouse? Defined in many different ways."— Presentation transcript:

1 What is Data Warehouse? Defined in many different ways.
A decision support database that is maintained separately from the organization’s operational database Supports information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Data warehousing: The process of constructing and using data warehouses

2 Subject-Oriented Integrated Time Variant Non-volatile
Organized around major subject(s) or fact(s), such as sales, enrollments, experiments, events. Focused on modeling and analysis for decision makers, not on daily operations or transactions. Integrated Constructed (possibly) by integrating multiple, heterogeneous data sources That must be cleaned and data integrated To ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources (E.g., Hotel price: currency, tax, breakfast covered, etc.). This step is done at the time the data is moved to the data warehouse. Time Variant The time horizon for the DW is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide info from a historical perspective (e.g., past 5-10 years) Every structure in the data warehouse contains an element of time, explicitly or implicitly But operational data may or may not contain “time element” (always assumed to be the “current value”) Non-volatile A physically separate store of data transformed from the operational environment. Operational updates of DW data does not occur (every insert is a considered a new item). Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and read of data.

3 Data Mining: Concepts and Techniques
Data Mining on a DW? visualization Data Mining goes into MOUNTAINS of raw data for info gems. Data Mining Pattern Evaluation and Assay OLAP Classification Clustering Rule Mining Loop backs Task-relevant Data Selection Feature extraction, tuple selection Data Cleaning/Integration: missing data, outliers, noise, errors Data Warehouse: cleaned, integrated, read-only, periodic, historical raw database Smart files December 3, 2018 Data Mining: Concepts and Techniques

4 From Tables and Spreadsheets to Data Cubes
A data warehouse is usually based on a multidimensional data model which views data in the form of a data cube describing the subject of interest (e.g., sales) A data cube allows data to be modeled and viewed in multiple dimensions Auxiliary dimension tables are added to the central cube for additional information (e.g., for sales cube, item (item_name, brand, type) time (day, week, month, quarter, year), salesman (name, addr, salary) Fact cube contains measurement(s) (e.g., number_of_sales) and keys (references) to each of the related dimension tables.

5 A Sample Data Cube Each cell contains a sales measurement, e. g
A Sample Data Cube Each cell contains a sales measurement, e.g., the number of sales (may contain many other measurements of product-date-country instances) Date 1Qtr 2Qtr 3Qtr 4Qtr TV Product PC U.S.A VCR Canada Country Mexico

6 Total of all product sales by country and quarter
Total sales by country and date Rollup (aggregate under +) along product (e.g., using the aggregate, sum) Date 1Qtr 2Qtr 3Qtr 4Qtr TV Product Total of all product sales by country and quarter PC U.S.A VCR Canada Country Mexico

7 Rollup along date (e.g., using the aggregate, sum)
Total annual sales by country and product Date 1Qtr 2Qtr 3Qtr 4Qtr TV Product PC U.S.A VCR Canada Country Mexico

8 Rollup along country (e.g., using the aggregate, sum)
Date 1Qtr 2Qtr 3Qtr 4Qtr TV Product PC U.S.A VCR Canada Country Mexico Total of all product sales by product and date Total of all product sales by product and date

9 All rollups (e.g., using the aggregate, sum)
Date 1Qtr 2Qtr 3Qtr 4Qtr TV Product sales by product, country PC U.S.A sales by product, country and quarter VCR sales by country, date sales by country sales by country Canada Country Mexico sales by product sales by product, country sales by product sales by date sales by date Total sales Total sales Total sales

10 Cuboids Corresponding to the Cube
all 0-D(apex) cuboid product date country 1-D cuboids product,date product,country date, country 2-D cuboids Drilldown on product product, date, country 3-D(base or fact) cuboid Rollup on country (Sum over country)

11 Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures Star schema: (simplest) A fact cube in the middle (star center) connected to the dimension tables (star points) Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact cubes share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

12 Example of Star Schema product country date Sales Fact Cube date_key
day day_of_the_week month quarter year date product_key product_name brand type supplier_type product Sales Fact Cube date_key product_key country_key units_sold dollars_sold avg_sales Measures country_key country_name country_continent country

13 Example of Snowflake Schema
time_key day day_of_the_week month quarter year time item_key item_name brand type supplier_key item supplier_type supplier Sales Fact Cube time_key item_key branch_key location_key street city_key location branch_key branch_name branch_type branch location_key units_sold city_key city province_or_street country dollars_sold avg_sales Measures

14 Example of Fact Constellation
time_key day day_of_the_week month quarter year time item_key item_name brand type supplier_type item Shipping Fact Cube time_key item_key shipper_key from_location to_location dollars_cost units_shipped shipper_name location_key shipper_type shipper Sales Fact Cube time_key item_key branch_key branch_key branch_name branch_type branch location_key location_key street city province_or_street country location units_sold dollars_sold avg_sales Measures

15 Visualizing a 5-D Data Cube
Sales Volume = box size (4th dimensions) Product color = box-color (5th dimension) Visualization is very important and can be done for more than 3 dimensions.

16 Typical OLAP Operations
Roll up: summarize data Grouped-by aggregation – dimension generalization, e.g. if date is initially in months, rollup to quarters (sum groups by quarter) or Dimension reduction/elimination (e.g., Slide-7: rollup by summing over all products leaving just country and date) aggregating over an entire dimension, eliminating it. Drill down: reverse of roll-up from higher level summary to lower level summary (detailed data) or introducing new dims Slice and dice: project and select Pivot (rotate): reorient (re-order) the cube, for visualization and faster processing.

17 Partial Rollup: climbing up a concept hierarchy (instead of eliminating Product altogether by summing over all products, rollup partially on Product, from (VCR, PC, TV) to computer (includes PC only) and non-computer (includes VCR + TV) Date 1Qtr 2Qtr 3Qtr 4Qtr Product TV U.S.A non-comp comp VCR PC Canada Country Mexico

18 SLICE e.g., slice off PC Date Product Country 1Qtr 2Qtr 3Qtr 4Qtr TV
U.S.A VCR PC Canada Country Mexico

19 DICE (e.g. dice off PC, the last two quarters, the country Mexico)
Date 1Qtr 2Qtr 3Qtr 4Qtr Product TV U.S.A VCR PC Canada Country Mexico

20 Pivot/Rotate Country Date Product Date Country Product Mexico Canada
secondary Pivot/Rotate Date Product Country TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico tertiary primary Date Product Country TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico

21 (with WHEN as an attribute)
Person (WHO) Dimension Table labor GF 150 5’6” Jose Fox teach Fgo 190 5’8” Ahmed Ali exec Zap 140 5’9” Jill Wade farmer Mot 220 6’11 John Roe Some important DWs for our region: Northern Border Security Data Warehouse Activity (WHAT) Dimension Table Chamber of Commerce 1 recreation Activity3 Dept of Commerce commerce Activity2 FBI felony Activity1 USBP crossing Activity0 Reported by Pub Info Category Activity Who Where What p0 p1 p2 p3 a0 a1 time time time time time time time time time time time time e0 e1 e2 e3 a2 a3 time time time time time time time time 950 Location (WHERE) Dimension Table rural plains 1 l3 river l2 l1 urban 49 l0 Class TERRAIN EL LON LAT Surface T/F LOC 91 95 89 90 900 2000 897 time l0 l1 l2 l3 time time time time time time time time Who-What-Where Cube (with WHEN as an attribute) for Who-What-Where-When Border Events (subject)

22 Gene-Organism Dimension Table (chromosome,length)
PUBLIC (Ptree Unfied BioLogical InformtiCs Data Cube and Dimension Tables) Gene Dimension Table 1 PolyA-Tail .9 .1 StopCodonDensity apop mito meio Function Ribo Nucl Myta SubCell-Location Gene-Organism Dimension Table (chromosome,length) Organism Dimension Table 3000 1 Mus musculus mouse 12.1 Saccharomyces cerevisiae yeast 185 Drosophila melanogaster fly Homo sapiens human Genome Size (million bp) Vert Species Organism g0 g1 g2 g3 o1 o2 o3 o0 17, , Mi, , 48 10, , 40 , 16, , , 43 1 e0 e1 e2 e3 TreatmentDimension Table (MIAME) 1 a s 4 2 c h b 3 N M H S AD ED STZ CTY STR UNV PI LAB t0 t1 t2 t3 Gene-Treatment-Organism Cube (1 iff that gene from that organism expresses at a threshold level under that treatment.) (subject = experiment)


Download ppt "What is Data Warehouse? Defined in many different ways."

Similar presentations


Ads by Google