What is Data Warehouse? Defined in many different ways.

Slides:



Advertisements
Similar presentations
Data Warehousing.
Advertisements

Introduction to Data Warehousing CPS Notes 6.
ICS 421 Spring 2010 Data Warehousing 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/30/20101Lipyeow.
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
The Role of Data Warehousing and OLAP Technologies CS 536 – Data Mining These slides are adapted from J. Han and M. Kamber’s book slides (
Data Warehouses and OLAP
Data Warehousing Xintao Wu. Evolution of Database Technology (See Fig. 1.1) 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational.
Data Warehousing.
Dr. M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining Data Warehousing COMP207: Data Mining.
1 Lecture 10: More OLAP - Dimensional modeling
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Warehouse and Data Cube Lecture Notes for Chapter 3 Introduction to Data Mining By.
Lab3 CPIT 440 Data Mining and Warehouse.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
1 Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously.  A decision support database that is maintained.
CS346: Advanced Databases
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
1 Data Warehouses C hapter 2. 2 Chapter 2 Outline Chapter 2 Outline – Introduction –Data Warehouses –Data Warehouse in Organisation – OLTP vs. OLAP –Why.
Dr. Bernard Chen Ph.D. University of Central Arkansas
8/20/ Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously. Defined in many different ways, but.
School of Management, HUST
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
Data Warehousing Xintao Wu. Can You Easily Answer These Questions? What are Personnel Services costs across all departments for all funding sources? What.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
Roadmap 1.What is the data warehouse, data mart 2.Multi-dimensional data modeling 3.Data warehouse design – schemas, indices 4.The Data Cube operator –
October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.
Dr. N. MamoulisAdvanced Database Technologies1 Topic 6: Data Warehousing & OLAP Defined in many different ways, but not rigorously. A decision support.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Shilpa Seth.  Multidimensional Data Model Concepts Multidimensional Data Model Concepts  Data Cube Data Cube  Data warehouse Schemas Data warehouse.
Data Mining Data Warehouses.
Data warehousing, data analysis and OLAP Sunita Sarawagi
Managing Data for DSS II. Managing Data for DS Data Warehouse Common characteristics : –Database designed to meet analytical tasks comprising of data.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 This is the full course notes, but not quite complete. You.
January 21, 2016Data Mining: Concepts and Techniques 1 Chapter 3: Data Warehousing and OLAP Technology: An Overview What is a data warehouse? A multi-dimensional.
Datawarehousing and OLAP C.Eng 714 Spring
Data Warehouses and OLAP. Data Warehousing and OLAP Technology for Data Mining  What is a data warehouse?  A multi-dimensional data model  Data warehouse.
1 Chapter 4: Data Warehousing and On-line Analytical Processing Data Warehouse: Basic Concepts Data Warehouse Modeling: Data Cube and OLAP Data Warehouse.
Data Mining and Data Warehousing: Concepts and Techniques Conceptual Modeling of Data Warehouses Defining a Snowflake Schema in Data Mining Query Language.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Chapter 4. Data Warehousing and On-line Analytical Processing (OLAP)
Data Mining: Data Warehousing
Introduction to Data Warehousing
Data Warehousing and OLAP
Data Mining: Concepts and Techniques — Chapter 3 —
Data Mining: Concepts and Techniques
Information Management course
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
A B D C G5b Date 1Qtr 2Qtr 3Qtr 4Qtr TV Product PC
Remember the Sales Data Cube? Each cell contains a sales measurement, e.g., the number of sales (may contain many other measurements of product-date-country.
Data warehouse and OLAP
A multi-dimensional data model
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 4 —
Information Management course
Data Warehouse—Subject‐Oriented
OLAP Concepts and Techniques
Data Mining Data Warehousing
Data Warehousing and OLAP Technology for Data Mining
Data Warehousing and OLAP Technology for Data Mining
Data Warehousing modified by Donghui Zhang
Chapter 2: Data Warehousing and OLAP Technology for Data Mining
Data Warehouse and OLAP
Lecture 4: From Data Cubes to ML
Overview of Data Warehousing and OLAP
What is Data Warehouse? Defined in many different ways.
Data Warehousing and Decision Support Chapter 25
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009
Dedicated to the memory of
Data Mining: Concepts and Techniques
Data Warehouse and OLAP
Presented by: Tek Narayan Adhikari
Presentation transcript:

What is Data Warehouse? Defined in many different ways. A decision support database that is maintained separately from the organization’s operational database Supports information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Data warehousing: The process of constructing and using data warehouses

Subject-Oriented Integrated Time Variant Non-volatile Organized around major subject(s) or fact(s), such as sales, enrollments, experiments, events. Focused on modeling and analysis for decision makers, not on daily operations or transactions. Integrated Constructed (possibly) by integrating multiple, heterogeneous data sources That must be cleaned and data integrated To ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources (E.g., Hotel price: currency, tax, breakfast covered, etc.). This step is done at the time the data is moved to the data warehouse. Time Variant The time horizon for the DW is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide info from a historical perspective (e.g., past 5-10 years) Every structure in the data warehouse contains an element of time, explicitly or implicitly But operational data may or may not contain “time element” (always assumed to be the “current value”) Non-volatile A physically separate store of data transformed from the operational environment. Operational updates of DW data does not occur (every insert is a considered a new item). Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and read of data.

Data Mining: Concepts and Techniques Data Mining on a DW? visualization Data Mining goes into MOUNTAINS of raw data for info gems. Data Mining Pattern Evaluation and Assay OLAP Classification Clustering Rule Mining Loop backs Task-relevant Data Selection Feature extraction, tuple selection Data Cleaning/Integration: missing data, outliers, noise, errors Data Warehouse: cleaned, integrated, read-only, periodic, historical raw database Smart files December 9, 2018 Data Mining: Concepts and Techniques

From Tables and Spreadsheets to Data Cubes A data warehouse is usually based on a multidimensional data model which views data in the form of a data cube describing the subject of interest (e.g., sales) A data cube allows data to be modeled and viewed in multiple dimensions Auxiliary dimension tables are added to the central cube for additional information (e.g., for sales cube, item (item_name, brand, type) time (day, week, month, quarter, year), salesman (name, addr, salary) Fact cube contains measurement(s) (e.g., number_of_sales) and keys (references) to each of the related dimension tables.

A Sample Data Cube Each cell contains a sales measurement, e. g A Sample Data Cube Each cell contains a sales measurement, e.g., the number of sales (may contain many other measurements of product-date-country instances) Date 1Qtr 2Qtr 3Qtr 4Qtr TV Product PC U.S.A VCR Canada Country Mexico

Total of all product sales by country and quarter Total sales by country and date Rollup (aggregate under +) along product (e.g., using the aggregate, sum) Date 1Qtr 2Qtr 3Qtr 4Qtr TV Product Total of all product sales by country and quarter PC U.S.A VCR Canada Country Mexico

Rollup along date (e.g., using the aggregate, sum) Total annual sales by country and product Date 1Qtr 2Qtr 3Qtr 4Qtr TV Product PC U.S.A VCR Canada Country Mexico

Rollup along country (e.g., using the aggregate, sum) Date 1Qtr 2Qtr 3Qtr 4Qtr TV Product PC U.S.A VCR Canada Country Mexico Total of all product sales by product and date Total of all product sales by product and date

All rollups (e.g., using the aggregate, sum) Date 1Qtr 2Qtr 3Qtr 4Qtr TV Product sales by product, country PC U.S.A sales by product, country and quarter VCR sales by country, date sales by country sales by country Canada Country Mexico sales by product sales by product, country sales by product sales by date sales by date Total sales Total sales Total sales

Cuboids Corresponding to the Cube all 0-D(apex) cuboid product date country 1-D cuboids product,date product,country date, country 2-D cuboids Drilldown on product product, date, country 3-D(base or fact) cuboid Rollup on country (Sum over country)

Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures Star schema: (simplest) A fact cube in the middle (star center) connected to the dimension tables (star points) Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact cubes share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Example of Star Schema product country date Sales Fact Cube date_key day day_of_the_week month quarter year date product_key product_name brand type supplier_type product Sales Fact Cube date_key product_key country_key units_sold dollars_sold avg_sales Measures country_key country_name country_continent country

Example of Snowflake Schema time_key day day_of_the_week month quarter year time item_key item_name brand type supplier_key item supplier_type supplier Sales Fact Cube time_key item_key branch_key location_key street city_key location branch_key branch_name branch_type branch location_key units_sold city_key city province_or_street country dollars_sold avg_sales Measures

Example of Fact Constellation time_key day day_of_the_week month quarter year time item_key item_name brand type supplier_type item Shipping Fact Cube time_key item_key shipper_key from_location to_location dollars_cost units_shipped shipper_name location_key shipper_type shipper Sales Fact Cube time_key item_key branch_key branch_key branch_name branch_type branch location_key location_key street city province_or_street country location units_sold dollars_sold avg_sales Measures

Visualizing a 5-D Data Cube Sales Volume = box size (4th dimensions) Product color = box-color (5th dimension) Visualization is very important and can be done for more than 3 dimensions.

Typical OLAP Operations Roll up: summarize data Grouped-by aggregation – dimension generalization, e.g. if date is initially in months, rollup to quarters (sum groups by quarter) or Dimension reduction/elimination (e.g., Slide-7: rollup by summing over all products leaving just country and date) aggregating over an entire dimension, eliminating it. Drill down: reverse of roll-up from higher level summary to lower level summary (detailed data) or introducing new dims Slice and dice: project and select Pivot (rotate): reorient (re-order) the cube, for visualization and faster processing.

Partial Rollup: climbing up a concept hierarchy (instead of eliminating Product altogether by summing over all products, rollup partially on Product, from (VCR, PC, TV) to computer (includes PC only) and non-computer (includes VCR + TV) Date 1Qtr 2Qtr 3Qtr 4Qtr Product TV U.S.A non-comp comp VCR PC Canada Country Mexico

SLICE e.g., slice off PC Date Product Country 1Qtr 2Qtr 3Qtr 4Qtr TV U.S.A VCR PC Canada Country Mexico

DICE (e.g. dice off PC, the last two quarters, the country Mexico) Date 1Qtr 2Qtr 3Qtr 4Qtr Product TV U.S.A VCR PC Canada Country Mexico

Pivot/Rotate Country Date Product Date Country Product Mexico Canada secondary Pivot/Rotate Date Product Country TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico tertiary primary Date Product Country TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico

(with WHEN as an attribute) Person (WHO) Dimension Table labor GF 150 5’6” Jose Fox teach Fgo 190 5’8” Ahmed Ali exec Zap 140 5’9” Jill Wade farmer Mot 220 6’11 John Roe Some important DWs for our region: Northern Border Security Data Warehouse Activity (WHAT) Dimension Table Chamber of Commerce 1 recreation Activity3 Dept of Commerce commerce Activity2 FBI felony Activity1 USBP crossing Activity0 Reported by Pub Info Category Activity Who Where What p0 p1 p2 p3 a0 a1 time time time time time time time time time time time time e0 e1 e2 e3 a2 a3 time time time time time time time time 950 Location (WHERE) Dimension Table rural plains 1 l3 river l2 l1 urban 49 l0 Class TERRAIN EL LON LAT Surface T/F LOC 91 95 89 90 900 2000 897 time l0 l1 l2 l3 time time time time time time time time Who-What-Where Cube (with WHEN as an attribute) for Who-What-Where-When Border Events (subject)

Gene-Organism Dimension Table (chromosome,length) PUBLIC (Ptree Unfied BioLogical InformtiCs Data Cube and Dimension Tables) Gene Dimension Table 1 PolyA-Tail .9 .1 StopCodonDensity apop mito meio Function Ribo Nucl Myta SubCell-Location Gene-Organism Dimension Table (chromosome,length) Organism Dimension Table 3000 1 Mus musculus mouse 12.1 Saccharomyces cerevisiae yeast 185 Drosophila melanogaster fly Homo sapiens human Genome Size (million bp) Vert Species Organism g0 g1 g2 g3 o1 o2 o3 o0 17, 78 12, 60 Mi, 40 1, 48 10, 75 0 0 7, 40 0 14, 65 0 0 16, 76 0 9, 45 1, 43 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1 e0 e1 e2 e3 TreatmentDimension Table (MIAME) 1 a s 4 2 c h b 3 N M H S AD ED STZ CTY STR UNV PI LAB t0 t1 t2 t3 Gene-Treatment-Organism Cube (1 iff that gene from that organism expresses at a threshold level under that treatment.) (subject = experiment)