Dr. M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining Data Warehousing COMP207: Data Mining.

Slides:



Advertisements
Similar presentations
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Advertisements

April 30, Data Warehousing and OLAP Technology: An Overview  What is a data warehouse?  Data warehouse architecture  From data warehousing to.
Data Warehousing.
Data Warehousing Willem Visser RW334. Somebody is watching! Everybody seems to be recording your every move Loyalty cards Cookies – Facebook, Twitter,…
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Introduction to Data Warehousing CPS Notes 6.
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
The Role of Data Warehousing and OLAP Technologies CS 536 – Data Mining These slides are adapted from J. Han and M. Kamber’s book slides (
Data Warehousing & OLAP
Instructor: Pedro Domingos
Data Warehouses and OLAP
Data Warehousing Xintao Wu. Evolution of Database Technology (See Fig. 1.1) 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational.
Data Warehousing.
1 Lecture 10: More OLAP - Dimensional modeling
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Warehouse and Data Cube Lecture Notes for Chapter 3 Introduction to Data Mining By.
Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
DATA WAREHOUSE (Muscat, Oman).
1 Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously.  A decision support database that is maintained.
Tanvi Madgavkar CSE 7330 FALL Ralph Kimball states that : A data warehouse is a copy of transaction data specifically structured for query and analysis.
CS346: Advanced Databases
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
1 Data Warehouses C hapter 2. 2 Chapter 2 Outline Chapter 2 Outline – Introduction –Data Warehouses –Data Warehouse in Organisation – OLTP vs. OLAP –Why.
Dr. Bernard Chen Ph.D. University of Central Arkansas
8/20/ Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously. Defined in many different ways, but.
CIS664-Knowledge Discovery and Data Mining
8/25/2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 3 — Jiawei Han Department of Computer Science University.
August 25, 2015Data Mining: Concepts and Techniques 1 Data Warehousing What is a data warehouse? A multi-dimensional data model Data warehouse architecture.
Data Warehousing and Decision Support courtesy of Jiawei Han, Larry Kerschberg, and etc. for some slides. Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
An overview of Data Warehousing and OLAP Technology
School of Management, HUST
Data warehouses and Online Analytical Processing.
Introduction to Data Mining and Data Warehousing Muhammad Ali Yousuf DSC – ITM Friday, 9 th May 2003.
Data Warehousing.
Data Warehousing Xintao Wu. Can You Easily Answer These Questions? What are Personnel Services costs across all departments for all funding sources? What.
1 Fall 2004, CIS, Temple University CIS527: Data Warehousing, Filtering, and Mining Lecture 2 Data Warehousing and OLAP Technology for Data Mining Lecture.
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
Roadmap 1.What is the data warehouse, data mart 2.Multi-dimensional data modeling 3.Data warehouse design – schemas, indices 4.The Data Cube operator –
October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.
Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University Data Warehousing and OLAP Technology.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
Dr. N. MamoulisAdvanced Database Technologies1 Topic 6: Data Warehousing & OLAP Defined in many different ways, but not rigorously. A decision support.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Shilpa Seth.  Multidimensional Data Model Concepts Multidimensional Data Model Concepts  Data Cube Data Cube  Data warehouse Schemas Data warehouse.
1 CSE 592 Data Mining Instructor: Pedro Domingos.
Data Mining Data Warehouses.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 This is the full course notes, but not quite complete. You.
2016年1月21日星期四 2016年1月21日星期四 2016年1月21日星期四 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 3 — Jiawei Han Department.
January 21, 2016Data Mining: Concepts and Techniques 1 Chapter 3: Data Warehousing and OLAP Technology: An Overview What is a data warehouse? A multi-dimensional.
Datawarehousing and OLAP C.Eng 714 Spring
Data Warehousing COMP3017 Advanced Databases Dr Nicholas Gibbins –
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
CPT-S Advanced Databases 1 Yinghui Wu EME 49 ADB (ln27)
June 12, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 3 — Jiawei Han Department of Computer Science University.
Data Warehouses and OLAP. Data Warehousing and OLAP Technology for Data Mining  What is a data warehouse?  A multi-dimensional data model  Data warehouse.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data warehouse and OLAP
OLAP Concepts and Techniques
Data Warehouse.
Data Warehousing and OLAP Technology for Data Mining
Chapter 2: Data Warehousing and OLAP Technology for Data Mining
Data Warehouse and OLAP
Data Warehousing and Decision Support Chapter 25
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009
Data Mining: Concepts and Techniques
Data Warehouse and OLAP
Presentation transcript:

Dr. M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2010 COMP207: Data Mining Data Warehousing COMP207: Data Mining

Data Warehouses Data Cubes Warehouse Schemas OLAP Materialisation Today's Topics Data Warehousing COMP207: Data Mining

Most common definition: “A data warehouse is a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management's decision-making process.” - W. H. Inmon  Corporate focused, assumes a lot of data, and typically sales related  Data for “Decision Support System” or “Management Support System”  1996 survey: Return on Investment of 400+% Data Warehousing: Process of constructing (and using) a data warehouse What is a Data Warehouse? Data Warehousing COMP207: Data Mining

 Subject-oriented: Focused on important subjects, not transactions Concise view with only useful data for decision making  Integrated: Constructed from multiple, heterogeneous data sources. Normally distributed relational databases, not necessarily same schema. Cleaning, pre-processing techniques applied for missing data, noisy data, inconsistent data (sounds familiar, I hope)‏ Data Warehouse Data Warehousing COMP207: Data Mining

 Time-variant: Has different values for the same fields over time. Operational database only has current value. Data Warehouse offers historical values.  Nonvolatile: Physically separate store Updates not online, but in offline batch mode only Read only access required, so no concurrency issues Data Warehouse Data Warehousing COMP207: Data Mining

Data Warehouses are distinct from: Distributed DB: Integrated via wrappers/mediators. Far too slow, semantic integration much more complicated. Integration done before loading, not at run time. Operational DB: Only records current value, lots of extra non useful information. Different schemas/models, access patterns, users, functions, even though the data is derived from an operational db. Data Warehouse Data Warehousing COMP207: Data Mining

OLAP: Online Analytical Processing (Data Warehouse)‏ OLTP: Online Transaction Processing (Traditional DBMS)‏ OLAP data typically: historical, consolidated, and multi- dimensional (eg: product, time, location). Involves lots of full database scans, across terabytes or more of data. Typically aggregation and summarisation functions. Distinctly different uses to OLTP on the operational database. OLAP vs OLTP Data Warehousing COMP207: Data Mining

Data is normally Multi-Dimensional, and can be thought of as a cube. Often: 3 dimensions of time, location and product. No need to have just 3 dimensions -- could have one for cars with make, colour, price, location, and time for example. Image courtesy of IBM OLAP Miner documentation Data Cubes Data Warehousing COMP207: Data Mining

 Can construct many 'cuboids' from the full cube by excluding dimensions.  In an N dimensional data cube, the cuboid with N dimensions is the 'base cuboid'. A 0 dimensional cuboid (other than non existent!) is called the 'apex cuboid'.  Can think of this as a lattice of cuboids... (Following lattice courtesy of Han & Kamber) Data Cubes Data Warehousing COMP207: Data Mining

Lattice of Cuboids Data Warehousing COMP207: Data Mining all timeitemlocationsupplier time,itemtime,location time,supplier item,location item,supplier location,supplier time,item,location time,item,supplier time,location,supplier item,location,supplier time, item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid

Each dimension can also be thought of in terms of different units.  Time: decade, year, quarter, month, day, hour (and week, which isn't strictly hierarchical with the others!)‏  Location: continent, country, state, city, store  Product: electronics, computer, laptop, dell, inspiron This is called a “Star-Net” model in data warehousing, and allows for various operations on the dimensions and the resulting cuboids. Multi-dimensional Units Data Warehousing COMP207: Data Mining

Star-Net Model Data Warehousing COMP207: Data Mining Shipping Method AIR-EXPRESS TRUCK ORDER Customer Orders CONTRACTS Customer Product PRODUCT GROUP PRODUCT LINE PRODUCT ITEM SALES PERSON DISTRICT DIVISION OrganizationPromotion DISTRICT REGION COUNTRY Geography DAILY QTRLY ANNUALY Time

 Roll Up: Summarise data by climbing up hierarchy. Eg. From monthly to quarterly, from Liverpool to England  Drill Down: Opposite of Roll Up Eg. From computer to laptop, from £ to £  Slice: Remove a dimension by setting a value for it Eg. location/product where time is Q1,2007  Dice: Restrict cube by setting values for multiple dimensions Eg. Q1,Q2 / North American cities / 3 products sub cube  Pivot: Rotate the cube (mostly for visualisation) Data Cube Operations Data Warehousing COMP207: Data Mining

 Star Schema: Single fact table in the middle, with connected set of dimension tables (Hence a star)‏  Snowflake Schema: Some of the dimension tables further refined into smaller dimension tables (Hence looks like a snow flake)‏  Fact Constellation: Multiple fact tables can share dimension tables (Hence looks like a collection of star schemas. Also called Galaxy Schema)‏ Data Cube Schemas Data Warehousing COMP207: Data Mining

Star Schema Data Warehousing COMP207: Data Mining Sales Fact Table time_key item_key location_key units_sold Time Dimension time_key day day_of_week month quarter year Item Dimension item_key name brand type supplier_type Loc.n Dimension location_key street city state country continent Measure (value)‏

Snowflake Schema Data Warehousing COMP207: Data Mining Sales Fact Table time_key item_key location_key units_sold Time Dimension time_key day day_of_week month quarter year Item Dimension item_key name brand type supplier_key Loc Dimension location_key street city_key Measure (value)‏ City Dimension city_key city state country

Fact Constellation Data Warehousing COMP207: Data Mining Sales Fact Table time_key item_key location_key units_sold Time Dimension time_key day day_of_week month quarter year Item Dimension item_key name brand type supplier_key Loc Dimension location_key street city_key Measure (value)‏ City Dimension city_key city state country Shipping Table time_key item_key from_key units_shipped

ROLAP: Relational OLAP Uses relational DBMS to store and manage the warehouse data Optimised for non traditional access patterns Lots of research into RDBMS to make use of! MOLAP: Multidimensional OLAP Sparse array based storage engine Fast access to precomputed data HOLAP: Hybrid OLAP Mixture of both MOLAP and ROLAP OLAP Server Types Data Warehousing COMP207: Data Mining

Data Warehouse Architecture Data Warehousing COMP207: Data Mining Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Monitor & Integrator Metadata Data Sources Front-End Tools Serve Data Marts Operational DBs Other sources Data Storage OLAP Server (also courtesy of Han & Kamber)‏

In order to compute OLAP queries efficiently, need to materialise some of the cuboids from the data. None: Very slow, as need to compute entire cube at run time Full: Very fast, but requires a LOT of storage space and time to compute all possible cuboids Partial: But which ones to materialise? Called an 'iceberg cube', as only partially materialised and the rest is "below water". Many cells in a cuboid will be empty, only materialise sections that contain more values than a minimum threshold. Materialisation Data Warehousing COMP207: Data Mining

and subsequent links Further Reading Data Warehousing COMP207: Data Mining