Lecture 1: Data Warehousing Based on the slides by Jeffrey D. Ullman and Hector Garcia-Molina at Stanford University 1.

Slides:



Advertisements
Similar presentations
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Advertisements

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Data Analysis. Overview Traditional database systems are tuned to many, small, simple queries. Some applications use fewer, more time-consuming, analytic.
Data Warehouse Design Enrico Franconi CS 636. CS 3362 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Introduction to Data Warehousing CPS Notes 6.
Data Warehousing M R BRAHMAM.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #15.
2/10/05Salman Azhar: Database Systems1 On-Line Analytical Processing Salman Azhar Warehousing Data Cubes Data Mining These slides use some figures, definitions,
Data Warehousing Overview
OLAP. Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming, analytic queries.
Data Warehousing and OLAP
SLIDE 1IS 257 – Fall 2011 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
1 On-Line Application Processing Warehousing Data Cubes Data Mining.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
1 Lecture 10: More OLAP - Dimensional modeling
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
SLIDE 1IS 257 – Fall 2010 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
CS346: Advanced Databases
On-Line Application Processing Warehousing Data Cubes Data Mining 1.
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 5 th Edition, Aug 26, 2005 Buzzword List OLTP – OnLine Transaction Processing (normalized,
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
1 On-Line Application Processing Warehousing Data Cubes Data Mining.
1 Data Warehouses BUAD/American University Data Warehouses.
Data Warehousing.
October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.
Data Warehousing and OLAP. Warehousing ► Growing industry: $8 billion in 1998 ► Range from desktop to huge:  Walmart: 900-CPU, 2,700 disk, 23TB Teradata.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Ayyat IT Group Murad Faridi Roll NO#2492 Muhammad Waqas Roll NO#2803 Salman Raza Roll NO#2473 Junaid Pervaiz Roll NO#2468 Instructor :- “ Madam Sana Saeed”
1 On-Line Analytic Processing Warehousing Data Cubes.
On-Line Application Processing Warehousing Data Cubes (Data Mining) (slides borrowed from Stanford)
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Data Warehouses and OLAP 1.  Review Questions ◦ Question 1: OLAP ◦ Question 2: Data Warehouses ◦ Question 3: Various Terms and Definitions ◦ Question.
Databases 2 On-Line Application Processing: Warehousing, Data Cubes, Data Mining.
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch , Ch. 22.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
11/20/ :11 AMData Mining 1 Data Mining – CSE 9033 Chapter – 1; Data Warehousing Dr. Goutam Sarker, B.E., M.E., Ph.D.(Engineering), Fellow: IE(I),
Advanced Database Systems: DBS CB, 2nd Edition
On-Line Application Processing
Data Warehousing Overview CS245 Notes 12
Data warehouse.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data Warehouse.
On-Line Analytic Processing
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Data Warehouse.
Data Warehouse Design Enrico Franconi CS 636.
Data Warehousing Overview CS245 Notes 11
Data Warehousing and OLAP
On-Line Application Processing
Data Warehousing Concepts
Presentation transcript:

Lecture 1: Data Warehousing Based on the slides by Jeffrey D. Ullman and Hector Garcia-Molina at Stanford University 1

2 Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time- consuming, complex queries. New architectures have been developed to handle complex “analytic” queries efficiently.

3 The Data Warehouse The most common form of data integration. – Copy sources into a single DB (warehouse) and try to keep it up-to-date. – Usual method: periodic reconstruction of the warehouse, perhaps overnight. – Frequently essential for analytic queries.

4 OLTP Most database operations involve On-Line Transaction Processing (OTLP). – Short, simple, frequent queries and/or modifications, each involving a small number of tuples. – Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets.

5 OLAP Of increasing importance are On-Line Application Processing (OLAP) queries. – Few, but complex queries --- may run for hours. – Queries do not depend on having an absolutely up-to-date database.

6 OLAP Examples 1.Amazon analyzes purchases by its customers to come up with an individual screen with products of likely interest to the customer. 2.Analysts at Wal-Mart look for items with increasing sales in some region.

Warehouse Architecture 7 Client Warehouse Source Query & Analysis Integration Metadata

Why a Warehouse? Two Approaches: – Query-Driven (Lazy) – Warehouse (Eager) 8 Source ?

9 Data Warehouse Databases at store branches handle OLTP. Local store databases copied to a central warehouse overnight. Analysts use the warehouse for OLAP.

Query-Driven Approach 10 Client Wrapper Mediator Source

Advantages of Warehousing High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse – Modify, summarize (store aggregates) – Add historical information 11

Advantages of Query-Driven No need to copy data – less storage – no need to purchase data More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources 12

OLTP vs. OLAP OLTP: On Line Transaction Processing – Describes processing at operational sites OLAP: On Line Analytical Processing – Describes processing at warehouse 13

OLTP vs. OLAP Mostly updates Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical Mostly reads Queries long, complex Gb-Tb of data Summarized, consolidated data Decision-makers, analysts as users 14 OLTP OLAP

15 Star Schemas A star schema is a common organization for data at a warehouse. It consists of: 1.Fact table : a very large accumulation of facts such as sales. wOften “insert-only.” 2.Dimension tables : smaller, generally static information about the entities involved in the facts.

16 Example: Star Schema Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged. The fact table is a relation: Sales(bar, beer, drinker, day, time, price)

17 Example, Continued The dimension tables include information about the bar, beer, and drinker “dimensions”: Bars(bar, addr, license) Beers(beer, manf) Drinkers(drinker, addr, phone)

18 Visualization – Star Schema Dimension Table (Beers)Dimension Table (etc.) Dimension Table (Drinkers)Dimension Table (Bars) Fact Table - Sales Dimension Attrs.Dependent Attrs.

19 Dimensions and Dependent Attributes Two classes of fact-table attributes: 1.Dimension attributes : the key of a dimension table. 2.Dependent attributes : a value determined by the dimension attributes of the tuple.

20 Example: Dependent Attribute price is the dependent attribute of our example Sales relation. It is determined by the combination of dimension attributes: bar, beer, drinker, and the time (combination of day and time-of-day attributes).

21 Approaches to Building Warehouses 1.ROLAP = “relational OLAP”: Tune a relational DBMS to support star schemas. 2.MOLAP = “multidimensional OLAP”: Use a specialized DBMS with a model such as the “data cube.”

22 ROLAP Techniques 1.Bitmap indexes : For each key value of a dimension table (e.g., each beer for relation Beers) create a bit-vector telling which tuples of the fact table have that value. 2.Materialized views : Store the answers to several useful queries (views) in the warehouse itself.

23 Typical OLAP Queries Often, OLAP queries begin with a “star join”: the natural join of the fact table with all or most of the dimension tables. Example: SELECT * FROM Sales, Bars, Beers, Drinkers WHERE Sales.bar = Bars.bar AND Sales.beer = Beers.beer AND Sales.drinker = Drinkers.drinker;

24 Typical OLAP Queries --- (2) The typical OLAP query will: 1.Start with a star join. 2.Select for interesting tuples, based on dimension data. 3.Group by one or more dimensions. 4.Aggregate certain attributes of the result.

25 Example: OLAP Query For each bar in Palo Alto, find the total sale of each beer manufactured by Anheuser- Busch. Filter: addr = “Palo Alto” and manf = “Anheuser-Busch”. Grouping: by bar and beer. Aggregation: Sum of price.

26 Example: In SQL SELECT bar, beer, SUM(price) FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’ GROUP BY bar, beer;

27 Using Materialized Views A direct execution of this query from Sales and the dimension tables could take too long. If we create a materialized view that contains enough information, we may be able to answer our query much faster.

28 Example: Materialized View Which views could help with our query? Key issues: 1.It must join Sales, Bars, and Beers, at least. 2.It must group by at least bar and beer. 3.It must not select out Palo-Alto bars or Anheuser- Busch beers. 4.It must not project out addr or manf.

29 Example --- Continued Here is a materialized view that could help: CREATE VIEW BABMS(bar, addr, beer, manf, sales) AS SELECT bar, addr, beer, manf, SUM(price) sales FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers GROUP BY bar, addr, beer, manf; Since bar -> addr and beer -> manf, there is no real grouping. We need addr and manf in the SELECT.

30 Example --- Concluded Here’s our query using the materialized view BABMS: SELECT bar, beer, sales FROM BABMS WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’;

31 MOLAP and Data Cubes Keys of dimension tables are the dimensions of a hypercube. – Example: for the Sales data, the four dimensions are bar, beer, drinker, and time. Dependent attributes (e.g., price) appear at the points of the cube.

32 Visualization - Data Cubes price bar beer drinker

33 Marginals The data cube also includes aggregation (typically SUM) along the margins of the cube. The marginals include aggregations over one dimension, two dimensions,…

34 Visualization - Data Cube w/ Aggregation price bar beer drinker SUM over all Drinkers

35 Example: Marginals Our 4-dimensional Sales cube includes the sum of price over each bar, each beer, each drinker, and each time unit (perhaps days). It would also have the sum of price over all bar-beer pairs, all bar-drinker-day triples,…

36 Structure of the Cube Think of each dimension as having an additional value *. A point with one or more *’s in its coordinates aggregates over the dimensions with the *’s. Example: Sales(“Joe’s Bar”, “Bud”, *, *) holds the sum over all drinkers and all time of the Bud consumed at Joe’s.

37 Drill-Down Drill-down = “de-aggregate” = break an aggregate into its constituents. Example: having determined that Joe’s Bar sells very few Anheuser-Busch beers, break down his sales by particular A.-B. beer.

38 Roll-Up Roll-up = aggregate along one or more dimensions. Example: given a table of how much Bud each drinker consumes at each bar, roll it up into a table giving total amount of Bud consumed for each drinker.

39 Roll Up and Drill Down JimBobMary Joe’s Bar Nut- House Blue Chalk $ of Anheuser-Busch by drinker/bar JimBobMary $ of A-B / drinker JimBobMary Bud M’lob Bud Light Roll up by Bar $ of A-B Beers / drinker Drill down by Beer

40 Materialized Data-Cube Views Data cubes invite materialized views that are aggregations in one or more dimensions. Dimensions may not be completely aggregated --- an option is to group by an attribute of the dimension table.

41 Example A materialized view for our Sales data cube might: 1.Aggregate by drinker completely. 2.Not aggregate at all by beer. 3.Aggregate by time according to the week. 4.Aggregate according to the city of the bar.

42 Example A materialized view for our Sales data cube might: 1.Aggregate by drinker completely. 2.Not aggregate at all by beer. 3.Aggregate by time according to the week. 4.Aggregate according to the city of the bar.

Warehouse Models & Operators Data Models – relations – stars & snowflakes – cubes Operators – slice & dice – roll-up, drill down – pivoting – other 43

Star 44

Star Schema 45

Terms Fact table Dimension tables Measures 46

Dimension Hierarchies 47 store sType cityregion  snowflake schema  constellations

Cube 48 Fact table view: Multi-dimensional cube: dimensions = 2

3-D Cube 49 day 2 day 1 dimensions = 3 Multi-dimensional cube:Fact table view:

ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing MOLAP: Multi-Dimensional On-Line Analytical Processing 50

Aggregates 51 Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81

Aggregates 52 Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

Another Example 53 Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId drill-down rollup

Aggregates Operators: sum, count, max, min, median, ave “Having” clause Using dimension hierarchy – average by region (within store) – maximum by month (within date) 54

Cube Aggregation 55 day 2 day drill-down rollup Example: computing sums

Cube Operators 56 day 2 day sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)

Extended Cube 57 day 2 day 1 * sale(*,p2,*)

Aggregation Using Hierarchies 58 day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B)

Pivoting 59 day 2 day 1 Multi-dimensional cube: Fact table view:

Query & Analysis Tools Query Building Report Writers (comparisons, growth, graphs,…) Spreadsheet Systems Web Interfaces Data Mining 60

Other Operations Time functions – e.g., time average Computed Attributes – e.g., commission = sales * rate Text Queries – e.g., find documents with words X AND B – e.g., rank documents by frequency of words X, Y, Z 61

Implementing a Warehouse Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing,... Managing: Metadata, Design,... 62

Monitoring Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … Incremental vs. Refresh 63 new

Monitoring Techniques Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring 64  Advantages & Disadvantages!!

Monitoring Issues Frequency – periodic: daily, weekly, … – triggered: on “big” change, lots of changes,... Data transformation – convert data to uniform format – remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways 65

Integration Data Cleaning Data Loading Derived Data 66 Client Warehouse Source Query & Analysis Integration Metadata

Data Cleaning Migration (e.g., yen  dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) Auditing: discover rules & relationships (like data mining) 67 billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)

Loading Data Incremental vs. refresh Off-line vs. on-line Frequency of loading – At night, 1x a week/month, continuously Parallel/Partitioned load 68

Derived Data Derived Warehouse Data – indexes – aggregates – materialized views (next slide) When to update derived data? Incremental vs. refresh 69

Materialized Views Define new warehouse relations using SQL expressions 70 does not exist at any source

Processing ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms 71 Client Warehouse Source Query & Analysis Integration Metadata

ROLAP Server Relational OLAP Server 72 relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”

MOLAP Server Multi-Dimensional OLAP Server 73 multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date milk soda eggs soap A B Sales

Index Structures Traditional Access Methods – B-trees, hash tables, R-trees, grids, … Popular in Warehouses – inverted lists – bit map indexes – join indexes – text indexes 74

CS 245Notes12 75 Inverted Lists... age index inverted lists data records

CS 245Notes12 76 Using Inverted Lists Query: – Get people with age = 20 and name = “fred” List for age = 20: r4, r18, r34, r35 List for name = “fred”: r18, r52 Answer is intersection: r18

CS 245Notes12 77 Bit Maps... age index bit maps data records

CS 245Notes12 78 Using Bit Maps Query: – Get people with age = 20 and name = “fred” List for age = 20: List for name = “fred”: Answer is intersection: l Good if domain cardinality small l Bit vectors can be compressed

CS 245Notes12 79 Join “Combine” SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT

What to Materialize? Store in warehouse results useful for common queries Example: 80 day 2 day total sales materialize

Materialization Factors Type/frequency of queries Query response time Storage cost Update cost 81

Cube Aggregates Lattice 82 city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day use greedy algorithm to decide what to materialize

Dimension Hierarchies 83 all state city

Dimension Hierarchies 84 city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...

Interesting Hierarchy 85 all years quarters months days weeks conceptual dimension table

Managing Metadata Warehouse Design Tools 86 Client Warehouse Source Query & Analysis Integration Metadata

Administrative – definition of sources, tools,... – schemas, dimension hierarchies, … – rules for extraction, cleaning, … – refresh, purging policies – user profiles, access control,... 87

Metadata Business – business terms & definition – data ownership, charging Operational – data lineage – data currency (e.g., active, archived, purged) – use stats, error reports, audit trails 88

Design What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index? 89

Tools Development – design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis – what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management – performance monitoring, usage patterns, exception reporting System & Network Management – measure traffic (sources, warehouse, clients) Workflow Management – “reliable scripts” for cleaning & analyzing data 90

Current State of Industry Extraction and integration done off-line – Usually in large, time-consuming, batches Everything copied at warehouse – Not selective about what is stored – Query benefit vs storage & update cost Query optimization aimed at OLTP – High throughput instead of fast response – Process whole query before displaying anything 91

Future Directions Better performance Larger warehouses Easier to use What are companies & research labs working on? 92

Conclusions Massive amounts of data and complexity of queries will push limits of current warehouses Need better systems: – easier to use – provide quality information 93