1 Advanced Database Systems: DBS CB, 2 nd Edition Data Warehouse, OLAP, Data Mining Ch , Ch. 22
2 Outline What is a data warehouse? Why a data warehouse? Data Warehouse Models & operations Query & Analysis Tools Implementing a data warehouse Future directions
333 What is a Data Warehouse
4 Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history) integrated time-varying non-volatile What is a Data Warehouse more Collection of tools gathering data cleansing, integrating,... querying, reporting, analysis data mining monitoring, administering warehouse
5 Data Warehouse Architecture What is a Data Warehouse Client Warehouse Source Query & Analysis Integration Metadata
6 Motivation Examples: Forecasting Comparing performance of units Monitoring, detecting fraud Visualization What is a Data Warehouse
777 Why a Data Warehouse
8 Two Approaches to Extract Information in the Enterprise: Query-Driven (Lazy) Warehouse (Eager) Why a Data Warehouse
9 Query Driven Approach (Federation): Why a Data Warehouse Client Wrapper Mediator Source
10 Warehouse Advantages Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse Modify, summarize (store aggregates) Add historical information Why a Data Warehouse
11 Query-driven Advantages No need to copy data less storage no need to purchase data More up-to-date data – real-time Query needs can be unknown Only query interface needed at sources May be less draining on sources Why a Data Warehouse
12 OLTP vs. OLAP OLTP: On Line Transaction Processing Describes processing at operational sites OLAP: On Line Analytical Processing Describes processing at warehouse What is a Data Warehouse
13 OLTP vs. OLAP What is a Data Warehouse Mostly updates Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical Mostly reads Queries long, complex GB-TB of data Summarized, consolidated data Decision-makers, analysts as users OLTP OLAP
14 Data Marts Smaller warehouses Spans part of organization e.g., marketing (customers, products, sales) Do not require enterprise-wide consensus but long term integration problems? What is a Data Warehouse
15 Data Warehouse Models & Operations
16 Data Models relations stars & snowflakes cubes Operators slice & dice roll-up, drill down pivoting other Data Warehouse Models & Operations
17 Star Schema Data Warehouse Models & Operations
Star Schema 18 Data Warehouse Models & Operations Terms: Fact table: PK is the aggregate of the dimensions’ PK - 2 nd NF Dimension tables: “by” condition - one view for how the data will be analyzed – 3 rd NF Measures: aggregates – sales/day, sales/quarter, etc. Summary Tables: pre- summarized data to meet 90% of user queries. Use triggers/stored procedures to maintain summary tables current on insert/delete product prodId name price customer custId name address city store storeId city sale orderId date custId prodId storeId qty amt
19 Star Schema - Dimension Hierarchies Data Warehouse Models & Operations store sType cityregion
20 S nowflakes schema, a dimension table have the hierarchy broken into set of tables – more complicated and slower queries Data Warehouse Models & Operations time_key day day_of_the_week month quarter year time location_key street city_key location Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key item branch_key branch_name branch_type branch supplier_key supplier_type supplier city_key city state_or_province country city
21 Fact Constellations schema, multiple fact tables that share the set of dimensional tables; viewed as a collection of stars Data Warehouse Models & Operations time_key day day_of_the_week month quarter year time location_key street city_key location Sales Fact Table t item_key branch_key location_key units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key item branch_key branch_name branch_type branch supplier_key supplier_type supplier city_key city state_or_province country city time_key
22 Data Warehouse Models & Operations product prodId name price customer custId name address city store storeId city sale orderId date custId prodId storeId qty amt Fact table view: Multi-dimensional cube: dimensions = 2 Star Schema
23 3D Cube Data Warehouse Models & Operations day 2 day 1 dimensions = 3 Multi-dimensional cube:Fact table view:
24 ROLAP vs. MOLAP ROLAP: Relational On-Line Analytical Processing MOLAP: Multi-Dimensional On-Line Analytical Processing Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1; Data Warehouse Models & Operations 81
25 Aggregates (Contd.) Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date; Data Warehouse Models & Operations
26 Aggregates (Contd.) Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId; Data Warehouse Models & Operations drill-down rollup
27 Aggregates (Contd.) Operators: sum, count, max, min, median, ave “Having” clause Using dimension hierarchy average by region (within store) maximum by month (within date) Data Warehouse Models & Operations
28 Cube Aggregations Data Warehouse Models & Operations day 2 day Example: computing sums drill-down rollup
29 Cube Operators Data Warehouse Models & Operations day 2 day sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)
30 Extended Cube Data Warehouse Models & Operations day 2 day 1 * sale(*,p2,*)
31 Aggregation Using Hierarchies Data Warehouse Models & Operations day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B)
32 Pivoting Data Warehouse Models & Operations day 2 day 1 Multi-dimensional cube: Fact table view:
33 Query & Analysis Tools
34 Query Building Report Writers (comparisons, growth, graphs,…) Spreadsheet Systems Web Interfaces Data Mining Query & Analysis Tools
35 Other Operations Time functions e.g., time average Computed Attributes e.g., commission = sales * rate Text Queries e.g., find documents with words X AND B e.g., rank documents by frequency of words X, Y, Z Query & Analysis Tools
36 Data Mining Decision Trees Clustering Association Rules (i) Decision Tree: Conducted survey to see what customers were interested in new model car Want to select customers for advertising campaign Query & Analysis Tools training set
37 One possibility: Query & Analysis Tools age<30 city=sfcar=van likely unlikely YY Y N N N
38 Another Possibility: Query & Analysis Tools car=taurus city=sfage<45 likely unlikely YY Y N N N
39 Decision Tree Issues: Decision tree cannot be “too deep” would not have statistically significant amounts of data for lower decisions Need to select tree that most reliably predicts outcomes Query & Analysis Tools
40 (ii) Clustering: Query & Analysis Tools age income education
41 Another Example: Text Each document is a vector e.g., contains words 1,4,5,... Clusters contain “similar” documents Useful for understanding, searching documents Query & Analysis Tools international news sports business
42 Clustering Issues: Given desired number of clusters? Finding “best” clusters Are clusters semantically meaningful? e.g., “yuppies’’ cluster? Using clusters for disk storage Query & Analysis Tools
43 (iii) Association Rules Mining: Query & Analysis Tools customer id products bought sales records: Trend: Products p5, p8 often bought together Trend: Customer 12 likes product p9 market-basket data
44 Association Rule Rule: {p 1, p 3, p 8 } Support: number of baskets where these products appear High-support set: support threshold s Problem: find all high support sets Query & Analysis Tools
45 Finding High-Support Pairs Baskets(basket, item) SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item = s; Query & Analysis Tools WHY?
46 Example: Query & Analysis Tools check if count s
47 Association Rule Issues: Performance for size 2 rules Performance for size k rules Query & Analysis Tools big even bigger!
48 Implementing a Data Warehouse
49 Monitoring: Sending data from sources Integrating: Loading, cleansing,... Processing: Query processing, indexing,... Managing: Metadata, Design,... Implementing a Data Warehouse
50 Monitoring: Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … Incremental vs. Refresh Implementing a Data Warehouse new
51 Monitoring Techniques: Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Screen scraping Application level monitoring Implementing a Data Warehouse Advantages & Disadvantages!!
52 Monitoring Issues: Frequency periodic: daily, weekly, … triggered: on “big” change, lots of changes,... Data transformation convert data to uniform format remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways Implementing a Data Warehouse
53 Integration Data Cleaning Data Loading Derived Data Implementing a Data Warehouse Client Warehouse Source Query & Analysis Integration Metadata
54 Data Cleaning: Migration (e.g., yen dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) Auditing: discover rules & relationships (like data mining) Implementing a Data Warehouse billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)
55 Loading Data: Incremental vs. refresh Off-line vs. on-line Frequency of loading At night, 1x a week/month, continuously Parallel/Partitioned load Implementing a Data Warehouse
56 Derived Data: Derived Warehouse Data indexes aggregates materialized views (next slide) When to update derived data? Incremental vs. refresh Implementing a Data Warehouse
57 Materialized Views: Define new warehouse relations using SQL expressions Implementing a Data Warehouse does not exist at any source
Processing ROLAP servers vs. MOLAP servers Index Structures What to Materialize? Algorithms 58 Implementing a Data Warehouse Client Warehouse Source Query & Analysis Integration Metadata
ROLAP Server: Relational OLAP Server 59 Implementing a Data Warehouse relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”
MOLAP Server: Multi-Dimensional OLAP Server 60 Implementing a Data Warehouse multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date milk soda eggs soap A B Sales
Index Structures Traditional Access Methods B-trees, hash tables, R-trees, grids, … Popular in Warehouses inverted lists bit map indexes join indexes text indexes 61 Implementing a Data Warehouse
Inverted Lists 62 Implementing a Data Warehouse... age index inverted lists data records
Using Inverted Lists: Query: Get people with age = 20 and name = “fred” List for age = 20: r4, r18, r34, r35 List for name = “fred”: r18, r52 Answer is intersection of “list for name” & “list for age”: r18 63 Implementing a Data Warehouse
Bit Map Index: 64 Implementing a Data Warehouse... age index bit maps data records
Using Bit Map Index: Query: Get people with age = 20 and name = “fred” List for age = 20: List for name = “fred”: Answer is intersection: Good if domain cardinality small Bit vectors can be compressed 65 Implementing a Data Warehouse
Join “Combine” SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT; 66 Implementing a Data Warehouse
Join Index 67 Implementing a Data Warehouse join index
What to Materialize Store in warehouse results useful for common queries Example: 68 Implementing a Data Warehouse day 2 day total sales materialize
Materialization Factors: Type/frequency of queries Query response time Storage cost Update cost 69 Implementing a Data Warehouse
Cube Aggregates Lattice: 70 Implementing a Data Warehouse city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day use greedy algorithm to decide what to materialize
Dimension Hierarchies: 71 Implementing a Data Warehouse all state city
Dimension Hierarchies: 72 Implementing a Data Warehouse city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...
Interesting Hierarchy: 73 Implementing a Data Warehouse all years quarters months days weeks conceptual dimension table
Algorithms: Query Optimization Parallel Processing Data Mining 74 Implementing a Data Warehouse
Managing: Metadata Warehouse Design Tools 75 Implementing a Data Warehouse Client Warehouse Source Query & Analysis Integration Metadata
Metadata: Administrative definition of sources, tools,... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control, Implementing a Data Warehouse
Metadata (Contd.): Business business terms & definition data ownership, charging Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails 77 Implementing a Data Warehouse
Design: What data is needed? Where does it come from? How to clean data? How to represent in warehouse (schema)? What to summarize? What to materialize? What to index? 78 Implementing a Data Warehouse
Tools: Development design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management performance monitoring, usage patterns, exception reporting System & Network Management measure traffic (sources, warehouse, clients) Workflow Management “reliable scripts” for cleaning & analyzing data 79 Implementing a Data Warehouse
Current State of Industry: Extraction and integration done off-line Usually in large, time-consuming, batches Everything copied at warehouse Not selective about what is stored Query benefit vs. storage & update cost Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything 80 Implementing a Data Warehouse
81 Future Directions
Better performance Larger warehouses Easier to use What are companies & research labs working on? 82 Future Directions
Research: Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush) 83 Future Directions
Research: Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data 84 Future Directions
85 END