Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS Notes11
CS 245Notes11 2 Warehousing l Growing industry: $8 billion in 1998 l Range from desktop to huge: u Walmart: 900-CPU, 2,700 disk, 23TB Teradata system l Lots of buzzwords, hype u slice & dice, rollup, MOLAP, pivot,...
CS 245Notes11 3 Outline l What is a data warehouse? l Why a warehouse? l Models & operations l Implementing a warehouse l Future directions
CS 245Notes11 4 What is a Warehouse? l Collection of diverse data u subject oriented u aimed at executive, decision maker u often a copy of operational data u with value-added data (e.g., summaries, history) u integrated u time-varying u non-volatile more
CS 245Notes11 5 What is a Warehouse? l Collection of tools u gathering data u cleansing, integrating,... u querying, reporting, analysis u data mining u monitoring, administering warehouse
CS 245Notes11 6 Warehouse Architecture Client Warehouse Source Query & Analysis Integration Metadata
CS 245Notes11 7 Motivating Examples l Forecasting l Comparing performance of units l Monitoring, detecting fraud l Visualization
CS 245Notes11 8 Why a Warehouse? l Two Approaches: u Query-Driven (Lazy) u Warehouse (Eager) Source ?
CS 245Notes11 9 Query-Driven Approach Client Wrapper Mediator Source
CS 245Notes11 10 Advantages of Warehousing l High query performance l Queries not visible outside warehouse l Local processing at sources unaffected l Can operate when sources unavailable l Can query data not stored in a DBMS l Extra information at warehouse u Modify, summarize (store aggregates) u Add historical information
CS 245Notes11 11 Advantages of Query-Driven l No need to copy data u less storage u no need to purchase data l More up-to-date data l Query needs can be unknown l Only query interface needed at sources l May be less draining on sources
CS 245Notes11 12 OLTP vs. OLAP l OLTP: On Line Transaction Processing u Describes processing at operational sites l OLAP: On Line Analytical Processing u Describes processing at warehouse
CS 245Notes11 13 OLTP vs. OLAP l Mostly updates l Many small transactions l Mb-Tb of data l Raw data l Clerical users l Up-to-date data l Consistency, recoverability critical l Mostly reads l Queries long, complex l Gb-Tb of data l Summarized, consolidated data l Decision-makers, analysts as users OLTP OLAP
CS 245Notes11 14 Data Marts l Smaller warehouses l Spans part of organization u e.g., marketing (customers, products, sales) l Do not require enterprise-wide consensus u but long term integration problems?
CS 245Notes11 15 Warehouse Models & Operators l Data Models u relations u stars & snowflakes u cubes l Operators u slice & dice u roll-up, drill down u pivoting u other
CS 245Notes11 16 Star
CS 245Notes11 17 Star Schema
CS 245Notes11 18 Terms l Fact table l Dimension tables l Measures
CS 245Notes11 19 Dimension Hierarchies store sType cityregion snowflake schema constellations
CS 245Notes11 20 Cube Fact table view: Multi-dimensional cube: dimensions = 2
CS 245Notes D Cube day 2 day 1 dimensions = 3 Multi-dimensional cube:Fact table view:
CS 245Notes11 22 ROLAP vs. MOLAP l ROLAP: Relational On-Line Analytical Processing l MOLAP: Multi-Dimensional On-Line Analytical Processing
CS 245Notes11 23 Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81
CS 245Notes11 24 Aggregates Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
CS 245Notes11 25 Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId drill-down rollup
CS 245Notes11 26 Aggregates l Operators: sum, count, max, min, median, ave l “Having” clause l Using dimension hierarchy u average by region (within store) u maximum by month (within date)
CS 245Notes11 27 Cube Aggregation day 2 day drill-down rollup Example: computing sums
CS 245Notes11 28 Cube Operators day 2 day sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)
CS 245Notes11 29 Extended Cube day 2 day 1 * sale(*,p2,*)
CS 245Notes11 30 Aggregation Using Hierarchies day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B)
CS 245Notes11 31 Pivoting day 2 day 1 Multi-dimensional cube: Fact table view:
CS 245Notes11 32 Query & Analysis Tools l Query Building l Report Writers (comparisons, growth, graphs,…) l Spreadsheet Systems l Web Interfaces l Data Mining
CS 245Notes11 33 Other Operations l Time functions u e.g., time average l Computed Attributes u e.g., commission = sales * rate l Text Queries u e.g., find documents with words X AND B u e.g., rank documents by frequency of words X, Y, Z
CS 245Notes11 34 Data Mining l Decision Trees l Clustering l Association Rules
CS 245Notes11 35 Decision Trees Example: Conducted survey to see what customers were interested in new model car Want to select customers for advertising campaign training set
CS 245Notes11 36 One Possibility age<30 city=sfcar=van likely unlikely YY Y N N N
CS 245Notes11 37 Another Possibility car=taurus city=sfage<45 likely unlikely YY Y N N N
CS 245Notes11 38 Issues l Decision tree cannot be “too deep” è would not have statistically significant amounts of data for lower decisions l Need to select tree that most reliably predicts outcomes
CS 245Notes11 39 Clustering age income education
CS 245Notes11 40 Another Example: Text l Each document is a vector u e.g., contains words 1,4,5,... l Clusters contain “similar” documents l Useful for understanding, searching documents international news sports business
CS 245Notes11 41 Issues l Given desired number of clusters? l Finding “best” clusters l Are clusters semantically meaningful? u e.g., “yuppies’’ cluster? l Using clusters for disk storage
CS 245Notes11 42 Association Rule Mining transaction id customer id products bought sales records: Trend: Products p5, p8 often bough together Trend: Customer 12 likes product p9 market-basket data
CS 245Notes11 43 Association Rule l Rule: {p 1, p 3, p 8 } l Support: number of baskets where these products appear l High-support set: support threshold s l Problem: find all high support sets
CS 245Notes11 44 Finding High-Support Pairs l Baskets(basket, item) l SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item = s; WHY?
CS 245Notes11 45 Example check if count s
CS 245Notes11 46 Issues l Performance for size 2 rules big even bigger! l Performance for size k rules
CS 245Notes11 47 Implementing a Warehouse l Monitoring: Sending data from sources l Integrating: Loading, cleansing,... l Processing: Query processing, indexing,... l Managing: Metadata, Design,...
CS 245Notes11 48 Monitoring l Source Types: relational, flat file, IMS, VSAM, IDMS, WWW, news-wire, … l Incremental vs. Refresh new
CS 245Notes11 49 Monitoring Techniques l Periodic snapshots l Database triggers l Log shipping l Data shipping (replication service) l Transaction shipping l Polling (queries to source) l Screen scraping l Application level monitoring Advantages & Disadvantages!!
CS 245Notes11 50 Monitoring Issues l Frequency u periodic: daily, weekly, … u triggered: on “big” change, lots of changes,... l Data transformation u convert data to uniform format u remove & add fields (e.g., add date to get history) l Standards (e.g., ODBC) l Gateways
CS 245Notes11 51 Integration l Data Cleaning l Data Loading l Derived Data Client Warehouse Source Query & Analysis Integration Metadata
CS 245Notes11 52 Data Cleaning Migration (e.g., yen dollars) l Scrubbing: use domain-specific knowledge (e.g., social security numbers) l Fusion (e.g., mail list, customer merging) l Auditing: discover rules & relationships (like data mining) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)
CS 245Notes11 53 Loading Data l Incremental vs. refresh l Off-line vs. on-line l Frequency of loading u At night, 1x a week/month, continuously l Parallel/Partitioned load
CS 245Notes11 54 Derived Data l Derived Warehouse Data u indexes u aggregates u materialized views (next slide) l When to update derived data? l Incremental vs. refresh
CS 245Notes11 55 Materialized Views l Define new warehouse relations using SQL expressions does not exist at any source
CS 245Notes11 56 Processing l ROLAP servers vs. MOLAP servers l Index Structures l What to Materialize? l Algorithms Client Warehouse Source Query & Analysis Integration Metadata
CS 245Notes11 57 ROLAP Server l Relational OLAP Server relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”
CS 245Notes11 58 MOLAP Server l Multi-Dimensional OLAP Server multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date milk soda eggs soap A B Sales
CS 245Notes11 59 Index Structures l Traditional Access Methods u B-trees, hash tables, R-trees, grids, … l Popular in Warehouses u inverted lists u bit map indexes u join indexes u text indexes
CS 245Notes11 60 Inverted Lists... age index inverted lists data records
CS 245Notes11 61 Using Inverted Lists l Query: u Get people with age = 20 and name = “fred” l List for age = 20: r4, r18, r34, r35 l List for name = “fred”: r18, r52 l Answer is intersection: r18
CS 245Notes11 62 Bit Maps... age index bit maps data records
CS 245Notes11 63 Using Bit Maps l Query: u Get people with age = 20 and name = “fred” l List for age = 20: l List for name = “fred”: l Answer is intersection: l Good if domain cardinality small l Bit vectors can be compressed
CS 245Notes11 64 Join “Combine” SALE, PRODUCT relations In SQL: SELECT * FROM SALE, PRODUCT
CS 245Notes11 65 Join Indexes join index
CS 245Notes11 66 What to Materialize? l Store in warehouse results useful for common queries l Example: day 2 day total sales materialize
CS 245Notes11 67 Materialization Factors l Type/frequency of queries l Query response time l Storage cost l Update cost
CS 245Notes11 68 Cube Aggregates Lattice city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day use greedy algorithm to decide what to materialize
CS 245Notes11 69 Dimension Hierarchies all state city
CS 245Notes11 70 Dimension Hierarchies city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...
CS 245Notes11 71 Interesting Hierarchy all years quarters months days weeks conceptual dimension table
CS 245Notes11 72 Algorithms l Query Optimization l Parallel Processing l Data Mining
CS 245Notes11 73 Example: Association Rules l How do we perform rule mining efficiently? l Observation: If set X has support t, then each X subset must have at least support t l For 2-sets: u if we need support s for {i, j} u then each i, j must appear in at least s baskets
CS 245Notes11 74 Algorithm for 2-Sets (1) Find OK products u those appearing in s or more baskets (2) Find high-support pairs using only OK products
CS 245Notes11 75 Algorithm for 2-Sets l INSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s; l Perform mining on okBaskets SELECT I.item, J.item, COUNT(I.basket) FROM okBaskets I, okBaskets J WHERE I.basket = J.basket AND I.item = s;
CS 245Notes11 76 Counting Efficiently l One way: sort count & remove threshold = 3
CS 245Notes11 77 Counting Efficiently l Another way: remove scan & count keep counter array in memory threshold = 3
CS 245Notes11 78 Yet Another Way (1) scan & hash & count in-memory hash table threshold = 3 (2) scan & remove (4) remove (3) scan& count in-memory counters false positive
CS 245Notes11 79 Discussion l Hashing scheme: 2 (or 3) scans of data l Sorting scheme: requires a sort! l Hashing works well if few high-support pairs and many low-support ones item-pairs ranked by frequency frequency threshold iceberg queries
CS 245Notes11 80 Managing l Metadata l Warehouse Design l Tools Client Warehouse Source Query & Analysis Integration Metadata
CS 245Notes11 81 Metadata l Administrative u definition of sources, tools,... u schemas, dimension hierarchies, … u rules for extraction, cleaning, … u refresh, purging policies u user profiles, access control,...
CS 245Notes11 82 Metadata l Business u business terms & definition u data ownership, charging l Operational u data lineage u data currency (e.g., active, archived, purged) u use stats, error reports, audit trails
CS 245Notes11 83 Design l What data is needed? l Where does it come from? l How to clean data? l How to represent in warehouse (schema)? l What to summarize? l What to materialize? l What to index?
CS 245Notes11 84 Tools l Development u design & edit: schemas, views, scripts, rules, queries, reports l Planning & Analysis u what-if scenarios (schema changes, refresh rates), capacity planning l Warehouse Management u performance monitoring, usage patterns, exception reporting l System & Network Management u measure traffic (sources, warehouse, clients) l Workflow Management u “reliable scripts” for cleaning & analyzing data
CS 245Notes11 85 Current State of Industry l Extraction and integration done off-line u Usually in large, time-consuming, batches l Everything copied at warehouse u Not selective about what is stored u Query benefit vs storage & update cost l Query optimization aimed at OLTP u High throughput instead of fast response u Process whole query before displaying anything
CS 245Notes11 86 Future Directions l Better performance l Larger warehouses l Easier to use l What are companies & research labs working on?
CS 245Notes11 87 Research (1) l Incremental Maintenance l Data Consistency l Data Expiration l Recovery l Data Quality l Error Handling (Back Flush)
CS 245Notes11 88 Research (2) l Rapid Monitor Construction l Temporal Warehouses l Materialization & Index Selection l Data Fusion l Data Mining l Integration of Text & Relational Data
CS 245Notes11 89 Conclusions l Massive amounts of data and complexity of queries will push limits of current warehouses l Need better systems: u easier to use u provide quality information