Download presentation
Presentation is loading. Please wait.
Published byAbigayle Pearson Modified over 9 years ago
1
Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11
2
CS 245Notes11 2 Outline l What is a data warehouse? l Why a warehouse? l Models & operations l Implementing a warehouse
3
CS 245Notes11 3 What is a Warehouse? l Collection of diverse data u subject oriented u aimed at executive, decision maker u often a copy of operational data u with value-added data (e.g., summaries, history) u integrated u time-varying u non-volatile more
4
CS 245Notes11 4 What is a Warehouse? l Collection of tools u gathering data u cleansing, integrating,... u querying, reporting, analysis u data mining u monitoring, administering warehouse
5
CS 245Notes11 5 Warehouse Architecture Client Warehouse Source Query & Analysis Integration Metadata
6
CS 245Notes11 6 Motivating Examples l Forecasting l Comparing performance of units l Monitoring, detecting fraud l Visualization
7
CS 245Notes11 7 Alternative to Warehousing l Two Approaches: u Query-Driven (Lazy) u Warehouse (Eager) Source ?
8
CS 245Notes11 8 Query-Driven Approach Client Wrapper Mediator Source
9
CS 245Notes11 9 Advantages of Warehousing l High query performance l Queries not visible outside warehouse l Local processing at sources unaffected l Can operate when sources unavailable l Can query data not stored in a DBMS l Extra information at warehouse u Modify, summarize (store aggregates) u Add historical information
10
CS 245Notes11 10 Advantages of Query-Driven l No need to copy data u less storage u no need to purchase data l More up-to-date data l Query needs can be unknown l Only query interface needed at sources l May be less draining on sources
11
CS 245Notes11 11 Warehouse Models & Operators l Data Models u relational u cubes l Operators
12
CS 245Notes11 12 Star
13
CS 245Notes11 13 Star Schema
14
CS 245Notes11 14 Terms l Fact table l Dimension tables l Measures
15
CS 245Notes11 15 Dimension Hierarchies store sType cityregion snowflake schema constellations
16
CS 245Notes11 16 Cube Fact table view: Multi-dimensional cube: dimensions = 2
17
CS 245Notes11 17 3-D Cube day 2 day 1 dimensions = 3 Multi-dimensional cube:Fact table view:
18
Operators l Traditional u selection u aggregation u... l Analysis u clean data u find trends u... CS 245Notes11 18 l Relational l Cube
19
CS 245Notes11 19 Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81
20
CS 245Notes11 20 Aggregates Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
21
CS 245Notes11 21 Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId drill-down rollup
22
CS 245Notes11 22 Aggregates l Operators: sum, count, max, min, median, ave l “Having” clause l Using dimension hierarchy u average by region (within store) u maximum by month (within date)
23
CS 245Notes11 23 Cube Aggregation day 2 day 1 129... drill-down rollup Example: computing sums
24
CS 245Notes11 24 Cube Operators day 2 day 1 129... sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)
25
CS 245Notes11 25 Extended Cube day 2 day 1 * sale(*,p2,*)
26
CS 245Notes11 26 Aggregation Using Hierarchies day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B)
27
CS 245Notes11 27 Data Analysis l Decision Trees l Clustering l Association Rules
28
CS 245Notes11 28 Decision Trees Example: Conducted survey to see what customers were interested in new model car Want to select customers for advertising campaign training set
29
CS 245Notes11 29 One Possibility age<30 city=sfcar=van likely unlikely YY Y N N N
30
CS 245Notes11 30 Another Possibility car=taurus city=sfage<45 likely unlikely YY Y N N N
31
CS 245Notes11 31 Issues l Decision tree cannot be “too deep” è would not have statistically significant amounts of data for lower decisions l Need to select tree that most reliably predicts outcomes
32
CS 245Notes11 32 Clustering age income education
33
CS 245Notes11 33 Clustering age income education
34
CS 245Notes11 34 Another Example: Text l Each document is a vector u e.g., contains words 1,4,5,... l Clusters contain “similar” documents l Useful for understanding, searching documents international news sports business
35
CS 245Notes11 35 Issues l Given desired number of clusters? l Finding “best” clusters l Are clusters semantically meaningful? u e.g., “yuppies’’ cluster? l Using clusters for disk storage
36
CS 245Notes11 36 Association Rule Mining transaction id customer id products bought sales records: Trend: Products p5, p8 often bough together Trend: Customer 12 likes product p9 market-basket data
37
CS 245Notes11 37 Association Rule l Rule: {p 1, p 3, p 8 } l Support: number of baskets where these products appear l High-support set: support threshold s l Problem: find all high support sets
38
Implementation Issues l ETL (Extraction, transformation, loading) u Getting data to the warehouse u Entity Resolution l What to materialize? l Efficient Analysis u Association rule mining u... CS 245Notes11 38
39
CS 245Notes11 39 ETL: Monitoring Techniques l Periodic snapshots l Database triggers l Log shipping l Data shipping (replication service) l Transaction shipping l Polling (queries to source) l Screen scraping l Application level monitoring Advantages & Disadvantages!!
40
CS 245Notes11 40 ETL: Data Cleaning Migration (e.g., yen dollars) l Scrubbing: use domain-specific knowledge (e.g., social security numbers) l Fusion (e.g., mail list, customer merging) l Auditing: discover rules & relationships (like data mining) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)
41
41 More details: Entity Resolution N: a A: b CC#: c Ph: e e1 N: a Exp: d Ph: e e2
42
42 Applications l comparison shopping l mailing lists l classified ads l customer files l counter-terrorism N: a A: b CC#: c Ph: e e1 N: a Exp: d Ph: e e2
43
Why is ER Challenging? l Huge data sets l No unique identifiers l Lots of uncertainty l Many ways to skin the cat 43
44
Taxonomy: Pairwise vs Global l Decide if r, s match only by looking at r, s? l Or need to consider more (all) records? 44 Nm: Pat Smith Ad: 123 Main St Ph: (650) 555-1212 Nm: Patrick Smith Ad: 132 Main St Ph: (650) 555-1212 Nm: Patricia Smith Ad: 123 Main St Ph: (650) 777-1111 or
45
Taxonomy: Pairwise vs Global l Global matching complicates things a lot! u e.g., change decision as new records arrive 45 Nm: Pat Smith Ad: 123 Main St Ph: (650) 555-1212 Nm: Patrick Smith Ad: 132 Main St Ph: (650) 555-1212 Nm: Patricia Smith Ad: 123 Main St Ph: (650) 777-1111 or
46
Taxonomy: Outcome l Partition of records u e.g., comparison shopping l Merged records 46 Nm: Pat Smith Ad: 123 Main St Ph: (650) 555-1212 Nm: Patricia Smith Ad: 123 Main St Ph: (650) 555-1212 (650) 777-1111 Hair: Black Nm: Patricia Smith Ad: 132 Main St Ph: (650) 777-1111 Hair: Black
47
47 Taxonomy: Outcome l Iterate after merging Nm: Tom Wk: IBM Oc: laywer Sal: 500K Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Nm: Thomas Ad: 123 Maim Oc: lawyer Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500K
48
Taxonomy: Record Reuse l One record related to multiple entities? 48 Nm: Pat Smith Sr. Ph: (650) 555-1212 Ad: 123 Main St Nm: Pat Smith Jr. Ph: (650) 555-1212 Nm: Pat Smith Sr. Ph: (650) 555-1212 Ad: 123 Main St Nm: Pat Smith Jr. Ph: (650) 555-1212 Ad: 123 Main St
49
Taxonomy: Record Reuse l Partitions 49 Merges r s t r s t rs st
50
Taxonomy: Record Reuse l Partitions 50 Merges r s t r s t rs st Record reuse complex and expensive!
51
51 Taxonomy: Multiple Entity Types person 1 person 2 Organization B Organization A brother member business member
52
52 Taxonomy: Multiple Entity Types p1 p2 p5 p7 a1 a2 a3 a5 a4 authors papers same??
53
53 Taxonomy: Exact vs Approximate products cameras resolved cameras CDs books... resolved CDs resolved books... ER
54
54 Taxonomy: Exact vs Approximate terrorists sort by age B Cooper 30 match against ages 25-35
55
Implementation Issues l ETL (Extraction, transformation, loading) u Getting data to the warehouse u Entity Resolution l What to materialize? l Efficient Analysis u Association rule mining u... CS 245Notes11 55
56
CS 245Notes11 56 What to Materialize? l Store in warehouse results useful for common queries l Example: day 2 day 1 129... total sales materialize
57
CS 245Notes11 57 Materialization Factors l Type/frequency of queries l Query response time l Storage cost l Update cost
58
CS 245Notes11 58 Cube Aggregates Lattice city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day 1 129 use greedy algorithm to decide what to materialize
59
CS 245Notes11 59 Dimension Hierarchies all state city
60
CS 245Notes11 60 Dimension Hierarchies city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...
61
CS 245Notes11 61 Interesting Hierarchy all years quarters months days weeks conceptual dimension table
62
Implementation Issues l ETL (Extraction, transformation, loading) u Getting data to the warehouse u Entity Resolution l What to materialize? l Efficient Analysis u Association rule mining u... CS 245Notes11 62
63
CS 245Notes11 63 Finding High-Support Pairs l Baskets(basket, item) l SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item = s;
64
CS 245Notes11 64 Finding High-Support Pairs l Baskets(basket, item) l SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item = s; WHY?
65
CS 245Notes11 65 Example
66
CS 245Notes11 66 Example check if count s
67
CS 245Notes11 67 Issues l Performance for size 2 rules big even bigger! l Performance for size k rules
68
CS 245Notes11 68 Association Rules l How do we perform rule mining efficiently?
69
CS 245Notes11 69 Association Rules l How do we perform rule mining efficiently? l Observation: If set X has support t, then each X subset must have at least support t
70
CS 245Notes11 70 Association Rules l How do we perform rule mining efficiently? l Observation: If set X has support t, then each X subset must have at least support t l For 2-sets: u if we need support s for {i, j} u then each i, j must appear in at least s baskets
71
CS 245Notes11 71 Algorithm for 2-Sets (1) Find OK products u those appearing in s or more baskets (2) Find high-support pairs using only OK products
72
CS 245Notes11 72 Algorithm for 2-Sets l INSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s;
73
CS 245Notes11 73 Algorithm for 2-Sets l INSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s; l Perform mining on okBaskets SELECT I.item, J.item, COUNT(I.basket) FROM okBaskets I, okBaskets J WHERE I.basket = J.basket AND I.item = s;
74
CS 245Notes11 74 Counting Efficiently l One way: threshold = 3
75
CS 245Notes11 75 Counting Efficiently l One way: sort threshold = 3
76
CS 245Notes11 76 Counting Efficiently l One way: sort count & remove threshold = 3
77
CS 245Notes11 77 Counting Efficiently l Another way: threshold = 3
78
CS 245Notes11 78 Counting Efficiently l Another way: scan & count keep counter array in memory threshold = 3
79
CS 245Notes11 79 Counting Efficiently l Another way: remove scan & count keep counter array in memory threshold = 3
80
CS 245Notes11 80 Counting Efficiently l Another way: remove scan & count keep counter array in memory threshold = 3
81
CS 245Notes11 81 Yet Another Way (1) scan & hash & count in-memory hash table threshold = 3
82
CS 245Notes11 82 Yet Another Way (1) scan & hash & count in-memory hash table threshold = 3 (2) scan & remove
83
CS 245Notes11 83 Yet Another Way (1) scan & hash & count in-memory hash table threshold = 3 (2) scan & remove false positive
84
CS 245Notes11 84 Yet Another Way (1) scan & hash & count in-memory hash table threshold = 3 (2) scan & remove (3) scan& count in-memory counters false positive
85
CS 245Notes11 85 Yet Another Way (1) scan & hash & count in-memory hash table threshold = 3 (2) scan & remove (4) remove (3) scan& count in-memory counters false positive
86
CS 245Notes11 86 Discussion l Hashing scheme: 2 (or 3) scans of data l Sorting scheme: requires a sort! l Hashing works well if few high-support pairs and many low-support ones
87
CS 245Notes11 87 Discussion l Hashing scheme: 2 (or 3) scans of data l Sorting scheme: requires a sort! l Hashing works well if few high-support pairs and many low-support ones item-pairs ranked by frequency frequency threshold iceberg queries
88
Implementation Issues l ETL (Extraction, transformation, loading) u Getting data to the warehouse u Entity Resolution l What to materialize? l Efficient Analysis u Association rule mining u... CS 245Notes11 88
89
Extra: Data Mining in the InfoLab CS 245Notes11 89 quarters Recommendations in CourseRank userq1q2q3q4 u1a: 5b: 5d: 5 u2a: 1e: 2d: 4f: 3 u3g: 4h: 2e: 3f: 3 u4b: 2g: 4h: 4e: 4 ua: 5g: 4e: 4
90
Extra: Data Mining in the InfoLab CS 245Notes11 90 quarters Recommendations in CourseRank userq1q2q3q4 u1a: 5b: 5d: 5 u2a: 1e: 2d: 4f: 3 u3g: 4h: 2e: 3f: 3 u4b: 2g: 4h: 4e: 4 ua: 5g: 4e: 4 u3 and u4 are similar to u Recommend h
91
Extra: Data Mining in the InfoLab CS 245Notes11 91 quarters Recommendations in CourseRank userq1q2q3q4 u1a: 5b: 5d: 5 u2a: 1e: 2d: 4f: 3 u3g: 4h: 2e: 3f: 3 u4b: 2g: 4h: 4e: 4 ua: 5g: 4e: 4 Recommend d (and f, h)
92
Sequence Mining l Given a set of transcripts, use Pr[x|a] to predict if x is a good recommendation given user has taken a. l Two issues... CS 245Notes11 92
93
Pr[x|a] Not Quite Right CS 245Notes11 93 transcriptcontaining 1- 2a 3x 4a -> x 5x -> a Pr[x|a] = 2/3 Pr[x|a ~x ] = 1/2 target user’s transcript: [... a.... || unknown ] recommend x?
94
User Has Taken >= 1 Course l User has taken T= {a, b, c} l Need Pr[x|T ~x ] l Approximate as Pr[x|a ~x b ~x c ~x ] l Expensive to compute, so... CS 245Notes11 94
95
CourseRank User Study CS 245Notes11 95 good, expected good, unexpected percentage of ratings
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.