Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.

Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11

CS 245Notes11 2 Outline l What is a data warehouse? l Why a warehouse? l Models & operations l Implementing a warehouse

CS 245Notes11 3 What is a Warehouse? l Collection of diverse data u subject oriented u aimed at executive, decision maker u often a copy of operational data u with value-added data (e.g., summaries, history) u integrated u time-varying u non-volatile more

CS 245Notes11 4 What is a Warehouse? l Collection of tools u gathering data u cleansing, integrating,... u querying, reporting, analysis u data mining u monitoring, administering warehouse

CS 245Notes11 5 Warehouse Architecture Client Warehouse Source Query & Analysis Integration Metadata

CS 245Notes11 6 Motivating Examples l Forecasting l Comparing performance of units l Monitoring, detecting fraud l Visualization

CS 245Notes11 7 Alternative to Warehousing l Two Approaches: u Query-Driven (Lazy) u Warehouse (Eager) Source ?

CS 245Notes11 8 Query-Driven Approach Client Wrapper Mediator Source

CS 245Notes11 9 Advantages of Warehousing l High query performance l Queries not visible outside warehouse l Local processing at sources unaffected l Can operate when sources unavailable l Can query data not stored in a DBMS l Extra information at warehouse u Modify, summarize (store aggregates) u Add historical information

CS 245Notes11 10 Advantages of Query-Driven l No need to copy data u less storage u no need to purchase data l More up-to-date data l Query needs can be unknown l Only query interface needed at sources l May be less draining on sources

CS 245Notes11 11 Warehouse Models & Operators l Data Models u relational u cubes l Operators

CS 245Notes11 12 Star

CS 245Notes11 13 Star Schema

CS 245Notes11 14 Terms l Fact table l Dimension tables l Measures

CS 245Notes11 15 Dimension Hierarchies store sType cityregion  snowflake schema  constellations

CS 245Notes11 16 Cube Fact table view: Multi-dimensional cube: dimensions = 2

CS 245Notes11 17 3-D Cube day 2 day 1 dimensions = 3 Multi-dimensional cube:Fact table view:

Operators l Traditional u selection u aggregation u... l Analysis u clean data u find trends u... CS 245Notes11 18 l Relational l Cube

CS 245Notes11 19 Aggregates Add up amounts for day 1 In SQL: SELECT sum(amt) FROM SALE WHERE date = 1 81

CS 245Notes11 20 Aggregates Add up amounts by day In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

CS 245Notes11 21 Another Example Add up amounts by day, product In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId drill-down rollup

CS 245Notes11 22 Aggregates l Operators: sum, count, max, min, median, ave l “Having” clause l Using dimension hierarchy u average by region (within store) u maximum by month (within date)

CS 245Notes11 23 Cube Aggregation day 2 day 1 129... drill-down rollup Example: computing sums

CS 245Notes11 24 Cube Operators day 2 day 1 129... sale(c1,*,*) sale(*,*,*) sale(c2,p2,*)

CS 245Notes11 25 Extended Cube day 2 day 1 * sale(*,p2,*)

CS 245Notes11 26 Aggregation Using Hierarchies day 2 day 1 customer region country (customer c1 in Region A; customers c2, c3 in Region B)

CS 245Notes11 27 Data Analysis l Decision Trees l Clustering l Association Rules

CS 245Notes11 28 Decision Trees Example: Conducted survey to see what customers were interested in new model car Want to select customers for advertising campaign training set

CS 245Notes11 29 One Possibility age<30 city=sfcar=van likely unlikely YY Y N N N

CS 245Notes11 30 Another Possibility car=taurus city=sfage<45 likely unlikely YY Y N N N

CS 245Notes11 31 Issues l Decision tree cannot be “too deep” è would not have statistically significant amounts of data for lower decisions l Need to select tree that most reliably predicts outcomes

CS 245Notes11 32 Clustering age income education

CS 245Notes11 33 Clustering age income education

CS 245Notes11 34 Another Example: Text l Each document is a vector u e.g., contains words 1,4,5,... l Clusters contain “similar” documents l Useful for understanding, searching documents international news sports business

CS 245Notes11 35 Issues l Given desired number of clusters? l Finding “best” clusters l Are clusters semantically meaningful? u e.g., “yuppies’’ cluster? l Using clusters for disk storage

CS 245Notes11 36 Association Rule Mining transaction id customer id products bought sales records: Trend: Products p5, p8 often bough together Trend: Customer 12 likes product p9 market-basket data

CS 245Notes11 37 Association Rule l Rule: {p 1, p 3, p 8 } l Support: number of baskets where these products appear l High-support set: support  threshold s l Problem: find all high support sets

Implementation Issues l ETL (Extraction, transformation, loading) u Getting data to the warehouse u Entity Resolution l What to materialize? l Efficient Analysis u Association rule mining u... CS 245Notes11 38

CS 245Notes11 39 ETL: Monitoring Techniques l Periodic snapshots l Database triggers l Log shipping l Data shipping (replication service) l Transaction shipping l Polling (queries to source) l Screen scraping l Application level monitoring  Advantages & Disadvantages!!

CS 245Notes11 40 ETL: Data Cleaning Migration (e.g., yen  dollars) l Scrubbing: use domain-specific knowledge (e.g., social security numbers) l Fusion (e.g., mail list, customer merging) l Auditing: discover rules & relationships (like data mining) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe)

41 More details: Entity Resolution N: a A: b CC#: c Ph: e e1 N: a Exp: d Ph: e e2

42 Applications l comparison shopping l mailing lists l classified ads l customer files l counter-terrorism N: a A: b CC#: c Ph: e e1 N: a Exp: d Ph: e e2

Why is ER Challenging? l Huge data sets l No unique identifiers l Lots of uncertainty l Many ways to skin the cat 43

Taxonomy: Pairwise vs Global l Decide if r, s match only by looking at r, s? l Or need to consider more (all) records? 44 Nm: Pat Smith Ad: 123 Main St Ph: (650) 555-1212 Nm: Patrick Smith Ad: 132 Main St Ph: (650) 555-1212 Nm: Patricia Smith Ad: 123 Main St Ph: (650) 777-1111 or

Taxonomy: Pairwise vs Global l Global matching complicates things a lot! u e.g., change decision as new records arrive 45 Nm: Pat Smith Ad: 123 Main St Ph: (650) 555-1212 Nm: Patrick Smith Ad: 132 Main St Ph: (650) 555-1212 Nm: Patricia Smith Ad: 123 Main St Ph: (650) 777-1111 or

Taxonomy: Outcome l Partition of records u e.g., comparison shopping l Merged records 46 Nm: Pat Smith Ad: 123 Main St Ph: (650) 555-1212 Nm: Patricia Smith Ad: 123 Main St Ph: (650) 555-1212 (650) 777-1111 Hair: Black Nm: Patricia Smith Ad: 132 Main St Ph: (650) 777-1111 Hair: Black

47 Taxonomy: Outcome l Iterate after merging Nm: Tom Wk: IBM Oc: laywer Sal: 500K Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Nm: Thomas Ad: 123 Maim Oc: lawyer Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Nm: Tom Ad: 123 Main BD: Jan 1, 85 Wk: IBM Oc: lawyer Sal: 500K

Taxonomy: Record Reuse l One record related to multiple entities? 48 Nm: Pat Smith Sr. Ph: (650) 555-1212 Ad: 123 Main St Nm: Pat Smith Jr. Ph: (650) 555-1212 Nm: Pat Smith Sr. Ph: (650) 555-1212 Ad: 123 Main St Nm: Pat Smith Jr. Ph: (650) 555-1212 Ad: 123 Main St

Taxonomy: Record Reuse l Partitions 49 Merges r s t r s t rs st

Taxonomy: Record Reuse l Partitions 50 Merges r s t r s t rs st Record reuse complex and expensive!

51 Taxonomy: Multiple Entity Types person 1 person 2 Organization B Organization A brother member business member

52 Taxonomy: Multiple Entity Types p1 p2 p5 p7 a1 a2 a3 a5 a4 authors papers same??

53 Taxonomy: Exact vs Approximate products cameras resolved cameras CDs books... resolved CDs resolved books... ER

54 Taxonomy: Exact vs Approximate terrorists sort by age B Cooper 30 match against ages 25-35

CS 245Notes11 56 What to Materialize? l Store in warehouse results useful for common queries l Example: day 2 day 1 129... total sales materialize

CS 245Notes11 57 Materialization Factors l Type/frequency of queries l Query response time l Storage cost l Update cost

CS 245Notes11 58 Cube Aggregates Lattice city, product, date city, productcity, dateproduct, date cityproductdate all day 2 day 1 129 use greedy algorithm to decide what to materialize

CS 245Notes11 59 Dimension Hierarchies all state city

CS 245Notes11 60 Dimension Hierarchies city, product city, product, date city, date product, date city product date all state, product, date state, date state, product state not all arcs shown...

CS 245Notes11 61 Interesting Hierarchy all years quarters months days weeks conceptual dimension table

CS 245Notes11 63 Finding High-Support Pairs l Baskets(basket, item) l SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item = s;

CS 245Notes11 64 Finding High-Support Pairs l Baskets(basket, item) l SELECT I.item, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE I.basket = J.basket AND I.item = s; WHY?

CS 245Notes11 65 Example

CS 245Notes11 66 Example check if count  s

CS 245Notes11 67 Issues l Performance for size 2 rules big even bigger! l Performance for size k rules

CS 245Notes11 68 Association Rules l How do we perform rule mining efficiently?

CS 245Notes11 69 Association Rules l How do we perform rule mining efficiently? l Observation: If set X has support t, then each X subset must have at least support t

CS 245Notes11 70 Association Rules l How do we perform rule mining efficiently? l Observation: If set X has support t, then each X subset must have at least support t l For 2-sets: u if we need support s for {i, j} u then each i, j must appear in at least s baskets

CS 245Notes11 71 Algorithm for 2-Sets (1) Find OK products u those appearing in s or more baskets (2) Find high-support pairs using only OK products

CS 245Notes11 72 Algorithm for 2-Sets l INSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s;

CS 245Notes11 73 Algorithm for 2-Sets l INSERT INTO okBaskets(basket, item) SELECT basket, item FROM Baskets GROUP BY item HAVING COUNT(basket) >= s; l Perform mining on okBaskets SELECT I.item, J.item, COUNT(I.basket) FROM okBaskets I, okBaskets J WHERE I.basket = J.basket AND I.item = s;

CS 245Notes11 74 Counting Efficiently l One way: threshold = 3

CS 245Notes11 75 Counting Efficiently l One way: sort threshold = 3

CS 245Notes11 76 Counting Efficiently l One way: sort count & remove threshold = 3

CS 245Notes11 77 Counting Efficiently l Another way: threshold = 3

CS 245Notes11 78 Counting Efficiently l Another way: scan & count keep counter array in memory threshold = 3

CS 245Notes11 79 Counting Efficiently l Another way: remove scan & count keep counter array in memory threshold = 3

CS 245Notes11 80 Counting Efficiently l Another way: remove scan & count keep counter array in memory threshold = 3

CS 245Notes11 81 Yet Another Way (1) scan & hash & count in-memory hash table threshold = 3

CS 245Notes11 82 Yet Another Way (1) scan & hash & count in-memory hash table threshold = 3 (2) scan & remove

CS 245Notes11 83 Yet Another Way (1) scan & hash & count in-memory hash table threshold = 3 (2) scan & remove false positive

CS 245Notes11 84 Yet Another Way (1) scan & hash & count in-memory hash table threshold = 3 (2) scan & remove (3) scan& count in-memory counters false positive

CS 245Notes11 85 Yet Another Way (1) scan & hash & count in-memory hash table threshold = 3 (2) scan & remove (4) remove (3) scan& count in-memory counters false positive

CS 245Notes11 86 Discussion l Hashing scheme: 2 (or 3) scans of data l Sorting scheme: requires a sort! l Hashing works well if few high-support pairs and many low-support ones

CS 245Notes11 87 Discussion l Hashing scheme: 2 (or 3) scans of data l Sorting scheme: requires a sort! l Hashing works well if few high-support pairs and many low-support ones item-pairs ranked by frequency frequency threshold iceberg queries

Extra: Data Mining in the InfoLab CS 245Notes11 89 quarters Recommendations in CourseRank userq1q2q3q4 u1a: 5b: 5d: 5 u2a: 1e: 2d: 4f: 3 u3g: 4h: 2e: 3f: 3 u4b: 2g: 4h: 4e: 4 ua: 5g: 4e: 4

Extra: Data Mining in the InfoLab CS 245Notes11 90 quarters Recommendations in CourseRank userq1q2q3q4 u1a: 5b: 5d: 5 u2a: 1e: 2d: 4f: 3 u3g: 4h: 2e: 3f: 3 u4b: 2g: 4h: 4e: 4 ua: 5g: 4e: 4 u3 and u4 are similar to u Recommend h

Extra: Data Mining in the InfoLab CS 245Notes11 91 quarters Recommendations in CourseRank userq1q2q3q4 u1a: 5b: 5d: 5 u2a: 1e: 2d: 4f: 3 u3g: 4h: 2e: 3f: 3 u4b: 2g: 4h: 4e: 4 ua: 5g: 4e: 4 Recommend d (and f, h)

Sequence Mining l Given a set of transcripts, use Pr[x|a] to predict if x is a good recommendation given user has taken a. l Two issues... CS 245Notes11 92

Pr[x|a] Not Quite Right CS 245Notes11 93 transcriptcontaining 1- 2a 3x 4a -> x 5x -> a Pr[x|a] = 2/3 Pr[x|a ~x ] = 1/2 target user’s transcript: [... a.... || unknown ] recommend x?

User Has Taken >= 1 Course l User has taken T= {a, b, c} l Need Pr[x|T ~x ] l Approximate as Pr[x|a ~x b ~x c ~x ] l Expensive to compute, so... CS 245Notes11 94

CourseRank User Study CS 245Notes11 95 good, expected good, unexpected percentage of ratings

Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.

Similar presentations

Presentation on theme: "Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11.

Similar presentations

Presentation on theme: "Data Warehousing Overview CS245 Notes 11 Hector Garcia-Molina Stanford University CS 245 1 Notes11."— Presentation transcript:

Similar presentations

About project

Feedback