Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bellwether Analysis Bellwether Analysis Predicting Global Aggregates from Local Regions Raghu Ramakrishnan Yahoo! Research University of Wisconsin—Madison.

Similar presentations


Presentation on theme: "Bellwether Analysis Bellwether Analysis Predicting Global Aggregates from Local Regions Raghu Ramakrishnan Yahoo! Research University of Wisconsin—Madison."— Presentation transcript:

1 Bellwether Analysis Bellwether Analysis Predicting Global Aggregates from Local Regions Raghu Ramakrishnan Yahoo! Research University of Wisconsin—Madison Bee-Chung Chen, Jude Shavlik, Pradeep Tamma University of Wisconsin—Madison

2 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 2 Motivating Example A company wants to predict the first year worldwide profit of a new item (e.g., a new movie) by using its historical database –By looking at the features and profits of previous (similar) movies, we want to predict the expected total profit (total US sales at the end of the release year) for the new movie Wait a year and write a query! If you can’t wait, read this paper –The most predictive “features” may be based on sales data gathered by releasing the new movie in many “regions” (different locations over different time periods). Example “region-based” features: 1 st week sales in Peoria, week-to- week sales growth in Wisconsin, etc. Gathering this data has a cost (e.g., marketing expenses, waiting time) Problem statement: Find the most predictive region features that can be obtained within a given “cost budget”

3 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 3 Key Ideas Large datasets are rarely labeled with the targets that we wish to learn to predict –But for the tasks we address, we can readily use OLAP queries to generate features (e.g., 1 st week sales in Peoria) and even targets (e.g., profit) for mining We use data-mining models as building blocks in the mining process, rather than thinking of them as the end result –The central problem is to find data subsets (“bellwether regions”) that lead to predictive features which can be gathered at low cost for a new case

4 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 4 Outline Motivating example Basic bellwether analysis Subset bellwether analysis –Bellwether trees –Bellwether cubes Experimental results Conclusion

5 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 5 Motivating Example A company wants to predict the first year’s worldwide profit for a new item, by using its historical database Database Schema: The combination of the underlined attributes forms a key

6 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 6 A Straightforward Approach Build a regression model to predict item profit There is much room for accuracy improvement! ItemIDCategoryR&D ExpenseProfit 1Laptop500K12,000K 2Desktop100K8,000K ………… By joining and aggregating tables in the historical database we can create a training set: Item-table features Target An Example regression model: Profit =  0 +  1 Laptop +  2 Desktop +  3 RdExpense

7 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 7 Using Regional Features Example region: [1 st week, Korea] Regional features: –Regional Profit: The 1 st week profit in Korea –Regional Ad Expense: The 1 st week ad expense in Korea A possibly more accurate model: Profit [1yr, All] =  0 +  1 Laptop +  2 Desktop +  3 RdExpense +  4 Profit [1wk, KR] +  5 AdExpense [1wk, KR] Problem: Which region should we use? –The smallest region that improves the accuracy the most –We give each candidate region a cost –The most “cost-effective” region is the bellwether region

8 Bellwether Analysis Basic Bellwether Problem

9 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 9 Basic Bellwether Problem Historical database: DB Training item set: I Candidate region set: R –E.g., { [1-n week, Location] } Target generation query:  i (DB) returns the target value of item i  I –E.g.,  sum(Profit)   i, [1-52, All] ProfitTable Feature generation query:  i,r (DB), i  I r and r  R –I r : The set of items in region r –E.g., [ Category i, RdExpense i, Profit i, [1-n, Loc], AdExpense i, [1-n, Loc] ] Cost query:  r (DB), r  R, the cost of collecting data from r Predictive model: h r (x), r  R, trained on {(  i,r (DB),  i (DB)) : i  I r } –E.g., linear regression model Location domain hierarchy

10 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 10 Basic Bellwether Problem 12345 … 52 KR USA … WI WY... … ItemIDCategory…Profit [1-2,USA] … …………… iDesktop45K …………… Aggregate over data records in region r = [1-2, USA] Features  i,r (DB) ItemIDTotal Profit …… i2,000K …… Target  i (DB) Total Profit in [1-52, All] For each region r, build a predictive model h r (x); and then choose bellwether region: Coverage(r)  fraction of all items in region  minimum coverage support Cost(r, DB)  cost threshold Error(hr) is minimized r

11 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 11 Experiment on a Mail Order Dataset Bel Err: The error of the bellwether region found using a given budget Avg Err: The average error of all the cube regions with costs under a given budget Smp Err: The error of a set of randomly sampled (non-cube) regions with costs under a given budget [1-8 month, MD] Error-vs-Budget Plot (RMSE: Root Mean Square Error)

12 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 12 Experiment on a Mail Order Dataset Uniqueness Plot Y-axis: Fraction of regions that are as good as the bellwether region –The fraction of regions that satisfy the constraints and have errors within the 99% confidence interval of the error of the bellwether region We have 99% confidence that that [1-8 month, MD] is a quite unique bellwether region [1-8 month, MD]

13 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 13 Basic Bellwether Computation OLAP-style bellwether analysis –Candidate regions: Regions in a data cube –Queries: OLAP-style aggregate queries E.g., Sum(Profit) over a region Efficient computation: –Use iceberg cube techniques to prune infeasible regions (Beyer-Ramakrishnan, ICDE 99; Han-Pei- Dong-Wang SIGMOD 01) Infeasible regions: Regions with cost > B or coverage < C –Share computation by generating the features and target values for all the feasible regions all together Exploit distributive and algebraic aggregate functions Simultaneously generating all the features and target values reduces DB scans and repeated aggregate computation 12345 … 5252 KR USA … WI WY... …

14 Bellwether Analysis Subset Bellwether Problem

15 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 15 Subset-Based Bellwether Prediction Motivation: Different subsets of items may have different bellwether regions –E.g., The bellwether region for laptops may be different from the bellwether region for clothes Two approaches: Bellwether Tree Bellwether Cube LowMediumHigh SoftwareOS[1-3,CA][1-1,NY][1-2,CA] …...…… HardwareLaptop[1-4,MD][1-1, NY][1-3,WI] ………… …………… R&D Expenses Category

16 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 16 Bellwether Tree How to build a bellwether tree –Similar to regression tree construction –Starting from the root node, recursively split the current leaf node using the “best split criterion” A split criterion partitions a set of items into disjoint subsets Pick the split that reduces the error the most –Stop splitting when the number of items in the current leaf node falls under a threshold value –Prune the tree to avoid overfitting 1 27 3489 56

17 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 17 Bellwether Tree How to split a node –Split criterion: Numeric split: A k   Categorical split: A k (A k is an item-table feature) –Pick the best split criterion Best split: The split that can reduce the error the most Find bellwether region for S h: Bellwether model for S Find bellwether region for S p h p : Bellwether model for S p (S is the set of items at the parent node, and S p is the set of items at the pth child node) Total parent errorTotal child error

18 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 18 Problem of Naïve Tree Construction A naïve bellwether tree construction algorithm will scan the dataset n  m times –n is the number of nodes –m is the number of candidate split criteria Idea: Extending the RainForest framework [Gehrke et al., 98] 1 27 3489 56 For each node: Try all candidate split criteria to find the best one It needs to scan the dataset m times

19 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 19 Efficient Tree Construction Idea: Extending the RainForest framework [Gehrke et al., 98] –Build the tree level by level –Scan the entire dataset once per level and keep small sufficient statistics in memory (size: O(n  s  c)) Sufficient Statistics for a split criterion |S p | and Error(h p | S p ), for p = 1 to # of children –Split all the nodes at that level after the scan based on the sufficient statistics –Further improved by a hybrid algorithm 1 23 4567 89 1 st scan 2 nd scan 3 rd scan 4 th scan

20 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 20 Bellwether Cube LowMediumHigh Software OS[1-3,CA]: 0.05[1-1,NY]: 0.03[1-2,CA]:0.02 …...…… Hardwar e LaptopNULL[1-1, NY]: 0.02 [1-4,WI]: 0.03 Deskto p [1-4,MD]: 0.17[1-4,WA]: 0.01NULL ………… …………… LowMediumHigh Software[1-4,CA]: 0.10[1-2,CA]: 0.05[1-2,CA]:0.03 Hardware[1-4,MD]: 0.08[1-1, IL]: 0.03[1-4,WI]: 0.05 ………… R&D Expenses Category The number in a cell is the error of the bellwether region for that subset of items RollupDrilldown

21 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 21 Problem of Naïve Cube Construction A naïve bellwether cube construction algorithm will conduct a basic bellwether search for the subset of items in each cell –A basic bellwether search involves building a model for each candidate region LowMediumHigh Software OS … Hardwar e Laptop … …… Any Software Hardware … LowMediumHigh Any For each cell: Build a model for each candidate region

22 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 22 Efficient Cube Construction Idea: Transform model construction into computation of distributive or algebraic aggregate functions –Let S 1, …, S n partition S S = S 1  …  S n and S i  S j =  –Distributive function:  (S) = F({  (S 1 ), …,  (S n )}) E.g., Count(S) = Sum({Count(S 1 ), …, Count(S n )}) –Algebraic function:  (S) = F({G(S 1 ), …, G(S n )}) G(S i ) returns a length-fixed vector of values E.g., Avg(S) = F({G(S 1 ), …, G(S n )}) –G(S i ) = [Sum(S i ), Count(S i )] –F({[a 1, b 1 ], …, [a n, b n ]}) = Sum({a i }) / Sum({b i })

23 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 23 Efficient Cube Construction Build models for each finest-grained cells For higher-level cells, use data cube computation techniques to compute the aggregate functions LowMediumHigh Software OS … Hardwar e Laptop … …… Any Software Hardware … LowMediumHigh Any For each finest-grained cell: Build models to find the bellwether region For each higher-level cell: Compute aggregate functions to find the bellwether region

24 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 24 Efficient Cube Construction Classification models: –Use the prediction cube [Chen et al., 05] execution framework Regression models: (Weighted linear regression model; builds on work in Chen-Dong-Han-Wah-Wang VLDB 02) –Having the sum of squared error (SSE) for each candidate region is sufficient to find the bellwether region –SSE(S) is an algebraic function, where S is a set of item –SSE(S) = q( { g(S k ) : k = 1, …, n } ) S 1, …, S n partition S g(S k ) =  Y k W k Y k, X k W k X k, X k W k Y k  q({  A k, B k, C k  : k = 1, …, n}) =  k A k  (  k C k )(  k B k )  1 (  k C k ) Y k is the vector of target values for set S k of items X k is the matrix of features for set S k of items W k is the weight matrix for set S k of items where

25 Bellwether Analysis Experimental Results

26 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 26 Experimental Results: Summary We have shown the existence of bellwether regions on a real mail-order dataset We characterize the behavior of bellwether trees and bellwether cubes using synthetic datasets We show our computation techniques improve efficiency by orders of magnitude We show our computation techniques scale linearly in the size of the dataset

27 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 27 Characteristics of Bellwether Trees & Cubes Dataset generation: Use random tree to generate different bellwether regions for different subset of items Parameters: Noise Concept complexity: # of tree nodes Result: Bellwether trees & cubes have better accuracy than basic bellwether search Increase noise  increase error Increase complexity  increase error 15 nodesNoise level: 0.5

28 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 28 Efficiency Comparison Naïve computation methods Our computation techniques

29 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 29 Scalability

30 Bee-Chung Chen, Raghu Ramakrishnan, Jude Shavlik, Pradeep Tamma Bellwether Cubes, VLDB 2006Chen, Ramakrishnan, Shavlik, Tamma 30 Conclusion Promising data mining paradigm: –Using OLAP queries to generate features and even targets for mining –Using data-mining models as building blocks in the mining process, rather than thinking of them as the end result –Exploit the nested structure of OLAP queries to achieve efficient computation


Download ppt "Bellwether Analysis Bellwether Analysis Predicting Global Aggregates from Local Regions Raghu Ramakrishnan Yahoo! Research University of Wisconsin—Madison."

Similar presentations


Ads by Google