2002/4/10IDSL seminar Estimating Business Targets Advisor: Dr. Hsu Graduate: Yung-Chu Lin Data Source: Datta et al., KDD01, pp
2002/4/10IDSL seminar Abstract Propose a new solution to the classical econometric task of frontier analysis Combine nearest neighbor methods and classical statistical methods Identify under marketed customers Benchmark regional directory divisions
2002/4/10IDSL seminar Outline Motivation Objective Historical approaches Target estimation methodology Case study Conclusion Personal opinion
2002/4/10IDSL seminar Motivation Setting targets is a critical task Setting the target of each entity to the average amongst the entities traditionally Two challenges –The characteristics of the entities will have a heavy influence on the outcome –The inherent unsupervised nature of the problem
2002/4/10IDSL seminar Objective Provide a methodology for estimating unsupervised maximal or minimal targets Setting revenue target expectations for individual customers Revenue target setting for regional yellow page directories
2002/4/10IDSL seminar Historical Approaches Mathematical programming Economics
2002/4/10IDSL seminar Mathematical Programming where is the target for xi, a vector for the ith observation Sensitivity to errors or outliers since it assumes that all observed targets define the possible space
2002/4/10IDSL seminar Economics where is a non-negative error term The requirement of a model for the error term and for g
2002/4/10IDSL seminar Target Estimation Methodology Nearest neighbor vs. clustering The neighborhoods The distance function Target estimation from the neighborhoods A heuristic for comparing neighborhoods
2002/4/10IDSL seminar Nearest Neighbor vs. Clustering Time complexity –Clustering is better than nearest neighbor Problem of clustering –Two similar entities fall into different cluster –Dimension higher, influence more serious –But nearest neighbor is not so
2002/4/10IDSL seminar The Neighborhoods xi: ith observation yi: the variable containg its target value ni: neighborhood for xi, where ni is a set of observations {xi, xj, …}
2002/4/10IDSL seminar The Distance Function Continuous standardize e.g. Continuous- (2,1)(3,4) Nominal- (a,b)(a,c)
2002/4/10IDSL seminar Target Estimation From the Neighborhoods Let yi(1), yi(2), …, yi(k) be the order statistics, so that yi(1) is the largest
2002/4/10IDSL seminar A Heuristic for Comparing Neighborhoods Maximal frontier E(xi) will range from 0 to 1 Minimal frontier E(xi) >=1
2002/4/10IDSL seminar Case Study Target revenues for directory book advertisers Target revenue for regional directories
2002/4/10IDSL seminar (1) Target Revenues for Directory Book Advertisers Goal –Find businesses that have low spending relative to those with otherwise similar characteristics Three categories of data available –Advertiser: e.g. number of employees –Directory: e.g. distribution size –Market : e.g. median household income
2002/4/10IDSL seminar Calculating Nearest Neighbors Standardize continuous data: natural log K=4 Weight the variables equally –But decrease the weights for many of the directory and market variables
2002/4/10IDSL seminar Distribution for E(x) for Advertisers
2002/4/10IDSL seminar A Decision Tree to Predict phi - xi
2002/4/10IDSL seminar (2) Target Revenue for Regional Directories Goal –Benchmark regional directory divisions Separate the data into two sets –Training set: 80% –Test set: 20% K=4
2002/4/10IDSL seminar Book Type System book –an entire serving area System-neighborhood book –A smaller number of geographic areas in the franchise area Neighborhood book –Areas outside of the telephone company’s franchise area
2002/4/10IDSL seminar Four Different Distributions labeled according to the legend
2002/4/10IDSL seminar Neigborhood booksSystem booksNon-system books The x-axis shos log(distribution) and the y-axis E(x)
2002/4/10IDSL seminar Conclusion Present a general data mining methodology for estimating business targets by frontier analysis First case –Increase sales focus on the under-marketed customers –Increase the potential revenue by several million Second case –Estimate optimal revenue performance targets for directory divisions –Increase for directory books is a minimum of several million dollars
2002/4/10IDSL seminar Personal opinion Combine several existed methodologies or disciplines can make new powerful one