Predictive Modeling in Data Management Byung S. Lee Computer Science University of Vermont
Cost UDF Overview Funding: US Department of Energy. Title: Generating Cost Functions of User- Defined Functions. Phase 1: preliminary studies. Phase 2: core modeling techniques. Phase 3: applications.
How long would this one take to run? UDF CostUDF Problem
Phase 1 Approaches: –Off-line training with cost data sets generated in the same batch. –On-line training with cost data sets generated in incremental batches. (a.k.a. self-tuning) Models: –parametric or nonparametric regression.
Phase 1 UDFs: –Financial time series aggregate functions: median(time series, start date, end date) nth moving window average(time series, start date, end date, window size) –Keyword-based text search functions: “dog AND cat” “dog OR cat” “dog cat” within w words apart. –Spatial search operators: range(ref_point, distance) Window(lower_left_point, upper_right_point) KNN(ref_point, K)
Phase 2 Approaches: –On-line training with cost data points generated one at a time. –Assume limited main memory. Models: –Nonparametric techniques using multidimensional index structures.
Phase 2 Core modeling techniques: –Incremental edited k nearest neighbors. –Memory limited quadtrees. –Dr. Zhen He will give a quick overview of the recent phase 2 efforts.
Phase 3 Additional core modeling techniques. Abstraction of the problem to “efficient adaptive predictive modeling of incremental data.” Applications that need –Value predictions. –Class predictions.