Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell
Density estimate problem Convert a set of numeric data points to a smoothed approximation of the underlying probability density Example Points
Techniques Manual estimates Histograms Curve fitting
Generalized histograms chance: [ ] 0.5 chance: [ ] 0.3 chance: [ ] General form prob 1 : [min 1.. max 1 ] prob 2 : [min 2.. max 2 ] … prob n : [min n.. max n ] Intervals do not overlap Probabilities sum to 1.0
Special cases Standard histogram Set of points Weighted points
Smoothing problem Given a generalized histogram, construct its coarser approximation
Input Initial distribution: A point set or a fine-grained histogram Distance function: A measure of similarity between distributions Target size: The number of intervals in an approximation
Standard distance measures Simple difference: ∫ | p(x) − q(x) | dx Kullback-Leibler: ∫ p(x) · log (p(x) / q(x)) dx Jensen-Shannon: (Kullback-Leibler (p, (p+q)/2) + Kullback-Leibler (q, (p+q)/2)) / 2
Smoothing algorithm Repeat: Merge two adjacent intervals Until the histogram has the right size
Interval merging min 1 min 2 max 1 max 2 prob 1 prob 2 min 1 max 2 prob 1 + prob 2 For each potential merge, calculate the distance Perform the smallest- distance merge
Smoothing examples: Normal distribution 5000 points 200 intervals 50 intervals 10 intervals
Smoothing examples: Geometric distribution 5000 points 200 intervals 10 intervals50 intervals
Running time Theoretical: O (n · log n) Practical: O (n)
Running time 3.4 GHz Pentium, C++ code (2.5 ± 0.5) · num-points microseconds Number of points Time (microsec)
Visual smoothing We convert a piecewise-uniform distribution to a smooth curve by spline fitting. The user usually prefers a smooth probability density
Main results Density estimation Lossy compression of generalized histograms
Advantages Explicit specification of - Distance measure - Compression level Effective representation for automated reasoning