Download presentation
Presentation is loading. Please wait.
Published bySusan Ramsey Modified over 8 years ago
1
Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20 30
2
Density estimate problem Convert a set of numeric data points to a smoothed approximation of the underlying probability density. 10 20 30 11 12 19 21 Example Points 17 18 22 26 27 29
3
Techniques Manual estimates Histograms 10 20 30 10 20 30 Curve fitting 10 20 30
4
Generalized histograms 10 20 30 0.2 chance: [11.. 12] 0.5 chance: [17.. 22] 0.3 chance: [26.. 29] General form prob 1 : [min 1.. max 1 ] prob 2 : [min 2.. max 2 ] … prob n : [min n.. max n ] Intervals do not overlap Probabilities sum to 1.0
5
Special cases Standard histogram Set of points Weighted points
6
Smoothing problem Given a generalized histogram, construct its coarser approximation. 10 20 30 10 20 30 10 20 30
7
Input Initial distribution: A point set or a fine-grained histogram Distance function: A measure of similarity between distributions Target size: The number of intervals in an approximation
8
Standard distance measures Simple difference: ∫ | p(x) − q(x) | dx Kullback-Leibler: ∫ p(x) · log (p(x) / q(x)) dx Jensen-Shannon: (Kullback-Leibler (p, (p+q)/2) + Kullback-Leibler (q, (p+q)/2)) / 2
9
Smoothing algorithm Repeat: Merge two adjacent intervals Until the histogram has the right size 1020 30
10
Interval merging min 1 min 2 max 1 max 2 prob 1 prob 2 min 1 max 2 prob 1 + prob 2 For each potential merge, calculate the distance Perform the smallest- distance merge
11
Smoothing examples: Normal distribution 5000 points 200 intervals 50 intervals 10 intervals
12
Smoothing examples: Geometric distribution 5000 points 200 intervals 10 intervals50 intervals
13
Running time Theoretical: O (n · log n) Practical: O (n)
14
Running time 3.4 GHz Pentium, C++ code (2.5 ± 0.5) · num-points microseconds Number of points Time (microsec) 10 2 10 4 10 6 10 2 10 4 10 6
15
Visual smoothing We convert a piecewise-uniform distribution to a smooth curve by spline fitting. The user usually prefers a smooth probability density. 10 20 30
16
Main results 10 20 30 10 20 30 10 20 30 Density estimation Lossy compression of generalized histograms
17
Advantages Explicit specification of - Distance measure - Compression level Effective representation for automated reasoning
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.