Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20 30.

Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20 30

Density estimate problem Convert a set of numeric data points to a smoothed approximation of the underlying probability density. 10 20 30 11 12 19 21 Example Points 17 18 22 26 27 29

Techniques Manual estimates Histograms 10 20 30 10 20 30 Curve fitting 10 20 30

Generalized histograms 10 20 30 0.2 chance: [11.. 12] 0.5 chance: [17.. 22] 0.3 chance: [26.. 29] General form prob 1 : [min 1.. max 1 ] prob 2 : [min 2.. max 2 ] … prob n : [min n.. max n ] Intervals do not overlap Probabilities sum to 1.0

Special cases Standard histogram Set of points Weighted points

Smoothing problem Given a generalized histogram, construct its coarser approximation. 10 20 30 10 20 30 10 20 30

Input Initial distribution: A point set or a fine-grained histogram Distance function: A measure of similarity between distributions Target size: The number of intervals in an approximation

Standard distance measures Simple difference: ∫ | p(x) − q(x) | dx Kullback-Leibler: ∫ p(x) · log (p(x) / q(x)) dx Jensen-Shannon: (Kullback-Leibler (p, (p+q)/2) + Kullback-Leibler (q, (p+q)/2)) / 2

Smoothing algorithm Repeat: Merge two adjacent intervals Until the histogram has the right size 1020 30

Interval merging min 1 min 2 max 1 max 2 prob 1 prob 2 min 1 max 2 prob 1 + prob 2 For each potential merge, calculate the distance Perform the smallest- distance merge

Smoothing examples: Normal distribution 5000 points 200 intervals 50 intervals 10 intervals

Smoothing examples: Geometric distribution 5000 points 200 intervals 10 intervals50 intervals

Running time Theoretical: O (n · log n) Practical: O (n)

Running time 3.4 GHz Pentium, C++ code (2.5 ± 0.5) · num-points microseconds Number of points Time (microsec) 10 2 10 4 10 6 10 2 10 4 10 6

Visual smoothing We convert a piecewise-uniform distribution to a smooth curve by spline fitting. The user usually prefers a smooth probability density. 10 20 30

Main results 10 20 30 10 20 30 10 20 30 Density estimation Lossy compression of generalized histograms

Advantages Explicit specification of - Distance measure - Compression level Effective representation for automated reasoning

Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20 30.

Similar presentations

Presentation on theme: "Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20 30."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20 30.

Similar presentations

Presentation on theme: "Analysis of Uncertain Data: Smoothing of Histograms Eugene Fink Ankur Sarin Jaime G. Carbonell 10 20 30."— Presentation transcript:

Similar presentations

About project

Feedback