Dr. Richard Young Optronic Laboratories, Inc.
Uncertainty budgets are a growing requirement of measurements. Multiple measurements are generally required for estimates of uncertainty. Multiple measurements can also decrease uncertainties in results. How many measurement repeats are enough?
Here is an example probability distribution function of some hypothetical measurements. We can use a random number generator with this distribution to investigate the effects of sampling.
Here is a set of 10,000 data points…
Plotting Sample # on a log scale is better to show behaviour at small samples.
There is a lot of variation, but how is this affected by the data set?
Here we have results for 200 data sets.
The most probable value for the sample standard deviation of 2 samples is zero! Many samples are needed to make 10 most probable.
Sometimes it is best to look at the CDF. The 50% level is where lower or higher values are equally likely.
What if the distribution was uniform instead of normal? The most probable value for >2 samples is 10.
Underestimated values are still more probable because the PDF is asymmetric.
Throwing a die is an example of a uniform random distribution. A uniform distribution is not necessarily random however. It may be cyclic e.g. temperature variations due to air conditioning. With computer controlled acquisition, data collection is often at regular intervals. This can give interactions between the cycle period and acquisition interval.
For symmetric cycles, any multiple of two data points per cycle will average to the average of the cycle.
Correct averages are obtained when full cycles are sampled, regardless of the phase. Unless synchronized, data collection may begin at any point (phase) within the cycle.
Again, whole cycles are needed to give good values. The value is not 10 because sample standard deviation has a (n-1) 0.5 term. Standard Deviation
The population standard deviation is 10 at each complete cycle. Each cycle contains all the data of the population. The standard deviation for full cycle averages = 0.
Smoothing involves combining adjacent data points to create a smoother curve than the original. A basic assumption is that data contains noise, but the calculation does NOT allow for uncertainty. Smoothing should be used with caution.
What is the difference?
Here is a spectrum of a white LED. It is recorded at very short integration time to make it deliberately noisy.
A 25 point Savitzky-Golay smooth gives a line through the center of the noise.
The result of the smooth is very close to the same device measured with optimum integration time
But how does the number of data points affect results? Here we have 1024 data points.
Now we have 512 data points.
Now we have 256 data points.
Now we have 128 data points.
A 25 point smooth follows the broad peak but not the narrower primary peak.
To follow the primary peak we need to use a 7 point smooth… But it doesn’t work so well on the broad peak.
Comparing to the optimum scan, the intensity of the primary peak is underestimated. This is because some of the higher signal data have been removed.
Beware of under-sampling peaks – you may underestimate or overestimate intensities.
Here is the original data again. What about other types of smoothing?
An exponential smooth shifts the peak. Beware of asymmetric algorithms!
This is the optimum integration scan but with 128 points like the noisy example. With lower noise, can we describe curves with fewer points?
… 64 points.
… 32 points. Is this enough to describe the peak?
Interpolation is the process of estimating data between given points. National Laboratories often provide data that requires interpolation to be useful. Interpolation algorithms generally estimate a smooth curve.
There are many forms of interpolation: LeGrange, B-spline, Bezier, Hermite, Cardinal spline, cubic, etc. They all have one thing in common: They go through each given point and hence ignore uncertainty completely. Generally, interpolation algorithms are local in nature and commonly use just 4 points.
The interesting thing about interpolating data containing random noise is you never know what you will get. Let’s zoom this portion…
The Excel curve can even double back. Uneven sampling can cause overshoots.
If a spectrum can be represented by a function, e.g. polynomial, the closest “fit” to the data can provide smoothing and give the values between points. The “fit” is achieved by changing the coefficients of the function until it is closest to the data. A least-squares fit.
The square of the differences between values predicted by the function, and those given by the data are added to give a “goodness of fit” measure. Coefficients are changed until the “goodness of fit” is minimized. Excel has a regression facility that performs this calculation.
Theoretically, any simple smoothly varying curve can be fitted by a polynomial. Sometimes it is better to “extract” the data you want to fit by some reversible calculation. This means you can use, say, 9 th order polynomials instead of 123 rd order to make the calculations easier.
NIST provide data at uneven intervals. To use the data, we have to interpolate to intervals required by our measurements.
NIST recommend to fit a high- order polynomial to data values multiplied by 5 /exp(a+b/ ) for interpolation. The result looks good, but…
...on a log scale, the match is very poor at lower values.
When converted back to the original scale, lower values bear no relation to the data.
The “goodness of fit” parameter is a measure of absolute differences, not relative differences. NIST use a weighting of 1/E 2 to give relative differences, and hence closer matching, but that is not easy in Excel. Large values tend to dominate smaller ones in the calculation. A large dynamic range of values should be avoided. We are trying to match data over 4 decades!
Although NIST’s 1/E 2 weighting gives closer matches than this data, to get best results they split the data into 2 regions and calculate separate polynomials for each. This a reasonable thing to do but can lead to local data effects and arbitrary splits that do not fit all examples. Is there an alternative?
A plot of the log of E* 5 values vs. -1 is a gentle curve – almost a straight line. – almost a straight line. We can calculate a polynomial without splitting the data. The fact that we are fitting a log scale means we are effectively using relative differences in the least squares calculation.
Incandescent lamp emission is close to that of a blackbody.
If we calculate a scaled blackbody curve as we would to get the distribution temperature… …and then divide the data by the blackbody...
...we get a smooth curve with very little dynamic range. The “fit” is not good because of the high initial slope and almost linear falling slope.
Plotting vs. -1, as in alternative method 1, allows close fitting of the polynomial.
Method 2 shows lower residuals, but there is not much difference.
All methods discussed give essentially the same result when converted back to the original scale.
None of the algorithms mentioned allow for uncertainty or assume it is constant. If we replaced the least-squares “goodness of fit” parameter with “most probable,” this would use the uncertainty we know is there to determine the best fit. Why is this not done? Difficult in Excel. Easy with custom programs.
From the data value (mean) and the standard deviation, we can calculate the PDF. The value from the fit has a probability that we can use.
Multiply the probabilities at each point to give the “goodness of fit” parameter. Use this parameter instead of the least- squares in the fit calculations. MAXIMIZE the “goodness of fit” parameter to obtain the best fit. The fit will be closest where uncertainties are lowest.
Standard deviations may be under- estimated with small samples. Cyclic variations should be integrated for complete cycle periods. Smoothing and interpolation should be used with caution: Do not assume results are valid – check.
Polynomial fits can give good results, but: Avoid large dynamic range Avoid complex curvatures Avoid high initial slopes All these manipulations ignore uncertainty (or assume it is constant). But least-squares fits can be replaced by maximum probability to take uncertainty into consideration.