Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics.

Similar presentations


Presentation on theme: "1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics."— Presentation transcript:

1 1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics

2 2 Outline I.Introduction II.Four Rounding Rules III.Mean and Variance IV.Distance Measure V.Concluding Comments VI.References

3 3 I.Introduction Reasons for rounding Rounding noninteger values to integer values for statistical purposes; To enhance readability of the data; To protect confidentiality of records in the file; To keep the important digits only.

4 4 Purpose of this paper: Evaluate the effects of four rounding methods on data quality and utility in two ways : (1) bias and variance; (2) effects on the underlying distribution of the data determined by a distance measure.

5 5 B : Base : Quotient : Remainder

6 6 Types of rounding: Unbiased rounding: E [R (r) |r] = r Example E[R(3)= 0 or 10|3] = 3 Sum-unbiased rounding: E [R (r)] = E (r)

7 7 II. Four rounding rules 1. Conventional rounding 1.1B even Suppose r = 0, 1, 2,...,9. In this case B = 10. If r (B/2), round r up to 10 else round r down to zero (0).

8 8 1.2B odd Round r up to B when r Otherwise round down to zero. Example: When B = 5, round up r = 3 and 4 to 5; Otherwise, round down.

9 9 Assumptions for r and q r follows a discrete uniform distribution; qfollows lognormal, Pareto of second kind or multinomial distribution. Thus, xhas some mixed distribution, but the term qB dominates.

10 10 2. Modified Conventional rounding Same as conventional rounding, except when rounding 5 (B/2) up or down with probability ½. 3. Zero-restricted 50/50 rounding Except zero (0), round r up or down with probability ½.

11 11 4.Unbiased rounding rule Round r up with probability r/B, and Round r down with probability 1 - r/B Example: r = 1, P [R(1)=B] = 1/10 P [R(1)=0] = 9/10

12 12 III. Mean and variance III.1 Mean and variance of unrounded number r = 0, 1, 2, 3,..., B-1. P (r) = and E (r) = =.

13 13 = In general when r and q are independent,

14 14 III. 2 Conventional rounding when B is even for unrounded number.

15 15

16 16

17 17 III.3 Conventional rounding when B is odd Note, (B-1)/2 out of B elements can be rounded up. is sum unbiased, and for unrounded number

18 18 P [R (r) = B] for modified conventional rounding. P [R (r) = B] for 50/50 rounding: same as above. No. of elements which can be rounded up: B-1. All B elements. Probability of rounding up is ½.

19 19 P [R (r) = B] for unbiased rounding Modified conventional rounding, 50/50 rounding and unbiased rounding have the same mean, variance and MSE as the conventional rounding with odd B.

20 20 IV. Distance measure Assume that when x = 0, U = 0. Define

21 21 Reexpressing the numerator of U, we have With conventional rounding with B=10, Then we have

22 22 Expected value of U We define

23 23 IV.1 Conventional rounding with B even which can be expressed as

24 24 Sum of 1/r terms. Recall the harmonic series: The upper and lower bounds for harmonic series

25 25 The upper bound for the first term of is The second term of is Note that the second term of E(U) is

26 26 IV.2 Modified conventional rounding with even B This has the same E(U) as conventional rounding. IV.350/50 rounding The first term of is The second term of is

27 27 IV.4 Unbiased rounding The first term of is The second term of is

28 28 IV.5Comparison of four rounding rules Conventional or Mod. Conven. 50/50 Unbiased Term 1 Term 2

29 29 Comparison of four rounding rules B = 10 Conventional or Mod. Conven. 50/50 Unbiased Term 1 2.61 11.49 (4.4) 4.5 (1.7) Term 2.85 2.851.65

30 30 Comparison of four rounding rules B = 1,000 Conventional or Mod. Conven. 50/50 Unbiased Term 1 194 3,454 (18) 500 (2.6) Term 2 83 323 166

31 31 IV.6E(1/q) for log-normal distribution Suppose and Then, x has a lognormal distribution, i.e.,

32 32 Let Then which is equivalent to

33 33 IV.6 E(1/q) for Pareto distribution of the 2nd kind The Pareto distribution of the second kind is In the above k = min(q). Let

34 34 IV.7 Upper bound for E(1/q) for multinomial distribution The multinomial distribution has the form = 0,1,2,

35 35 In binomial distribution, we let When is truncated at 1 from below, we have Note that for all i.

36 36 In general, Using the above relationship, we have the following does not generate any term having or. Hence,

37 37 The upper bound of the expected value is Let be the size of the category j and

38 38 V. Concluding comments Various methods of rounding and in some applications various choices for rounding base B are available. The question becomes: which method and/or base is expected to perform best in terms of data quality and preserving distributional properties of original data and, quantitatively, what is the expected distortion due to rounding?

39 39 The expected value of U, the distance measure, is intractable, so we derived its upper bound. The expected value of 1/q is also intractable for a multinomial distribution. So we derived an upper bound. There should be room for improvement. This paper provides a preliminary analysis toward answering these questions. In summary, In terms of bias, unbiased rounding is optimal.

40 40 In terms of the distance measure, conventional or modified conventional rounding performs best. In terms of protecting confidentiality, 50/50 rounding rule is best. VI.References Grab, E.L & Savage, I.R. (1954), Tables of the Expected Value of 1/X for Positive Bernoulli and Poisson Variables, Journal of the American Statistical Association 49, 169-177. N.L. Johnson & S. Kotz (1969). Distributions in Statistics, Discrete Distributions, Boston: Houghton Mifflin Company.

41 41 N.L. Johnson & S. Kotz (1970). Distributions in Statistics, Continuous Univariate Distributions-1, New York: John Wiley and Sons, Inc. Kim, Jay J., Cox, L.H., Gonzalez, J.F. & Katzoff, M.J. (2004), Effects of Rounding Continuous Data Using Rounding Rules, Proceedings of the American Statistical Association, Survey Research Methods Section, Alexandria, VA, 3803-3807 (available on CD). Vasek Chvatal. Harmonic Numbers, Natural Logarithm and the Euler-Mascheroni Constant. See www.cs.rutgers.edu/~chvatal/notes/harmonic.ht ml www.cs.rutgers.edu/~chvatal/notes/harmonic.ht ml


Download ppt "1 Effects of Rounding on Data Quality Lawrence H. Cox, Jay J. Kim, Myron Katzoff, Joe Fred Gonzalez, Jr. U.S. National Center for Health Statistics."

Similar presentations


Ads by Google