Positive Skew and Empirical Likelihood Al Kvanli (akvanli@verizon Positive Skew and Empirical Likelihood Al Kvanli (akvanli@verizon.net) 2019
Background Book by John Neter and James Loebbecke – 1975
Background Nonstandard Mixtures of Distributions Study - 1988
National Research Council panel for this report
Background Journal of Business & Economics Statistics Article - 1998
Background The Canadian Journal of Statistics - 2003
The Central Limit Theorem and How It Works The Central Limit Theorem states that for large samples, the sample mean follows an approximate normal distribution regardless of the population shape If we repeatedly obtained large samples, found the corresponding sample means, then made a histogram of the sample means, the result would be a histogram that is bell-shaped (normal) in appearance
An Illustration: 5,000 Paid Amounts Histogram has a severe right skew Mean = $75
A computer-simulation approach was used to select a random sample of 40 claims from this universe and the sample mean was determined. This procedure was repeated 1,000 times resulting in the random selection of 40,000 claims with 1,000 corresponding sample means and produced this histogram. Here, you can see the Central Limit Theorem at work Despite sampling from a very skewed population, the sample means are normally distributed.
What We Would Like to Occur A reliable and accurate 2-sided 90% confidence interval for the population mean (µ) should contain the actual mean 90% of the time and (1) 5% of the time, the population mean will be below the lower limit, and (2) 5% of the time, the population mean will be above the upper limit. When the lower limit understates the true lower limit (as is typically the case with highly skewed audit populations), the percentage of time that the population mean will be below the lower limit will be much less than 5%.
What We Would Like to Occur Conversely, the percentage of time the population mean will be above the upper limit is much larger than 5%. We can envision this as a confidence interval that is “shifted too far to the left.” That is, both the lower and upper limits are too small.
How the Histogram of Sample Means was Constructed I have an Excel macro to sample a column of values (here, the 5,000 paid amounts) by randomly selecting 40 claims and deriving the sample mean and corresponding confidence interval for each sample. This procedure was repeated 1,000 times resulting in the random selection of 40,000 claims with 1,000 corresponding sample means and confidence intervals.
A Sample of Overpayments (Differences) An overpayment population was constructed by keeping the first 1,000 paid amounts, followed by 4,000 zero values This represents the overpayment values when a 20% error rate is anticipated and all 1,000 overpaid claims were 100% overpaid (the entire claim amounts were disallowed) Using the same Excel macro, this population was sampled 1,000 times by randomly selecting 40 overpayments from the population of 5,000 overpayments (4,000 of which are zero).
The Overpayment Samples Assuming a 20% error rate, we would expect the mean of the overpayments to be approximately .8 x 0 + .2 x $75 = $15. In fact, this value is $15.20 (in cell C15). The 1,000 sample means are in Column G and the limits for the corresponding 90% confidence intervals are in Columns H and I. 4,000 of these are zero
The Overpayment Samples According to cell C12, in 185 of the 1,000 samples, the actual mean ($15.20) is above the upper limit. If this were a true confidence interval, this should have occurred 5% of the time or 50 times. What this implies is that the upper limit is too small. According to cell C11, the actual mean was below the lower limit in 15 of the 1,000 samples (1.5% of them). Again, if this were a true confidence interval, this should have occurred in 5% of the generated samples. This implies that the lower limit is also too small.
The Actual Coverage Cell C13 tells us that only 800 of the 1,000 generated confidence intervals actually contained the actual universe mean of $15.20. So, what we thought were 90% confidence intervals (the specified coverage) were in fact, closer to 80% confidence intervals (the actual coverage). That is, the actual coverage is less than we specified.
The Empirical Likelihood Procedure We need a procedure that will shift the traditional confidence interval for the total overpayment to the right The traditional approach also makes certain statistical assumptions which at times can be very tenuous Empirical Likelihood (EL) estimation provides for more reliable confidence intervals (i.e., a 90% confidence interval contains the true value closer to 90% of the time) This is because both the lower and upper limits are larger than those produced using the traditional procedure
Why Use the EL Procedure? Consequently, a 90% lower limit is indeed a 90% lower limit. It does not require that some problematic statistical assumptions be made (i.e., normality of the mean for the overpayment amounts). Statisticians often refer to such a procedure as a nonparametric procedure.
How Does EL Work? Consider the following function of µ (log is to the base e - - a natural log) Here, the Yi’s are the n sample overpayments (differences)
How Does EL Work? Procedure: 1. Pick a value of µ Note: µ will always be between YMIN and YMAX, where YMIN and YMAX are the minimum/maximum of the n values of Yi 2. Determine the value of A that satisfies (2) 3. Using these values of µ and A, calculate f(µ) using equation (1)
How Does EL Work? Procedure: The lower limit of the 90% two-sided confidence interval is the smallest value of µ for which f(µ) is ≤ 2.7055. The upper limit is the largest value of µ for which f(µ) is ≤ 2.7055. The confidence interval for the overpayment total can be obtained by multiplying the endpoints of the confidence interval for the mean by the population size (N).
How Does EL Work? Lower limit of 90% confidence interval Upper limit of 90% confidence interval
How Does EL Work? This illustration discussed a 90% confidence interval for the population mean, µ The chi-square value (using 1 degree of freedom) having a right-tail area of 10% is 2.7055
Where did the 2.7055 come from?
How Does EL Work? This right-tail area would be .20 for an 80% confidence interval. So, replace 2.7055 with 1.6424. This right-tail area would be .05 for a 95% confidence interval. So, replace 2.7055 with 3.8415.
Using the Excel Macro I have an Excel macro that can be used for single-stage designs (using a simple random sample) or for stratified designs. First, click on (1) Input and enter the number of strata and desired confidence level. Then click on the OK button.
Using the Excel Macro
Using the Excel Macro Enter the sample size (365) and population size (301,410) Then click on (2) OK
Using the Excel Macro Enter the sample values in column B and click on (3) Continue.
Macro Output
EL Result and RAT-STATS Result The 90% confidence interval using the traditional procedure (RAT-STATS) is $1,085,209 to $35,772,048. The 90% EL confidence interval for the population total is $6,466,372 to $42,357,599.
Is there published research supporting the EL procedure? Absolutely. Two articles and a textbook are listed on the next slide. The textbook by Art Owen is entirely devoted to discussion of the EL procedure. The lead author on the first article reference is J.N.K Rao. This is the same Rao that created the RHC (Rao/Hartley/Cochran) multistage estimator contained in RAT-STATS. He is an absolute authority regarding the EL methodology.
References Empirical Likelihood by Art B. Owen, Chapman and Hall, Boca Raton, FL, 2001 (304 pages). J. Chen, S. Y. Chen, and J. N. K. Rao, “Empirical Likelihood Confidence Intervals for the Mean of a Population Containing Many Zero Values”, The Canadian Journal of Statistics, March, 2003, Vol. 31, No. 1, pp. 53 - 67. Kvanli, A.H. and Schauer, R., “Is Your Agency Too Conservative: Deriving More Reliable Confidence Intervals,” Journal of Government Financial Management, Vol. 54, No. 2, 2005.
Preview of the New Improved RAT-STATS (Beta Version)
Opening RAT-STATS Window
Input Screen for Variable SRS Appraisal User can specify the confidence level for the Empirical Likelihood Estimator
Output for New Version of RAT-STATS Includes Empirical Likelihood in addition to the Difference/Simple Expansion
Empirical Likelihood Pros 1) No assumptions regarding the shape of the difference (overpayment) population are made and so there are no assumptions to challenge during an audit appeal. 2) The EL methodology produces more reliable confidence intervals (based on the simulation results in the Rao reference) with less conservative lower limits.
Empirical Likelihood Pros 3) Can be easily adapted to a stratified sampling design. 4) A set of difference (overpayment) values will always provide the same lower limit calculation (the results are easily replicated).
Empirical Likelihood Cons 1) There is not a closed-form expression for this confidence interval and to determine a lower limit, one must search a particular interval for the value satisfying a certain equation. However, this procedure is very easy to program and within the computing abilities of a standard personal computer. 2) This is a very new methodology and many statisticians and auditors are not aware of it.