Disclosure detection & control in research environments Felix Ritchie
Why are research environments special? Little disclosure control on input Few limits on processing Unpredictable, complex outputs –an infinity of “special cases” Manual review for disclosiveness required
Problems of reviewing research outputs Limited application of rules How do we ensure –consistency? –transparency? –security? How do we do this with few resources?
Classifying the research zoo Some outputs inherently “safe” Some inherently “unsafe” Concentrate on the unsafe –Focus training –Define limits –Discourage use
Safe versus unsafe Safe outputs –Will be released unless certain conditions arise Unsafe outputs –Won’t be released unless demonstrated to be safe Examples: * = conditions for release apply UnsafeSafeIndeterminate QuantilesLinear regression*Herfindahl indexes GraphsPanel data estimates Aggregated tables Cross-product matrices Estimated covariances*
Determining safety Key is to understand whether the underlying functional form is safe or unsafe Each output type assessed for risk of –Primary disclosure –Disclosure by differencing
Example: linear aggregates of data are unsafe Inherent disclosiveness: –Differencing is feasible each data point needs to be assessed for threshold/dominance limits => resource problem for large datasets Disclosure by differencing:
Example: linear regression coefficients are safe Let can’t identify single data point But No risk of differencing Exceptions –All right hand variables public and an excellent fit (easily tested, can generate automatic limits on prediction) –All observations on a single person/company –Must be a valid regression
Example: cross-product/variance-covariance matrices –Can’t create a table for X unless Z=X and W=I weighted covariance matrix is safe Cross product matrix M = (X’X) is unsafe Frequencies/totals identified by interaction with constant And for any other categorical variables What about variance-covariance matrices? –V is unsafe – can be inverted to produce M –But in the more general case
Example: Herfindahl indices Safe as long as at least 3 firms in the industry? No: –Quadratic term exacerbates dominance –If second-largest share is much smaller, H share of largest firm –Standard dominance rule of largest unit<45% share doesn’t prevent this Current tests for safety not very satisfactory Composite index of industrial concentration
Questions? Felix Ritchie Microdata Analysis and User Support Office for National Statistics