Data processing German foreign trade statistics ADVANCED ISSUES IN INTERNATIONAL TRADE IN GOODS STATISTICS ESTP training course 2 – 4 April 2014 German foreign trade statistics
Data processing
German foreign trade statistics Up to 30 million records per month First results 40 days after reference period High efficient data processing necessary Uneven distributed statistical value Most records have limited effect on results
The ASA System
ASA: Data submission monitoring
Data submission monitoring Monitoring the very important enterprises (“Top 60“) for the German foreign trade Checking of the variance Investigation of large deviations from previous year or month Identification of unusual deviations for all enterprises Acceptance Factor: (Current Value – Mean Value)/Std. Dev. Fast correction or confirmation of unusual values
Data submission monitoring
Data submission monitoring Checking of data delivery Structural Checks Data file format, Field format, Readability, Statistics Delivery specific checks Declaration attributes: Form, Flow, Specific Number, Doublet Declaration specific checks: Tax Number, Serial errors Processing serial errors in the data declarations A data delivery with more then 250 errors is generally rejected Approval of data declarations for the main (micro-) data processing
The ASA system: Selective editing
The selective editing process Limited capacity for manual correction Important data records are corrected manually The vast majority of the data records have limited impact on the results Rather unimportant data records are corrected by automated procedures Rule-based procedures Hot-Deck procedure Regression-based procedure
Selective editing: Threshold values Prioritization by CN8 specific threshold values High quality results for all commodity codes Determination of the important micro data for the results The highest potential value of a record (according to statistical value, supplementary unit and net mass) is compared with the threshold value of the respective CN8 code Threshold values are calculated by the processed error free micro data of the previous 12 months
Selective editing: Threshold values <25% >75% Threshold (75%) for CN8 code 22042979 flow arrivals: (6000+2300)/2=4150
Classification by fictional value The statistical value can be erroneous The fictional value (highest potential value) is less vulnerable for errors The fictional value is the maximum of: The statistical value The average statistical value per supplementary unit multiplied by the supplementary unit The average statistical value per net mass multiplied by the net mass
Selective editing: Validation checks The data records are compared with reference data in order to find errors and to prioritize them The reference data and validation rules are managed by the tool “BASE PL-Editor“ The validation rules and the structure of the reference data are implemented in the ASA system by a XML file (Definite) Errors and possible errors
Selective editing: Validation checks Errors Invalid codes Very unusual unit-price Invalid combinations Possible Errors Unlikely Partner countries etc. Unlikely unit-price, value Unlikely combinations
Selective editing: Validation checks
Selective editing: Validation checks
The ASA system: Selective editing
Selective editing: Automated correction Deterministic error correction If – then correction rules Effective method provided a strong correlation between variables For example: CN8 code and mode of transport Typical errors For example: Numerical code instead of Iso-Alpha Numerical variables The supplementary unit and net mass are corrected by the statistical value and the average ratio
Selective editing: Automated correction Hot-Deck error correction Correcting erroneous micro data by imputing values of error free micro data (donor records) Only categorical variables Nearest-Neighbor approach for donor determination Calculating of the distance between the records Weighting of the variables In most cases a donor with the same CN8 code Avoiding outliers as donors Considering the impact on the donor result
Selective editing: Automated correction Hot-Deck Donor determination Variable 1 Variable 2 Variable 3 Distance w 1 =1 2 3 =2 Erroneous record A B C Potential donor 1 D Potential donor 2 Potential donor 3 Corrected record å = - k XY y x
The ASA system: Outlier detection
Outlier detection Comparison of current results with results of previous 12 months Outliers are highlighted by the Acceptance Factor (Current value – Mean value)/Std. dev. Detailed results at CN8 level CN8 result Partner country result Statistical value, net mass, supplementary unit and their ratios
Outlier detection