Validation and credibility checking procedures in UK trade-in-goods statistics HMRC Trade Statistics Don Priest
Introduction Data Collection Validation Credibility
Data Collection 1 Electronic submissions Electronic data: EU: CSV File Upload Online web form EDIFACT submission Non-EU: CHIEF (Customs Handling of Import and Export Freight) TS93
Data Collection 2 Volume of Data (2013) Monthly LinesValue (%) Edifact1 million +18 Online Form300,00037 CSV4 million +46 Extrastat CHIEF4 million +100 Source: HMRC
Validation Invalid data must be corrected before it is processed Checking for correct data in the correct field
Validation 2 Looking for: Invalid Commodity Codes Invalid Country Codes Very large values Blank or Zero value Cross field checks: mode of transport/place of clearance, commodity to partner country, count of origin, mode of transport
Front End Validation EU Data Online Web Form CSV upload Submission not allowed with invalid country or commodity codes Non-EU Data (CHIEF) Codes Checked for validity: Failure results in rejection Country, commodity, mode of transport, statistical procedure, CPC Codes checked for credibility: Value to net mass and value to supplementary unit checked Failure can be overridden
Automatic Corrections Some validation problems are solved automatically Automatic correction can be used to correct simple problems Common problems like GB and UK country code, out of date commodity codes Mode of transport/place of clearance interfield checks also resolved automatically Optional net mass for Intrastat: net mass imputed based on supplementary unit
Credibility Credibility checks ensure realism Compare Values against Quantities If the ratio is too high or low, we call it a Credibility Failure Parameters are created to test the data
Exponential regression model for line of best fit. Equivalently: Credibility 2
Credibility 3
Credibility 4 A score is given to the data to indicate how “incredible” it is The further from the norm, the higher the score
Credibility 5 Score based on ‘Standard Error of Model’ the ‘S’ Parameter: MOF 0: < ± S from best fit MOF 9: ± 6 S from best fit Other integer MOF scores by linear extrapolation, 1-99 MOF = Magnitude of failure
Credibility Method Problems Insufficient Data ? 1 3 2
Credibility Method Problems
If the ‘b’ parameter is below 1 (b<1) when the axis are flipped, we cannot trust the data Instead, force b=1 i.e., Mirror Trick
Manual Checking More MOF failures than staff to check! Risk based approach, always check: All items >=£100m (even if not MOF failures) All 5+ MOF failures > £0.3m All 50+ MOF failures Other MOFs checked in descending order
Manual Checking 2 Not just MOF failures – all below are checked: “Heavy by air”: Air imports over 50,000kg Sensitive commodities or commodity/country combinations. For example: Military goods Appear to contravene embargos Import or Export bans
Resources IT TS93 mainframe Holds all trade and trader details Trade changes made by desktop PC interface to TS93 SAS Unix based database – Updated each night from TS93 PC SAS used to calculate MOF failures and other lines to be checked each morning
Resources 2 Human Resources (Full time equivalents) 20 Data Managers – Manually check lines and contact traders 0.6 Statisticians – Regularly update all parameters, manually adjust problematic parameters
Advantages and Disadvantages Advantages Risk based approach which minimises the impact of limited resources Disadvantages TS93 Mainframe is fairly inflexible – all change involves IT partners Current MOF system generates ~100 staff years of work – majority will never be checked
Future Work Auto correction To address the issue of most MOF failures not being checked Target low value/quantity lines Most likely using “Effective Value”, based on German “Fictitious Value” Impact of SIMSTAT proposals
Any Questions? UK Trade Statistics contacts: Lisa Carter Kevin Sams Nicola Jobson