European Conference on Quality in Official Statistics Different Quality Tests on the Automatic Coding Procedure for the Economic Activities Descriptions.

Slides:



Advertisements
Similar presentations
Innovation data collection: Methodological procedures & basic forms Regional Workshop on Science, Technology and Innovation (STI) Indicators.
Advertisements

Innovation data collection: Advice from the Oslo Manual South East Asian Regional Workshop on Science, Technology and Innovation Statistics.
Innovation Surveys: Advice from the Oslo Manual South Asian Regional Workshop on Science, Technology and Innovation Statistics Kathmandu,
Innovation Surveys: Advice from the Oslo Manual National training workshop Amman, Jordan October 2010.
- ONS Classification Coding Tools Project Occupation Classification Workshop RSS, London, 21 June 2004 Nigel Swier.
Configuration management
The Many Ways of Improving the Industrial Coding for Statistics Canada’s Business Register Yanick Beaucage ICES III June 2007.
United Nations Statistics Division Principles and concepts of classifications.
Part III: Inference Topic 6 Sampling and Sampling Distributions
Federal Department of Home Affairs FDHA Federal Statistical Office FSO The revision of the codification of the economic activities in the Swiss Business.
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008) Sample results expected accuracy in the Italian Population and Housing.
ECONOMIC CENSUSES IN MEXICO The National Institute of Statistics and Geography (Instituto Nacional de Estadística y Geografía, INEGI) is the responsible.
United Nations Economic Commission for Europe Statistical Division Applying the GSBPM to Business Register Management Steven Vale UNECE
1 WORLD TOURISM ORGANIZATION (UNWTO) MEASURING TOURISM EXPENDITURE: A UNWTO PROPOSAL SESRIC-UNWTO WORKSHOP ON TOURISM STATISTICS AND THE ELABORATION OF.
Section 2: Science as a Process
Determining Sample Size
Carmela Pascucci – Istat - Italy Meeting of the Working Party on International Trade in Goods and Trade in Services Statistics (WPTGS) Linking business.
Sampling: Theory and Methods
Sicore The Insee Automatic Coding System François Bulot April 22, 2003.
European conference on quality in official statistics Rome, 8-10 July 2008 How to assess the quality of the Italian classification of occupations Francesca.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
9 th Workshop on Labour Force Survey Methodology – Rome, May 2014 The Italian LFS sampling design: recent and future developments 9 th Workshop on.
Fundamentals of Data Analysis Lecture 9 Management of data sets and improving the precision of measurement.
Session IV - Use of administrative data for data collection - Statistics Belgium Geneva, 31 October – 2 November.
BIO1130 Lab 2 Scientific literature. Laboratory objectives After completing this laboratory, you should be able to: Determine whether a publication can.
Quality issues on the way from survey to administrative data: the case of SBS statistics of microenterprises in Slovakia Andrej Vallo, Andrea Bielakova.
Configuration Management (CM)
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
Copyright 2010, The World Bank Group. All Rights Reserved. Business registration, part 2 Administrative and statistical business registers 1 Business statistics.
A Language Independent Method for Question Classification COLING 2004.
Assumes that events are governed by some lawful order
Backcasting United Nations Statistics Division. Overview  Any change in classifications creates a break in time series, since they are suddenly based.
Toward Generic Systems Shifra Haar - Central Bureau of Statistics-Israel.
1 Assessing inconsistencies in reported job characteristics of employed stayers: An analysis on two-wave panels from the Italian Labour Force Survey,
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
Post enumeration survey in the 2009 Pilot Census of Population, Households and Dwellings in Serbia Olga Melovski Trpinac.
Chapter 7 Probability and Samples: The Distribution of Sample Means.
Introduction to Earth Science Section 2 Section 2: Science as a Process Preview Key Ideas Behavior of Natural Systems Scientific Methods Scientific Measurements.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
Confidence Interval Estimation For statistical inference in decision making:
Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census.
Developing and applying business process models in practice Statistics Norway Jenny Linnerud and Anne Gro Hustoft.
Improving of Household Sample Surveys Data Quality on Base of Statistical Matching Approaches Ganna Tereshchenko Institute for Demography and Social Research,
1 For a Population Statistical Register Characteristics and Potentials for the Official Statistics Central department for administrative data and archives.
LOGO Mamdouh Abdel Aziz Refaiy Dr. Associate Professor, Business Administration Department, Faculty of Commerce, Ain Shams University, Cairo, Egypt. Evaluating.
Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.
Evolution of Census Statistics on Enterprises in Italy : from the Traditional Census to a Register of Local Units Monica Consalvi, Luigi Costanzo,
Bangor Transfer Abroad Programme Marketing Research SAMPLING (Zikmund, Chapter 12)
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Joint Eurostat Unece Worksession on Statistical Data Confidentiality 2011, Tarragona Initial analyses on comparable dissemination from the Essnet project.
13-Jul-07 State of the art of the ISCO-08 implementation.
An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Looking for statistical twins
BIO1130 Lab 2 Scientific literature
Section 2: Science as a Process
SAMPLING (Zikmund, Chapter 12.
Classification systems within business registers – Session 3 ITALY - ISTAT New economic classification and new instruments for Business Register classification:
Workshop on the data collection of occupational data 28 November 2008
Italian situation in the following areas:
BIO1130 Lab 2 Scientific literature
Estimating a Population Proportion
SAMPLING (Zikmund, Chapter 12).
Chapter 8: Estimating with Confidence
Sampling and estimation
Parallel Session: BR maintenance Quality in maintenance of a BR:
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Presentation transcript:

European Conference on Quality in Official Statistics Different Quality Tests on the Automatic Coding Procedure for the Economic Activities Descriptions A. Ferrillo, S. Macchia, P. Vicari ISTAT – Italian Institute of Statistics Rome, 8 – 11 July 2008

The automated coding system used in Istat ACTR Automatic Coding by Text Recognition Developed by Statistics Canada. It is a generalised system = independent from the classification and the language To be used, it is necessary to customise it  to build the dictionary (reference file), define synonymous and adapt it to language. The construction of the coding dictionary is the heaviest activity, since its quality deeply affects the performance of automatic coding.

ACTR n The coding activity is preceded by a quite sophisticated text standardisation phase, called “parsing”, providing 14 different “parsing functions” (character mapping, deletion of trivial words, definition of synonymous, suffixes removal, etc…) able to remove grammatical or syntactical differences so that any two different descriptions, with the same semantic content, become identical. n The parsed response to be coded is then compared to the parsed descriptions of the dictionary. If by this search, a perfect match is found, that is a “direct matching” (score = 10) is realised, then a unique code is assigned, otherwise the software runs an algorithm to look for the most suitable partial matches (“indirect matching”).

ACTR As a result the software returns: n unique matches, when a unique code is assigned to a response phrase n multiple matches, when several possible codes are proposed n failed matches, when no matches are found Its performances are measured through two indicators: n Recall rate (coding rate)  percentage of codes automatically assigned n Precision rate  percentage of correct codes automatically assigned

Automated coding applications developed in Istat The most important applications built in Istat, already used in different surveys, are referred to the following variables: n Occupation n Economic Activities n Education level n Country/Nationality n Municipalities. The coding rate obtained for Economic Activities varies from 50% for households surveys to 80% for business surveys.

ATECO 2007 – The new economic activities classification ATECO 2007 is the national version of NACE Rev. 2, the European economic activities classification The new NACE is deeply different from the previous one NACE and its impact on the official statistics: the four digit codes that split in two or more new codes are 45%; the five digit codes that split are 35%

ATECO 2007 and ACTR The ACTR application updating for ATECO 2007 was a complex process made of different steps and problems: n only a part of the old classification at five digit level (around the 65%) directly translated in the new one. The other part had to be checked description by description, n since the classification was very different, some descriptions have been completely re-examined; in some case it was necessary to divide old descriptions (for example: “Repair and installation of pumps”) because a part is now in a code (Repair, group 33.1) and the other part is in a different code (Installation, group 33.2), n completely new activities were introduced, n it was necessary to delete some old descriptions because completely obsolete (281 texts).

ATECO and ACTR Texts in the dictionary ATECO ‘9127,306 ATECO ,745 ATECO ,587

ACTR: aims of the new application n ACTR for surveys and Census ACTR was already used for Census 2001 and other surveys, it is already set on the descriptions given from this type of respondents n ACTR for administrative sources These descriptions are different from those of statistical surveys because they are quite often very long and there are no specifications or rules on how to describe the company’s activity These texts have been treated in a specific way in order: a) to obtain descriptions shorter than the original ones, b) to delete redundancies and useless information n ACTR on web As a new tool for all the users in order to find their economic activities code

Quality tests for ACTR 2007 In order to measure the quality of the procedure to be used to code not homogeneous descriptions, different quality tests have been planned. They are different both for the methodologies they use and the samples they treat. Three tests will be described: for two of them, the correctness of codes assigned by the automatic coding application is stated by the analysis of expert coders, while in the third one the assigned codes are compared to codes deriving from some special surveys.

Classes of occurrence s Number of original texts Number of different texts ( ) Hypothesised precision of autom. coding (π) Margin of error ( ) Approximate optimal sample size ( ) Sampling fraction ( ) 1 169, % ±0, % 2 55,790 27, % ±0, % ,401 20, % ±0, % ,828 6, % ±0, % ,636 1, % ±0, % , % ±0, % , % ±0, % , % ±0, % , % ±0, % , % ±0, % , % ±0, % , % ±0, % 1, , % ±0, % Tot. 1,130,662228,7383, % 1) Quality test on descriptions of the Industry Census A stratified random sample has been extracted from the 1,130,662 descriptions. The methodology adopted in drawing this sample optimises the analysis of results, so as to examine only once very similar texts (D’Orazio, Macchia 2002). Texts were stratified according to their frequency of occurrence; then, within each stratum, a simple random sample of texts was selected.

1) Quality test on descriptions of the Industry Census The recall rate (78.47%) is absolutely satisfactory, also if analysed in details. As a matter of fact, while the Unique matches are distributed among all the classes, there are not Failed matches in classes over 180 occurrences. In addition, the 71.25% of Unique matches have a score equal 10, which means that they correspond to direct matches, and more than the 53% of them belong to classes of occurrences greater than 91, which means that the dictionary enrichment was made consistently with the way respondents are used to express themselves. Results in terms of recall rate N.%Direct matches (score=10) Unique2, , Multiple Failed Total3,

1) Quality test on descriptions of the Industry Census Results in terms of precision rate Precision rate N% C = Correct codes 2, W = Wrong codes D = Codes assigned to descriptions impossible to be coded , As it can be seen, the precision rate is higher than 95% and, if analysed per score, the 98.09% of direct matches are corrected, which is surely a satisfactory result. In addition, it has been verified that the percentage of correct and non correct codes is uniformly distributed among all the classes of occurrences. To state the precision, coded descriptions of this sample were submitted to expert coders

In order to update the Business Register Istat used different methodologies and sources: n ACTR was involved in the analysis of the descriptions of the Chamber of Commerce. A recall rate 61%, corresponding to 84,117 coded descriptions, was obtained. n Sector Studies are an administrative source that covers more than the 70% of the Business Register; the quality of this source is particularly high. Sector Studies assign a five digit code through a specific methodology not based on the text analysis. 2) Quality test on descriptions of the Chambers of Commerce

For this test: n The descriptions corresponding to codes at maximum level of detail were extracted from the ACTR coded dataset. n They were compared with those assigned through the Sector Studies (it was assumed that coinciding codes had to be considered correct as two different methodologies came to the same conclusion). n The results showed that the 67% of the extracted descriptions had equal codes, which can be considered a good indicator of quality. n The quality analysis regarded the remaining descriptions corresponding to different codes, but, due to its huge quantity (17,746 descriptions), a sample has been extracted. n Due to the characteristics of these texts, it was not considered suitable to adopt the same sampling strategy used for the first quality test. So, frequency classes of descriptions with not coinciding codes have been defined and then a sample has been extracted proportionally within each class, with a margin of error of ±0.014%. 2) Quality test on descriptions of the Chambers of Commerce

Quality control sample ACTR codes – coinciding first 4 digits2, ACTR codes – coinciding first 3 digits3, ACTR codes – coinciding first 2 digits4, ACTR codes – coinciding first digit5, ,136 ACTR codes – not coinciding3, ,746 4,000 Comparison between code assigned through ACTR and through Sector Studies Quality control sample

2) Quality test on descriptions of the Chambers of Commerce Precision rate To state the precision, coded descriptions of this sample were submitted to expert coders who classified them as: n (A) correct codes according to ACTR n (C) correct codes according to Chambers of Commerce n (E) wrong codes according to both the methodologies n (D) doubt codes according to both the methodologies Quality control sample A CED n%n%n%n% Coinciding codes first 4 digits Coinciding codes first 3 digits Coinciding first 2 digits Coinciding first digits 1, Not coinciding codes ,000 As it can be seen the precision rate is high, from 80% to 94% in all classes, apart from that coinciding only with the first digit (This is due to the fact that this class is widely populated with very generic descriptions owing to the Construction sector, which has been strongly revised in the new classification).

3) Quality test on special surveys When ATECO 2007 was almost finalised, it became evident that it was necessary to realize specific surveys in those sectors where information was not available or the activities included were completely new. Particularly, it was decided to send a questionnaire to the enterprises in the fields of: n Information and Communication (section J); n Architectural and engineering activities; technical testing and analysis (division 71); n Research and experimental development on natural sciences and engineering (group 72.1); n Specialised design activities (group 74.1); n Services to buildings and landscape activities (division 81); n Other professional, scientific and technical activities n.e.c.; Office administrative and support activities; Business support services activities n.e.c. (74.9; 82.1; 82.9).

3) Quality test on special surveys The surveys were sent to around 45 thousands enterprises: all the enterprises larger than 10 employees and a sample of the smallest enterprises (1 – 9 employees). The questionnaires were very simple. At the beginning of every questionnaire a description of the economic activity not longer than 200 bytes was required.

3) Quality test on special survey The respondents were around 30%. In order to realize a quality test on these activities only the questionnaires where it was possible to attribute an ATECO code, analysing the answers to the survey, were considered (around 52%). The coding rate was not so good (44.5%), but it was not considered a failure as the survey regarded specific sectors for which it was already known that the dictionary had to be enriched. On the other hand, the precision rate was quite high (88,2%). The main purpose of the survey was to enrich the ACTR dictionary in order to improve the quality in terms of performance.

Conclusions and results The performances of the application are satisfactory both in terms of recall rate and precision rate The ACTR on WEB tool is having a big success (in the first weeks an average of 9,000 queries) The dictionary continues to be enriched using both descriptions given by ACTR on WEB and those written in the special surveys questionnaires