Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Slides:



Advertisements
Similar presentations
History Data Service1 Good Design for Historical source based Databases History Data Service Hamish James.
Advertisements

C6 Databases.
Dr Gordon Russell, Napier University Unit Data Dictionary 1 Data Dictionary Unit 5.3.
Managing Data Resources
Database Management: Getting Data Together Chapter 14.
Chapter 2: Algorithm Discovery and Design
Introduction to z/OS Basics © 2006 IBM Corporation Chapter 8: Designing and developing applications for z/OS.
Introduction to Database Management
Programming Logic and Design, Introductory, Fourth Edition1 Understanding Computer Components and Operations (continued) A program must be free of syntax.
Chapter 1 Program Design
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
Chapter 2: Algorithm Discovery and Design
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Software System Integration
Chapter 9 Database Management
Software Re-engineering
Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. Chapter 7: Designing and developing applications for z/OS.
IT – DBMS Concepts Relational Database Theory.
Chapter 11 Databases.
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Structured COBOL Programming, Stern & Stern, 9th edition
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 13 Database Management Systems: Getting Data Together.
Session 1 – Use of profiling for public administration Linda Scott Head of Business Register Operations UK.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Gary MarsdenSlide 1University of Cape Town Principles of programming language design Gary Marsden Semester 2 – 2001.
Software School of Hunan University Database Systems Design Part III Section 5 Design Methodology.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
CHAPTER 8: MANAGING DATA RESOURCES. File Organization Terms Field: group of characters that represent something Record: group of related fields File:
End HomeWelcome! The Software Development Process.
Programming Lifecycle
Software Engineering Quality What is Quality? Quality software is software that satisfies a user’s requirements, whether that is explicit or implicit.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
S2008Final_part1.ppt CS11 Introduction to Programming Final Exam Part 1 S A computer is a mechanical or electrical device which stores, retrieves,
CSE 219 Computer Science III Program Design Principles.
Information: Policy, Strategy and Systems Module Overview
Methodology - Conceptual Database Design. 2 Design Methodology u Structured approach that uses procedures, techniques, tools, and documentation aids to.
1/26/2004TCSS545A Isabelle Bichindaritz1 Database Management Systems Design Methodology.
© Pearson Education Limited, Chapter 9 Logical database design – Step 1 Transparencies.
15 may 2008 The validation of the french population census Olivier Lefebvre, Insee.
CountrySTAT Regional Basic Administrator Training for ECO Member States Friday, October 23, 2015 EVENT Foundations of CountrySTAT E-learning.
1.file. 2.database. 3.entity. 4.record. 5.attribute. When working with a database, a group of related fields comprises a(n)…
The Dutch Virtual Census based on registers and already existing surveys Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Census Data Processing: Contemporary Technologies for Data Capture Bangkok, Thailand September, 2008 By Jatan Kumar Saha Systems Analyst Bangladesh.
FILES AND DATABASES. A FILE is a collection of records with similar characteristics, e.g: A Sales Ledger Stock Records A Price List Customer Records Files.
ITGS Databases.
Part4 Methodology of Database Design Chapter 07- Overview of Conceptual Database Design Lu Wei College of Software and Microelectronics Northwestern Polytechnical.
The availability of Dutch census microdata Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands Division Social.
Lyne Guertin Census Data Processing and Estimation Section Social Survey Methods Division Methodology Branch, Statistics Canada UNECE April 28-30, 2014.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 28Slide 1 CO7206 System Reengineering 4.2 Software Reengineering Most slides are Slides.
François CLANCHÉ Insee, National statistical office, France 30/09/2013 The French rolling census, ten years after its launch.
13-Jul-07 State of the art of the ISCO-08 implementation.
State of play and plans by variable Occupation. 2 Policy needs for comparable data on occupations  Indicators on gender segregation used in the follow.
Program Design. Simple Program Design, Fourth Edition Chapter 1 2 Objectives In this chapter you will be able to: Describe the steps in the program development.
1 Management Information Systems M Agung Ali Fikri, SE. MM.
Design Evaluation Overview Introduction Model for Interface Design Evaluation Types of Evaluation –Conceptual Design –Usability –Learning Outcome.
Managing Data Resources File Organization and databases for business information systems.
CS223: Software Engineering Lecture 34: Software Maintenance.
Normalized bubble chart for Data in the Instructor’s View
Transaction processing systems
Component and Deployment Diagrams
Statistics Netherlands Division Social and Spatial Statistics
Databases.
An Introduction to Visual Basic .NET and Program Design
MANAGING DATA RESOURCES
VISUAL BASIC – CHAPTER ONE NOTES An Introduction to Visual Basic
MANAGING DATA RESOURCES
Spreadsheets, Modelling & Databases
Software Re-engineering and Reverse Engineering
Presentation transcript:

Sicore The Insee Automatic Coding System François Bulot April 22, 2003

Plan  Introduction  The knowledge bases  How does the Sicore system work ?  An adequate management structure  Some important results  The software package  Surveys

The Sicore project  Launched in 1993 by Pascal Rivière  Written by Éric Meyer and Bruno Berlemont  Finished in may 1996  Followed successively by Pierrette Schuhl, Frédérique Deschamps and François Bulot

The main four objectives of Sicore  Construct evolutive knowledge bases for the variables  Create an adequate management structure  Write a generalized software package User-friendly For any variable For any language  Provide a documented methodology

The knowledge bases : 4 kinds of information  The reference file : texts codes  The normalization rules - Maximum number of words - Maximum length of each word - Empty (and blank) characters - Empty words - Synonyms

The knowledge bases : 4 kinds of information (next)  The logical rules : additional variables  The parameters of the learning algorithm ; parameters about : the structure of the reference file how to split the words and build the coding tree

How does the Sicore system work ?  First, the learning phase  Second, the coding phase

1 - The learning phase : two steps to build the coding tree  The normalization step of the reference file Remove empty characters Remove empty words Replace words (or groups of words) by their synonyms Limit the number of words and the length of each word Split each word into pieces of two characters : bigrams

Example

1 - The learning phase : two steps to build the coding tree (next)  To build the coding tree, Sicore : Takes the normalized reference as input Computes the position of the word piece which gives the biggest amount of information (Shannon information) Builds all branches which correspond to this position For each branch, Sicore computes again the second position which gives the biggest amount of information Builds the next branch Repeats this process until each branch uniquely identifies a code

Example

2 - The coding phase  Normalization of the file to be coded  Pattern recognition algorithm :determines a code using the coding tree  Failure : the pattern of the text is not recognized => no code  Complete success : the pattern is recognized and a code is obtained  Partial success : the pattern is recognized but the text is too much ambiguous  The decision step for the complete success : Set of logical rules and additional variables => code

Sicore circle

An adequate management structure  To insure that the knowledge bases are regularly updated The variable expert, the Sicore expert  To properly incorporate automatic coding in survey data processing To ensure that all concerned parties (3) join forces to attain the common goal

The documented methodology  As of now : 3 documents written The user's guide A dictionary with the important words and concepts The methodology guide : how Sicore works, how to construct the knowledge bases, how to verify the knowledge bases coherence  The programmer's guide  At the moment, only in French

Surveys coded for Occupation  All INSEE surveys since the last Census (1999) : surveys on living conditions (PCV), Household Consumption Survey, Health Survey, Continuous Employment Survey (LFS)…  Before : PCV from 1997, the survey on household patrimonies, t est for the national Census (1997)  Many regional surveys  Surveys for other national organisations

Other variables l Communes for the national Census l Nationalities/countries for the Census l Diploma and training levels l Activities for the Time Use Survey l Consumption products and shops for the Household Consumption Survey l Geocoding in the Réunion Island l Activities of the firms ( 4 sources : agriculture, administrative body responsible for collecting social security payments, Chamber of Commerce, Guild Chamber)

The use for the French National Census in 1999  Batch process " Slight" run : communes of studies,of the previous place of residence, of the working place ; country of birth, of the previous place of residence ; nationality "Heavy" run : present and previous occupations  Interactive process Pick-up codification for the present and the previous occupations

News relating to Sicore l Pick-Up Activities : –Occupation for the Census –Occupation for the EEC –Diploma/training level for the EEC –Occupation for the Health Survey and all the Surveys with the common trunk l Sicore under CAPI/BLAISE

Sicore’s main criteria  Three criteria to be examined together : The efficiency : percentage of records that are automatically coded The accuracy : percentage of coded records that are well coded The speed : average time to code one record

Occupations base  Reference file : lines ; Text = occupation + rank  Normalization rules : 10 empty characters : '()-_,/\+: 299 empty words : "dand", "chevronné", "SMIG"... Synonyms : 2684 expressions 775 synonyms  Parameters of the learning phase : 5 words ( ) 8 priority bigrams, 3 redundancy bigrams  Logical rules : 14 additional variables 2933 tables 524 codes  Learning phase time : 8 seconds

Communes base  Reference file : : lines (base : geographical official code)  Normalization rules : 8 empty characters : '()-_,/* 58 empty words : "district", "canton", "cedex",... Synonyms : 126 expressions 35 synonyms  Parameters of the learning phase : 5 words ( ) 4 priority bigrams, no redundancy bigram  Logical rules : 1 additional variable = date 2291 tables 4021 codes  Learning phase time : 2 seconds

Countries (nationalities) base  Reference file : : 1542 lines  Normalization rules : 7 empty characters : '()-_,/ 29 empty words : "democratic", "republic",... Synonyms : 42 expressions 14 synonyms  Parameters of the learning phase : 4 words ( ) 3 priority bigrams 2 redundancy bigrams  Logical rules : None  Learning phase time : less than 1 second

Several speeds  Occupation (EEC) : about 900 wordings by second  Occupation (Common Trunk) : about 1000 wordings by second  Activities of Time Use : about 1700 wordings by second  Commune : about 7000 wordings by second  Nationality : about wordings by second

Efficiencies for the Occupation For the National Census ("Heavy" run) : - Present Occupation : 56,6% coded - Former Occupation : 83,7% For the EEC (LFS) : 80% For the household surveys (common trunk) : Between 75 and 80% not empty wordings

Efficiencies for other variables l For national Census : Communes of place of work, of study or previous home : 98,5% l Countries/nationalities : 98,9% l Time Use activities : 90% l Household Consumption Survey : - Till Receipts : 69,5% - Consumption board (other purchases) : 75,3% - Shops : 91,8% l Diploma (EEC) : 90%

The software package  Independence of the language and the variables used  Written in C language  Available in PC with Windows or Windows NT  Works on IBM/MVS mainframes and on Unix workstations, excluding the expert interface  3 parts : the expert interface, the application program interface (A.P.I.) package, the object modules and include files package

Conclusion, the important elements  Separation between software and knowledge bases  A quick learning phase  Many parameters  Specific tools to help experts  The use of local and global criteria  Distinction between learning and coding phases  Independence vis-à-vis variables and languages  And only one piece of software to maintain