Download presentation
Presentation is loading. Please wait.
Published byMark Bishop Modified over 9 years ago
1
Sicore The Insee Automatic Coding System François Bulot April 22, 2003
2
Plan Introduction The knowledge bases How does the Sicore system work ? An adequate management structure Some important results The software package Surveys
3
The Sicore project Launched in 1993 by Pascal Rivière Written by Éric Meyer and Bruno Berlemont Finished in may 1996 Followed successively by Pierrette Schuhl, Frédérique Deschamps and François Bulot
4
The main four objectives of Sicore Construct evolutive knowledge bases for the variables Create an adequate management structure Write a generalized software package User-friendly For any variable For any language Provide a documented methodology
5
The knowledge bases : 4 kinds of information The reference file : texts codes The normalization rules - Maximum number of words - Maximum length of each word - Empty (and blank) characters - Empty words - Synonyms
6
The knowledge bases : 4 kinds of information (next) The logical rules : additional variables The parameters of the learning algorithm ; parameters about : the structure of the reference file how to split the words and build the coding tree
7
How does the Sicore system work ? First, the learning phase Second, the coding phase
8
1 - The learning phase : two steps to build the coding tree The normalization step of the reference file Remove empty characters Remove empty words Replace words (or groups of words) by their synonyms Limit the number of words and the length of each word Split each word into pieces of two characters : bigrams
9
Example
10
1 - The learning phase : two steps to build the coding tree (next) To build the coding tree, Sicore : Takes the normalized reference as input Computes the position of the word piece which gives the biggest amount of information (Shannon information) Builds all branches which correspond to this position For each branch, Sicore computes again the second position which gives the biggest amount of information Builds the next branch Repeats this process until each branch uniquely identifies a code
11
Example
12
2 - The coding phase Normalization of the file to be coded Pattern recognition algorithm :determines a code using the coding tree Failure : the pattern of the text is not recognized => no code Complete success : the pattern is recognized and a code is obtained Partial success : the pattern is recognized but the text is too much ambiguous The decision step for the complete success : Set of logical rules and additional variables => code
13
Sicore circle
14
An adequate management structure To insure that the knowledge bases are regularly updated The variable expert, the Sicore expert To properly incorporate automatic coding in survey data processing To ensure that all concerned parties (3) join forces to attain the common goal
15
The documented methodology As of now : 3 documents written The user's guide A dictionary with the important words and concepts The methodology guide : how Sicore works, how to construct the knowledge bases, how to verify the knowledge bases coherence The programmer's guide At the moment, only in French
16
Surveys coded for Occupation All INSEE surveys since the last Census (1999) : surveys on living conditions (PCV), Household Consumption Survey, Health Survey, Continuous Employment Survey (LFS)… Before : PCV from 1997, the survey on household patrimonies, t est for the national Census (1997) Many regional surveys Surveys for other national organisations
17
Other variables l Communes for the national Census l Nationalities/countries for the Census l Diploma and training levels l Activities for the Time Use Survey l Consumption products and shops for the Household Consumption Survey l Geocoding in the Réunion Island l Activities of the firms ( 4 sources : agriculture, administrative body responsible for collecting social security payments, Chamber of Commerce, Guild Chamber)
18
The use for the French National Census in 1999 Batch process " Slight" run : communes of studies,of the previous place of residence, of the working place ; country of birth, of the previous place of residence ; nationality "Heavy" run : present and previous occupations Interactive process Pick-up codification for the present and the previous occupations
19
News relating to Sicore l Pick-Up Activities : –Occupation for the Census –Occupation for the EEC –Diploma/training level for the EEC –Occupation for the Health Survey and all the Surveys with the common trunk l Sicore under CAPI/BLAISE
20
Sicore’s main criteria Three criteria to be examined together : The efficiency : percentage of records that are automatically coded The accuracy : percentage of coded records that are well coded The speed : average time to code one record
21
Occupations base Reference file : 26784 lines ; Text = occupation + rank Normalization rules : 10 empty characters : '()-_,/\+: 299 empty words : "dand", "chevronné", "SMIG"... Synonyms : 2684 expressions 775 synonyms Parameters of the learning phase : 5 words (2 - 12 - 12 - 12 - 12) 8 priority bigrams, 3 redundancy bigrams Logical rules : 14 additional variables 2933 tables 524 codes Learning phase time : 8 seconds
22
Communes base Reference file : : 49006 lines (base : geographical official code) Normalization rules : 8 empty characters : '()-_,/* 58 empty words : "district", "canton", "cedex",... Synonyms : 126 expressions 35 synonyms Parameters of the learning phase : 5 words (2 - 14 - 12 - 12 - 12) 4 priority bigrams, no redundancy bigram Logical rules : 1 additional variable = date 2291 tables 4021 codes Learning phase time : 2 seconds
23
Countries (nationalities) base Reference file : : 1542 lines Normalization rules : 7 empty characters : '()-_,/ 29 empty words : "democratic", "republic",... Synonyms : 42 expressions 14 synonyms Parameters of the learning phase : 4 words (12 - 12 - 12 - 12) 3 priority bigrams 2 redundancy bigrams Logical rules : None Learning phase time : less than 1 second
24
Several speeds Occupation (EEC) : about 900 wordings by second Occupation (Common Trunk) : about 1000 wordings by second Activities of Time Use : about 1700 wordings by second Commune : about 7000 wordings by second Nationality : about 25000 wordings by second
25
Efficiencies for the Occupation For the National Census ("Heavy" run) : - Present Occupation : 56,6% coded - Former Occupation : 83,7% For the EEC (LFS) : 80% For the household surveys (common trunk) : Between 75 and 80% not empty wordings
26
Efficiencies for other variables l For national Census : Communes of place of work, of study or previous home : 98,5% l Countries/nationalities : 98,9% l Time Use activities : 90% l Household Consumption Survey : - Till Receipts : 69,5% - Consumption board (other purchases) : 75,3% - Shops : 91,8% l Diploma (EEC) : 90%
27
The software package Independence of the language and the variables used Written in C language Available in PC with Windows or Windows NT Works on IBM/MVS mainframes and on Unix workstations, excluding the expert interface 3 parts : the expert interface, the application program interface (A.P.I.) package, the object modules and include files package
28
Conclusion, the important elements Separation between software and knowledge bases A quick learning phase Many parameters Specific tools to help experts The use of local and global criteria Distinction between learning and coding phases Independence vis-à-vis variables and languages And only one piece of software to maintain
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.