Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sicore The Insee Automatic Coding System François Bulot April 22, 2003.

Similar presentations


Presentation on theme: "Sicore The Insee Automatic Coding System François Bulot April 22, 2003."— Presentation transcript:

1 Sicore The Insee Automatic Coding System François Bulot April 22, 2003

2 Plan  Introduction  The knowledge bases  How does the Sicore system work ?  An adequate management structure  Some important results  The software package  Surveys

3 The Sicore project  Launched in 1993 by Pascal Rivière  Written by Éric Meyer and Bruno Berlemont  Finished in may 1996  Followed successively by Pierrette Schuhl, Frédérique Deschamps and François Bulot

4 The main four objectives of Sicore  Construct evolutive knowledge bases for the variables  Create an adequate management structure  Write a generalized software package User-friendly For any variable For any language  Provide a documented methodology

5 The knowledge bases : 4 kinds of information  The reference file : texts codes  The normalization rules - Maximum number of words - Maximum length of each word - Empty (and blank) characters - Empty words - Synonyms

6 The knowledge bases : 4 kinds of information (next)  The logical rules : additional variables  The parameters of the learning algorithm ; parameters about : the structure of the reference file how to split the words and build the coding tree

7 How does the Sicore system work ?  First, the learning phase  Second, the coding phase

8 1 - The learning phase : two steps to build the coding tree  The normalization step of the reference file Remove empty characters Remove empty words Replace words (or groups of words) by their synonyms Limit the number of words and the length of each word Split each word into pieces of two characters : bigrams

9 Example

10 1 - The learning phase : two steps to build the coding tree (next)  To build the coding tree, Sicore : Takes the normalized reference as input Computes the position of the word piece which gives the biggest amount of information (Shannon information) Builds all branches which correspond to this position For each branch, Sicore computes again the second position which gives the biggest amount of information Builds the next branch Repeats this process until each branch uniquely identifies a code

11 Example

12 2 - The coding phase  Normalization of the file to be coded  Pattern recognition algorithm :determines a code using the coding tree  Failure : the pattern of the text is not recognized => no code  Complete success : the pattern is recognized and a code is obtained  Partial success : the pattern is recognized but the text is too much ambiguous  The decision step for the complete success : Set of logical rules and additional variables => code

13 Sicore circle

14 An adequate management structure  To insure that the knowledge bases are regularly updated The variable expert, the Sicore expert  To properly incorporate automatic coding in survey data processing To ensure that all concerned parties (3) join forces to attain the common goal

15 The documented methodology  As of now : 3 documents written The user's guide A dictionary with the important words and concepts The methodology guide : how Sicore works, how to construct the knowledge bases, how to verify the knowledge bases coherence  The programmer's guide  At the moment, only in French

16 Surveys coded for Occupation  All INSEE surveys since the last Census (1999) : surveys on living conditions (PCV), Household Consumption Survey, Health Survey, Continuous Employment Survey (LFS)…  Before : PCV from 1997, the survey on household patrimonies, t est for the national Census (1997)  Many regional surveys  Surveys for other national organisations

17 Other variables l Communes for the national Census l Nationalities/countries for the Census l Diploma and training levels l Activities for the Time Use Survey l Consumption products and shops for the Household Consumption Survey l Geocoding in the Réunion Island l Activities of the firms ( 4 sources : agriculture, administrative body responsible for collecting social security payments, Chamber of Commerce, Guild Chamber)

18 The use for the French National Census in 1999  Batch process " Slight" run : communes of studies,of the previous place of residence, of the working place ; country of birth, of the previous place of residence ; nationality "Heavy" run : present and previous occupations  Interactive process Pick-up codification for the present and the previous occupations

19 News relating to Sicore l Pick-Up Activities : –Occupation for the Census –Occupation for the EEC –Diploma/training level for the EEC –Occupation for the Health Survey and all the Surveys with the common trunk l Sicore under CAPI/BLAISE

20 Sicore’s main criteria  Three criteria to be examined together : The efficiency : percentage of records that are automatically coded The accuracy : percentage of coded records that are well coded The speed : average time to code one record

21 Occupations base  Reference file : 26784 lines ; Text = occupation + rank  Normalization rules : 10 empty characters : '()-_,/\+: 299 empty words : "dand", "chevronné", "SMIG"... Synonyms : 2684 expressions 775 synonyms  Parameters of the learning phase : 5 words (2 - 12 - 12 - 12 - 12) 8 priority bigrams, 3 redundancy bigrams  Logical rules : 14 additional variables 2933 tables 524 codes  Learning phase time : 8 seconds

22 Communes base  Reference file : : 49006 lines (base : geographical official code)  Normalization rules : 8 empty characters : '()-_,/* 58 empty words : "district", "canton", "cedex",... Synonyms : 126 expressions 35 synonyms  Parameters of the learning phase : 5 words (2 - 14 - 12 - 12 - 12) 4 priority bigrams, no redundancy bigram  Logical rules : 1 additional variable = date 2291 tables 4021 codes  Learning phase time : 2 seconds

23 Countries (nationalities) base  Reference file : : 1542 lines  Normalization rules : 7 empty characters : '()-_,/ 29 empty words : "democratic", "republic",... Synonyms : 42 expressions 14 synonyms  Parameters of the learning phase : 4 words (12 - 12 - 12 - 12) 3 priority bigrams 2 redundancy bigrams  Logical rules : None  Learning phase time : less than 1 second

24 Several speeds  Occupation (EEC) : about 900 wordings by second  Occupation (Common Trunk) : about 1000 wordings by second  Activities of Time Use : about 1700 wordings by second  Commune : about 7000 wordings by second  Nationality : about 25000 wordings by second

25 Efficiencies for the Occupation For the National Census ("Heavy" run) : - Present Occupation : 56,6% coded - Former Occupation : 83,7% For the EEC (LFS) : 80% For the household surveys (common trunk) : Between 75 and 80% not empty wordings

26 Efficiencies for other variables l For national Census : Communes of place of work, of study or previous home : 98,5% l Countries/nationalities : 98,9% l Time Use activities : 90% l Household Consumption Survey : - Till Receipts : 69,5% - Consumption board (other purchases) : 75,3% - Shops : 91,8% l Diploma (EEC) : 90%

27 The software package  Independence of the language and the variables used  Written in C language  Available in PC with Windows or Windows NT  Works on IBM/MVS mainframes and on Unix workstations, excluding the expert interface  3 parts : the expert interface, the application program interface (A.P.I.) package, the object modules and include files package

28 Conclusion, the important elements  Separation between software and knowledge bases  A quick learning phase  Many parameters  Specific tools to help experts  The use of local and global criteria  Distinction between learning and coding phases  Independence vis-à-vis variables and languages  And only one piece of software to maintain


Download ppt "Sicore The Insee Automatic Coding System François Bulot April 22, 2003."

Similar presentations


Ads by Google