Using OCR for Census Data Capture in China National Bureau of Statistics of China.


Similar presentations
Debugging ACL Scripts.

MICS4 Data Processing Workshop Multiple Indicator Cluster Surveys Data Processing Workshop Overview of Data Processing System.
Pay special attention to crucial steps, work pace and actual effect Pay special attention to crucial steps, work pace and actual effect Introduction of.
2010 Population Census (SP2010) and 2013 Agriculture Census (ST2013) Modern Data Processing Automation in BPS-Statistics Indonesia: an Innovations Dudy.
Systems Analysis & IT Project Management Pepper. System Life Cycle BirthDeathDevelopmentProduction.
Commercial Data Processing Lesson 2: The Data Processing Cycle.
Input to the Computer * Input * Keyboard * Pointing Devices
Programming System development life cycle Life cycle of a program
6.1 Copyright © 2014 Pearson Education, Inc. publishing as Prentice Hall Building Information Systems Chapter 13 VIDEO CASES Video Case 1: IBM: Business.
MSIS 110: Introduction to Computers; Instructor: S. Mathiyalakan1 Systems Design, Implementation, Maintenance, and Review Chapter 13.
System Implementation
Using SMS-Gateways for Monitoring Progress and Quality of Data Collection: Lessons Learned from the 2010 Population Census of Indonesia Thoman Pardosi.
Brief Overview of Data Processing of Afghanistan Household Listing, Pilot Census Results, Population and Housing Census and NRVA Survey Brief Overview.
ITIL Problem Management Tool Guide Gerald M. Guglielmo ITIL Problem Manager CD-doc
Advanced Workgroup System. RED Advanced Workgroup Systems: Scan Features Copy Print Scan DNSG Software Our Customers Documents Our Customers Documents.
Census Data Capture Challenge Intelligent Document Capture Solution UNSD Workshop - Minsk Dec 2008 Amir Angel Director of Government Projects.
General Statistics Office of Vietnam THE 2009 VIETNAM POPULATION AND HOUSING CENSUS.
1 Australian Newspapers Digitisation Program Development of the Newspapers Content Management System Rose Holley – ANDP Manager ANPlan/ANDP Workshop, 28.
Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Today’s Lecture application controls audit methodology.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 9 Processing the Data.
TH-OCR NK. content introduction go to next page background assumptions overall structure chart IPO for overall structure dataflow diagram of overall structure.
Background on USPS mail forwarding operations Overview of PARS
NextGen Trustee Department Disbursements This class will cover the various methods of handling department disbursements. Whether entering them manually.
DRS Census Experience Andy Tye International Manager, DRS DRS Census Experience Andy Tye International Manager, DRS Census Meeting – New Caledonia Feb.
UNSD-ESCWA Regional Workshop on Census Data Processing in the ESCWA region: Contemporary technologies for data capture, methodology and practice of data.
1 Use of Hand Held Computers in United States 2010 Census: Lessons Learned So Far Andrea Grace Johnson United States Census Bureau UNECE Conference of.
What’s New in VRS? GUGM May 15, 2008 Presenter: Kelly P. Robinson GIL Service Georgia State University
Experiences and Challenges: Review on China’s Agricultural Censuses Xu ZhiQuan Department of Rural Surveys, National Bureau of Statistics.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
Workshop on International Standards, Contemporary Technologies and Regional Cooperation, Noumea, New Caledonia, 04–08 February 2008 Results Generated from.
Copyright 2010, The World Bank Group. All Rights Reserved. COVERAGE, FRAMES & GIS, Part 2 Quality assurance for census 1.
Let VRS Work for You! ELUNA Conference 2008 Presenter: Kelly P. Robinson GIL Service Georgia State University
Scanning Technology and Its Application in Ethiopia Yakob Mudesir Deputy Director General Central Statistical Agency of Ethiopia
Principles of Information Systems, Sixth Edition Systems Design, Implementation, Maintenance, and Review Chapter 13.
Software Systems for Survey and Census Yudi Agusta Statistics Indonesia (Chief of IT Division Regional Statistics Office of Bali Province) Joint Meeting.
© Beta Systems Software AG Process Stages of Census Surveys Richard J. Lang, International Manager September 2008, Bangkok.
MATSEC Past Papers May 2010 Paper 1 Paper 2A. What is the difference between each of the following pairs of items? Syntax Error Caused by forgetting certain.
Chapter 3 Developing an algorithm. Objectives To introduce methods of analysing a problem and developing a solution To develop simple algorithms using.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
UNSD Census Workshop Day 2 - Session 7 Data Capture: Intelligent Character Recognition Andy Tye – International Manager DRS are Worldwide specialists in.
Data Capture Technology Statistical Centre Of IRAN Presented by : MS. SOMAYE AHANGAR Vice – Presidency for Strategic Planning and Supervision Statistical.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
European Conference on Quality in Official Statistics Session 26: Quality Issues in Census « Rome, 10 July 2008 « Quality Assurance and Control Programme.
Multi-modal of data collection for the 2010 Population and Housing Census National Statistical Office, Thailand (Daejeon, Republic of Korea, April.
Census Data Processing: Contemporary Technologies for Data Capture Bangkok, Thailand September, 2008 By Jatan Kumar Saha Systems Analyst Bangladesh.
Test and Review chapter State the differences between archive and back-up data. Answer: Archive data is a copy of data which is no longer in regular.
UN Regional Workshop on Data Processing, Bangkok, Sep Philippines 2007 Census of Population Data Processing Philippines 2007 Census of Population.
Topics Covered Phase 1: Preliminary investigation Phase 1: Preliminary investigation Phase 2: Feasibility Study Phase 2: Feasibility Study Phase 3: System.
Statistical Expertise for Sound Decision Making Quality Assurance for Census Data Processing Jean-Michel Durr 28/1/20111Fourth meeting of the TCG - Lubjana.
Data processing of the 1999 Vietnam Population Census.
Data Processing of the 2010 Population and Housing Census September 2008, Bangkok, Thailand National Statistical Office, Thailand.
Principles of Information Systems, Sixth Edition 1 Systems Design, Implementation, Maintenance, and Review Chapter 13.
Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9.
Irwin/McGraw-Hill Copyright © 2000 The McGraw-Hill Companies. All Rights reserved Whitten Bentley DittmanSYSTEMS ANALYSIS AND DESIGN METHODS5th Edition.
Chang, Wen-Hsi Division Director National Archives Administration, 2011/3/18/16:15-17: TELDAP International Conference.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
 Problem Analysis  Coding  Debugging  Testing.
Jordan Population and Housing Census 2015 Prepared by: Ahmad Mowafi
System Design, Implementation and Review
UNSD Census Workshop Data Capture: Intelligent Character Recognition
The Impact of Information Technology on the Audit Process
The Impact of Information Technology on the Audit Process
Software Systems for Survey and Census
Data Capture Process Stages
Programming Logic and Design Eighth Edition
Presentation transcript:

Using OCR for Census Data Capture in China National Bureau of Statistics of China

Background 5 population censuses have been conducted in 1953, 1964, 1982, 1990, 2000 respectively 1953,1964 census: manual tabulation Since 1982 census, using computer for data process. 1982,1990 census, manual data entry 2000 census, using OCR for data capture 2006, the second Agriculture census: also use OCR for data capture 2000 population census and 2006 agriculture census are the two cases of OCR use for large- volume data capture

Two cases of OCR for large-volume Census Data Capture The data capture of 2000 Population Census  Census reference time: Nov. 1,2000  Data Capture cycle : Jan. -- June., 2001.(6 months)  Scale :  Types of Census Form : 4 Short Form: 49 Census items, 90% HH, 360 million A4 size double sheets in total Long Form: 95 Census items, 10% HH, 40 million A3 double sheets in total Other Forms: Death pop, temporary residents. 10 million A4 double sheets in total  Original Census data : About 64 GB  Image volume: 5.5TB

The Second National Agricultural Census  Census reference time: end of 2006  Data Capture cycle : April to mid-July, 2007,100 days  Scale :  Types of Census Form : 8  Total census items : 541  Total agricultural Families : 250 millions  Total Census Forms : about 500 million pieces of paper  Original Census data : about 300GB  Image data : 40TB Two cases of OCR for large-volumes Census Data Capture

Organizational Structure for data process EA Town County 113 Checked & packed Checked & packed Coding Prefecture 5 million Data Capture editing Province 31 NBS Data process Data capture was decentralized at prefecture offices Village 0.9 million Checked & packed

Function framework of OCR data capture (2006 agriculture census) task management System management scanning Editing and checkup Data management Image management User managementSystem Initialization Log management Space management Sys management Address base Client management Archiving management process management Check scan numeric data Chinese character numeric data edit OCR Image reported Restore Delete EnquiryBrowse Importexport Forms check Scanner self-inspection Browse Input Restore Output Backup Delete QA Add scan Alternative scan Batch form scan Generation image management ID Image merger Chinese character edit Generation census form management ID Repeat scan English character Special character Progress monitoring Receiving file Statistics summary Information display System Functions English character edit Special character edit checkup

The scanning module generates image files and transmits them to image management module and also transmits the status information to task management module. The task management module executes task distribution according to the state of vacancy of each OCR clients. The Process of OCR data capture

The OCR module performs recognition of numerical data and Chinese characters and transmits the data and Chinese characters to data management module and transmits the status information to task management module. The Process of OCR data capture

The task management dispatches the data to edit module for editing. If original image is needed, corresponding image is fetched by image management module for comparison, the cleansed data after edit are returned back to data management module. when data capture work is all finished, report upward the data. The Process of OCR data capture

Quality Control To ensure the quality of captured data, quality control is executed in three stages: scanning, recognizing and data editing. During the process of scanning, recognizing batch cover data and scanner count, the system checks if the total page count, total household count for each batch are consistent with the results of scanning; Comparing the actual address code with address code repository, ensure that the address codes are validity, uniqueness and correctness. During the recognition, collecting real time statistics for rejection ratio and suspect ratio. If rejection ratio and suspect ratio is too high, the task administrator checks the reason.

During the process of editing, checking the consistency between recognized record count and the record count in controller document; Checking the basic logic relationship and value range; indicate the items which have mistakes in logic relationships or value ranges, recognition results and corresponding items from original scanned images are displayed comparatively in parallel windows, and convenient modification means are provided for those which need get modified. After the whole set of data has been captured, quality is assured through executing sampling quality check through all phases Quality Control

Main Problems and Solutions In large-scale census data capture projects, there’re three aspects of problems we regard as the most outstanding: 1. How to enhance OCR’s recognition capability. 2. Availability and reliability of the system. 3. Project management. What we have done are: 1. Improve the capability of recognizing numeric characters Two kinds of recognition algorithms and two kinds of recognition engines based on the two algorithms were developed, after a series of onsite test, which better suites the census project is chosen. 2. Improve the recognition capability for Chinese characters By collecting large number of actual samples and training the recognizer, recognition capability for Chinese names is improved.

Main Problems and Solutions 3. Improve orientation capability Aiming at print deviation and filling deviation, smart locating algorithm has been developed which has minimized the impact of the print deviation and filling deviation. 4. Enhance efficiency of recognition Improve the fundamental software of scanner, to achieve the best match between hardware drivers and OCR software and improve the efficiency of recognition. 5. Improve the quality of forms filling Prescribe the filling standards for form filling so that OCR error rate will be reduced, meanwhile rejection rate could also be reduced.

Main Problems and Solutions 6. Establish regulation, working guidance and processes to make every data entry site to execute work following uniform regulations, processes and standards. 7. Strengthen the training. we organized centralized training and on-site training for the users. Lecturing and actual operations are combined during centralized training, through the combination of these two ways, the familiarity with the system has get deepened. 8. Organize multi-target pilot. We organized multiple pilots in many locations aiming at different targets.

Lessons Learned Using advanced technology to raise efficiency Combining technical and administrative methods to resolve quality problems and security issues Choose partners with the higher capability of system development and service Early project preparation Manage project with partners Training, pilot projects and management is the key to success Control the printing quality of the census forms and census data filling quality Project change control

Prospect of the 2010 Population Census Census time: Nov. 1, 2010  Short form and long form, death population form  Foreigners living in China are considered to be enumerated Data capture in 2011  OCR data capture will be the main data entry method  Modifying the existing system of agricultural census and make some innovate  Adding more OCR equipments