Download presentation
Presentation is loading. Please wait.
Published bySarah Hodge Modified over 9 years ago
1
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Data Editing: Introduction
2
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Objectives of Session Editing is the procedure for detecting and correcting errors from data. Imputation is the procedure of assigning values to missing or inconsistent data The objective of the session is to present an overview of the concepts and definitions, and discuss the application and issues
3
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Summary Types of Errors in the census process Objectives of Editing: Why do it? How to and Why Edit? Some illustrative examples Principles of Editing: How to do it Fatal versus Query Edits Micro-editing versus Macro-editing Manual versus automatic editing Impact of capture mode on editing Pitfalls of Over-editing Other considerations
4
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Types of Errors in the Census Process Coverage Errors Incomplete/inaccurate maps or EAs Incomplete canvassing of all units Duplicate counting Omission of persons unwilling to be enumerated Erroneous treatment of visitors or non-resident aliens (especially in relation to de jure versus de facto methods) Loss or destruction of census records after enumeration ……
5
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Types of Errors in the Census Process Content Errors Errors in questionnaire design Enumerator errors Respondent errors Coding errors Data entry errors Errors in computer editing Errors in tabulation
6
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Types of Errors in the Census Process Two types of errors at processing stage: Those that block further processing and Those that produce invalid/ inconsistent results without interrupting logical flow of subsequent processing operations ALL errors of first kind must be corrected and as many of second kind as possible
7
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Objectives of Editing : Why do it? Objectives of editing (Granquist, 1984) “Tidy up data” so as to facilitate analysis (creation of complete file) Identify types and sources of errors (for reporting on data quality) Improve quality of census data (for current and future census) Important not only to detect errors but also to identify causes, in order to take appropriate corrective measures and improve overall quality
8
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 How to Edit? TABLE 1: 2010 Population by Age and Sex, Unedited and Edited Unedited data Edited data Age groupTotalMaleFemaleSex Not reportedTotalMaleFemale Total4,1472,0332,09123 4,1472,0432,104 Less than 15 years1,63979982515 1,646809837 15 to 29 years1,2566126431 1,260614646 30 to 44 years7273563692 729358371 45 to 59 years3601941660 362195167 60 to 74 years11654593 1165561 75 years and over3412220 341222 Age Not reported15672
9
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 How to Edit? TABLE 1: Population by Age and Sex, Unedited and Edited How to deal with data “not reported”? Distribute the age unknowns and the sex unknowns in same proportion as the corresponding known values For example, for 23 sex unknowns, distribute (2033/4147)*23 = 12 to males (and remaining 11 to females by subtraction); see RHS of Table 1 Similarly, distribute 15 age unknowns across 6 age groups in proportion to known values, see RHS of Table 1 This method could render biased results if number of unknowns (number of non-responses) high since distribution of knowns and unknowns may be very different An improved strategy would be to use multivariate distributions involving other variables such as relationship between spouses, having a positive entry for number of children born, etc,
10
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Why Edit? TABLE 2: Population by Age with Unknowns for 2000 and 2010 Age groupNumbers Percent 2010200020102000 Total4,1473,319100 Less than 15 years1,6391,34839.540.6 15 to 29 years1,25690230.327.2 30 to 44 years72753817.516.2 45 to 59 years3602008.76 60 to 74 years116892.82.7 75 years and over34250.8 Age Not reported152170.46.5
11
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Why Edit? TABLES 2 and 3: Population by Age with and without Unknowns for 2000 and 2010 Another problem is that unknowns may affect the analysis of trends In Table 2, if unknowns not taken into account, percentage of persons aged 15-29 years appears to increase from 27.2% in 2000 to 30.3% in 2010 Redistributing unknowns may change this trend In Table 3, after distributing unknowns, there is only an increase from 28.7% in 2000 to 29.3% in 2010
12
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Why Edit? TABLE 3: Population by Age without Unknowns for 2000 and 2010 Age groupNumbersPer cent 2010200020102000 Total4,1473,319100 Less than 151,7431,4084242.4 15 to 29 years1,21795229.328.7 30 to 44 years69557816.817.4 45 to 59 years3412308.26.9 60 to 74 years1141092.73.3 75 years and over37420.91.3
13
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Principles of Editing : How to do it In general the editing system should be: Minimalist (change only obvious errors and as few as possible) Automated (as much as possible, for both detection and correction) Systematic Consistent with other NSO statistical collections Compliant with UN or other international standards
14
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Fatal versus Query Edits Types of edits: Fatal Edits: identify errors with certainty Query Edits: identify suspected errors Fatal Edits identify fatal errors, which include invalid or missing entries as well as errors due to inconsistencies Query Edits identify data items that fall outside subjective data bounds, or items that are relatively high or low as compared with other data on the same questionnaire Fatal edits must be resolved but query edits more difficult to correct, have fewer benefits than the detection and resolution of fatal edits, and add more to the cost of the process For query edits, subject-matter specialists should investigate edits developed for pilot censuses and those developed during processing to make sure that individual edit have the expected cost of census evaluation (e.g., look at hit rates or share of flags that result in changes to the original data)
15
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Micro-editing versus Macro-editing Micro-editing: concerns ways to ensure validity and consistency of individual data records and relationships between records in a household Macro-editing: checks aggregated data to make sure that they are reasonable Example, If census results show large percentage of persons without a reported age, imputing for age (at micro level) will produce a complete data set. BUT far more essential to make checks at macro (aggregate) level to ensure that imputation does not skew overall age distribution
16
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Impact of Capture Mode on Editing Types of capture modes typically used: manual (key-entry), OMR, OCR/ICR, PDA, Internet For key-entry, PDA, Internet: some limited detection and correction of errors can be done in “real time” Not possible for OMR or OCR/ICR (from paper questionnaire) with scanning; limited to “batch editing” after the fact
17
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Manual versus Automated Editing Manual edits may be done in several places along the editing chain – by enumerator, supervisor, field office worker, coder, key entry clerk, etc Disadvantage is that manual editing expends enormous amount of time (months or years), energy (human resources) and cost If data set is small, timing not so crucial and work force available, then manual editing may be feasible Automated editing reduces time required, decreases introduction of human error, and allows for creation of edit trail (and is therefore reproducible) Unlike manual editing, automated editing makes it feasible and efficient to impute responses based on other information in the questionnaire or on reported information for a unit with similar characteristics
18
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Pitfalls of Over-editing Reduced timeliness Increased costs Potential distortion of true values False sense of security
19
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Other Considerations Determination of tolerance levels for error detection For most items in a census, some small percentage of the respondents will not give “acceptable” responses, for whatever reason Not every failure is pervasive and therefore may not be worthy of remedial action- see Pitfalls of Over-editing Tolerance levels indicate number of invalid and inconsistent responses allowed before editing teams take remedial action Decided by editing team including both subject-matter and data processing specialists For key items such as age and sex, typically low (1%-2%) whereas less key items such as literacy and disability, typically higher (5%-10%) Correction may occur by returning enumerators to field, conducting telephone re-interviews or by applying specific knowledge of an area Learning from the editing process/ quality assurance systems Positive and negative feed-back loops need to be recorded to improve the quality of both the current census and future censuses and surveys Audit trails, performance measures and diagnostic statistics crucial This is often the most important outcome of editing
20
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 Other Considerations Cost of editing Cost of editing has not decreased in the last much in the last 20 years, although process has been rationalized by continuous exploitation of technological developments In general, editing activities take a disproportionate amount of time (and therefore staff costs) relative to other activities Excessive editing can delay census results Archiving Both edited and unedited data files should be preserved for later analysis – and in several places Documentation should be complete enough for census planners to be able to reconstruct the same processes at a later date
21
UNSD-UNESCAP Regional Workshop on Census Data Processing: Contemporary technologies for data capture, methodology and practice of data editing, documentation and archiving Bangkok, Thailand, 15-19 September 2008 THANK YOU!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.