Download presentation
Presentation is loading. Please wait.
Published byRussell Morris Modified over 8 years ago
1
Kevin A Henry, Ph.D New Jersey Cancer Registry Cancer Epidemiology Services Frank Boscoe, Ph.D New York State Cancer Registry Estimating the accuracy of different geographical imputation methods Paper Presentation: NAACCR Annual Meeting, 2007, Detroit, MI
2
Introduction Geographical Imputation: Methods to assign a case a geographic location that is approximate or accurate given available geographic and demographic data Goal of geo-imputation is to assign a case a location at one geographical aggregate level based on information from one or more known geographical aggregates (Boscoe 2007). Assigned locations can be: Geo-imputation Example: Zip code to census tract Available Case Information: Zip code:’08648’ Race: ‘Black’ 1,639 636 692 293 Black Population 3260 21% 8.9% 19.5% 50.% 08648 Area (e.g. census tract, block group) Point (e.g. latitude & longitude within census tract)
3
Introduction Why should we geo-impute? Studies can be biased due to the geographic non-randomness of ungeocoded cases or cases geocoded to zip code centroid (Oliver et al. 2006). Cases geocoded to a zip code centroid may not be located in the correct census tract. Removing cases geocoded by zip code can result in selection bias. Cases geocoded to zip code centroids can inflate case counts at the location where the zip centroid falls. No systematic evaluation of geo-imputation has been completed to determine which method offers the best predictive power. Should we geo-impute?
4
Study Objective What census tract demographic information (e.g. race, age) provides the best predictive value to assign a case to the correct census tract? Is demographic based geo-imputation better than two alternatives? 1) Selecting census tracts within a zip code zone randomly 2) Using the census tracts originally assigned to cases based on the zip code centroid location. Study Questions Examine the usefulness of geo-imputation for assigning census tracts to cases that have been previously geocoded to only a zip code centroid.
5
Background: What is a zip code ZIP or ‘Zone Improvement Program’ are linear features associated with specific roads or specific addresses Zip code zones are created by digitizing boundaries around geographically street ranges Census Tracts Falling Within in Zip Code Zone Street Segments Used for Geocoding Zip Code Centroid
6
Background: New Jersey Zip Codes 558 zip code zones 92% of zip codes have 2 or more potential census tracts 1 zip code has 23 potential census tracts Average tracts per zip code: 6 1357911131517192123 0% 5% 10% 15% 20% 25% Tract Frequency Percent Census Tracts Per Zip Code
7
Methods: Study Population New Jersey residents diagnosed with breast, prostate and colorectal cancer geocoded to a full street address (2000-2004, N=96,852, NJSCR) Additional study exclusions (N=4100) : No age or race Invalid zip codes Invalid census tracts Cases geocoded to zip centroids with only one census tract Registry Variables: Race Age Census Tract Zip Code Census Tract Certainty Census Tracts Assigned to Cases Compared with: ‘Truth’ Census Tracts Assigned to Cases Imputed Case Data Original Case Data
8
Methods: Demographic Data Creation of Census Tract Populations: 2000 Census block populations aggregated into zip codes (Tele Atlas, 2006). Census tract populations created to include only populations within zip code. Total Tract Population 6,774 3,101 Zip code: 07524 Cumulative probabilities calculated for each tract per zip code. Census Block Population 2000 SF1 Census populations included: -Total Population (P001001) -White alone (P003003) -Black or African Amer. alone (P003004) -Asian alone (P003006) -Hispanic or Latino (P004002) -Total Population by age (P012003-P012049)
9
Method: Geo-imputation Step 1 Calculate Cumulative Probabilities From CT Population Step 2 Generate random number for each case (0-1) Generate census tract based on random number ranges Step 3 01.84.65.32.15.18.32 Percent Cum Probability 1234 07001 32.7% 32.8% 18.4% 15.9.% 2 1 3 4
10
Methods: Test Samples Random samples for race and age groups stratified by population density (Quintiles). Geo-imputations completed for each subset: Compared imputed census tracts with the tracts from the original case data (truth). Each imputation was run 1000 times. Results: Boxplots of mean % of matches.
11
Urban Rural 10% 15% 20% 25% 30% 35% Mean Percent Correct <1,132 Population Per Square Mile by Census Tract 1,133 - 2,882 2,883 - 5,078 5,079 - 11,579 >11,579 No imputation (17.1%) Random 13% Results:
12
N=1500 N=25000 N=4000 N=3000 Asian White Black Random Hispanic N=33,500 Asia, White, Black & Hispanic Combined No imputation (17.1%) 10% 15% 20% 25% 30% 23.1% Mean Percent Correct 22% 26.3% 22.2% 13% 24.6% Total Population (24%) Population Results:
13
10% 15% 20% 25% 30% Mean Percent Correct Age groups 40-4445-4950-5455-5960-6162-6465-6667-6970-7475-7980-84>85 No imputation (17.1%) Random 13% Age Combined (24.9%) Results:
14
Conclusion Geo-imputation provides a higher match rate than no-imputation or randomly allocating tracts. Percent correct dependent on population density. Imputation based on race specific population slightly higher than total population (23.1% vs 24% ). States with larger rural populations would likely have better match rates than New Jersey. Geographic imputation does offer some advantages and no serious drawbacks compared with the alternative of excluding ungeocoded cases from an analysis.
15
Thank you Note: New Jersey Case counts for Breast, Prostate, Colorectal & Cervical Cancer (1997-2003);( N=154,071) Data extracted from NJ Registry analytical database March 5, 2007
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.