Download presentation
Presentation is loading. Please wait.
1
IMPUTING MISSING VALUES FOR HIERARCHICAL POPULATION DATA Overview of Database Research Muhammad Aurangzeb Ahmad Nupur Bhatnagar
2
Background Along with harmonizing U.S. and international census data ; Minnesota Population Center improves data quality for historical U.S. census samples. Modern samples, the U.S. Census has allocated any missing data. Older census years data had to be converted into machine readable form. Errors and omissions were coded as Missing. If the records with missing data are not representative of the dataset as a whole, throwing out the missing cases can effect results of calculations in an undesirable way.
3
Problem Definition Given : 1850,1860,1870,1880 + Datasets with set of variables. Missing variable of interest is the RELATE variable. This particular variable describes an individual's relationship to the head of household. Constraints: A person is not traceable down multiple years. Position of people within a household holds significance. Different household structure within a given year with minimum number of people being 1 and maximum 17. Importance : The relate variable is of significant importance since it is very vital in trend analysis of the household structure for the Researchers. RelateCode1850186018701880 Head/Househol der 01 Missing Available Spouse02 Missing Available Child03 Missing Available Child-in-law04 Missing Available To Find : Predict the relate variable for years where it is missing
4
Existing Approach : Hot Deck Allocation FIRST PASS: Substitute the missing values on the basis of simple rules that are hard coded explicitly. Ex: Relate code 101: head of household Relate Code 201:spouse of head of the household Relate Code 301:Child of the head of the household. These relationship if missing are generated by simple rules. Result : Almost 75% of the missing values are IMPUTED. SECOND PASS: The remaining 25% cases are assigned using the following process: Persons having a relate code of 101,201 or 301 are removed. The remaining persons are known as “donors”. For each donor a temporary table is created that comprises of the predictor variables of the missing relate code of the record. A temporary table for each qualifying donor record is created.
5
Approach-contd THIRD PASS The predictor variables of each record with a missing relate label is compared against the temporary table of the donors. If the first characteristic in the temporary table matches the value of the predictor variable of the current record a score is assigned to that donor. If the first characteristic does not match; the donor is ignored. The process goes on interactively comparing the characteristic of donors with the predictor variables of missing record and increasing the score of the donors. Result The donor with maximum score qualifies for substituting the missing relate label of the recipient record. This version of algorithm was implemented in Fortran and is now being converted into Java code.
6
Results : Traditional Approach Fortran Results: Num of all persons: 502840 # of imputed 01: 101228, percent correct: 99.37% # of imputed 02: 80256, percent correct: 98.6% # of imputed 03: 244175, percent correct: 99.01% # of imputed 04: 1822, percent correct: 79.04% # of imputed 05: 3343, percent correct: 87.28% # of imputed 06: 2123, percent correct: 87.58% # of imputed 07: 5335, percent correct: 85.77% # of imputed 08: 2566, percent correct: 89.59% # of imputed 09: 6800, percent correct: 87.6% # of imputed 10: 4449, percent correct: 79.56% # of imputed 11: 0, percent correct: 0% # of imputed 12: 37281, percent correct: 95.31% # of imputed 13: 2289, percent correct: 87.83%
7
Results : Traditional Approach Fortran Results: Num of all persons: 502840 # of imputed 01: 101228, percent correct: 99.37% # of imputed 02: 80256, percent correct: 98.6% # of imputed 03: 244175, percent correct: 99.01% # of imputed 04: 1822, percent correct: 79.04% # of imputed 05: 3343, percent correct: 87.28% # of imputed 06: 2123, percent correct: 87.58% # of imputed 07: 5335, percent correct: 85.77% # of imputed 08: 2566, percent correct: 89.59% # of imputed 09: 6800, percent correct: 87.6% # of imputed 10: 4449, percent correct: 79.56% # of imputed 11: 0, percent correct: 0% # of imputed 12: 37281, percent correct: 95.31% # of imputed 13: 2289, percent correct: 87.83%
8
Proposed Approach : Data Reformatting H_idP_idNUMPRECRelateSexAge 113101M50 123201F24 133301M10 H_idRelate_1 Relate_2 Relate_3Sex_1Sex_2Sex_3Age_1Age_2Age_3NUMPREC 1101201301210MFM3 Data Reformatting : In order to capture the entire Household in a single row; the data was reformatted with each row entailing the characteristics of every person in a household.
9
Proposed Approach :Classification classifier F(relate_charcaterstics) P(x) F(family_characterstics) Relationship Code Ex: 04,05,06 Classification : Task of assigning objects to one of the predefined categories. P(x): Position Vector of the person whose relationship is being imputed in the family F(family_characterstics): Attributes of person belonging to the same household F(relate_charcaterstics): Attributes of person whose relationship is being imputed
10
In a 4 person Household If I am a child in law and 4 th person in the household compared to if I am again the fourth person but this time in a five person household would the set of my characteristics overlap?
11
Assigning categories : Segregation Segregation is a process of separating complex structures into smaller more meaningful clusters such that each cluster represents a part of the complex data independently. In the census data there is a variable NUMPREC that reports the number of person records that are included in a household. Sibling_in _law Sibling Child_in_ law Age Child_in law Sibling_in _law Sibling Age Number Of person 0405
12
Proposed Approach : Classification 4 Person Household 01020304 01030409 01030304 01040709 Age_04Sex_04Occ_04CLASS 24M42304 45F56704 34F43104 43M76604 Does not captures the dependence of a person with respect to the household. Classification is the task of learning a target function f that maps each attribute set to one of the predefined class labels.
13
Proposed Approach : Classification 4 Person Household 01020304 01030409 01030304 01040709 04_age04_ position Prev _age Next_ age Head_ age Age_ diff_ head Label 24412999543004 18327382004 34299956643004 Takes into account the complete household and the position vector of the person in the household.
14
Data analysis Bar graph : Household Size 4 04 05 06 07 08 09
15
Results : MataBagging using Decision trees and Rule Based classifier : PrecisionRecallF-MeasureClass 0.7830.8020.79304 0.9150.8560.88505 0.7650.90.82706 0.9170.8850.90107 0.736 0.7640.7508 0.967 0.9610.96409 PrecisionRecallF-MeasureClass 0.789 0.7820.78504 0.915 0.860.88705 0.7650.9090.83106 0.8970.8870.89207 0.7710.7390.75508 0.9560.960.95809 Ada Boost
16
Results : Exisiting vs Data Mining Recall Measure: Number of correctly classified instances out of the number of relevant instances
17
Validation Metrics used for comparison: Recall: Recall measures the fraction of positive examples correctly predicted by the classifier. recall(X) = number of correctly classified instances of class X / number of instances in class X 1880,1910 1% sample data set to train and test the classifier. Sql Server 2005 for data pre processing, analysis and data reformatting. Weka for building and testing the classifier.
18
Summary and Future Work Higher accuracy when predicting relationship codes 04-09 as compared to the existing hot deck allocation method for year 1880. Tested our model on 1910 dataset where the labels were known and it worked with the desired accuracy. Reduced time complexity for the execution. Extend it to larger number of households. Extend it to the detailed version of Relationship.
19
Questions ?
20
Background
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.