IMPUTING MISSING VALUES FOR HIERARCHICAL POPULATION DATA Overview of Database Research Muhammad Aurangzeb Ahmad Nupur Bhatnagar.

Slides:



Advertisements
Similar presentations
Context-based object-class recognition and retrieval by generalized correlograms by J. Amores, N. Sebe and P. Radeva Discussion led by Qi An Duke University.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
On-line learning and Boosting
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Classification and Decision Boundaries
ITCS 6265/8265 Project Group 5 Gabriel Njock Tanusree Pai Ke Wang.
Chapter 7 – K-Nearest-Neighbor
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Classification II.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Data Mining: Discovering Information From Bio-Data Present by: Hongli Li & Nianya Liu University of Massachusetts Lowell.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Linear Regression and Linear Prediction Predicting the score on one variable.
Determining the Size of
Data Mining: A Closer Look
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
Edit and Imputation of the 2011 Abu Dhabi Census Glenn Hui and Hanan AlDarmaki Statistics Centre - Abu Dhabi UNECE CES Work Session on Statistical Data.
Enterprise systems infrastructure and architecture DT211 4
Environment Change Information Request Change Definition has subtype of Business Case based upon ConceptPopulation Gives context for Statistical Program.
Basic Data Mining Techniques
Data Mining Techniques
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Bayesian Networks. Male brain wiring Female brain wiring.
Inductive learning Simplest form: learn a function from examples
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Presented by Tienwei Tsai July, 2005
Weka Project assignment 3
DATA MINING FINAL REPORT Vipin Saini M 許博淞 M 陳昀志 M
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.
Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
United Nations Workshop on Evaluation and Analysis of Census Data, 1-12 December 2014, Nay Pyi Taw, Myanmar DATA VALIDATION-I Evaluation of editing and.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Luca Cagliero, Paolo Garza 2013.DKE. Improving classification models with taxonomy.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Analysis Classes. What Is an Analysis Class?  A class that represents initial data and behavior requirements, and whose software and hardware-oriented.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Machine Learning: Ensemble Methods
Customer Analytics: Strategies for Success
Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN
Exam #3 Review Zuyin (Alvin) Zheng.
ECE539 final project Instructor: Yu Hen Hu Fall 2005
COSC 4335: Other Classification Techniques
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Intro to Machine Learning
MIS2502: Data Analytics Classification Using Decision Trees
Treatment of Missing Data Pres. 8
Presentation transcript:

IMPUTING MISSING VALUES FOR HIERARCHICAL POPULATION DATA Overview of Database Research Muhammad Aurangzeb Ahmad Nupur Bhatnagar

Background Along with harmonizing U.S. and international census data ; Minnesota Population Center improves data quality for historical U.S. census samples. Modern samples, the U.S. Census has allocated any missing data. Older census years data had to be converted into machine readable form. Errors and omissions were coded as Missing. If the records with missing data are not representative of the dataset as a whole, throwing out the missing cases can effect results of calculations in an undesirable way.

Problem Definition Given : 1850,1860,1870, Datasets with set of variables. Missing variable of interest is the RELATE variable. This particular variable describes an individual's relationship to the head of household. Constraints: A person is not traceable down multiple years. Position of people within a household holds significance. Different household structure within a given year with minimum number of people being 1 and maximum 17. Importance : The relate variable is of significant importance since it is very vital in trend analysis of the household structure for the Researchers. RelateCode Head/Househol der 01 Missing Available Spouse02 Missing Available Child03 Missing Available Child-in-law04 Missing Available To Find : Predict the relate variable for years where it is missing

Existing Approach : Hot Deck Allocation FIRST PASS: Substitute the missing values on the basis of simple rules that are hard coded explicitly. Ex: Relate code 101: head of household Relate Code 201:spouse of head of the household Relate Code 301:Child of the head of the household. These relationship if missing are generated by simple rules. Result : Almost 75% of the missing values are IMPUTED. SECOND PASS: The remaining 25% cases are assigned using the following process: Persons having a relate code of 101,201 or 301 are removed. The remaining persons are known as “donors”. For each donor a temporary table is created that comprises of the predictor variables of the missing relate code of the record. A temporary table for each qualifying donor record is created.

Approach-contd THIRD PASS The predictor variables of each record with a missing relate label is compared against the temporary table of the donors. If the first characteristic in the temporary table matches the value of the predictor variable of the current record a score is assigned to that donor. If the first characteristic does not match; the donor is ignored. The process goes on interactively comparing the characteristic of donors with the predictor variables of missing record and increasing the score of the donors. Result The donor with maximum score qualifies for substituting the missing relate label of the recipient record. This version of algorithm was implemented in Fortran and is now being converted into Java code.

Results : Traditional Approach Fortran Results: Num of all persons: # of imputed 01: , percent correct: 99.37% # of imputed 02: 80256, percent correct: 98.6% # of imputed 03: , percent correct: 99.01% # of imputed 04: 1822, percent correct: 79.04% # of imputed 05: 3343, percent correct: 87.28% # of imputed 06: 2123, percent correct: 87.58% # of imputed 07: 5335, percent correct: 85.77% # of imputed 08: 2566, percent correct: 89.59% # of imputed 09: 6800, percent correct: 87.6% # of imputed 10: 4449, percent correct: 79.56% # of imputed 11: 0, percent correct: 0% # of imputed 12: 37281, percent correct: 95.31% # of imputed 13: 2289, percent correct: 87.83%

Results : Traditional Approach Fortran Results: Num of all persons: # of imputed 01: , percent correct: 99.37% # of imputed 02: 80256, percent correct: 98.6% # of imputed 03: , percent correct: 99.01% # of imputed 04: 1822, percent correct: 79.04% # of imputed 05: 3343, percent correct: 87.28% # of imputed 06: 2123, percent correct: 87.58% # of imputed 07: 5335, percent correct: 85.77% # of imputed 08: 2566, percent correct: 89.59% # of imputed 09: 6800, percent correct: 87.6% # of imputed 10: 4449, percent correct: 79.56% # of imputed 11: 0, percent correct: 0% # of imputed 12: 37281, percent correct: 95.31% # of imputed 13: 2289, percent correct: 87.83%

Proposed Approach : Data Reformatting H_idP_idNUMPRECRelateSexAge M F M10 H_idRelate_1 Relate_2 Relate_3Sex_1Sex_2Sex_3Age_1Age_2Age_3NUMPREC MFM3 Data Reformatting : In order to capture the entire Household in a single row; the data was reformatted with each row entailing the characteristics of every person in a household.

Proposed Approach :Classification classifier F(relate_charcaterstics) P(x) F(family_characterstics) Relationship Code Ex: 04,05,06 Classification : Task of assigning objects to one of the predefined categories. P(x): Position Vector of the person whose relationship is being imputed in the family F(family_characterstics): Attributes of person belonging to the same household F(relate_charcaterstics): Attributes of person whose relationship is being imputed

In a 4 person Household If I am a child in law and 4 th person in the household compared to if I am again the fourth person but this time in a five person household would the set of my characteristics overlap?

Assigning categories : Segregation Segregation is a process of separating complex structures into smaller more meaningful clusters such that each cluster represents a part of the complex data independently. In the census data there is a variable NUMPREC that reports the number of person records that are included in a household. Sibling_in _law Sibling Child_in_ law Age Child_in law Sibling_in _law Sibling Age Number Of person 0405

Proposed Approach : Classification 4 Person Household Age_04Sex_04Occ_04CLASS 24M F F M76604 Does not captures the dependence of a person with respect to the household. Classification is the task of learning a target function f that maps each attribute set to one of the predefined class labels.

Proposed Approach : Classification 4 Person Household _age04_ position Prev _age Next_ age Head_ age Age_ diff_ head Label Takes into account the complete household and the position vector of the person in the household.

Data analysis Bar graph : Household Size

Results : MataBagging using Decision trees and Rule Based classifier : PrecisionRecallF-MeasureClass PrecisionRecallF-MeasureClass Ada Boost

Results : Exisiting vs Data Mining Recall Measure: Number of correctly classified instances out of the number of relevant instances

Validation Metrics used for comparison: Recall: Recall measures the fraction of positive examples correctly predicted by the classifier. recall(X) = number of correctly classified instances of class X / number of instances in class X 1880,1910 1% sample data set to train and test the classifier. Sql Server 2005 for data pre processing, analysis and data reformatting. Weka for building and testing the classifier.

Summary and Future Work Higher accuracy when predicting relationship codes as compared to the existing hot deck allocation method for year Tested our model on 1910 dataset where the labels were known and it worked with the desired accuracy. Reduced time complexity for the execution. Extend it to larger number of households. Extend it to the detailed version of Relationship.

Questions ?

Background