Replacing Missing Values Jukka Parviainen Tik-61.181 Special Course in Information Technology 27.10.1999.

Slides:



Advertisements
Similar presentations
Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Advertisements

CountrySTAT Team-I November 2014, ECO Secretariat,Teheran.
Objectives 10.1 Simple linear regression
Analyzing Survey Data Angelina Hill, Associate Director of Academic Assessment 2009 Academic Assessment Workshop May 14 th & 15 th UNLV.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
How to Handle Missing Values in Multivariate Data By Jeff McNeal & Marlen Roberts 1.
Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.
1 BA 275 Quantitative Business Methods Residual Analysis Multiple Linear Regression Adjusted R-squared Prediction Dummy Variables Agenda.
Chapter 11: Inference for Distributions
Psych 524 Andrew Ainsworth Data Screening 2. Transformation allows for the correction of non-normality caused by skewness, kurtosis, or other problems.
Business Statistics - QBM117 Statistical inference for regression.
Chapter 7 Probability and Samples: The Distribution of Sample Means
Copyright c 2001 The McGraw-Hill Companies, Inc.1 Chapter 7 Sampling, Significance Levels, and Hypothesis Testing Three scientific traditions critical.
POLS 7000X STATISTICS IN POLITICAL SCIENCE CLASS 7 BROOKLYN COLLEGE-CUNY SHANG E. HA Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Chapter 8 Introduction to Hypothesis Testing
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall. 2-1 Chapter 2 Examining Your Data.
+ DO NOW What conditions do you need to check before constructing a confidence interval for the population proportion? (hint: there are three)
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Guide to Handling Missing Information Contacting researchers Algebraic recalculations, conversions and approximations Imputation method (substituting missing.
Chapter 8 Introduction to Inference Target Goal: I can calculate the confidence interval for a population Estimating with Confidence 8.1a h.w: pg 481:
Chap 6-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 6 Introduction to Sampling.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 19 Process of Quantitative Data Analysis and Interpretation.
Introduction Osborn. Daubert is a benchmark!!!: Daubert (1993)- Judges are the “gatekeepers” of scientific evidence. Must determine if the science is.
Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Basics of Data Cleaning
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Multivariate Data Analysis Chapter 1 - Introduction.
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
Multivariate Data Analysis Chapter 2 – Examining Your Data
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 7 Sampling, Significance Levels, and Hypothesis Testing Three scientific traditions.
A P STATISTICS LESSON 3 – 3 (DAY 3) A P STATISTICS LESSON 3 – 3 (DAY 3) RISIDUALS.
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 13, 2013.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5. Sampling Distributions and Estimators What we want to do is find out the sampling distribution of a statistic.
Section 9.2: Large-Sample Confidence Interval for a Population Proportion.
Tutorial I: Missing Value Analysis
Chapter 9 Inferences Based on Two Samples: Confidence Intervals and Tests of Hypothesis.
© 2000 Prentice-Hall, Inc. Chap Chapter 10 Multiple Regression Models Business Statistics A First Course (2nd Edition)
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 9, 2012.
Lecture 22 Dustin Lueker.  Similar to testing one proportion  Hypotheses are set up like two sample mean test ◦ H 0 :p 1 -p 2 =0  Same as H 0 : p 1.
Introduction to Multivariate Data Analysis Pekka Malo 30E00500 – Quantitative Empirical Research Spring 2016.
Asteroid Strike! Research the answers to these questions: What caused the extinction of the dinosaurs? What is the evidence for this theory? What were.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
The 2 nd to last topic this year!!.  ANOVA Testing is similar to a “two sample t- test except” that it compares more than two samples to one another.
Multiple Imputation using SOLAS for Missing Data Analysis
Maximum Likelihood & Missing data
Introduction to Survey Data Analysis
Chapter 6 Predicting Future Performance
Statistics in Applied Science and Technology
Multiple Imputation Using Stata
BA 275 Quantitative Business Methods
Quantifying uncertainty using the bootstrap
You need: Pencil Agenda Scrap Paper AP log Math book Calculator
Sampling Distribution
Sampling Distribution
The European Statistical Training Programme (ESTP)
CH2. Cleaning and Transforming Data
Chapter 12 Power Analysis.
Chapter 6 Predicting Future Performance
MGSE7.SP.3/MGSE7.SP.4: I can use measure of center and measures of variability for numerical data from random samples to draw informal comparative inferences.
I can determine the different sampling techniques used in real life.
Chapter 2 Examining Your Data
STA 291 Spring 2008 Lecture 22 Dustin Lueker.
Chapter 13: Item nonresponse
Machine Learning in Business John C. Hull
Presentation transcript:

Replacing Missing Values Jukka Parviainen Tik Special Course in Information Technology

Jukka Parviainen2 Agenda n Motivation n Objectives n Meaning for the conclusions n Origin of missing values (MV) n Detection of missing values n Replacing missing values n Examples

Jukka Parviainen3 References n Pyle, DP for DM, chapter 8 n Hair, Anderson, Tatham, Black: Multivariate Data Analysis n Bishop: NN for PR

Jukka Parviainen4 …missing values?

Jukka Parviainen5 Motivation n There are always MVs in a real data set n MVs may have an impact on modeling, in fact, they can destroy it! n MVs contain also information!!! n Hint for the modeler: Avoid-Detect- Replace-Understand

Jukka Parviainen6 “Definitions” n Missing value - not captured in the data set: errors in feeding, transmission,... n Empty value - no value in the population n Outlier, out-of-range value

Jukka Parviainen7 Objectives n Controlled and understood by the modeler n “Least harm”, no “new” information into a data set n statistical estimation of MVs not the primary issue, but DM n KISS - speed and simplicity n PIE-I/O - training+testing+execution

Jukka Parviainen8 Origin and Detection n Missing data process n Degree of randomness u nonrandom u missing at random u missing completely at random n Detecting missing value patterns u number of MVs in each variable/case u compare MVP to complete sets

Jukka Parviainen9 Replacing missing values n Randomness of MVs? n Methods u Use the complete data u Delete variable(s)/case(s) u Imputation methods... u Model based (ML, Bayes) u Use robust models

Jukka Parviainen10 Imputation methods n Process of estimating MVs based on valid values of other variables / cases n Techniques: u distribution characteristics from all available valid values u replacing: case, mean substitution, cold deck, regression imputation

Jukka Parviainen11 Examples n Polls, Questionnaires u Planning more than essential u human factors! u small amounts of data n Data from steel plant u Information system u errors, default values u lots of data

Jukka Parviainen12 Questions n Does software applications help or hide the effect of missing values? (SPSS Clementine) n Execution/prediction phase of DM process? n What to do with alpha variables?