Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Cleaning in Financial Modules Workshop in Frankfurt Mario Schnalzenberger.

Similar presentations


Presentation on theme: "Data Cleaning in Financial Modules Workshop in Frankfurt Mario Schnalzenberger."— Presentation transcript:

1 Data Cleaning in Financial Modules Workshop in Frankfurt Mario Schnalzenberger

2 Which Modules exactly EP –Pensions and Benefits (ep078*, ep094*) –Income from various sources (ep201, ep045, ep314, ep041, ep205, ep207, ep305) AS –Financial assets and dividends from them (as003, as005, as007, as009, as011, as015, as017, as058, as021, as024, as027, as030, as032*, as034*, as042, as049, as051) –Total debts (as051) –Savings of other HH members (as069) HC – out of pocket payments (hc045, hc047, hc049, hc051) FT – financial transfers (ft004*, ft011*, ft018) HO – rents and value of real property –(ho003, ho005, ho008, ho015, ho020, ho024, ho027, ho030) HH – household income (hh002, hh011, hh017) CO – amout spent (co002, co003, co004, co011)

3 1. Check for outliers Merge all relevant sections (include also IWER – id from dataset origin, makes it easier for the agency) Recode „Don‘t Know“, „Refuse“ as missing –(eg. use „mvdecode _all, mv(9e+20=.a \ 8e+20=.b)” in STATA) Use graphical methods (e.g. histograms if that is easier for you) Search for outliers in each section: use percentiles, min, max or other automatic procedures (see hadimvo)

4 Graphical check Here you see that without correcting the outliers, all data is concentrated in few bins Corrected data is distributed nicely

5 Automatic check for the FT section foreach var of varlist ft004_1 ft004_2 ft004_3 ft011_1 ft011_2 ft018_1 { noi disp "`var'" noi sum `var', detail hadimvo `var' if `var'>0, gen(g_`var') sort laptop date1 time1 noi list laptop sampid respid `var' if g_`var'==1 } egen g_ft=rsum(g_ft*) sort laptop date1 time1 noi disp "Rows with more than one outlier in FT" noi list laptop sampid respid g_ft if g_ft>1

6 Look at those outputs Find real outliers & typos Find currency problems (this may be hard in various countries) Make a list (eg. Excel – File) for the agency Those issues should be easy to fix.

7 In addition to ask the agency Check things which are duplicate in the data Gross and net income of same source (but Attention: sometimes apply to different years)! Income (ep078* ep094* and others) + hh002 + hh011 ~= hh017 It often occured that hh002 was set to the partners income (exactly), even though this is wrong (see question text)

8 Find absolute brackets Maximum pensions (maybe per type, in Austria 3850 Euro per month) Find some useful brackets for assets (we used 100000 Euro for many of them) Look for proportions of large values in the whole dataset (reasonable?) Look for wealthy households

9 Don‘t forget Usually we have a skewed distribution (monetary numbers) Therefore outliers are more easily found on the maximum side! No outliers will be usually found on the minimum side. Check also the minimal numbers (e.g. we had houses worth 0, 2 or 5 Euro, penisons of 0 or 1 Euro a month, …) – see also Dimitri‘s presentation

10 Other issues Periodicity of payments (e.g. in Austria ALL(!) pensions are paid monthly only) Sometimes the respondents told they got additional payments (last month, i.e. ep201 is higher than in usual month), but those payments (ep314) are equal to or even higher than ep201 This would mean they have no or even negative regular income payment (obviously wrong)


Download ppt "Data Cleaning in Financial Modules Workshop in Frankfurt Mario Schnalzenberger."

Similar presentations


Ads by Google