Error check in data Hein Stigum
Example data HUMIS –Birth cohort, 5 counties in Norway –N=475 mother-child pairs –Repeated questionnaires Purpose –Outcome:Growth after birth –Exposure:Contaminants in mother’s milk Mar-16H.S.2
Mar-16H.S.3 Agenda Potential problems –String variables, Missing, … Univariate Bivariate Multivariable Individual growth
Mar-16H.S.4 Potential problems
Mar-16H.S.5 String variables encode KJONN if KJONN!=" ", generate(sex3) String to numeric
Mar-16H.S.6 Missing
Mar-16H.S.7 Univariate outliers
Mar-16H.S.8 Commands for previous plot local i=1 foreach var of varlist age1 weight1 fHCB BMI1 mHeight mWeight { graph hbox `var', marker(1, mlabel(id) msymbol(i) mlabpos(0) mlabangle(-90)) /// name(plt`i', replace) local ++i } graph combine plt1 plt2 plt3 plt4 plt5 plt6, col(2)
Mar-16H.S.9 Bivariate outliers
Mar-16H.S.10 Commands for previous plot twoway (scatter mWeight mHeight) /// (scatter mWeight mHeight if BMI1>35 | BMI1<16, mcol(red))/// (qfit mWeight mHeight)/// (qfit mWeight mHeight if mHeight<185)///, legend(off) text( "BMI>35", col(red)) /// ytitle("Mother's weight") xtitle("Mother's height")
Mar-16H.S.11 Multivariable outliers Weight
Mar-16H.S.12 Commands for previous plot gen agesq=age^2 gen ageqb=age^3 regress weight age agesq ageqb if age>=0 & age<1000 capture: drop xb res predict xb, xb/* predicted value */ predict res, res/* residuals */ tw (scatter weight age)(scatter weight age if abs(res)>4000, mcol(red))/// (line xb age, sort lcol(red)) if age>=0 & age<1000, legend(off)
Mar-16H.S.13 Plot of individual growth patterns: weight versus age
Commands for previous plots * Individual growth patterns. OBS 16 pages of each 30 plots * Repeated measurements, long format, age nested in id sort id age/* sort by id-number and age */ global d=30/* 30 plots per page */ forvalues i=1(1)16 {/* 16 pages*30 plots=480 subjects */ local j=(`i'-1)*$d+1/* plot subjects in id-interval: j<=id<=k */ local k=`i'*$d twoway (line weight age, connect(ascending)) if id>=`j' & id<=`k‘ ///,by(id, compact title("Weight by age, `i'") note("") ) /// ylabel(0(5000)15000) xlabel(0(200)800) graph export “H:\Projects\HUMIS\Weight gain\plt`i'.emf", replace /* Enhanced Metafile Format */ }/* end of loop */ * Make new Photo album in Powerpoint, and add all plots. This will give one plot per page in max size. Mar-16H.S.30
Mar-16H.S.31 After new data merge Plot of individual growth patterns: weight versus age
Mar-16H.S.48 Individual plots in large datasets? Scan 1 page (=30 curves) in 5 sec –Hours used=5N/(30*60*60) Scan all –If N=50 000, need 2.3 hours May instead scan curves of subjects with medium to large residuals. –Residual>1000 finds 190 of the 470 children=40% 12 of the 15 deviant growth patterns=80%
Summing up Graph, outliers –Uni:Boxplots –Bi:Scatterplots –Multi:Scatterplots+residuals –Individual growth Merge errors are not rare! Mar-16H.S.49