Creating a collection of standardized datasets on household consumption Olivier Dupriez World Bank, Development Data Group 6 June 2013
Initial objective Calculate poverty PPPs Had price data at basic heading level from the ICP ; needed consumption shares “at the poverty line” for the same breakdown to be used as weights. See: A. Deaton and O. Dupriez, Purchasing power parity exchange rates for the global poor, American Economic Journal: Applied, vol. 3, pp (2011), and also Global Poverty and Global Price IndexesPurchasing power parity exchange rates for the global poorGlobal Poverty and Global Price Indexes
Intermediary output – data files A collection of “standard” files – Individual level: age, sex – Household level: region, total expenditure (before and after fixing outliers), adult equivalents, hhld size, etc – Household + product level: Product code (original as in questionnaire, with labels) and COICOP code Value purchased, home produced, received, total Deflated (when available) / non deflated NO information on quantities – Format/structure of the data files is standard; content not so much
Multiple uses and users Many potential applications – IFC “Business Opportunities at the Base of the Pyramid” – Micro-macro modeling – Poverty/inequality analysis – Assessment of reliability and relevance of surveys E.g., list all items related to health with percentage of respondents, for each survey E,g, list all categories not covered by questionnaires – And many more
Method Use household consumption/expenditure surveys – A VERY divers set of surveys (HBS, LSMS, HIES, etc) – Ex-post harmonization has limits Map all products and services to COICOP – From items in Brazil survey to less than 50 in other countries… Annualize values by product/service and hhld Fix outliers No attempt to fill gaps (no imputation of values for missing products/services) Generate the 3 standard files
Principle – Full replicability One single Stata program per survey – Calls one “generic” program to detect and fix outliers Controlled vocabulary for file names, folder names Survey ID to link to on-line metadata catalog
Mapping to COICOP ICP/COICOP: 110 basic headings for household consumption 105 are relevant for household surveys Situations: Many to one (e.g., long list of vegetables) One to one One to many (lack of detail in questionnaire) No data to one (questionnaire missed items)
Grouped categories One to many: items in questionnaires are not always detailed enough to be mapped to one single COICOP basic heading
Missing categories No questionnaire found to cover all 105 categories of products and services On average, N basic headings missing – Sometimes for know reasons (e.g., pork in muslim countries) – But questionnaire design needs improvement in all countries
Splitting grouped categories Used breakdown from national accounts to split grouped categories (data obtained from ICP)
Correlation between SNA and surveys From almost perfect (very few cases) to very low (many countries)
Annualization challenges Some problematic items: – Durables (use value/expenditure) – Imputed rents – Out of pocket health expenditure – Ceremonies, etc. – Food away from home Validation: compare with official estimates when available, and with PovCal aggregates – Never replicate exactly
Detecting and fixing outliers Top outliers only Tried multiple options Based on per capita or per household depending on item 75 th percentile + 5 times interquartile range Replace with maximum valid value (zero values not included in calculations) If outlier for multiple items, consider “rich” household and do not fix Would deserve a specific research project
Outliers fixing – Significant impact Example: change in Ginis
Past and future 160 datasets “standardized” – 90+ low and middle-income countries Many more survey datasets available at WB; could expand and update the collection if resources are available Conduct in-depth research work on outliers and formulate recommendations to countries Feedback to countries on issues in questionnaire design Dissemination of microdata?