SDMX: Enabling World Bank to automate data ingestion Siddhesh Kaushik, World Bank SDMX Global Conference(Oct 2-3 2017 Addis Ababa) Good morning everyone.
Our Focus
World Development Indicators The primary World Bank collection of development indicators Cross country data compiled from officially recognized sources most current and accurate global development data includes national, regional and global estimates Has 1400 plus indicators across statistical domains Updated Quarterly World Bank’s primary collection of cross country comparable Development indicators. Has over 1400 indicators and updated quarterly.
WDI Production Process Data Extraction Download Clean Transform Load Consolidate, Verify Publish Data is pulled in from multiple agencies and in different formats. In the past we used to pull data from IMF via backdoor access to their server and populate an in-house query tool for staff to get easier access to data. The data is then cleaned to remove aggregates not used, then transformed into a format need to load into the Data Management System. The loaded data is consolidated, verified and published.
Challenges in Data Extraction Different sources & formats Clean and Validate Time Consuming Many of you would have figured out the challenges in the previous slide itself as I am sure most of us face similar problems. Some of the major challenges are Different sources , formats and methods to obtain data. Excel’s have lot of additional information useful for human consumption that has to be cleaned up Due to security reason the backdoor connection to IMF was closed and we have to execute multiple queries in IMF site to get manageable amount of data in a desktop. Large database like Direction of Trade could not be updated anymore Majority of the cleaning and validation task is manual Leading to large amount of time needed to clean the file Manual process also means more quality checks to be done especially when we are dealing with large volume of data. Manual process More Quality Checks
to the rescue The Super Data Machine Exchange was called in to help us with the situation.
SDMX Implementation Scheduler SDMX Connector Mapping Management Transform Database connector This is how we implemented. We have a scheduler and we can specify details of the SDMX Web Service or file to processed. The frequency of processing and persons to be notified upon completion . Using a combination of Eurostat SDMX Source and R SDMX package we implemented a connector to fetch and read SDMX data. We have a mapping and transformation component that lets us map SDMX codes to internal codes and also to merge dimensions to meet our internal structure Database connectors help us to connect to different databases primarily the Management and Dissemination ones, but giving flexibility to push to additional databases if needed. We then used Eurostat’s SDMX-RI to provide a SDMX enabled Web Service for WDI SDMX Web Service Dissemination
Benefits Single Tool Easy to add new data sources Time Saving The most visible benefit is time saving for staff giving them more time to do more useful things than in cleaning data. We now have a single tool to pull and process SDMX data and it can be deposited in multiple system The databases that we updated using backdoor or now being updated again and more frequently. We just have to add configuration to accommodate new SDMX source Data is being updated weekly All this leads to better data quality and timeliness The code can be re-used for other activities. While our focus of the project was to automate the datasets needed for WDI we realized now we have access to over 8000 datasets from various agencies. We now have add the capacity to access Eurostat like Air traffic data that will be valuable to the Transport group in the Bank. This posed a challenge too, we could not pull all this data Unexpected Bonus Code reuse Access to 8000 + datasets
SDMX Browser Access to 8000 + datasets from 9 organizations To facilitate access to this data we built a SDMX browser and saved a lot of time thanks to OECD making their work available in github. Access to 8000 + datasets from 9 organizations Screen Design courtesy OECD
Future plans More Bank data in SDMX We have around 56 datasets in our dissemination system and we would start SDMX enabling them We need to validation routines for the data being pulled in. The other missing piece in the automation puzzle is metadata ingestion . We also have to face the reality of Excel data and automate its ingestion while reusing lot of the support components build for SDMX automation
Thank you