WebDataNet Conference 2015 Salamanca, 26th – 28th May 2015 Big data and official statistics - progress at European level Fernando Reis, Big Data Task-Force European Commission (Eurostat)
The data deluge © Copyright Brett Ryder 2010
The data deluge The origins Decreasing storage price Source: JC McCallum http://www.jcmit.com/disk2014.htm Some methodology and references can be found in: JC McCallum, Price-Performance of Computer Technology (1st edition), pages 4-1 to 4-18 in The Computer Engineering Handbook, Vojin G. Oklobdzija, editor, CRC Press, Boca Raton, Florida, USA, 2002
The data deluge The origins Decreasing storage price Developments in distributed computing paradigm Copyright © 2014 The Apache Software Foundation.
The data deluge The 3 V’s of big data Volume (large N and large P) Velocity (data streaming and real-time statistics) Variety
The data deluge Organic data / exhaust data / digital footprint Data ubiquity: text, sound, images, video Emergent use of unstructured data Reality mining
The data deluge Communication Mobile phone data Social Media WWW Web Searches Businesses' Websites E-commerce websites Job advertisements Real estate websites Sensors Traffic loops Smart meters Vessel Identification Satellite Images Process generated data Flight Booking transactions Supermarket Cashier Data Financial transactions Crowd sourcing VGI websites (OpenStreetMap) Community pictures collection The key critical success factor in the action plan is an agreement at ESS level to embark on a number of concrete pilot projects. A real understanding of the implications of big data for official statistics can only be gained through hands-on experience, ‘learning by doing’. Different actors have already gained experience conducting pilots in their respective organisations at global, European and at national level. The basic principle for the identification of the pilots should be their potential for producing official statistics. In addition, the pilots should cover a variety of characteristics related to big data and production of statistics. On the basis of the principles described in the action plan and roadmap, the he ESS Big Data ask Force should make proposals for pilots which would have to be agreed by the ESSC. Participation in the pilots will be guided by the collaboration principles of the ESS Vision on a case-by-case approach.
Analytics This photo, “Cartoon: Big Data” is copyright (c) 2014 Thierry Gregorius and made available under an Attribution 2.0 Generic license.
Analytics How to deal with exhaust data? Massive datasets Dealt by machine learning / predictive analytics Massive datasets Foster machine learning Data science: a new discipline? Multiple inference Overfitting
Analytics Visualisation Data-driven applications Predictive analytics The end of theory Predictive analytics Ex: Google translate, voice recognition, suggestions systems, health applications New types of data products Official stat: chances of getting a new job Ex: Inflation Calculator: Bureau of Labor Statistics
Predicting Personality Using Mobile Phone-Based Metrics de Montjoye, Yves-Alexandre, et al. "Predicting personality using novel mobile phone-based metrics." Social Computing, Behavioral-Cultural Modeling and Prediction. Springer Berlin Heidelberg, 2013. 48-55.
Multidimensional Poverty Index (Lighter colour indicates higher poverty) Poverty map estimated based on mobile phone data Poverty map in finer granularity estimated based on mobile phone data Smith, Christopher, Afra Mashhadi, and Licia Capra. "Ubiquitous sensing for mapping poverty in developing countries." Paper submitted to the Orange D4D Challenge (2013).
Big data industry sector
An emergent market Monetisation of data: Data is the new oil Data as a new factor of production (competitive differentiating factor for businesses) A threat to official statistics? (ex: Argentina) Data ecosystem The cases of Google and Facebook
What does big data mean for official statistics? Change of paradigm From: finite population sampling methodology To: additional statistical modelling and machine learning from designers of data collection processes to designers of statistical products Privacy Use of digital footprint Data subject lack of control of data High data detail and insight from analytics
Big data strategy of the European Statistical system Scheveningen Memorandum ESS big data action plan
Scheveningen Memorandum September 2013 Big data strategy Roadmap Heads of the National Statistical Institutes of the EU
Scheveningen Memorandum on Big Data Examine the potential of Big Data sources for official statistics Official Statistics Big Data strategy as part of wider government strategy Address privacy and data protection Collaboration at European and global level Address need for skills Partnerships between different stakeholders (government, academics, private sector) Developments in Methodology, quality assessment and IT Adopt action plan and roadmap for the European Statistical System At their meeting in September 2013, the directors general of the European Statistical System agreed on a memorandum on Big Data. The so called Scheveningen memorandum called for the submission of the action plan and roadmap on Big Data. It also expresses the main points of such an action plan, such as the integration of a Big Data strategy for statistics into an overall government strategy, the need for skills, the collaboration at European as well as at global level, the need for methodological developments and the protection of personal data. The action plan and roadmap was adopted by the ESS at its meeting on 25 Sep 2014.
Big data strategy Start with concrete pilots 3 time-frames Short-term: pilots, where we are, what do we need Medium-term: Long-term: Review the roadmap
Big data sources integrated in ESS official statistics production Roadmap "As is" versus "To be" Long term Vision Medium term aims Short term objectives Topics developed for integration of Big Data into official statistics Government Strategy Pilots finalised IT infrastructure, Methods, Quality Framework Skills Partnerships Ethics > 2020 Big data sources integrated in ESS official statistics production By 2020 Commission big data strategy Identification and analysis of output portfolio Pilot projects launched Skills and training requirements identified Horizon 2020 Research Framework Programme Communication strategy Exchange with stakeholders Long Term vision beyond 2020: Big data sources integrated in ESS official statistics production By 2020, we want to prepare the grounds for the successful integration of Big Data into official statistics. The issue should be tackled by the topics which have already been identified in the Scheveningen memorandum and further developed in the action plan. Medium term aims: Topics developed for integration of Big Data into official statistics Government Strategy Pilots finalised IT infrastructure, Methods, Quality Framework Skills Partnerships Ethics Short term goals: Commission big data strategy Identification and analysis of output portfolio Pilot projects launched Skills and training requirements identified Horizon 2020 Research Framework Programme Communication strategy Exchange with stakeholders Long term vision Big data sources are available to the ESS (in such a way that business continuity is guaranteed) The European and national legislations enable Big Data usage Statistics graduates with data science skills available across the ESS Methods, tools, IT infrastructures and quality frameworks are reviewed and adjusted to new requirements Medium term aims Official Statistics integrated in governmental big data strategies at national levels across the ESS Pilots are finalised and provide input as regards implementation of statistics based on big data IT infrastructures are designed for processing big data sources Methodological and quality frameworks are developed (and/or reviewed) Data science skills integral part of official statistics education Public Private Partnerships are in place on big data and Official Statistics. Ethical guidelines for big data usage are in place Implementation of communication strategy ensures positive attitude of general public towards use of big data sources in official statistics. Short term goals Official statistics integrated in the Commission big data strategy Identification and analysis of output portfolio of big data sources Pilot projects on big data applications in official statistics launched and delivering first results The requirements in terms of skills and ESS professional training programmes are established Research lines on big data relevant to official statistics are included into the 2016-2017 Work Programme of the Horizon 2020 Research Framework Programme Communication to the general public on planned and ongoing big data activities of the ESS with users/policy makers on their needs, and with big data owners on data aspirations Exchange of information with stakeholders within the statistical system and the research community, e.g. a European conference on big data in Official Statistics By 2016
Ethics / Communication T O P I C S Policy Quality Skills Experience sharing Legislation IT Infrastructures Methods Ethics / Communication Pilots The slide shows the horizontal topics that have been developed in the action plan and roadmap. The ESSC reinforced in its opinion the legal and ethical issues as well as the development of appropriate skills. Some of the topics should be directly further developed such as policy, communication or skills, while other topics should be further elaborated as part of the pilots that arer planned to be run at EU level.
Blending of Sources and multipurpose Statistics Mobile Phone Data Tourism Statistics Population Statistics Migration Statistics Traffic Statistics Commuting Statistics Statistics Population Mobile phone data Smart Meters VGI websites Satellite Images A Big Data source could provide data for different statistical areas. For example, mobile phone data could be used in the context of tourism statistics or population statistics. The pilots should explore the potential of different data sources for different statistical domains. On the other hand, different statistical domains could be enhanced from information that is derived from different data sources.
Thank you for your attention Eurostat Task Force on Big Data Fernando Reis Eurostat Task Force on Big Data fernando.reis@ec.europa.eu https://github.com/reisfe/ https://twitter.com/reisfe/ https://linkedin.com/in/reisfe/