El valor de la información: el reto del Big Data Instituto de Estadística y Cartografia de Andalucia 5 Feb 2016 Big data in official statistics in the European Statistical System: the Big Data Action Plan & Roadmap EUROSTAT – Fernando Reis – 'Task Force Big Data'
Datafication Sensors Digital footprint Good afternoon, I have the pleasure to welcome you to this session on the activities of Eurostat on big data and on possible collaboration between UNSD and Eurostat. <fade in "Datafication"> What I actually want to talk about at the beginning of this session, is DATAFICATION. This concept was introduced in the May/June 2013 issue of Foreign Affairs, in article by Kenneth Neil Cukier and Viktor Mayer-Schoenberger called “The Rise of Big Data”. In it they discuss the concept of datafication, and their example is how we quantify friendships with “likes”: it’s the way everything we do, online or otherwise, ends up recorded for later examination in someone’s data storage units. Or maybe multiple storage units, and maybe also for sale. They define datafication as a process of “taking all aspects of life and turning them into data”. For instance: Twitter 'datafies' stray thoughts, LinkedIn 'datafies' professional networks. Datafication is an interesting concept and led us to consider its importance with respect to people’s intentions about sharing their own data. We are being datafied all the time. Or rather: our actions are. When we “like” someone or something online, we should expect to be 'datafied'. When we merely browse the web, we are unintentionally, or at least passively, being datafied through cookies that we might or might not be aware of. And when we walk around in a store, or even on the street, we are being datafied in a completely unintentional way, via sensors, cameras, or Google streetview cars. <fade in "footprints & sensors"> As such, we can distinguish two "tools": the digital footprint passively left behind by an individual and sensors actively gathering information. So, with or without knowing, everyone of you left your footprint behind when you switched on your mobile phone last night or this morning to call home, some weeks ago when you booked your flight via Amadeus or when you looked for a hotel via booking.com, some days ago when you checked via Google how to get to this building by bus in the morning, or this morning during breakfast when you wrote on Facebook that you were going to attend a very interesting session on big data Our challenge as statisticians is to exploit this so-called datafication and use big data for producing statistics. What will be the impact of ubiquitous data collection and networking on official statistics? Sensors Digital footprint
Big Data and Official Statistics What will be the impact of ubiquitous data collection and networking Mobile Communication Internet of [every]Things, Social media, Wearables, Autonomous traffic, Smart systems, … on official statistics?
Expected benefits of using big data ? Outward-looking More adequate and flexible response to user needs Wider range of statistical products and services (without increasing burden) Better understand quality aspects of new sources Inward-looking Acquisition of new competences for NSIs Increase efficiency in producing statistics We remain key players for statistical information (self-explanatory)
Big data at Eurostat – key points ESS (European Statistical System) Scheveningen Memorandum Sept 2013 Examine the potential of big data sources for official statistics Official Statistics big data strategy as part of wider government strategy Address privacy and data protection Collaboration at European and global level Address need for skills Partnerships between different stakeholders (government, academics, private sector) Developments in methodology, quality assessment and IT Adopt action plan and roadmap for the ESS The real kick-off for the ESS work on big data was the Scheveningen Memorandum adopted by the heads of the national statistical offices. List of objectives (bulletpoints) I will not go further into the details here, but let me pick two important results <next slide>.
Big data at Eurostat – key points ESS (European Statistical System) Scheveningen Memorandum Sep 2013 Task Force Big Data Big Data Roadmap and Action Plan 1.0 June 2014 ESS Pilots 2016 - 2020 Implementation of ESS Vision 2020: Big Data project = integral part of the portfolio European Commission Communication "Towards a thriving data driven economy" Private Public Partnership on big data International cooperation (UNSD, UNECE, etc.) UN/ECE project “Big data in official statistics” (Sandbox) UNSD Global WG on Big Data Firstly, the creation of a Task Force on Big Data. We have an internal TF here at Eurostat (of which Albrecht is a fulltime member) and an ESS Task Force. The latter is currently composed of 16 statistical offices but also includes experts from the ECB, OECD, UNECE, academic experts and experts from other Commission services (DG CNECT, DG JRC), it is chaired by Eurostat. Secondly, the ESS Task Force drafted a Big Data Roadmap and Action Plan. An important axis of this roadmap and action plan, concerns the ESS Pilots that will be carried out until 2019 – but I come back to this later. <fade in "ESS vision 2020"> Obviously, big data is also an important element for achieving the implementation of the ESS VISION 2020. <fade in "EC" and "International cooperation> Apart from the mentioned work, initiaves are also taking at the level of the Commission and Eurostat is also closely cooperation with other international organisations the UNSD and UNECE, for instance via the Global Working Group on Big Data for Official Statistics.
Big Data Action Plan and Roadmap@ a glance Policy Quality Skills Experience sharing Legislation IT Infrastructures Methods Ethics / Communication Big data sources Governance Pilots
Ethics / Communication Policy Quality Skills Experience sharing Legislation IT Infrastructures Methods Ethics / Communication Big data sources Governance Pilots Challenges cooperation, sharing of know-how development of a sound methodology ("from design-based to model-based approach") exploration & tentative implementation Looking for partners Action (example) Pilot projects, carried out by the Member States (ESSnet) 2015 – 2019 (FPA / SGA construction) Exploring different big data sources (but also IT architecture, partnerships), developing generic guidelines and frameworks Establish Parternships with data providers and research and international organisations Cooperation with UN (lead) on Metodological Framework A first set of challenges refers to the cooperation and exchange of best practices, the methodology and the transition into the "real use" of data. These are perhaps the areas that are closest to a statistician's heart. One way of tackling these issues, is the launching of a series of PILOT PROJECTS. A Framework Partnership Agreement between Eurostat and 20 NSIs was signed in Nov 2015. In Dec 2015 Eurostat launched the Special Grant Agreements that will provide the resources to the NSIs to carry out the work. In this context close cooperation between the ESS and the GWG will be necessary in order to avoid double work and ensure synergies between the two groups. These pilot projects will be an important pillar of the big data activities in the ESS in the coming years and should pave the way towards a data production driven by big data.
Ethics / Communication Policy Quality Skills Experience sharing Legislation IT Infrastructures Methods Ethics / Communication Big data sources Governance Pilots Action (example) – continued List of pilot projects (Frame Partnership Agreement signed) Web scraping [job vacancies ; enterprise characteristics] Smart meters [electricity consumption ; temporary vacant dwellings] AIS data [vessel identification systems] Mobile phone data “The big data for official statistics competition" (2016) A first set of challenges refers to the cooperation and exchange of best practices, the methodology and the transition into the "real use" of data. These are perhaps the areas that are closest to a statistician's heart. One way of tackling these issues, is the launching of a series of PILOT PROJECTS. We hope to conclude a Framework Partnership Agreement very soon and will then launch the Special Grant Agreements that will provide the resources to the countries to carry out the work. These pilot projects will be an important pillar of the big data activities in the ESS in the coming years and should pave the way towards a data production driven by big data.
Ethics / Communication Policy Quality Skills Experience sharing Legislation IT Infrastructures Methods Ethics / Communication Big data sources Governance Pilots Challenges new skills for NSI staff: statisticians vs. data scientists ? computing capacity, hardware ? analytical tools, software? storage ? Action (example) Training program for European statisticians (ESTP) In the next years: dedicated courses on big data Focus on big data sources and on big data tools Acquiring the skills needed to assess sources and their quality, the skills to use tools and to explore big data sources Secondly, important enablers for a successful move towards big data, are SKILLS and IT INFRASTRUCTURE. Our staff will slowly but steadily need new skills and our IT architecture & infrastructure will need to adapt to the new sources. The impact on hardware needs will be significant. Experiments are ongoing, for instance the "sandbox" environment for big data experiments hosted by the Irish Central Statistics Office – in a cooperation between among others Eurostat and UNECE. An concrete action in the pipeline, is the set-up of a series of training courses under the umbrella of the ESTP. Our Task Force on Big Data is currently preparing the outline for such program for 2016. The courses will focus on sources and on tools and will be modulated in a way to address basic/new users or management as well as more experienced users.
ESTP courses supporting big data (2016) 12 – 15 Sep Big data sources - Web, Social media and text analytics 29 Feb – 2 Mar 21 – 24 Jun Introduction to big data and its tools Hands-on immersion on big data tools Nowcasting 7 – 10 Nov Advanced big data sources - Mobile phone and other sensors 5 – 7 Apr 8 – 10 Jun 24 – 26 Feb The use of R in official statistics: model based estimates Arrows represent suggested learning paths and not mandatory precedencies; Set of knowledge and skills required for staff to be prepared to work on the processing of big data (statisticians) is too large to be well covered in one single course. Therefore, it needs to be covered by several courses with well-defined precedencies. [click for first animation] Besides these precedencies between the big data courses, there are important skills not exclusively related to big data which are required when working with big data sources (e.g. machine learning). These are covered by ESTP training courses in other domains, in particular methodology. [click for second animation] Methodology training courses also provide important skills required in the production of statistical products which build on the potential of big data sources, in particular nowcasting. Can a statistician become a data scientist? Time-series econometrics Big data courses Methodology courses Activity
ESTP courses supporting big data (2016) 12 – 15 Sep Big data sources - Web, Social media and text analytics Web scrapping Content and sentiment analysis on social media Text mining 29 Feb – 2 Mar 21 – 24 Jun Introduction to big data and its tools Hands-on immersion on big data tools Hadoop; Map Reduce; Pig and Hive; Spark; NoSQL databases; RHadoop; Nowcasting 7 – 10 Nov Advanced big data sources - Mobile phone and other sensors Big data and the several digital traces people leave; Overview of big data sources: sensors and the IoT, process-mediated data; human-sourced data; The implications of big data for official statistics; International big data initiatives in official statistics; Privacy and personal data protection; Examples of use of big data for producing statistics; Methodological challenges of big data, e.g. over-fitting, multiple inference, and model-based inference. Visualisation and its importance in the analysis of big data; Data science and its role in big data analytics; Overview of big data tools, e.g. distributed computing; Mobile phone operators data; Road sensor data; Satellite images; Vessels and planes identification systems; 5 – 7 Apr 8 – 10 Jun 24 – 26 Feb The use of R in official statistics: model based estimates Arrows represent suggested learning paths and not mandatory precedencies; Set of knowledge and skills required for staff to be prepared to work on the processing of big data (statisticians) is too large to be well covered in one single course. Therefore, it needs to be covered by several courses with well-defined precedencies. [click for first animation] Besides these precedencies between the big data courses, there are important skills not exclusively related to big data which are required when working with big data sources (e.g. machine learning). These are covered by ESTP training courses in other domains, in particular methodology. [click for second animation] Methodology training courses also provide important skills required in the production of statistical products which build on the potential of big data sources, in particular nowcasting. Can a statistician become a data scientist? Time-series econometrics Methods of statistical inference: design-based, model-based and algorithm-based estimation Statistical learning Geo-spatial analysis Network analysis and Web analytics Graph database and advanced data visualisation Essentials of R Descriptive statistics with R Data visualization with R Programming with R Applications of R in an NSI Introduction to time series analysis. Forecasting with time series models, uncertainty and confidence in forecasting. Univariate time series modelling: ARIMA, ARCH and GRACH models. Multivariate time series modelling: cointegration and VAR and VECM models. Other developments : nowcasting, combination of forecasting, etc. Brief introduction to state space modelling; Big data courses Methodology courses Activity
Ethics / Communication Policy Quality Skills Experience sharing Legislation IT Infrastructures Methods Ethics / Communication Big data sources Governance Pilots Challenges integrating official statistics in big data strategies getting access to data & continuity of access data security & privacy concerns compensate for the burden ? Action (example) Project on the analysis of legislation and strategy (but also ethics and communication) 2015-2017 (22 months) Analysis for EU and for Member States at national level See also the Feasibility study on the use of mobile positioning data for tourism statistics (report on feasibility of access) Other important areas relate to the policy / political framework and the regulatory framework. Given the interaction between policy and regulation, it is very important to work on these areas in parallel and in narrow cooperation. One aspect of policy will be the integrating of (official) statistics into any strategy related to big data. This is essential to put statistics on the map and to open doors to actually accessing of data. It should be kept in mind that big data are often held or stored by private companies, e.g. mobile network operator. The discussion of access is not limited to the entry but should include a long term vision, in other words a certain continuity of access – this is a conditio sine qua non for a sound statistical system that is based, fully or partially, on big data sources. A main barrier to access, is data security and privacy concerns –as was also highlighted in the feasibility study carried out with respect to tourism statistics. Another important challenge is finding a sustainable business model for big data in official statistics, taking into account the budgetary impact for statistical offices and for those "holding" the data. To address these questions, I can mention that Eurostat recently launched a Call for Tender with the objective of analysing the legal frameworks at EU and national level.
Ethics / Communication Policy Quality Skills Experience sharing Legislation IT Infrastructures Methods Ethics / Communication Big data sources Governance Pilots Challenges transversal challenges to all big data activities: quality and ethics & communication big data vs. statistics : "goodness of fit" (concepts, representativeness,…) impact on the public opinion of privacy and security concerns ? Action (example) Cooperation with UN (lead) on a quality framework for big data Project on the analysis of ethics and communication (but also legislation and strategy) 2015-2017 (22 months) Analysis for EU and for Member States at national level As I already mentioned, all of the areas in the roadmap are interrelated. Two areas in particular are of a more horizontal, transversal nature. On the one hand "quality"… the quality framework as we know it, will not be adapted to the new data sources. Eurostat is contributing to the UN's work on a quality framework for big data. Quality issues will appear in the pilots, when assessing the access to data, etc. Just think of conceptual issues (can statistical definitions be maintained when using big data?), timeliness and flexibility of access, coverage and sampling issues, etc… On the other hand "ethics and communication" will play an important if not decisive role. Policy makers and businesses will be reluctant to cooperate or to launch big data initiatives if the "public opinion" is not supporting such approaches. Protection of data will become even more important than it already is now.
Currently a focal data source for big data Exists in all countries Communication Mobile phone data Social Media WWW Web Searches Businesses' Websites e-commerce websites Job advertisements Real estate websites Sensors Traffic loops Smart meters Vessel Identification Satellite Images Process generated data Flight Booking transactions Supermarket Cashier Data Financial transactions Crowd sourcing VGI websites (OpenStreetMap) Community pictures collection Currently a focal data source for big data Exists in all countries (≠ accessible in all countries) Many promising studies/experiments available Potential relevance to many areas of official statistics (synergies!) Most available studies linking big data to tourism statistics, are based on mobile phone data
Mobile phone data Eurostat: Feasibility study on the use of mobile positioning data for tourism statistics (2012-2014) Included in the forthcoming ESS Pilots on Big Data (2016-2019) GWG Big Data Pilot NSIs (and tourism researchers) Many small or larger scale projects ongoing! GWG Big Data Task Team Mobile Phone Data
… slow data vs. quick data… Article released one day after 2015 Easter weekend about tourism in Belgian coast: 150 000 same-day visitors on Sunday, 400 000 during the entire long weekend Data based on a monitoring of the regional tourism board, in cooperation with the main mobile network operator Proximus and the road infrastructure administration; In comparison: Eurostat will receive data on same-day visitors for the 2nd quarter of 2015 (not a particular weekend) on 30 June 2016 (not the day after) for the entire country (not a coastal strip within a NUTS2 region); Methodology not clear, but it's a nice example of how flash estimates based on big data decreases the relevance of official statistics.
Multiple sources & Multiple outputs Big data = Multiple sources & Multiple outputs Statistics Population Mobile phone data Smart Meters VGI websites Satellite Images Mobile Phone Data Tourism Statistics Population Statistics Migration Statistics Traffic Statistics Commuting Statistics We can expect that in the coming years, big data will influence oru work via many different entry points. The wide range of big data sources wil not replace the current statistics in the short term but will be used to enhance, improve and complement statistics – in many areas simultaneously. For example mobile phone data can contribute to producing statistical data in different domains, such as tourism or mobily, but also population or migration. On the other hand various big data sources can interact and contribute to providing statistical data for a specific domain. For example population statistics can be fed by mobile phone data, volunteered geographic information sources and smart meters.
Lifecycle for the coming years ? Domain STATISTICS Mobile phone data Payment cards data HOUSEHOLD & BUSINESS SURVEYS Other big data SHORT TERM 'Traditional' surveys as main input for tourism statistics Big data sources slowly becoming auxiliary information
Lifecycle for the coming years ? (2) Domain STATISTICS Mobile phone data Payment cards data HOUSEHOLD & BUSINESS SURVEYS Other big data MID TERM Weight of surveys decreases in favour of big data ? Surveys no longer 'main filter' but 'one of the sources' ?
Lifecycle for the coming years ? (3) Domain STATISTICS Mobile phone data Payment cards data HOUSEHOLD & BUSINESS SURVEYS Other big data Web (prices) Bookings (nowcast/forecast) NEW LONGER TERM 'Replacement of surveys continues (smaller samples, less frequent collection) ? Enhanced tourism statistics via embedding of newer sources ?
The statistical office of the future Data flows in addition to surveys and censuses Embedded in data flow – smart statistics Product designers in addition to data collection designers Statistical modelling will be a major activity From descriptive indicators to nowcasting (and forecasting) Trust and quality will be key New role in teaching digital literacy Accreditation and certification instead of pure production Address issues linked to quality & transparency, privacy & confidentiality, access to third party data sources & data sharing, scientific standards & methodology, professional ethics, skills, … To close, and before jumping to the next speakers, let's jump a bit further in time and try to imagine what the statistical office of the future could look like… We will move away from the traditional sources that have often been in place since the scientist after whom this meeting room in named, Quetelet, put them at the core of statistical production. Surveys and censuses will be "competing" with data flows, and somehow the NSIs will become embedded in such data flow. In terms of skills, we will no longer design data collection but we will be designing statistical products (using the available sources). Modelling and nowcasting will become common terminology. Partnerships and trust will become more important than even. Users and producers will need new types of digital literacy and skills. In terms of quality, the NSI will lose control over the entire production chain from interview to indicator and quality assessment will focus more on accreditation and cerfication of big data sources and statistical output based on big data.
Thank you for your attention Eurostat Task Force on Big Data Fernando Reis Eurostat Task Force on Big Data fernando.reis@ec.europa.eu https://github.com/reisfe/ https://twitter.com/reisfe/ https://linkedin.com/in/reisfe/