Big Data in Official Statistics: Generalities Antonino Virgillito
Big Data in Official Statistics: the International Setting - 1 Strategic vision of HLG, June 2011: «We are in a changeover from a society with little or no data available to one that has an abundance of data… … Another important point is that nowadays it is much easier to get data that cover more than the traditional national statistics users would need. We do not, however, have the mechanisms in place to make full use of these data»
Big Data in Official Statistics: the International Setting - 2 HLG Working paper 2013/6, January 2013: « Apart from generating new commercial opportunities in the private sector, Big data is also potentially very interesting as an input for official statistics; either for use on its own, or in combination with more traditional data sources such as sample surveys and administrative registers»
Big Data in Official Statistics: the International Setting - 3 Scheveningen Memorandum, September 2013 «Acknowledge that Big Data represent new opportunities and challenges for Official Statistics, and therefore encourage the European Statistical System and its partners to effectively examine the potential of Big Data sources in that regard.
Big Data for OS: The Concept Big Data can be an input for official statistics: either for use on its own or in combination with more traditional data sources such as sample surveys and administrative registers
Big Data Sources UNECE Classification Social Networks (human-sourced information) Traditional Business systems (process-mediated data) Internet of Things (machine-generated data)
Social Networks (human-sourced information) Interactions with news media and social media, job posting Humans interacting with devices (also mobile) produce data
Social Networks (human-sourced information) Interactions with news media and social media, job posting Humans interacting with devices (also mobile) produce data Example: Blog posts Twitter messages
Social Networks (human-sourced information) Interactions with news media and social media, job posting Humans interacting with devices (also mobile) produce data Example: Blog posts Twitter messages User-generated maps
Traditional Business systems (process-mediated data) Data collected by traditional systems in a passive mode Example: Web search logs
Traditional Business systems (process-mediated data) Data collected by traditional systems in a passive mode Example: Web search logs Medical records
Traditional Business systems (process-mediated data) Data collected by traditional systems in a passive mode Example: Web search logs Medical records Commercial transactions Banking/stock records
Internet of Things (machine-generated data) Sensors and machines used to measure and record the events and situations in the physical world. Example: Traffic sensors
Internet of Things (machine-generated data) Sensors and machines used to measure and record the events and situations in the physical world. Example: Traffic sensors Enviornmental Sensors
How can Big Data be included in official statistics? As the only source (replacement/new statistics) Traffic intensity statistics (NL) and ‘Billion Prices’ project (MIT) As the main source with survey/admin. data as benchmark Google trends like approaches, (regular) benchmarking needed As an additional source for a survey/admin. data based statistics for example to enable small area estimation As ‘supplier’ of missing data for example use data on level of education from the internet to fill gaps in education register But also for nowcasting and to increase timeliness! Don’t use it
Why do Big Data look so appealing to NSIs? Competitive pressure Private sector may take advantage of Big Data and produce more and more statistics that attempt to beat official statistics on timeliness and relevance The “Official Statistics” trademark could slowly lose reputation and relevance unless NSIs get on board Funding constraints Economic crisis (2009-20??) urges organizations to look for ways to increase efficiency and cut costs Being traditional data collection so cost-intensive, interest in alternative data sources and Big Data is growing
Why do Big Data look so appealing to NSIs? Improving quality of traditional statistics Providing new auxiliary information that NSIs could exploit to - Build and maintain better sampling frames - Design better samples - Build better Calibration estimators - Soften nonresponse bias further Reducing respondents’ burden Potential for discovering new knowledge New well-being indicators Agriculture and environment statistics New measures of consumers’ confidence Consumer behavior beyond HBS
Issues and Challenges Legislative, regulating access to data Privacy Possible diffrrent legislation country by country Privacy Possible privacy-by-design strategies Financial Private providers for Big data Management Including Training
Issues and Challenges – Statistical methodology Representativeness Difficult to define target population, survey population and survey frame Linking methods of Big Data with statistical units (individuals, families, enterprises,…) Estimation procedures Quality of the results
Issues and Challenges – Collecting Big Data Big Data originated from the need to manage data that grew inside organizations as a consequence of their business No collection involved In statistical offices we do not have such a situation because our “input” data is always generated from the collection phase Big Data too have to be gathered from external sources Most common sources of Big Data for statistical purposes datasets from external providers data extracted from Internet
International Initiatives UN Global Working Group UNECE-HLG Big Data Sandbox ESSNet project on Big Data