The Sandbox Task Team Antonino Virgillito Project Consultant, UNECE.

Slides:



Advertisements
Similar presentations
C6 Databases.
Advertisements

Microsoft ® System Center Configuration Manager 2007 R3 and Forefront ® Endpoint Protection Infrastructure Planning and Design Published: October 2008.
United Nations Economic Commission for Europe Statistical Division NTTS 2015 – Satellite Workshop on Big Data March 9, 2015 Computing Energy Consumption.
United Nations Economic Commission for Europe Statistical Division NTTS 2015 – Satellite Workshop on Big Data March 9, 2015 The Big Data Project – The.
ONS Big Data Project. Plan for today Introduce the ONS Big Data Project Provide a overview of our work to date Provide information about our future plans.
Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
Microsoft ® SQL Server ® 2008 and SQL Server 2008 R2 Infrastructure Planning and Design Published: February 2009 Updated: January 2012.
Big Data Quality, Partnerships and Privacy Teams.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Master Thesis Defense Jan Fiedler 04/17/98
Case 2: Emerson and Sanofi Data stewards seek data conformity
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
United Nations Economic Commission for Europe Statistical Division High-Level Group Achievements and Plans Steven Vale UNECE
IT Architectures for Handling Big Data in Official Statistics: the Case of Scanner Data in Istat Gianluca D’Amato, Annunziata Fiore, Domenico Infante,
United Nations Economic Commission for Europe Statistical Division UNECE Big Data Work Steven Vale UNECE
The future of Statistical Production CSPA. 50 task team members 7 task teams CSPA 2015 project.
United Nations Economic Commission for Europe Statistical Division Big Data Sandbox Antonino Virgillito Project manager “Big Data Project” UNECE Head of.
United Nations Economic Commission for Europe Statistical Division International Collaboration to Modernise Official Statistics Steven Vale UNECE
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
MidVision Enables Clients to Rent IBM WebSphere for Development, Test, and Peak Production Workloads in the Cloud on Microsoft Azure MICROSOFT AZURE ISV.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
New data sources (such as Big Data) and Traditional Sources Work Package 2.
United Nations Economic Commission for Europe Statistical Division CSPA: The Future of Statistical Production Steven Vale UNECE
What we mean by Big Data and Advanced Analytics
Web Scraping for Collecting Price Data: Are We Doing It Right?
Mobile application and website to buy and sell books
Data Science in Official Statistics: The Big Data Team
WHY VIDEO SURVELLIANCE
Sharing of previous experiences on scraping Istat’s experience
Technical Report 7th V-RMTC & T-RMN Experts’ Reunion
Business Intelligence
Organizations Are Embracing New Opportunities
Big Data is a Big Deal!.
Office 365 is cloud-based productivity, hosted by Microsoft.
Data Integration in Official Statistics 2017 Project Proposal
UNECE Data Integration Project
Discovering Computers 2010: Living in a Digital World Chapter 14
Utilize Internal Data via Mobile Business Apps
Achievements in 2016 Data Integration Linked Open Metadata
Get to know SQL Manager SQL Server administration done right 
Modernization Maturity Model and Roadmap
Big-Data Fundamentals
Modernization Maturity Model
Smart Org Charts in Microsoft Office 365: Securely Create, Collaborate, Edit, and Share Org Charts in PowerPoint and Online with OrgWeaver Software OFFICE.
Where to begin with standards based modernization?
“CareerGuide for Schools”
+Vonus: An Intuitive, Cloud-Based Point-of-Sale Solution That’s Powered by Microsoft Office 365 with Tools to Increase Sales Using Social Media OFFICE.
Today’s Business Pain Points
for Official Statistics.
Built on the Powerful Microsoft Azure Platform, the SiouxApp “Project-Server” Helps to Manage Projects and More with App Enhancement Tools MICROSOFT AZURE.
New ways to get the data Multiple mode and big data
Master dissertation Proposals
Institutional Framework, Resources and Management
Dissemination Workshop ESSnet Big Data Sofia, February 2017
An Introduction to Data Warehousing
Scanning the environment: The global perspective on the integration of non-traditional data sources, administrative data and geospatial information Sub-regional.
Overview of big data tools
Uses of web scraping for official statistics
Use of Web scraping for Enterprises Characteristics
Yooba File Sync: A Microsoft Office 365 Add-In That Syncs Sales Content in SharePoint Online to Yooba’s Sales Performance Management Solution OFFICE 365.
X-DIS/XBRL Phase 2 Kick-Off
Big Data ESSNet WP 1: Web scraping / Job Vacancies Pilot
WHY VIDEO SURVELLIANCE
Big DATA.
Mark Quirk Head of Technology Developer & Platform Group
Data Architecture project
Case Study: HLG Big Data Sandbox
Big Data in Official Statistics: Generalities
SDMX in AFRICA SDMX Roadmap th SDMX Global Conference
INITIATIVES OF REGIONAL LIBRARIES IN BULGARIA IN THE SPHERE OF SOCIAL AND CULTURAL INTEGRATION : EXAMPLES FROM PRACTICE S. Eftimova, E. Tsvetkova, P. Mukanova.
Presentation transcript:

The Sandbox Task Team Antonino Virgillito Project Consultant, UNECE

BACKGROUND

Task Team Objectives The Project Proposal defines 5 specific objectives for stating successful completion of the work 1 2 'Big Data' sources can be obtained, installed and manipulated with relative ease and efficiency on the chosen platform 'Big Data' sources can be obtained, installed and manipulated with relative ease and efficiency on the chosen platform The chosen sources can be processed to produce statistics (either mainstream or novel) which conform to appropriate quality criteria The chosen sources can be processed to produce statistics (either mainstream or novel) which conform to appropriate quality criteria 3 4 The resulting statistics correspond in a systematic and predictable way with existing mainstream products The resulting statistics correspond in a systematic and predictable way with existing mainstream products The chosen platforms, tools, methods and datasets can be used in similar ways to produce analogous statistics in different countries The chosen platforms, tools, methods and datasets can be used in similar ways to produce analogous statistics in different countries SO1 Sources collection and manipulation SO2 Production of quality statistics SO3 Correspondence with existing products SO4 Cross-country sharing SO5 CSPA-based sharing 5 The different participating countries can share tools, methods, datasets and results efficiently, operating on the principles established in the Common Statistical Production Architecture The different participating countries can share tools, methods, datasets and results efficiently, operating on the principles established in the Common Statistical Production Architecture

Hortonworks Data Platform The Sandbox Shared computation environment for the storage and the analysis of large-scale datasets Hortonworks Data Platform Pentaho RHadoop Used as a platform for collaboration across participating institutions Created with support from: - CSO Central Statistics Office of Ireland - ICHEC Irish Centre for High-End Computing Cluster of 28 machines Accessible through web and SSH Software: full Hadoop stack, visual analytics, R, RDBMS, NoSQL DB Objectives Explore tools and methods Test feasibility of producing Big Data-derived statistics Replicate outputs across countries

Partnerships ICHEC Assisted the task team for the testing and evaluation of Hadoop work-flows and associated data analysis application software Hortonworks provided free support on their open source Hadoop platform (Hortonworks Data Platform) Trial license of the Pentaho Enterprise Platform product for visual analytics were provided by Italian distributor BNova

Task Team Overview The Sandbox Task Team is composed of 38 participants from 18 among national institutes of statistics and international organizations. The team is organized around a set of “experiment teams”, focusing on topics related to different statistical domains. NOV13 JAN14 MAR14 MAY14 SEP14 NOV14 Virtual Sprint First document Training in Rome Workshop in Heerlen Workshop in Rome Topics of interest were identified HLG Presentation HLG Presentation Starting of data set acquisition Starting of data set acquisition Draft reports available Draft reports available

EXPERIMENTS

Social Media Job Vacancies Ads Mobile Phones Web Scraping Prices Traffic Loops Icons made by Freepik, Elegant Themes, Icomoon, Icons8 from www.flaticon.com Licensed by Creative Commons BY 3.0 Smart Meters Positive indication “Mixed” indication More work needed / ongoing Negative indication Each experiment team produced a detailed report on its activity, already available in draft format on the wiki A summary of the results is presented in the following

Social Media: Mobility Studies Tweets generated in Mexico Jan14/Jul14 42M 9.2Gb Analysis of mobility starting from georeference data of single tweets Patterns of mobility to touristic cities Trans-border mobility Mobility statistics computed at detailed territorial level

Social Media: Sentiment Analysis Tweets generated in Mexico Jan14/Jul14 42M 9.2Gb Derived sentiment indicator from analysis of Mexican tweets Emoticons and media acronyms Statistic Nederlands applied its methodology to relate sentiment to consumer confidence Cross-country sharing of method - Only emoticons were considered - Dutch study also used Facebook as a source Correlation is not as good as in previous study based on Dutch data More accurate, language-based computation of sentiment currently carried out in Mexico, based on partnership with university

Mobile Phones Four datasets from Orange. Call data from Ivory Coast. 865M 31.4Gb Experiments lagging behind because of long time required for data acquisition: 5 months! Some preliminary output already in place as well as a plan of future experiments Visual analysis of call location data User categories from call intensity patterns

Mobile Phones: Findings Disaggregated data would allow to replicate mobility surveys with a higher level of detail Aggregated data only allows to produce parallel indicators that integrate traditional sources However, several methods have been proposed. Need more time for implementation Comparison with real data used in Slovenia will allow to give more responses about use in statistics and cross-country sharing of methods and methodologies

Consumer Price Index Synthetic scanner data 11G 260Gb Test performance of big data technologies on big data sets through the computation of a simplified consumer price index on synthetic price data Comparison between “traditional” and Big Data technologies Could write index computation script with one of the high-level languages part of Hadoop environment Big Data tools are necessary and achieve good scalability when data grow over tenth of Gb Future work on methodology Work on scanner data is active in several NSIs. Data has same structure and methods can be shared. Novel statistics can be computed working on large scale data (no sampling)

Smart Meters Test of aggregation using Big Data tools - Real data from Ireland - Synthetic data from Canada 160M 2.5Gb Test of aggregation using Big Data tools Future work on sharing methods through the use of synthetic data sets Weekly consumption per hour of day over a year (IE) Hourly consumption per day (CAN) winter summer mid-seasons Quickly wrote aggregation scripts that could be used on both datasets

Job Vacancies Collected data from job web portals 10K/day 2Mb/day Set up continuous daily collection of data from job web portals to compute indices of statistics on job vacancies Identified possible free and commercial data sources in different countries. Tested different techniques for data collection and methodologies for data cleaning Timeliness. Set up a process that collects and cleans data automatically. Computed the statistics on a weekly basis. Coverage. Collected sources were limited by the capability of the tools used and the structure of the web sites. Coverage. Sources could not guarantee all the variables that are necessary for computing the official job vacancy indicator. Can be used for different - simplified - indicator, integration with other sources, benchmark.

Web Scraping Websites of Italian enterprises - 8Gb Test of automated, unassisted, massive datasets mining of text data extracted from the web 8,600 Italian websites, indicated by the 19,000 enterprises responding to ICT survey of year 2013, have been scraped and the acquired texts have been processed Sandbox approach resulted in significant performance improvement over the use of a single server A comparison of different solutions for extraction of data from the web, with recommendation about their use, has also been produced

Traffic Loops Data from 20,000 traffic loops located on 3,000 km of speedway in the Netherlands 156G 3Tb CBS will carry out the first test of the use of Sandbox for pre-production statistics Experiments on aggregation, cleaning and imputation have been also conducted on a subset of data 3Tb 10Gb 500Mb Traffic Index Transformation Selection Cleaning Aggregation The entire traffic dataset has been loaded in the Sandbox A disk had to be physically shipped in Ireland because dataset size did not allow network transfer

FINDINGS

Statistics We showed some of the possible improvements that can be obtained using Big Data sources Cheaper  More timely Novel

Statistics The results of the experiments were evaluated with respect to the initial objectives SO1 Sources collection and manipulation SO2 Production of quality statistics SO3 Correspondence with existing products SO4 Cross-country sharing SO5 CSPA-based sharing Details on the evaluation can be found in the Project Summary Report

Skills Crucial for building “data scientist” skills All available tools were used in the experiments by both researchers and techicians with no previous experience The Sandbox can represent a capacity building platform for participating institutions Projects in planning were less likely to use tools generally associated with “Big Data”.  Often this decision was made due to a lack of familiarity with new tools or a deficit of secure “Big Data” infrastructure (e.g. parallel processing no-SQL data stores such as Hadoop). UNSD Big Data Questionnaire At present there is insufficient training in the skills that were identified as most important for people working with Big Data Skills on Hadoop/NoSQL DBs indicated as “planned in the near future” by majority of organizations UNECE Big Data Questionnaire Crucial for building “data scientist” skills

Technology Big Data tools are necessary when dealing with data ranging from hundreds of Gb on Effective starting from tenths of Gb “Traditional” tools perform better with smaller datasets Researchers/technicians should be able to master different tools and be ready to deal with immature software Highly dynamic situation with frequent updates and new tools spawning frequently Need strong IT skills for managing the tools Support from software companies might be required in early phases

Acquisition 7 datasets were loaded Initial project proposal required “one or more” Difficult to retrieve “interesting” (i.e., meaningful, disaggregated…) datasets Privacy and size issues This also applies to web sources that are only apparently easy to retrieve Issues with quality, in terms of coverage and representativeness

Sharing Naturally achieved sharing of methods and datasets according to CSPA principles Big Data tools are de-facto standardized so not much need for CSPA-based integration Many data sets have the same form in all countries Methods can be developed and tested in the shared environment and then applied to real counterparts within each NSI Privacy constraints on datasets limit the possibility of sharing Can be partly bypassed through the use of synthetic data sets

Recommendations for Extension Continuation of experiments Consolidated technical skills that now can be used more effectively in experiments Some experiments started late. Need more time for full development Further research of possible methodologies Further testing of findings of other teams More datasets can be loaded Satellite data, transport data, ?... Possibility of testing new models of partnership Moving data is too difficult. Why not trying to involve partners in running our programs on their data in their data centers?

The Role of Big Data in the Modernisation of Statistical Production and Services DISCUSSION