IT Architectures for Handling Big Data in Official Statistics: the Case of Scanner Data in Istat Gianluca D’Amato, Annunziata Fiore, Domenico Infante,

Slides:



Advertisements
Similar presentations
Supervisor : Prof . Abbdolahzadeh
Advertisements

Database Management3-1 L3 Database Management Santa R. Susarapu Ph.D. Student Virginia Commonwealth University.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
United Nations Economic Commission for Europe Statistical Division NTTS 2015 – Satellite Workshop on Big Data March 9, 2015 Computing Energy Consumption.
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
Chapter 9 DATA WAREHOUSING Transparencies © Pearson Education Limited 1995, 2005.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
DATA WAREHOUSING.
United Nations Economic Commission for Europe Statistical Division NTTS 2015 – Satellite Workshop on Big Data March 9, 2015 The Big Data Project – The.
Designing a Data Warehouse
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
WHT/ HPCC Systems Flavio Villanustre VP, Products and Infrastructure HPCC Systems Risk Solutions.
Big Data A big step towards innovation, competition and productivity.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
OMB Data Visualization Tool Requirements Analysis: SAP Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Designing a Data Warehouse Issues in DW design. Three Fundamental Processes Data Acquisition Data Storage Data a Access.
IT – DBMS Concepts Relational Database Theory.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
CORE Rome Meeting – 3/4 October WP3: A Process Scenario for Testing the CORE Environment Diego Zardetto (Istat CORE team)
1.
Database Systems – Data Warehousing
1 The following presentation is from the Oracle Webcast “What’s New in P6 EPPM Release 8.1.” As a partner, you may not use the Oracle Power Point template,
Eurotrace Hands-On The Eurotrace File System. 2 The Eurotrace file system Under MS ACCESS EUROTRACE generates several different files when you create.
Enterprise Reporting Solution
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
November 18, 2014 Centers for Medicare and Medicaid Services Virtual Research Data Center.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
Monitoring the acquisition process by web widgets Leonardo Tininini and Antonino Virgillito ISTAT Meeting on the Management of Statistical Information.
Guest Lecture Introduction to Data Mining Dr. Bhavani Thuraisingham September 17, 2010.
1 Topics about Data Warehouses What is a data warehouse? How does a data warehouse differ from a transaction processing database? What are the characteristics.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
A radiologist analyzes an X-ray image, and writes his observations on papers  Image Tagging improves the quality, consistency.  Usefulness of the data.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Foundations of Business Intelligence: Databases and Information Management.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
United Nations Economic Commission for Europe Statistical Division Big Data Sandbox Antonino Virgillito Project manager “Big Data Project” UNECE Head of.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
Business Intelligence for everyone 2 For BI to deliver maximum value, all Information Workers must participate: Broad access to uncover and share insights.
Introduction to the Power BI Platform Presented by Ted Pattison.
Slide 1 © 2016, Lera Technologies. All Rights Reserved. SAP BO vs SPLUNK vs OBIEE By Lera Technologies.
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
Internal Modern Data Platform Somnath Data Platform Architect.
Supervisor : Prof . Abbdolahzadeh
Jaclyn Hansberry MIS2502: Data Analytics The Things You Can Do With Data The Information Architecture of an Organization Jaclyn.
Serve as Director Funded by the Louisiana Department of Transportation and Development Developed LaCrash application to electronically capture crash.
A Novel IT Architecture for Statistical Data Collection Using Big Data Technology Domenico Aprile, Lorenzo Di Gaetano, Guido Drovandi, Antonino Virgillito.
Big Data Enterprise Patterns
Enable the Hybrid Data Platform
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
NGAGE Intelligence Leverages Microsoft Azure Platform to Provide Essential Analytics for Hybrid SharePoint Server/Office 365 Environments MICROSOFT AZURE.
Accelerate Your Self-Service Data Analytics
Overview of big data tools
Database Management Systems
Business Intelligence
Big DATA.
Income Poverty Status Education The Labor Force Journey To Work
Customer 360.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
UNIT 6 RECENT TRENDS.
Map Reduce, Types, Formats and Features
Presentation transcript:

IT Architectures for Handling Big Data in Official Statistics: the Case of Scanner Data in Istat Gianluca D’Amato, Annunziata Fiore, Domenico Infante, Antonella Simone, Giorgio Vinci, Antonino Virgillito Istat Scanner Data Workshop Rome 2 October 2015

About me Head of unit «Architectures for business intelligence, mobile and big data» Project manager of the UNECE project Big Data in Official Statistics Leader of the IT group in the Istat Scanner Data project

Abstract In this talk we present the issues and challenges related to dealing with datasets of big size such as those involved in the Scanner Data project at Istat We illustrate the IT architecture backing the testing phase of the project, currently in place, and the ideas for the production architecture The motivations behind the design are explained as well as the solutions introduced as part of a larger scope approach to the modernisation of tools and techniques used for data storage and processing in Istat, envisioning the future challenges posed by the adoption of Big Data and Data Science in NSIs

Data Size - Current Received and stored data from 6 chains in 6 provinces 30% of all stores available from Nielsen in these provinces 2 and a half years time span Space occupation of 1 year of microdata in the database 426 Million records 18Gb 550 Stores 210,000 Products

Data Size - Expected Further 29 provinces will be received by the end of the year Estimated size of 1 year of microdata 47Gb 142Gb 6 chains only All available chains

Actual occupation 2 years of microdata only 36Gb Current occupation of DB space 200Gb 100Gb now End provinces 6 chains ? Indexes, views, aggregations, classifications… 540Gb 2016? EVERYTHING!

The Problem with Size… Query time is not satisfactory –Aggregated intermediate results in tables –IT support required –More space! «Difficult» to extract data for analysis Data growth is not predictable DBAs do not guarantee proper backup when space occupation is over 500Gb

Architectural Elements Data ingestion –SFTP –Custom Java code Data architecture –Relational DB Tools –SAS, Excel… –Business Analytics Platform (MicroStrategy)

Business Analyitics Platforms Allow to access data stored in different data sources and represent it in a common multi-dimensional schema that is easier to query and navigate –Automatically aggregate measures at different levels of dimentionality Normally used in enterprise contexts to facilitate management of large-heterogeneous data warehouses Enable interactive analysis –Reports: results of queries in a tabular format that can be browsed and downloaded in Excel or CSV file –Dashboard: free navigation on data through the creation of interactive visualizations in drag-and-drop mode First example of use in Istat with large data sizes

Load Pre-process IT Architecture: Testing Phase SFTP Views Reports and Visualizations DB SAS Microstrategy Control Dashboard

Load Pre-process Data Ingestion SFTP SAS Control Dashboard Data is sent by Nielsen in form of compressed text files via SFTP (secure channel) The SFTP server is located in Istat data center and it is protected by strict security policies

Load Pre-process Data Ingestion SFTP SAS Control Dashboard Received data are handled by programs written in Java -Load: performs integrity checks on received files, loads data in the DB, logs received files and estimates discounts -Pre-process: performs quality checks at record level, discards dirty data The whole acquisition process is controlled by a web dashboard

Load Pre-process Data Access and Analysis SFTP Views Reports and Visualizations DB SAS Microstrategy Data can be accessed in two ways: Extraction from the DB Materialized views were created in order to facilitate import in SAS Use of a business analytics tool (MicroStrategy) for reporting, visualization and browsing of the data

Load Pre-process Data Access and Analysis SFTP Views Reports and Visualizations DB SAS Microstrategy In both access modes the results of common interrogations were pre-computed at different levels of aggregations and were provided as views or reports This allowed to speed-up access time at the cost of additional space on the DB

Preliminary Analysis Analysis of quality of data –Anomalies –Distributions –Nulls –… Exploratory analysis –Time series of turnover and quantity –Distribution per coicop group –…

Turnover per Market

Total quantity per week and province

Navigation of DB

Production Data Platform We are setting up a new data platform for the production architecture based on Big Data tools –7 nodes Hadoop Cluster –Hadoop: parallel storage and processing platform, de-facto standard for Big Data Features: –All historical data always online for interactive analysis –Possibility of retaining historical data indefinitely –Costruction of a global historical data warehouse of prices data –Database is used only for processing current, online data (computation of indexes) –Easier to perform large-scale analysis

Ingestion IT Architecture: Production Phase SFTP Reports and Visualizations Control Dashboard SAS Processing of indexes Oracle DBHadoop Extraction for offline analysis Enhanced data warehouse Sample selection Current data Historical data

Conclusions The scanner data project represents a challenging testbed for experimenting new approaches in the IT support to analysis and production Objective is get faster results and more efficient processes The concept of «Big Data» is not merely a matter of size but rather of new opportunities Technology can give the answers, now it’s time to make new questions

Questions