A Novel IT Architecture for Statistical Data Collection Using Big Data Technology Domenico Aprile, Lorenzo Di Gaetano, Guido Drovandi, Antonino Virgillito.

Slides:



Advertisements
Similar presentations
Chen Zhang Hans De Sterck University of Waterloo
Advertisements

Chapter 10: Designing Databases
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 8 Slide 1 Tools of Software Development l 2 types of tools used by software engineers:
NoSQL Database.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
A Social blog using MongoDB ITEC-810 Final Presentation Lucero Soria Supervisor: Dr. Jian Yang.
Digital Object: A Virtual Online Storage Solution 598C Course Project Huajing Li.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
CSE 3330 Database Concepts MongoDB. Big Data Surge in “big data” Larger datasets frequently need to be stored in dbs Traditional relational db were not.
Data Sharing. Data Sharing in a Sysplex Connecting a large number of systems together brings with it special considerations, such as how the large number.
IT Architectures for Handling Big Data in Official Statistics: the Case of Scanner Data in Istat Gianluca D’Amato, Annunziata Fiore, Domenico Infante,
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
NOSQL DATABASE Not Only SQL DATABASE
CS422 Principles of Database Systems Introduction to NoSQL Chengyu Sun California State University, Los Angeles.
1 © 2013 Cengage Learning. All Rights Reserved. This edition is intended for use outside of the U.S. only, with content that may be different from the.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
3 Copyright © 2006, Oracle. All rights reserved. Designing and Developing for Performance.
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
Web Scraping for Collecting Price Data: Are We Doing It Right?
Introduction to Mongo DB(NO SQL data Base)
Neo4j: GRAPH DATABASE 27 March, 2017
Database Systems: Design, Implementation, and Management Tenth Edition
INTRODUCTION TO DATABASES (MICROSOFT ACCESS)
NO SQL for SQL DBA Dilip Nayak & Dan Hess.
and Big Data Storage Systems
Cloud Computing and Architecuture
BigData - NoSQL Hadoop - Couchbase
MANAGEMENT OF STATISTICAL PRODUCTION PROCESS METADATA IN ISIS
Chapter 6 - Database Implementation and Use
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
NoSQL Database and Application
Operational & Analytical Database
Azure Cosmos DB Venitta J Microsoft Connect /6/2018 4:36 PM
Modern Databases NoSQL and NewSQL
NOSQL.
Software Documentation
Methodology – Monitoring and Tuning the Operational System
GSAF Grid Storage Access Framework
NOSQL databases and Big Data Storage Systems
Textbook Engineering Web Applications by Sven Casteleyn et. al. Springer Note: (Electronic version is available online) These slides are designed.
1 Demand of your DB is changing Presented By: Ashwani Kumar
Tools of Software Development
What is database? Types and Examples
File Systems and Databases
SDMX Reference Infrastructure Introduction
Intro to NoSQL Databases
The implementation of a more efficient way of collecting data
Tomaž Špeh, Rudi Seljak Statistical Office of the Republic of Slovenia
Metadata in the modernization of statistical production at Statistics Canada Carmen Greenough June 2, 2014.
MANAGING DATA RESOURCES
NoSQL Databases Antonino Virgillito.
Technology Landscape and Enterprise Objectives
Overview of big data tools
Intro to NoSQL Databases
Lecture 1 File Systems and Databases.
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Methodology – Monitoring and Tuning the Operational System
ESS VIP ICT Project Task Force Meeting 5-6 March 2013.
Database Management Systems
Data Warehousing Concepts
Chapter 3 Database Management
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 8 Slide 1 Tools of Software Development l 2 types of tools used by software engineers:
Unit J: Creating a Database
Intro to NoSQL Databases
INTRODUCTION A Database system is basically a computer based record keeping system. The collection of data, usually referred to as the database, contains.
Data Warehouse and OLAP Technology
Database management systems
Palestinian Central Bureau of Statistics
SDMX meeting Big Data technologies
Presentation transcript:

A Novel IT Architecture for Statistical Data Collection Using Big Data Technology Domenico Aprile, Lorenzo Di Gaetano, Guido Drovandi, Antonino Virgillito Bruxelles, March 15th 2017

Introduction The use of Big Data sources in the context of production of official statistics has been at the core of several initiatives at both national and international level in recent years. Among all the questions that were raised by the use of Big Data for statistics, a specific one is the use of the novel IT tools that are available for handling Big Data Capability of coping with high volume of data and variety of data formats (loosely-structured data) The use of Big Data tools is not necessarily tied to the handling of unusual datasets but they can prove their usefulness also to cover specific technical requirements. Can co-exist with traditional tools in a modern IT architecture This presentation focuses on the use of a NoSQL database in the context of a novel IT architecture for data collection. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

The problem Istat is currently developing a web-based tool for generalized collection of questionnaires. Each questionnaire has its own metadata structure which is prone to frequent changes along survey editions. This normally involves modifications that propagate through several connected software systems. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

The idea Use of a NoSQL database in the perspective of simplifying the overall architecture, by natively coping with heterogeneous and dynamically changing structures, and possibly also allows for a better performance over traditional tools. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

NoSQL databases The term “NoSQL” is used to indicate storage solutions that are not based on the traditional tabular data model of relational databases (RDBMS). NoSQL databases do not include typical features such as indexes, transactions etc. The principle is to trade the consistency guarantees of relational databases, that introduce an operational overhead and may not be necessary for all applications, for overall benefits in terms of enhanced query performance or support for loosely structured data. NoSQL DBs are not meant to replace RDBMSs but rather to cover different kinds of requirements. Several categories of NoSQL databases exist, that differs according to the kind of data model they use. Examples are document-based, that accept semi-structured content, graph, key-value and column-based. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

Column based NoSQL Key-value NoSQL databases are schemaless storage facilities in which records are simply organized into an identifier (key) and a value. The focus of such databases is on reaching a high throughput in operation, enabling random direct access to single rows in a table through the key. However, the lack of any form of structure leaves to the programmer all the responsibility of correcting, writing and interpreting the data stored in the value cell. Column based databases are an extension of the key-value store concept that accept the possibility of organizing the value part in a semi-structured schema, similar to a namespace organization, that still preserve some flexibility in the organization of data. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

Figure 1 : Data organization in Column-based NoSQL Databases It is articulated in a four-dimensional general schema, where the simple, flat value of the key-value stores is replaced by a set of column families, that are common to all the rows in a table and contain a set of column, that can vary from one row to another. Each row has also a timestamp that identify all the versions of the data (Figure 1). Column families Key Personal Data Professional Data   First Name Last Name Title Salary 1 John Doe Mr. 10000 2 Mary White Eng. 20000 3 Carl Green Phd. 35000 Column qualifiers Figure 1 : Data organization in Column-based NoSQL Databases A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

Web Architecture for Data Collection Traditional architecture A different DB schema is required for each survey NoSQL based architecture All survey can be stored in a same table A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

The proposed IT Architecture The idea is that the data collection software stores all the questionnaires into a column-based NoSQL database (NSDB), for all the surveys it manages. The data in the NSDB component can be then passed to successive phases of the survey, that, according to the specific survey, can be implemented with different technologies/tools, including RDBMS, SAS or other kinds of NoSQL databases such as MongoDB. A level of decoupling is then created, where the NSDB acts merely as a temporary storage area for backing up data collection and all the operations on microdata (check, editing, etc.) are carried out after data is extracted. This modular scheme represents a great simplification when applied to many different surveys, that are typically handled in heterogeneous ways in terms of software tools and methodology, and can facilitate the process of converging towards a unified architecture. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

Expected Benefits While we expect substantial benefits from this architecture regarding technical aspects such as the overall performance of data inserts, facilitated maintenance and more flexible and modular design, what we deem even more relevant is the possible impact of this solution over organization and production. The NoSQL storage, thanks to its virtually indefinite scalability in terms of size, may eventually become a unique centralized storage of raw survey microdata, that at the same time are always available for access and analysis. This organization easily enables cross-survey analysis of survey paradata, giving the possibility to extract novel in-depth insights over the collection process, that can help in assessing quality of the collection phase as a whole over different areas. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

Experiments We performed a comparison test between a key-value table using our traditional RDBMS architecture, using Oracle and Hibernate ORM, and an experimental HBase table. The test is based on a fixed amount of read/write cycles, to compare performance and throughput. We used a custom Java application for both Oracle and Hbase approaches. We also tested HBase scalability repeating read/write cycles on three tables of increasing sizes, to see how different table sizes impact on performance. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

Experimental Results RDBMS HBase Read/write avg. (ms) 31.65 7 10000 Read/write performance comparison RDBMS HBase Read/write avg. (ms) 31.65 7 Data access is much faster with NoSQL HBase scalability over increasing table size 10000 100000 500000 Write time avg (ms) 13.4 13.41 Read/write avg (ms) 7 10 Access times in NoSQL increase sub-linearly with table size A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

Conclusions We showed that NoSQL databases represent a viable alternative to complement traditional data storage solutions Besides better performance of raw operations and a simplified programming model, NoSQL databases can enable interesting alternative ways of storing data For this reason, several possible use cases can be identified in statistical institutes around the idea of large-scale data repositories of non-uniform data A Novel IT Architecture for Statistical Data Collection Using Big Data Technology