Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Novel IT Architecture for Statistical Data Collection Using Big Data Technology Domenico Aprile, Lorenzo Di Gaetano, Guido Drovandi, Antonino Virgillito.

Similar presentations


Presentation on theme: "A Novel IT Architecture for Statistical Data Collection Using Big Data Technology Domenico Aprile, Lorenzo Di Gaetano, Guido Drovandi, Antonino Virgillito."— Presentation transcript:

1 A Novel IT Architecture for Statistical Data Collection Using Big Data Technology
Domenico Aprile, Lorenzo Di Gaetano, Guido Drovandi, Antonino Virgillito Bruxelles, March 15th 2017

2 Introduction The use of Big Data sources in the context of production of official statistics has been at the core of several initiatives at both national and international level in recent years. Among all the questions that were raised by the use of Big Data for statistics, a specific one is the use of the novel IT tools that are available for handling Big Data Capability of coping with high volume of data and variety of data formats (loosely-structured data) The use of Big Data tools is not necessarily tied to the handling of unusual datasets but they can prove their usefulness also to cover specific technical requirements. Can co-exist with traditional tools in a modern IT architecture This presentation focuses on the use of a NoSQL database in the context of a novel IT architecture for data collection. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

3 The problem Istat is currently developing a web-based tool for generalized collection of questionnaires. Each questionnaire has its own metadata structure which is prone to frequent changes along survey editions. This normally involves modifications that propagate through several connected software systems. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

4 The idea Use of a NoSQL database in the perspective of simplifying the overall architecture, by natively coping with heterogeneous and dynamically changing structures, and possibly also allows for a better performance over traditional tools. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

5 NoSQL databases The term “NoSQL” is used to indicate storage solutions that are not based on the traditional tabular data model of relational databases (RDBMS). NoSQL databases do not include typical features such as indexes, transactions etc. The principle is to trade the consistency guarantees of relational databases, that introduce an operational overhead and may not be necessary for all applications, for overall benefits in terms of enhanced query performance or support for loosely structured data. NoSQL DBs are not meant to replace RDBMSs but rather to cover different kinds of requirements. Several categories of NoSQL databases exist, that differs according to the kind of data model they use. Examples are document-based, that accept semi-structured content, graph, key-value and column-based. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

6 Column based NoSQL Key-value NoSQL databases are schemaless storage facilities in which records are simply organized into an identifier (key) and a value. The focus of such databases is on reaching a high throughput in operation, enabling random direct access to single rows in a table through the key. However, the lack of any form of structure leaves to the programmer all the responsibility of correcting, writing and interpreting the data stored in the value cell. Column based databases are an extension of the key-value store concept that accept the possibility of organizing the value part in a semi-structured schema, similar to a namespace organization, that still preserve some flexibility in the organization of data. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

7 Figure 1 : Data organization in Column-based NoSQL Databases
It is articulated in a four-dimensional general schema, where the simple, flat value of the key-value stores is replaced by a set of column families, that are common to all the rows in a table and contain a set of column, that can vary from one row to another. Each row has also a timestamp that identify all the versions of the data (Figure 1). Column families Key Personal Data Professional Data First Name Last Name Title Salary 1 John Doe Mr. 10000 2 Mary White Eng. 20000 3 Carl Green Phd. 35000 Column qualifiers Figure 1 : Data organization in Column-based NoSQL Databases A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

8 Web Architecture for Data Collection
Traditional architecture A different DB schema is required for each survey NoSQL based architecture All survey can be stored in a same table A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

9 The proposed IT Architecture
The idea is that the data collection software stores all the questionnaires into a column-based NoSQL database (NSDB), for all the surveys it manages. The data in the NSDB component can be then passed to successive phases of the survey, that, according to the specific survey, can be implemented with different technologies/tools, including RDBMS, SAS or other kinds of NoSQL databases such as MongoDB. A level of decoupling is then created, where the NSDB acts merely as a temporary storage area for backing up data collection and all the operations on microdata (check, editing, etc.) are carried out after data is extracted. This modular scheme represents a great simplification when applied to many different surveys, that are typically handled in heterogeneous ways in terms of software tools and methodology, and can facilitate the process of converging towards a unified architecture. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

10 Expected Benefits While we expect substantial benefits from this architecture regarding technical aspects such as the overall performance of data inserts, facilitated maintenance and more flexible and modular design, what we deem even more relevant is the possible impact of this solution over organization and production. The NoSQL storage, thanks to its virtually indefinite scalability in terms of size, may eventually become a unique centralized storage of raw survey microdata, that at the same time are always available for access and analysis. This organization easily enables cross-survey analysis of survey paradata, giving the possibility to extract novel in-depth insights over the collection process, that can help in assessing quality of the collection phase as a whole over different areas. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

11 Experiments We performed a comparison test between a key-value table using our traditional RDBMS architecture, using Oracle and Hibernate ORM, and an experimental HBase table. The test is based on a fixed amount of read/write cycles, to compare performance and throughput. We used a custom Java application for both Oracle and Hbase approaches. We also tested HBase scalability repeating read/write cycles on three tables of increasing sizes, to see how different table sizes impact on performance. A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

12 Experimental Results RDBMS HBase Read/write avg. (ms) 31.65 7 10000
Read/write performance comparison RDBMS HBase Read/write avg. (ms) 31.65 7 Data access is much faster with NoSQL HBase scalability over increasing table size 10000 100000 500000 Write time avg (ms) 13.4 13.41 Read/write avg (ms) 7 10 Access times in NoSQL increase sub-linearly with table size A Novel IT Architecture for Statistical Data Collection Using Big Data Technology

13 Conclusions We showed that NoSQL databases represent a viable alternative to complement traditional data storage solutions Besides better performance of raw operations and a simplified programming model, NoSQL databases can enable interesting alternative ways of storing data For this reason, several possible use cases can be identified in statistical institutes around the idea of large-scale data repositories of non-uniform data A Novel IT Architecture for Statistical Data Collection Using Big Data Technology


Download ppt "A Novel IT Architecture for Statistical Data Collection Using Big Data Technology Domenico Aprile, Lorenzo Di Gaetano, Guido Drovandi, Antonino Virgillito."

Similar presentations


Ads by Google