SDMX meeting Big Data technologies

Slides:



Advertisements
Similar presentations
Chen Zhang Hans De Sterck University of Waterloo
Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
HBase Presented by Chintamani Siddeshwar Swathi Selvavinayakam
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Thanks to our Sponsors! To connect to wireless 1. Choose Uguest in the wireless list 2. Open a browser. This will open a Uof U website 3. Choose Login.
Project By: Anuj Shetye Vinay Boddula. Introduction Motivation HBase Our work Evaluation Related work. Future work and conclusion.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
WINDOWS AZURE STORAGE SERVICES A brief comparison and overview of storage services offered by Microsoft.
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
HBase Elke A. Rundensteiner Fall 2013
Cloudera Kudu Introduction
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data Model and Storage in NoSQL Systems (Bigtable, HBase) 1 Slides from Mohamed Eltabakh.
SQL Server Analysis Services Understanding Unified Dimension Model (UDM)
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Microsoft Ignite /28/2017 6:07 PM
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
OMOP CDM on Hadoop Reference Architecture
Protecting a Tsunami of Data in Hadoop
Databases and DBMSs Todd S. Bacastow January
and Big Data Storage Systems
A Novel IT Architecture for Statistical Data Collection Using Big Data Technology Domenico Aprile, Lorenzo Di Gaetano, Guido Drovandi, Antonino Virgillito.
Column-Based.
HBase Mohamed Eltabakh
Hadoop.
Real-time analytics using Kudu at petabyte scale
Software Systems Development
Statistical Information Systems Introducing SIS tool .Stat
How did it start? • At Google • • • • Lots of semi structured data
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
CS122B: Projects in Databases and Web Applications Winter 2017
Running virtualized Hadoop, does it make sense?
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Chapter 14 Big Data Analytics and NoSQL
Big Data Technology.
CLOUDERA TRAINING For Apache HBase
Fundamentals of Information Systems, Sixth Edition
Scaling SQL with different approaches
Operational & Analytical Database
NOSQL.
Hadoop.
Gowtham Rajappan.
NOSQL databases and Big Data Storage Systems
Hadoop and NoSQL at Thomson Reuters
Extraction, aggregation and classification at Web Scale
Powering real-time analytics on Xfinity using Kudu
Ministry of Higher Education
SDMX Information Model
1 Demand of your DB is changing Presented By: Ashwani Kumar
Big Data - in Performance Engineering
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Database Vs. Data Warehouse
Hbase – NoSQL Database Presented By: 13MCEC13.
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
MAPREDUCE TYPES, FORMATS AND FEATURES
Database Management System
Presentation transcript:

SDMX meeting Big Data technologies Storing SDMX time series data in Hadoop SDMX Global Conference 2019 Budapest, 17/09/2019 Stefano Pambianco

Overview 1 2 3 Different options for storing data Data modelling and security aspects Why exploring BigData technologies 4 Conclusions

1 Why exploring BigData Technologies Some numbers

ECB Statistical Data Production in numbers Location, [Please select] [Please select] ECB Statistical Data Production in numbers >50 different data providers 400 datasets 400 million series 40 billion distinct observations 360 SDMX files per day 350GB/year of new data ingested ~1000 data producers/analysts More dataset coming with more granular data Name of the Author

2 Different options for storing data Evaluated approaches in Hadoop

Hadoop data storage options HDFS with Parquet HBase Kudu Column-oriented, NoSQL database management system Random read / write access in real time to a very large set of data. Fault tolerant (replication) Runs on top of HDFS Scalable and fast tabular storage SQL-like schema: finite number of typed columns Low-latency random access Fault tolerant (tablets replication) File based storage Performant batch ingestion Efficiently scanning large amounts of data Fault tolerant (replication) Only data append possible

Looking to statistical data production process Location, [Please select] [Please select] Looking to statistical data production process Prototypes have been implemented using HBase Name of the Author

3 Data modelling and security aspects Findings and ideas

HBase data model HBase is schema-less, column-oriented datastore designed to store denormalized data Namespace - Table - Column Family - Qualifier/Column - Cell (Timestamp&Value) - Versions The Apache HBase Data Model is made up of different logical components such as Namespaces, Tables, Rows, Column Families, Columns, Cells and Versions. A namespace is a logical grouping of tables analogous to a database in relation database systems.  HBase tables are logical collection of the multiple rows stored in a separate region partition. Rows are stored on the row key of the HBase table. A row is consists of row key and one or more qualifiers with values associated with them. The rows are sorted and stored based on the row key of the given table. One or more column in the HBase table are grouped together as Column Families. Column is identified by the column qualifier concatenated with column family name. For example, ColumnFamily:ColumnName. Rows in the HBase tables can have multiple columns. Columns in the HBase tables are grouped together into column families. Each row in HBase can have one or more column families and multiple columns associated with them.  A cell in HBase stores the data and is a unique combination of row key, column family, and column qualifier, and contains a value and a timestamp. The data stored in a cell is versioned and versions of data are identified by the timestamp associated with them. The version number is configurable and by default it is 3.

Storing SDMX time series in HBase An SDMX observation is identified by the values of its dimensions, it provides with an observation value for a given reference date and it has several attributes attached Table M.USD.EUR. SP00.A EXR: OBS ATTR STATS NUM_OBS 2018-12-31 FIRST_PERIOD HBase TS2 PerVal 20011231 Value 9 TS 2018-12-31 OBS_CONF 2018-12-31 OBS_STATUS 2018-12-31 HBase TS1 ObsVal 1.12 HBase TS1 ConfVal F HBase TS1 StatVal A HBase TS2 ObsVal 1.09 HBase TS2 ConfVal F HBase TS2 StatVal A The data model can be adapted based on access patterns and informational needs

Some insights on performance The work done until now is based on a prototype developed internally. Tests ran using a randomly generated dataset of 1.76M time series, each containing 1825 observations (daily observations for 5 years) for a total of more than 3 billion observations Data retrieval scenarios tested go from milliseconds to retrieve a TS or a partial/wildcarded TS to few seconds to retrieve wildcarded TSs or the full dataset. Data processing scenarios (e.g. aggregation on one and more than one dimensions, outlier detection, completeness checks) are in the order of seconds.

Security aspects Access Controller A prototype has been done to assess the impact on performance and the capabilities of HBase applying security rules to the data stored: Secure authentication using Kerberos Authorisation managed by HBase coprocessors: Access Controller Visibility Controller Visibility Controller No data leakage Minimal impact on data operations (less than 1 % = cost of metadata writing)

4 Conclusions and Q&A

BigData is a fit for SDMX data The combination of big data technologies with SDMX appears to be a powerful tool to cope with the future challenges in statistics Storing and retrieving data is very fast HBase data model provide with the flexibility to model the data according to the specific needs – same data can be modelled in different ways SDMX data model fits well into the HBase one Data security can be pushed at the level of the cell with no impact on performance

Q&A