Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Slides:



Advertisements
Similar presentations
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Advertisements

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CERN IT Department CH-1211 Geneva 23 Switzerland t Sequential data access with Oracle and Hadoop: a performance comparison Zbigniew Baranowski.
Hadoop Ecosystem Overview
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
InfiniDB Overview.
Hadoop File Formats and Data Ingestion
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
Hadoop File Formats and Data Ingestion
MapReduce VS Parallel DBMSs
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Hive : A Petabyte Scale Data Warehouse Using Hadoop
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
March 19981© Dennis Adams Associates Tuning Oracle: Key Considerations Dennis Adams 25 March 1998.
Goodbye rows and tables, hello documents and collections.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hive Facebook 2009.
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
Development of Hybrid SQL/NoSQL PanDA Metadata Storage PanDA/ CERN IT-SDC meeting Dec 02, 2014 Marina Golosova and Maria Grigorieva BigData Technologies.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
The PHysics Analysis SERver Project (PHASER) CHEP 2000 Padova, Italy February 7-11, 2000 M. Bowen, G. Landsberg, and R. Partridge* Brown University.
Hive. What is Hive? Data warehousing layer on top of Hadoop – table abstractions SQL-like language (HiveQL) for “batch” data processing SQL is translated.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
CERN IT Department CH-1211 Geneva 23 Switzerland t Oracle Parallel Query vs Hadoop MapReduce for Sequential Data Access Zbigniew Baranowski.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Cloudera Kudu Introduction
Master Cluster Manager User Interface (API Level) User Interface (API Level) Query Translator Avro NTA Query Engine NTA Query Engine Job Scheduler Avro.
Database CNAF Barbara Martelli Rome, April 4 st 2006.
2) Database System Concepts and Architecture. Slide 2- 2 Outline Data Models and Their Categories Schemas, Instances, and States Three-Schema Architecture.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
Moscow, November 16th, 2011 The Hadoop Ecosystem Kai Voigt, Cloudera Inc.
BIG DATA/ Hadoop Interview Questions.
Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
OMOP CDM on Hadoop Reference Architecture
Scaling HDFS to more than 1 million operations per second with HopsFS
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Hadoop and Analytics at CERN IT
Real-time analytics using Kudu at petabyte scale
Running virtualized Hadoop, does it make sense?
Scaling SQL with different approaches
Operational & Analytical Database
A quick trip from code profiling to file formats
Powering real-time analytics on Xfinity using Kudu
Introduction to Apache
Overview of big data tools
Presentation transcript:

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy

2 Outline What is Impala? Data access Supported data formats Data loading Measured performance

3 Outline What is Impala? Data access Data store Data loading Measured performance

What is Cloudera Impala? Impala is a SQL engine for analytics by performing fast sequental full table/partitions scanning on top of HADOOP Distributed FS (HDFS) SQL interface to data Supports many file formats as a data source Back-end written in C++ Queries run in parallel on the cluster nodes Shared nothing architecture – data locality Allows to scale for high capacity and throughput on commodity HW 4

Impala query execution Application ODBC HDFS Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor SQL Result

6 Outline What is Impala? Data access Data store Data loading Measured performance

7 Data access Declarative access with SQL ANSI SQL compliant -> not 100% compliant with Oracle Support of stored procedures written in C++ or Java Client interface Application: JDBC, ODBC access via db links (Oracle -> Impala) is possible User: Impala-shell -> like sqlplus Can query data stored in HBase

8 Outline What is Impala? Data access Data store Data loading Measured performance

9 Data store Table metadata stored in Hive metadata store (big)int, double, float, timestamp, string... data format is open => can be used accessed by other toolss Available structures Tables No indexes => No primary/foreign keys => no constraints List partitioning based on virtual columns Views Data stored on HDFS, supported formats Text/CSV, Sequence files (binary key/value), RCFiles Avro – binary Parquet – binary, columnar store

10 Various file formats

What is Parquet format? Parquet file format for additional performance Column-oriented storage => Limits IO data actually need Columns statistics in files headers Compression Snappy – fast with average compression rates

12 Outline What is Impala? Data access Data store Data loading Measured performance

13 Data storing No transactions! No updates! Sqoop -> from other RDBMS (including Oracle) Bulk loading, multiple file formats supported... Bulk loading -> from mapping from existsing HDFS files Any tool can be used to create data files Regular DML – not performing well so far Real time streaming with GoldenGate is an option (under investigation)

14 Outline What is Impala? Data access Data store Data loading Measured performance

15 Test hardware Old hardware used for databases in the past ( ) 12x machines 2x4 2.27GHz 24 GB of RAM 18x storage arrays 12 disks each attached with servers via redundant FC links (2*4Gb/s)

Initial test for physics data Source: M.Limper, M. Grzybek ETL: root -> csv/parquet -> Impala table Rows of hundreds of columns -> columnar store Queries x5 reduced execution time comparing to Oracle columnar nature of data -> Parquet wins Complex queries handling analysis According to Maaike’s words: easier to handle analysis than in root itself Parallelization for free – done by engine without any effort

Scalability

ATLAS PandaArch SELECT * FROM atlas_pandaarch.y2013_jobsarchived WHERE PRODSOURCELABEL IN ('panda', 'install', 'ddm', 'prod_test') AND MODIFICATIONTIME BETWEEN ' ' and ' ' 3 partitioned read! Full table (290GB) scan takes 217s

PandaArch – full data data scans Oracle ->current production system (ATLARC 12c) was used; executed on one instance out of two. Parquet ->current test Hadoop cluster was used (12 machines) to the client.

Profiling Signal value for a given variable (by ID) and within given time window select utc_stamp, value from data_numeric_part_by_impala where variable_id = and utc_stamp between cast(' :00:00' as timestamp) and cast(' :59:59' as timestamp) order by utc_stamp CPU fully utilised for on all cluster nodes Constant IO load of 2.8M blocks /s = 1.33GB/s

Summary Impala seems to be a good alternative to Oracle for data warehouse workloads Impala outperforms Oracle in our tests with full table/partition scan due to its scalable architecture Parquet and Avro file format are the way to go More features to come – indexes, transactions...

Future plans Extension of the test cluster (up to 28 machines) Run pilot systems for LHCLOG, PVSS,..., on preprod hardware We are open for new use cases and collaboration