Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy

2 Outline What is Impala? Data access Supported data formats Data loading Measured performance

3 Outline What is Impala? Data access Data store Data loading Measured performance

What is Cloudera Impala? Impala is a SQL engine for analytics by performing fast sequental full table/partitions scanning on top of HADOOP Distributed FS (HDFS) SQL interface to data Supports many file formats as a data source Back-end written in C++ Queries run in parallel on the cluster nodes Shared nothing architecture – data locality Allows to scale for high capacity and throughput on commodity HW 4

Impala query execution Application ODBC HDFS Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor SQL Result

7 Data access Declarative access with SQL ANSI SQL compliant -> not 100% compliant with Oracle Support of stored procedures written in C++ or Java Client interface Application: JDBC, ODBC access via db links (Oracle -> Impala) is possible User: Impala-shell -> like sqlplus Can query data stored in HBase

9 Data store Table metadata stored in Hive metadata store (big)int, double, float, timestamp, string... data format is open => can be used accessed by other toolss Available structures Tables No indexes => No primary/foreign keys => no constraints List partitioning based on virtual columns Views Data stored on HDFS, supported formats Text/CSV, Sequence files (binary key/value), RCFiles Avro – binary Parquet – binary, columnar store

10 Various file formats

What is Parquet format? Parquet file format for additional performance Column-oriented storage => Limits IO data actually need Columns statistics in files headers Compression Snappy – fast with average compression rates

13 Data storing No transactions! No updates! Sqoop -> from other RDBMS (including Oracle) Bulk loading, multiple file formats supported... Bulk loading -> from mapping from existsing HDFS files Any tool can be used to create data files Regular DML – not performing well so far Real time streaming with GoldenGate is an option (under investigation)

15 Test hardware Old hardware used for databases in the past (2009-2014) 12x machines 2x4 cores @ 2.27GHz 24 GB of RAM 18x storage arrays 12 disks each attached with servers via redundant FC links (2*4Gb/s)

Initial test for physics data Source: M.Limper, M. Grzybek ETL: root -> csv/parquet -> Impala table Rows of hundreds of columns -> columnar store Queries x5 reduced execution time comparing to Oracle columnar nature of data -> Parquet wins Complex queries handling analysis According to Maaike’s words: easier to handle analysis than in root itself Parallelization for free – done by engine without any effort

Scalability

ATLAS PandaArch SELECT * FROM atlas_pandaarch.y2013_jobsarchived WHERE PRODSOURCELABEL IN ('panda', 'install', 'ddm', 'prod_test') AND MODIFICATIONTIME BETWEEN '2013-10-10' and '2013-10-19' 3 partitioned read! Full table (290GB) scan takes 217s

PandaArch – full data data scans Oracle ->current production system (ATLARC 12c) was used; executed on one instance out of two. Parquet ->current test Hadoop cluster was used (12 machines) to the client.

Profiling Signal value for a given variable (by ID) and within given time window select utc_stamp, value from data_numeric_part_by_impala where variable_id = 929658 and utc_stamp between cast('2012-08-10 00:00:00' as timestamp) and cast('2012-08-17 23:59:59' as timestamp) order by utc_stamp CPU fully utilised for on all cluster nodes Constant IO load of 2.8M blocks /s = 1.33GB/s

Summary Impala seems to be a good alternative to Oracle for data warehouse workloads Impala outperforms Oracle in our tests with full table/partition scan due to its scalable architecture Parquet and Avro file format are the way to go More features to come – indexes, transactions...

Future plans Extension of the test cluster (up to 28 machines) Run pilot systems for LHCLOG, PVSS,..., on preprod hardware We are open for new use cases and collaboration

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Similar presentations

Presentation on theme: "Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Similar presentations

Presentation on theme: "Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy."— Presentation transcript:

Similar presentations

About project

Feedback