Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Similar presentations


Presentation on theme: "Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy."— Presentation transcript:

1 Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy

2 2 Outline What is Impala? Data access Supported data formats Data loading Measured performance

3 3 Outline What is Impala? Data access Data store Data loading Measured performance

4 What is Cloudera Impala? Impala is a SQL engine for analytics by performing fast sequental full table/partitions scanning on top of HADOOP Distributed FS (HDFS) SQL interface to data Supports many file formats as a data source Back-end written in C++ Queries run in parallel on the cluster nodes Shared nothing architecture – data locality Allows to scale for high capacity and throughput on commodity HW 4

5 Impala query execution Application ODBC HDFS Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor HDFS Query Planner Query Coordinator Query Executor SQL Result

6 6 Outline What is Impala? Data access Data store Data loading Measured performance

7 7 Data access Declarative access with SQL ANSI SQL compliant -> not 100% compliant with Oracle Support of stored procedures written in C++ or Java Client interface Application: JDBC, ODBC access via db links (Oracle -> Impala) is possible User: Impala-shell -> like sqlplus Can query data stored in HBase

8 8 Outline What is Impala? Data access Data store Data loading Measured performance

9 9 Data store Table metadata stored in Hive metadata store (big)int, double, float, timestamp, string... data format is open => can be used accessed by other toolss Available structures Tables No indexes => No primary/foreign keys => no constraints List partitioning based on virtual columns Views Data stored on HDFS, supported formats Text/CSV, Sequence files (binary key/value), RCFiles Avro – binary Parquet – binary, columnar store

10 10 Various file formats

11 What is Parquet format? Parquet file format for additional performance Column-oriented storage => Limits IO data actually need Columns statistics in files headers Compression Snappy – fast with average compression rates

12 12 Outline What is Impala? Data access Data store Data loading Measured performance

13 13 Data storing No transactions! No updates! Sqoop -> from other RDBMS (including Oracle) Bulk loading, multiple file formats supported... Bulk loading -> from mapping from existsing HDFS files Any tool can be used to create data files Regular DML – not performing well so far Real time streaming with GoldenGate is an option (under investigation)

14 14 Outline What is Impala? Data access Data store Data loading Measured performance

15 15 Test hardware Old hardware used for databases in the past (2009-2014) 12x machines 2x4 cores @ 2.27GHz 24 GB of RAM 18x storage arrays 12 disks each attached with servers via redundant FC links (2*4Gb/s)

16 Initial test for physics data Source: M.Limper, M. Grzybek ETL: root -> csv/parquet -> Impala table Rows of hundreds of columns -> columnar store Queries x5 reduced execution time comparing to Oracle columnar nature of data -> Parquet wins Complex queries handling analysis According to Maaike’s words: easier to handle analysis than in root itself Parallelization for free – done by engine without any effort

17 Scalability

18 ATLAS PandaArch SELECT * FROM atlas_pandaarch.y2013_jobsarchived WHERE PRODSOURCELABEL IN ('panda', 'install', 'ddm', 'prod_test') AND MODIFICATIONTIME BETWEEN '2013-10-10' and '2013-10-19' 3 partitioned read! Full table (290GB) scan takes 217s

19 PandaArch – full data data scans Oracle ->current production system (ATLARC 12c) was used; executed on one instance out of two. Parquet ->current test Hadoop cluster was used (12 machines) to the client.

20 Profiling Signal value for a given variable (by ID) and within given time window select utc_stamp, value from data_numeric_part_by_impala where variable_id = 929658 and utc_stamp between cast('2012-08-10 00:00:00' as timestamp) and cast('2012-08-17 23:59:59' as timestamp) order by utc_stamp CPU fully utilised for on all cluster nodes Constant IO load of 2.8M blocks /s = 1.33GB/s

21 Summary Impala seems to be a good alternative to Oracle for data warehouse workloads Impala outperforms Oracle in our tests with full table/partition scan due to its scalable architecture Parquet and Avro file format are the way to go More features to come – indexes, transactions...

22 Future plans Extension of the test cluster (up to 28 machines) Run pilot systems for LHCLOG, PVSS,..., on preprod hardware We are open for new use cases and collaboration


Download ppt "Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy."

Similar presentations


Ads by Google