Apache Kudu Zbigniew Baranowski.

Slides:

Advertisements

Similar presentations

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

Advertisements

Putting the Sting in Hive Page 1 Alan F.

7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Distributed storage for structured data

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗

Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Hive Facebook 2009.

Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.

CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.

Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

1 HBase Intro 王耀聰陳威宇

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();

Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.

GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Intuitions for Scaling Data-Centric Architectures

Nov 2006 Google released the paper on BigTable.

Page 1 © Hortonworks Inc – All Rights Reserved Hive: Data Organization for Performance Gopal Vijayaraghavan.

Cloudera Kudu Introduction

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Bigtable: A Distributed Storage System for Structured Data

1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.

Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.

Bigtable A Distributed Storage System for Structured Data.

Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.

1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.

Integration of Oracle and Hadoop: hybrid databases affordable at scale

OMOP CDM on Hadoop Reference Architecture

Image taken from: slideshare

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

In-Memory Capabilities

Integration of Oracle and Hadoop: hybrid databases affordable at scale

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

HBase Mohamed Eltabakh

Introduction to Distributed Platforms

Real-time analytics using Kudu at petabyte scale

File Format Benchmark - Avro, JSON, ORC, & Parquet

Curator: Self-Managing Storage for Enterprise Clusters

Fast Data Made Easy Ted Malaska Cloudera With Kafka and Kudu

Building realtime BI systems with HDFS and Kudu

How did it start? • At Google • • • • Lots of semi structured data

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Running virtualized Hadoop, does it make sense?

MongoDB Er. Shiva K. Shrestha ME Computer, NCIT

Open Source distributed document DB for an enterprise

Spark Presentation.

CLOUDERA TRAINING For Apache HBase

CSE-291 (Cloud Computing) Fall 2016

Powering real-time analytics on Xfinity using Kudu

Uber How to Stream Data with StorageTapper

Big Data - in Performance Engineering

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Introduction to Apache

Interpret the execution mode of SQL query in F1 Query paper

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

Large Object Datatypes

THE GOOGLE FILE SYSTEM.

SDMX meeting Big Data technologies

Pig Hive HBase Zookeeper

Presentation transcript:

Apache Kudu Zbigniew Baranowski

Intro

What is KUDU? New storage engine for structured data (tables) – does not use HDFS! Columnar store Mutable (insert, update, delete) Written in C++ Apache-licensed – open source Quite new ->1.0 version recently released First commit on October 11th, 2012 …and immature?

KUDU tries to fill the gap HDFS excels at Scanning of large amount of data at speed Accumulating data with high throughput HBASE (on HDFS) excels at Fast random lookups by key Making data mutable

Table oriented storage A Kudu table has RDBMS-like schema Primary key (one or many columns), No secondary indexes Finite and constant number of columns (unlike HBase) Each column has a name and type boolean, int(8,16,32,64), float, double, timestamp, string, binary Horizontally partitioned (range, hash) partitions are called tablets tablets can have 3 or 5 replicas

Data Consistency Writing Reading Single row mutations done atomically across all columns No multi-row ACID transactions Reading Tuneable freshness of the data read whatever is available or wait until all changes committed in WAL are available Snapshot consistency changes made during scanning are not reflected in the results point-in-time queries are possible (based on provided timestamp)

Classical low latency design Kudu simplifies BigData deployment model for online analytics (low latency ingestion and access) Classical low latency design Stream Source Events Staging area Stream Source Stream Source Events Stream Source Events Flush periodically Flush immediately HDFS Big Files Indexed data Batch processing Fast data access

Implementing low latency with Kudu Stream Source Events Stream Source Stream Source Events Stream Source Events Batch processing Fast data access

Kudu Architecture

Architecture overview Master server (can be multiple masters for HA) Stores metadata - tables definitions Tablets directory (tablets locations) Coordinates the cluster reconfigurations Tablet servers (worker nodes) Writes and reads tablets Stored on local disks (no HDFS) Tracks status of tablets replicas (followers) Replicates the data to followers

Tables and tablets Map of table TEST: Master TabletServer1 TabletID Leader Follower1 Follower2 TEST1 TS1 TS2 TS3 TEST2 TS4 TEST3 TabletID Leader Follower1 Follower2 TEST1 TS1 TS2 TS3 TEST2 TS4 TEST3 TabletID Leader Follower1 Follower2 TEST1 TS1 TS2 TS3 TEST2 TS4 TEST3 TabletServer1 TabletServer2 TabletServer3 TabletServer4 Leader TEST1 TEST1 TEST1 Leader TEST2 TEST2 TEST2 Leader TEST3 TEST3 TEST3

Data changes propagation in Kudu (Raft Consensus - https://raft.github.io) Master Get tablets locations Client Tablet server X Write (x rows) WAL Commit Tablet 1 (leader) Success Commit (ASNC) ACK Write (x rows) ACK Write (x rows) Commit(ASNC) Tablet server Y Tablet server Z WAL WAL Commit Commit Tablet 1 (follower) Tablet 1 (follower)

Insert into tablet (without uniqueness check) Tablets Server MemRowSet B+tree Row: Col1,Col2, Col3 INSERT Leafs sorted by Primary Key Row1,Row2,Row3 Flush Columnar store encoded similarly to Parquet Rows sorted by PK. DiskRowSet1 (32MB) PK {min, max} PK Bloom filters Bloom filters for PK ranges. Stored in cached btree Col1 Col2 Col3 Interval tree Interval tree keeps track of PK ranges within DiskRowSets DiskRowSet2 (32MB) PK {min, max} PK Bloom filters Col1 Col2 Col3 There might be Ks of sets per tablet

DiskRowSet compaction Periodical task Removes deleted rows Reduces the number of sets with overlapping PK ranges Does not create bigger DiskRowSets 32MB size for each DRS is preserved DiskRowSet1 (32MB) PK {A, G} DiskRowSet1 (32MB) PK {A, D} Compact DiskRowSet2 (32MB) PK {B, E} DiskRowSet2 (32MB) PK {E, G}

How columns are stored on disk (DiskRowSet) maps row offsets to pages maps PK to row offset Column1 Values Size 256KB 32MB Page metadata Btree index Values Page metadata Values Pages are encoded with a variety of encodings, such as dictionary encoding, bitshuffle, or RLE Page metadata Values Page metadata Column2 Values Btree index PK Page metadata Btree index Values Page metadata Values Page metadata Values Page metadata Column3 Values Page metadata Btree index Values Pages can be compressed: Snappy, LZ4 or ZLib Page metadata Values Page metadata Values Page metadata

Kudu deployment

3 options for deployments Build from source  Using RPMs 1 core rpms 2 service rpms (master and servers) One shared config file Using Cloudera manager Click, click, click, done

Interfacing with Kudu

Table access and manipulations Operations on tables (NoSQL) insert, update, delete, scan Python, C++, Java API Integrated with Impala & Hive(SQL), MapReduce, Spark Flume sink (ingestion)

Manipulating Kudu tables with SQL(Impala/Hive) Table creation CREATE TABLE `kudu_example` ( `runnumber` BIGINT, `eventnumber` BIGINT, `project` STRING, `streamname` STRING, `prodstep` STRING, `datatype` STRING, `amitag` STRING, `lumiblockn` BIGINT, `bunchid` BIGINT, ) DISTRIBUTE BY HASH (runnumber) INTO 64 BUCKETS TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = ‘example_table', 'kudu.master_addresses' = ‘kudu-master.cern.ch:7051', 'kudu.key_columns' = 'runnumber, eventnumber' ); DMLs insert into kudu_example values (1,30,'test',….); insert into kudu_example select * from data_parquet; update kudu_example set datatype='test' where runnumber=1; delete from kudu_example where project='test'; Queries select count(*),max(eventnumber) from kudu_example where datatype like '%AOD%‘ group by runnumber; select * from kudu_example k, parquet_table p where k.runnumber=p.runnumber ;

Creating table with Java import org.kududb.* //CREATING TABLE String tableName = "my_table"; String KUDU_MASTER_NAME = "master.cern.ch" KuduClient client = new KuduClient.KuduClientBuilder(KUDU_MASTER_NAME).build(); List<ColumnSchema> columns = new ArrayList(); columns.add(new ColumnSchema.ColumnSchemaBuilder("runnumber",Type.INT64). key(true).encoding(ColumnSchema.Encoding.BIT_SHUFFLE).nullable(false).compressionAlgorithm(ColumnSchema.CompressionAlgorithm.SNAPPY).build()); columns.add(new ColumnSchema.ColumnSchemaBuilder("eventnumber",Type.INT64). key(true).encoding(ColumnSchema.Encoding.BIT_SHUFFLE).nullable(false).compressionAlgorithm(ColumnSchema.CompressionAlgorithm.SNAPPY).build()); …….. Schema schema = new Schema(columns); List<String> partColumns = new ArrayList<>(); partColumns.add("runnumber"); partColumns.add("eventnumber"); CreateTableOptions options = new CreateTableOptions().addHashPartitions(partColumns, 64).setNumReplicas(3); client.createTable(tableName, schema,options);

Inserting rows with Java //INSERTING KuduTable table = client.openTable(tableName); KuduSession session = client.newSession(); Insert insert = table.newInsert(); PartialRow row = insert.getRow(); row.addLong(0, 1); row.addString(2,"test") …. session.apply(insert); //stores them in memory on client side (for batch upload) session.flush(); //sends data to Kudu ……..

Scanner in Java //configuring column projection List<String> projectColumns = new ArrayList<>(); projectColumns.add("runnumber"); projectColumns.add("dataType"); //setting a scan range PartialRow start = s.newPartialRow(); start.addLong("runnumber", 8); PartialRow end = s.newPartialRow(); end.addLong("runnumber",10); KuduScanner scanner = client.newScannerBuilder(table) .lowerBound(start) .exclusiveUpperBound(end) .setProjectedColumnNames(projectColumns) .build(); while (scanner.hasMoreRows()) { RowResultIterator results = scanner.nextRows(); while (results.hasNext()) { RowResult result = results.next(); System.out.println(result.getString(1)); //getting 2nd column }

Spark with Kudu wget http://central.maven.org/maven2/org/apache/kudu/kudu-spark_2.10/1.0.0/kudu-spark_2.10-1.0.0.jar spark-shell --jars kudu-spark_2.10-1.0.0.jar import org.apache.kudu.spark.kudu._ // Read a table from Kudu val df = sqlContext.read.options( Map("kudu.master"->“kudu_master.cern.ch:7051“, "kudu.table" ->“kudu_table“)s).kudu // Query using the DF API... df.select(df("runnumber"),df("eventnumber"),df("db0")).filter($"runnumber"===169864).filter($"eventnumber"===1).show(); // ...or register a temporary table and use SQL df.registerTempTable("kudu_table") sqlContext.sql("select id from kudu_table where id >= 5").show() // Create a new Kudu table from a dataframe schema // NB: No rows from the dataframe are inserted into the table kuduContext.createTable("test_table", df.schema, Seq("key"), new CreateTableOptions().setNumReplicas(1)) // Insert data kuduContext.insertRows(df, "test_table")

Kudu Security To be done!

Performance (based on ATLAS EventIndex case)

Average row length Very good compaction ratio The same like parquet Each row consists of 56 attributes Most of them are strings Few integers and floats

Insertion rates (per machine, per partition) with Impala Average ingestion speed worse than parquet better than HBase

Random lookup with Impala Good random data lookup speed Similar to Hbase

Data scan rate per core with a predicate on non PK column (using Impala) Quite good data scanning speed Much better than HBase If natively supported predicates operations are used it is even faster than parquet

Remarks about Kudu performance Ingestion speed depends on memory available on the server latency and throughput of device data stores WALs performance on follower servers Data access speed by index depends on on size of memory buffers (hot data are looked up within 30ms memory) Storage latency (cold data are looked up within 300ms from HDDs or CEPH) Data scan speed depends on predicate (if it can be pushed down to Kudu) number of projected columns storage throughput

Kudu monitoring

Cloudera Manager A lot of metrics are published though servers http All collected by CM agents and can be plotted Predefined CM dashboards Monitoring of Kudu processes Workload plots CM can be also used for Kudu configuration

CM – Kudu host status

CM - Workload plots

CM - Resource utilisation

Observations & Conclusions

What is nice about Kudu The first one in Big Data open source world trying to combine columnar store + indexing Simple to deploy It works (almost) without problems It scales (this depends how the schema is designed) Writing, Accessing, Scanning Integrated with Big Data mainstream processing frameworks Spark, Impala, Hive, MapReduce SQL and NoSQL on the same data Gives more flexibility in optimizing schema design comparing to HBase (to levels of partitioning) Cloudera is pushing to deliver production-like quality of the software ASAP

What is bad about Kudu? No security (it should be added in next releases) authentication (who connected) authorization (ACLs) Raft consensus not always works as it should Too frequent tablet leader changes (sometime leader cannot be elected at all) Period without leader is quite long (sometimes never ends) This freezes updates on tables Handling disk failures you have to erase/reinitialize entire server Only one index per table No nested types (but there is a binary type) Cannot control tablet placement on servers

When to Kudu can be useful? When you have structured ‘big data’ Like in a RDBMS Without complex types When sequential and random data access is required simultaneously and have to scale Data extraction and analytics at the same time Time series When low ingestion latency is needed and lambda architecture is too expensive

Learn more Main page: https://kudu.apache.org/ Video: https://www.oreilly.com/ideas/kudu-resolving-transactional-and-analytic-trade-offs-in-hadoop Whitepaper: http://kudu.apache.org/kudu.pdf KUDU project: https://github.com/cloudera/kudu Some Java code examples: https://gitlab.cern.ch:8443/zbaranow/kudu-atlas-eventindex Get Cloudera Quickstart VM and test it