Real-time analytics using Kudu at petabyte scale

Slides:



Advertisements
Similar presentations
A Fast Growing Market. Interesting New Players Lyzasoft.
Advertisements

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Distributed storage for structured data
Performance and Scalability. Performance and Scalability Challenges Optimizing PerformanceScaling UpScaling Out.
Managing Multi-User Databases AIMS 3710 R. Nakatsu.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Parallel Execution Plans Joe Chang
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Your Data Any Place, Any Time Performance and Scalability.
Cloudera Kudu Introduction
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Bigtable A Distributed Storage System for Structured Data.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Image taken from: slideshare
Apache Kudu Zbigniew Baranowski.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Cassandra - A Decentralized Structured Storage System
Managing Multi-User Databases
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
HBase Mohamed Eltabakh
Hadoop.
Introduction to Distributed Platforms
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
An Open Source Project Commonly Used for Processing Big Data Sets
Building realtime BI systems with HDFS and Kudu
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Running virtualized Hadoop, does it make sense?
Lecture 16: Data Storage Wednesday, November 6, 2006.
Spark Presentation.
CLOUDERA TRAINING For Apache HBase
CSE-291 Cloud Computing, Fall 2016 Kesden
CSE-291 (Cloud Computing) Fall 2016
NOSQL.
Wildfire Goals HTAP: transactions & queries on same data Open Format
MyRocks at Facebook and Roadmaps
Introduction to NewSQL
Google Filesystem Some slides taken from Alan Sussman.
NOSQL databases and Big Data Storage Systems
Software Engineering Introduction to Apache Hadoop Map Reduce
Powering real-time analytics on Xfinity using Kudu
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
湖南大学-信息科学与工程学院-计算机与科学系
Real world In-Memory OLTP
SQL 2014 In-Memory OLTP What, Why, and How
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Shaving of Microseconds
Interpret the execution mode of SQL query in F1 Query paper
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Large Object Datatypes
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
SDMX meeting Big Data technologies
Pig Hive HBase Zookeeper
Presentation transcript:

Real-time analytics using Kudu at petabyte scale Sridhar Alla & Shekhar Agrawal Comcast

Speakers Sridhar Alla Director, Data Engineering, Comcast Engineering on a big data scale sridhar_alla@cable.comcast.com Shekhar Agrawal Director, Data Science, Comcast Data science on a big data scale. shekhar_agrawal@cable.comcast.com

Use cases Update activity history table with new customer information each month Update viewing history table when program guide is updated Update metadata from upstream systems and models Load data into Impala using real time feeds However, Kudu is not a system for OLTP operations, so we don’t have transactions and rollbacks in our use cases

Old Architecture Map Reduce on 480 Nodes running CDH5.8 Spark 2.1.0 on 480 Nodes Hbase 1.2 on 100 Nodes Impala 2.7 on 420 Nodes

Old Architecture

Challenges Rebuild of entire datasets, or partitions by re-generating compressed CSV files and loading into HDFS to keep data current took several hours or days. Rebuild operations consumed cluster capacity, limiting availability to other teams in a shared cluster. No way to update a single row in the dataset without recreating table or using a slower complicated integration with HBase.

New Architecture Stores Tune Events into Kudu. Any data fixes are made directly in Kudu. Stores Metadata directly into Kudu. Any data fixes are made directly in Kudu Spark Streaming updates Kudu on a real time basis to support quick analytics. Spark Job reads the raw events , sessionizes and updates Kudu. BI tools like Zoomdata directly work with Impala or Kudu to enable analytics.

New Architecture

Kudu at a high Level Need for another data store, Kudu? Table/Schemas Read/Writes Consistency model Architecture (master/tablet) Partitioning Replication – via Raft >> Why Kudu… Hbase – low latency random access, HDFS – sequential access storage.. Kudu fills the gap >> Table/Schemas …. Primary key required, no secondary key.. Schema specific (explicit column types helps in encoding & to expose SQL like metadata to other BI tools, etc.) >> Write … primary key required, no multi-row transactional API – they are treated as individual transactions, though optimized by batches >> Read …. Comparisons and key range scans. Disk stores columar, so getting few columns are far more effective >> Consistency … (a) snapshot consistency – read-your-write guarantee from a single client, but casual dependence of external consistency not guaranteed (same as HBase). Client A does write a, client b does write b, third reader will read b without a. Though Kudu offers propagating timestamp tokens from 1 client to another manually. (b) spanner consistency – if commit-wait is enabled, the client wait until it make sure that any casual writes are taken care of. But this has significant write latency >> Timestamps – doesn’t allow timestamp in writes (it maintains itself).. User can give this while reading – this allows users to do point-in-time queries and to ensure that different distributed task that makes a single query read a consistent snaphot. >> Architecture… single master (for metadata and data servers). Master can be replicated for fault tolerance and quick recovery from fail-over.. Tablets are data servers (multiple tablets in one machine) >> Partitioning … horizontal partition, multi level partition – partition schema contains zero or more hash partitions followed by optional range partitions. E.g. key contains large number of time series events. Key should be hash partitioned and timestamp should be range partitioned. User can easily tradeoff between query parallelism and concurrency based on workload by employing the right partition scheme. >> Replication… Raft – write locates the leader replica. If that’s still the leader it commits & write to local log, pass the info to all follower replica via raft until all commits. If the leader is no more a leader, it rejects, causing the client to refresh its metadata cache and resends the write request.

Using Kudu Finite table schema (unlike Hbase/columnar NoSQL) Primary key clearly identifiable Kudu is best for use cases requiring a simultaneous combination of sequential and random reads and writes, e.g.: Time Series Examples: Stream market data; fraud detection & prevention; risk monitoring Workload: Insert, updates, scans, lookups Machine Data Analytics Examples: Network threat detection Workload: Inserts, scans, lookups Online Reporting Examples: ODS Workload: Inserts, updates, scans, lookups

Kudu

Kudu at comcast 410 Node Apache Kudu cluster (co-located with HDFS) memory_limit_hard_bytes is set to 4GB block_cache_capacity_mb is set to 512 MB Migrate a Parquet table in HDFS to Kudu Table has ~300 columns Note: Kudu currently supports max 300 columns per table ~500 million records are inserted every day. 3 years archive (~ 600 Billion records) Average row size of the table is 2KB

Kudu

Schema design – Data types Column data types of source table bigint decimal(38,10) Double Int String Timestamp Kudu table data types bigint stored as int64 decimal stored as double Int stored as int32 String stored as string Timestamp stored as unixtime_micros

Schema design – Column encoding Encoding types available in Kudu Dictionary encoding was used for string columns Bitshuffle encoding was used for ints and doubles. Column Type Encoding int8, int16, int32 plain, bitshuffle, run length int64, unixtime_micros plain, bitshuffle float, double bool plain, run length string, binary plain, prefix, dictionary

Table of 600 Billion Rows X 230 columns

Schema design – Partitioning Design partition scheme based on available server capacity Each partition is a tablet that will be managed by the tablet server. Design partition so that many tablets are engaged during writes. Ensure each tablet doesn’t have to manage large amount of data to support full scans when needed. Multilevel hash partition was used Top level has was based on day and account id Child level hash was based on scan requirements of the queries 4000 partitions were created (~150 Million records per tablet) Each tablet will manage ~64GB (with encoding and compression) 10 tablets per tablet server

Data migration Source data was in parquet format in HDFS Decimals was converted to double using cast function Timestamp was converted to int using unix_timestamp function Migration was tested using Apache Impala (incubating) and Apache Spark Spark is recommended for large inserts to ensure handling failures Impala performance was ~40% faster than Spark for inserts

Hardware design consideration Kudu cluster was co-located with HDFS Ensure use of static resource allocation to ensure YARN, Impala and Kudu have proper resources available. Ensure disk that store write-ahead logs are not a bottleneck Ensure upwards of 150 MB/sec and 1000 iops KUDU-1835 – Enable WAL compression. Resolved upstream and expected to be available in Kudu 1.3.

Update Benchmarks Benchmark 1 – Table memory changes 600K/sec Tablet server memory was set to 8 GB. Benchmark 2 – Iteration of tuning (Best) ~1 Million/sec Increasing size of maintenance manager thread pool helped. Changing multiple other parameters including experimental parameters did not help. Benchmark 3 – Change table partition ~2.5 Million/sec Changed table partitions so that inserts are now supported by 8000 partitions. Note: CPU, Memory and disk need to support 20 tablets per table server.

WAL was the bottleneck... Single disk was not able to support the disk IO requirement of the WAL. To support high write throughput WAL should be designed to support >1000 IOPS (Writes)

Query Operations

Delete Operations

Impala Impala 2.7 with 96GB per daemon. General purpose SQL query engine for analytical and transactional workloads. Supports high speed queries taking as low as milliseconds. Runs directly with HDFS or Kudu. High performance engine written in c++.

Impala

Impala

Questions?