Real-time analytics using Kudu at petabyte scale Sridhar Alla & Shekhar Agrawal Comcast
Speakers Sridhar Alla Director, Data Engineering, Comcast Engineering on a big data scale sridhar_alla@cable.comcast.com Shekhar Agrawal Director, Data Science, Comcast Data science on a big data scale. shekhar_agrawal@cable.comcast.com
Use cases Update activity history table with new customer information each month Update viewing history table when program guide is updated Update metadata from upstream systems and models Load data into Impala using real time feeds However, Kudu is not a system for OLTP operations, so we don’t have transactions and rollbacks in our use cases
Old Architecture Map Reduce on 480 Nodes running CDH5.8 Spark 2.1.0 on 480 Nodes Hbase 1.2 on 100 Nodes Impala 2.7 on 420 Nodes
Old Architecture
Challenges Rebuild of entire datasets, or partitions by re-generating compressed CSV files and loading into HDFS to keep data current took several hours or days. Rebuild operations consumed cluster capacity, limiting availability to other teams in a shared cluster. No way to update a single row in the dataset without recreating table or using a slower complicated integration with HBase.
New Architecture Stores Tune Events into Kudu. Any data fixes are made directly in Kudu. Stores Metadata directly into Kudu. Any data fixes are made directly in Kudu Spark Streaming updates Kudu on a real time basis to support quick analytics. Spark Job reads the raw events , sessionizes and updates Kudu. BI tools like Zoomdata directly work with Impala or Kudu to enable analytics.
New Architecture
Kudu at a high Level Need for another data store, Kudu? Table/Schemas Read/Writes Consistency model Architecture (master/tablet) Partitioning Replication – via Raft >> Why Kudu… Hbase – low latency random access, HDFS – sequential access storage.. Kudu fills the gap >> Table/Schemas …. Primary key required, no secondary key.. Schema specific (explicit column types helps in encoding & to expose SQL like metadata to other BI tools, etc.) >> Write … primary key required, no multi-row transactional API – they are treated as individual transactions, though optimized by batches >> Read …. Comparisons and key range scans. Disk stores columar, so getting few columns are far more effective >> Consistency … (a) snapshot consistency – read-your-write guarantee from a single client, but casual dependence of external consistency not guaranteed (same as HBase). Client A does write a, client b does write b, third reader will read b without a. Though Kudu offers propagating timestamp tokens from 1 client to another manually. (b) spanner consistency – if commit-wait is enabled, the client wait until it make sure that any casual writes are taken care of. But this has significant write latency >> Timestamps – doesn’t allow timestamp in writes (it maintains itself).. User can give this while reading – this allows users to do point-in-time queries and to ensure that different distributed task that makes a single query read a consistent snaphot. >> Architecture… single master (for metadata and data servers). Master can be replicated for fault tolerance and quick recovery from fail-over.. Tablets are data servers (multiple tablets in one machine) >> Partitioning … horizontal partition, multi level partition – partition schema contains zero or more hash partitions followed by optional range partitions. E.g. key contains large number of time series events. Key should be hash partitioned and timestamp should be range partitioned. User can easily tradeoff between query parallelism and concurrency based on workload by employing the right partition scheme. >> Replication… Raft – write locates the leader replica. If that’s still the leader it commits & write to local log, pass the info to all follower replica via raft until all commits. If the leader is no more a leader, it rejects, causing the client to refresh its metadata cache and resends the write request.
Using Kudu Finite table schema (unlike Hbase/columnar NoSQL) Primary key clearly identifiable Kudu is best for use cases requiring a simultaneous combination of sequential and random reads and writes, e.g.: Time Series Examples: Stream market data; fraud detection & prevention; risk monitoring Workload: Insert, updates, scans, lookups Machine Data Analytics Examples: Network threat detection Workload: Inserts, scans, lookups Online Reporting Examples: ODS Workload: Inserts, updates, scans, lookups
Kudu
Kudu at comcast 410 Node Apache Kudu cluster (co-located with HDFS) memory_limit_hard_bytes is set to 4GB block_cache_capacity_mb is set to 512 MB Migrate a Parquet table in HDFS to Kudu Table has ~300 columns Note: Kudu currently supports max 300 columns per table ~500 million records are inserted every day. 3 years archive (~ 600 Billion records) Average row size of the table is 2KB
Kudu
Schema design – Data types Column data types of source table bigint decimal(38,10) Double Int String Timestamp Kudu table data types bigint stored as int64 decimal stored as double Int stored as int32 String stored as string Timestamp stored as unixtime_micros
Schema design – Column encoding Encoding types available in Kudu Dictionary encoding was used for string columns Bitshuffle encoding was used for ints and doubles. Column Type Encoding int8, int16, int32 plain, bitshuffle, run length int64, unixtime_micros plain, bitshuffle float, double bool plain, run length string, binary plain, prefix, dictionary
Table of 600 Billion Rows X 230 columns
Schema design – Partitioning Design partition scheme based on available server capacity Each partition is a tablet that will be managed by the tablet server. Design partition so that many tablets are engaged during writes. Ensure each tablet doesn’t have to manage large amount of data to support full scans when needed. Multilevel hash partition was used Top level has was based on day and account id Child level hash was based on scan requirements of the queries 4000 partitions were created (~150 Million records per tablet) Each tablet will manage ~64GB (with encoding and compression) 10 tablets per tablet server
Data migration Source data was in parquet format in HDFS Decimals was converted to double using cast function Timestamp was converted to int using unix_timestamp function Migration was tested using Apache Impala (incubating) and Apache Spark Spark is recommended for large inserts to ensure handling failures Impala performance was ~40% faster than Spark for inserts
Hardware design consideration Kudu cluster was co-located with HDFS Ensure use of static resource allocation to ensure YARN, Impala and Kudu have proper resources available. Ensure disk that store write-ahead logs are not a bottleneck Ensure upwards of 150 MB/sec and 1000 iops KUDU-1835 – Enable WAL compression. Resolved upstream and expected to be available in Kudu 1.3.
Update Benchmarks Benchmark 1 – Table memory changes 600K/sec Tablet server memory was set to 8 GB. Benchmark 2 – Iteration of tuning (Best) ~1 Million/sec Increasing size of maintenance manager thread pool helped. Changing multiple other parameters including experimental parameters did not help. Benchmark 3 – Change table partition ~2.5 Million/sec Changed table partitions so that inserts are now supported by 8000 partitions. Note: CPU, Memory and disk need to support 20 tablets per table server.
WAL was the bottleneck... Single disk was not able to support the disk IO requirement of the WAL. To support high write throughput WAL should be designed to support >1000 IOPS (Writes)
Query Operations
Delete Operations
Impala Impala 2.7 with 96GB per daemon. General purpose SQL query engine for analytical and transactional workloads. Supports high speed queries taking as low as milliseconds. Runs directly with HDFS or Kudu. High performance engine written in c++.
Impala
Impala
Questions?