Powering real-time analytics on Xfinity using Kudu Sridhar Alla & Kiran Muglurmath Comcast
Speakers Sridhar Alla Director, Solution Architecture, Comcast Architecting and building solutions on a big data scale sridhar_alla@cable.comcast.com Kiran Muglurmath Executive Director, Data Science, Comcast Data science on a big data scale. kiran_muglurmath@cable.comcast.com
Use cases Update activity history table with new customer information each month Update viewing history table when program guide is updated Update metadata from upstream systems and models Load data into Impala using real time feeds However, Kudu is not a system for OLTP operations, so we don’t have transactions and rollbacks in our use cases
Current Architecture Map Reduce on 300 Nodes running CDH5.7 Hbase 1.2 Impala 2.7 on 300 Nodes
Current Architecture
Challenges Rebuild of entire datasets, or partitions by re-generating compressed CSV files and loading into HDFS to keep data current took several hours or days. Rebuild operations consumed cluster capacity, limiting availability to other teams in a shared cluster. No way to update a single row in the dataset without recreating table or using a slower complicated integration with Hbase.
Technologies Cloudera distribution - CDH5.7 Hadoop MapReduce – Hadoop 2.6 Hbase 1.2 Hive 1.1 Spark 1.6.2 Impala 2.7 Kudu 0.10 Kafka 0.9 BI tools such as Zoomdata and Tableau
Map Reduce
Map Reduce Most used paradigm for distributed processing at PB scale. Uses Mappers and Reducers to distribute the processing. Uses Disk for pretty much everything thus making it very scalable.
Map Reduce
Spark
Spark Spark 1.6.2 using 50 executors/cores of 4Gb each Framework for large scale distributed processing using a large set of operations , transformations and actions. (Also supports MapReduce like functionality) Can also do batch processing (not as scalable as MapReduce due to memory constraints) Supports real time processing through Spark Streaming. Supports Machine Learning through Spark ML Supports SQL like interface through Spark SQL Support graphical algorithms such as shortestpaths through Graphx
Spark
Impala
Impala Impala 2.7 with 96GB per daemon. General purpose SQL query engine for analytical and transactional workloads. Supports high speed queries taking as low as milliseconds. Runs directly with Hadoop or Kudu. High performance engine written in c++.
Impala
Impala
Challenges Rebuild of entire datasets, or partitions by re-generating compressed CSV files and loading into HDFS to keep data current took several hours or days. Rebuild operations consumed cluster capacity, limiting availability to other teams in a shared cluster. No way to update a single row in the dataset without recreating table or using a slower complicated integration with Hbase.
Someone asked us to try Kudu.
Idea! Lets make analytics on Xfinity faster!!!!
What is Kudu? The kudu is a sub-species of antelope that is found inhabiting mixed shrub woodland, and savanna plains in eastern and southern Africa http://a-z-animals.com/animals/kudu/
I meant the “other” Kudu Complements HDFS and Hbase Fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer.
New Architecture Stores Tune Events into Kudu. Any data fixes are made directly in Kudu. Stores Metadata directly into Kudu. Any data fixes are made directly in Kudu Spark Streaming updates Kudu on a real time basis to support quick analytics. MapReduce Job reads the raw events , sessionizes and updates Kudu. BI tools like Zoomdata directly work with Impala or Kudu to enable analytics.
New Architecture
Kudu
Kudu 5 Nodes running Kudu 0.10 cluster has 300 Nodes and 12 PB memory_limit_hard_bytes is set to 4GB block_cache_capacity_mb is set to 512 MB Impala 2.7 running on the same 5 Nodes for data locality 96GB memory per daemon Separate instance of Impala on 300 Nodes with shared metastore
Kudu
Using Kudu Finite table schema (unlike Hbase/columnar NoSQL) Primary key clearly identifiable Kudu is best for use cases requiring a simultaneous combination of sequential and random reads and writes, e.g.: Time Series Examples: Stream market data; fraud detection & prevention; risk monitoring Workload: Insert, updates, scans, lookups Machine Data Analytics Examples: Network threat detection Workload: Inserts, scans, lookups Online Reporting Examples: ODS Workload: Inserts, updates, scans, lookups
Kudu Table Creation Kudu (Impala) HIVE CREATE EXTERNAL TABLE rcr.monthly_data ( sub_type STRING, sub_acct_no STRING, customer_account_id STRING, report_name STRING, gl_account STRING, report_date TIMESTAMP, service_code STRING, trans_code STRING, totalrevenue DECIMAL(38,10), bulk_type STRING, spa STRING, recordcount BIGINT, customer_mix STRING, accounting_period INT, prod_name STRING ) PARTITIONED BY (day STRING, region STRING ) WITH SERDEPROPERTIES ('serialization.format'='1') STORED AS PARQUET LOCATION 'hdfs://hadoop/user/ hive/warehouse/rcr.db/monthly_data' Kudu (Impala) CREATE TABLE kudupoc.rcr_monthly_data_kudu( customer_account_id STRING, day STRING, sub_type STRING, sub_acct_no STRING, report_name STRING, gl_account STRING, report_date STRING, service_code STRING, trans_code STRING, totalrevenue DOUBLE, bulk_type STRING, spa STRING, recordcount BIGINT, customer_mix STRING, accounting_period INT, prod_name STRING, region STRING ) DISTRIBUTE BY HASH(customer_account_id,day ) INTO 16 BUCKETS TBLPROPERTIES ( 'kudu.key_columns'='customer_account_id,day', 'kudu.master_addresses'=KuduMasterIP:7051', 'storage_handler'='com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name'='rcr_monthly_data_kudu')
Kudu
Zoomdata
Observations Sample table in Kudu (4 Nodes + 1 Master ) 1.06B rows 40 columns of mixed datatypes Primary key (LONG INT) Update 80 records by referring primary key: 0.25 seconds Update 600 records by referring non primary key columns: 12 seconds Drop 81 records by referring primary key: 0.15 seconds
Observations – query speed Query: 32MM per Sec 0.03 microsec per record 30 millisecs per Million select count(*) as `count`, ds.`day` as `day` from kudupoc.`rcr_monthly_data_kudu` as ds group by ds.`day`
Observations – duplicate table Query: 2MM per Sec 0.5 microsec per record 500 millisecs per Million create table rcrtest1 stored as parquet as select * from rcr_monthly_data_kudu;
Next Steps- Scaling up Scaling to 480 Nodes ~ 20PB. Insert all +600 Billion events into Kudu. Use Spark ML on Kudu queries to build models using the connectors.
Questions?