Powering real-time analytics on Xfinity using Kudu

Slides:

Advertisements

Similar presentations

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

Advertisements

A Fast Growing Market. Interesting New Players Lyzasoft.

Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Hadoop Ecosystem Overview

SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Hive : A Petabyte Scale Data Warehouse Using Hadoop

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.

Introduction to Hadoop and HDFS

Hive Facebook 2009.

How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.

USING MULTIPLE PERSISTENCE LAYERS IN SPARK TO BUILD A SCALABLE PREDICTION ENGINE Richard Williamson

Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Nov 2006 Google released the paper on BigTable.

Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.

Cloudera Kudu Introduction

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.

Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.

Microsoft Ignite /28/2017 6:07 PM

Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,

Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.

Big Data & Test Automation

Integration of Oracle and Hadoop: hybrid databases affordable at scale

OMOP CDM on Hadoop Reference Architecture

Image taken from: slideshare

PROTECT | OPTIMIZE | TRANSFORM

Integration of Oracle and Hadoop: hybrid databases affordable at scale

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

Hadoop and Analytics at CERN IT

Real-time analytics using Kudu at petabyte scale

Fast Data Made Easy Ted Malaska Cloudera With Kafka and Kudu

Building realtime BI systems with HDFS and Kudu

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Running virtualized Hadoop, does it make sense?

Chapter 14 Big Data Analytics and NoSQL

Spark Presentation.

A Warehousing Solution Over a Map-Reduce Framework

CLOUDERA TRAINING For Apache HBase

Scaling SQL with different approaches

Operational & Analytical Database

Wildfire Goals HTAP: transactions & queries on same data Open Format

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Data Platform and Analytics Foundational Training

Data Analytics and CERN IT Hadoop Service

Data Analytics and CERN IT Hadoop Service

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Introduction to Spark.

Data Analytics and CERN IT Hadoop Service

Big Data - in Performance Engineering

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

SETL: Efficient Spark ETL on Hadoop

Overview of big data tools

Charles Tappert Seidenberg School of CSIS, Pace University

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

Big-Data Analytics with Azure HDInsight

SDMX meeting Big Data technologies

Pig Hive HBase Zookeeper

Presentation transcript:

Powering real-time analytics on Xfinity using Kudu Sridhar Alla & Kiran Muglurmath Comcast

Speakers Sridhar Alla Director, Solution Architecture, Comcast Architecting and building solutions on a big data scale sridhar_alla@cable.comcast.com Kiran Muglurmath Executive Director, Data Science, Comcast Data science on a big data scale. kiran_muglurmath@cable.comcast.com

Use cases Update activity history table with new customer information each month Update viewing history table when program guide is updated Update metadata from upstream systems and models Load data into Impala using real time feeds However, Kudu is not a system for OLTP operations, so we don’t have transactions and rollbacks in our use cases

Current Architecture Map Reduce on 300 Nodes running CDH5.7 Hbase 1.2 Impala 2.7 on 300 Nodes

Current Architecture

Challenges Rebuild of entire datasets, or partitions by re-generating compressed CSV files and loading into HDFS to keep data current took several hours or days. Rebuild operations consumed cluster capacity, limiting availability to other teams in a shared cluster. No way to update a single row in the dataset without recreating table or using a slower complicated integration with Hbase.

Technologies Cloudera distribution - CDH5.7 Hadoop MapReduce – Hadoop 2.6 Hbase 1.2 Hive 1.1 Spark 1.6.2 Impala 2.7 Kudu 0.10 Kafka 0.9 BI tools such as Zoomdata and Tableau

Map Reduce

Map Reduce Most used paradigm for distributed processing at PB scale. Uses Mappers and Reducers to distribute the processing. Uses Disk for pretty much everything thus making it very scalable.

Map Reduce

Spark

Spark Spark 1.6.2 using 50 executors/cores of 4Gb each Framework for large scale distributed processing using a large set of operations , transformations and actions. (Also supports MapReduce like functionality) Can also do batch processing (not as scalable as MapReduce due to memory constraints) Supports real time processing through Spark Streaming. Supports Machine Learning through Spark ML Supports SQL like interface through Spark SQL Support graphical algorithms such as shortestpaths through Graphx

Spark

Impala

Impala Impala 2.7 with 96GB per daemon. General purpose SQL query engine for analytical and transactional workloads. Supports high speed queries taking as low as milliseconds. Runs directly with Hadoop or Kudu. High performance engine written in c++.

Impala

Impala

Challenges Rebuild of entire datasets, or partitions by re-generating compressed CSV files and loading into HDFS to keep data current took several hours or days. Rebuild operations consumed cluster capacity, limiting availability to other teams in a shared cluster. No way to update a single row in the dataset without recreating table or using a slower complicated integration with Hbase.

Someone asked us to try Kudu.

Idea! Lets make analytics on Xfinity faster!!!!

What is Kudu? The kudu is a sub-species of antelope that is found inhabiting mixed shrub woodland, and savanna plains in eastern and southern Africa http://a-z-animals.com/animals/kudu/

I meant the “other” Kudu Complements HDFS and Hbase Fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer.

New Architecture Stores Tune Events into Kudu. Any data fixes are made directly in Kudu. Stores Metadata directly into Kudu. Any data fixes are made directly in Kudu Spark Streaming updates Kudu on a real time basis to support quick analytics. MapReduce Job reads the raw events , sessionizes and updates Kudu. BI tools like Zoomdata directly work with Impala or Kudu to enable analytics.

New Architecture

Kudu

Kudu 5 Nodes running Kudu 0.10 cluster has 300 Nodes and 12 PB memory_limit_hard_bytes is set to 4GB block_cache_capacity_mb is set to 512 MB Impala 2.7 running on the same 5 Nodes for data locality 96GB memory per daemon Separate instance of Impala on 300 Nodes with shared metastore

Kudu

Using Kudu Finite table schema (unlike Hbase/columnar NoSQL) Primary key clearly identifiable Kudu is best for use cases requiring a simultaneous combination of sequential and random reads and writes, e.g.: Time Series Examples: Stream market data; fraud detection & prevention; risk monitoring Workload: Insert, updates, scans, lookups Machine Data Analytics Examples: Network threat detection Workload: Inserts, scans, lookups Online Reporting Examples: ODS Workload: Inserts, updates, scans, lookups

Kudu Table Creation Kudu (Impala) HIVE CREATE EXTERNAL TABLE rcr.monthly_data ( sub_type STRING, sub_acct_no STRING, customer_account_id STRING, report_name STRING, gl_account STRING, report_date TIMESTAMP, service_code STRING, trans_code STRING, totalrevenue DECIMAL(38,10), bulk_type STRING, spa STRING, recordcount BIGINT, customer_mix STRING, accounting_period INT, prod_name STRING ) PARTITIONED BY (day STRING, region STRING ) WITH SERDEPROPERTIES ('serialization.format'='1') STORED AS PARQUET LOCATION 'hdfs://hadoop/user/ hive/warehouse/rcr.db/monthly_data' Kudu (Impala) CREATE TABLE kudupoc.rcr_monthly_data_kudu( customer_account_id STRING, day STRING, sub_type STRING, sub_acct_no STRING, report_name STRING, gl_account STRING, report_date STRING, service_code STRING, trans_code STRING, totalrevenue DOUBLE, bulk_type STRING, spa STRING, recordcount BIGINT, customer_mix STRING, accounting_period INT, prod_name STRING, region STRING ) DISTRIBUTE BY HASH(customer_account_id,day ) INTO 16 BUCKETS TBLPROPERTIES ( 'kudu.key_columns'='customer_account_id,day', 'kudu.master_addresses'=KuduMasterIP:7051', 'storage_handler'='com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name'='rcr_monthly_data_kudu')

Kudu

Zoomdata

Observations Sample table in Kudu (4 Nodes + 1 Master ) 1.06B rows 40 columns of mixed datatypes Primary key (LONG INT) Update 80 records by referring primary key: 0.25 seconds Update 600 records by referring non primary key columns: 12 seconds Drop 81 records by referring primary key: 0.15 seconds

Observations – query speed Query: 32MM per Sec 0.03 microsec per record 30 millisecs per Million select count(*) as `count`, ds.`day` as `day` from kudupoc.`rcr_monthly_data_kudu` as ds group by ds.`day`

Observations – duplicate table Query: 2MM per Sec 0.5 microsec per record 500 millisecs per Million create table rcrtest1 stored as parquet as select * from rcr_monthly_data_kudu;

Next Steps- Scaling up Scaling to 480 Nodes ~ 20PB. Insert all +600 Billion events into Kudu. Use Spark ML on Kudu queries to build models using the connectors.

Questions?