Building realtime BI systems with HDFS and Kudu

Slides:



Advertisements
Similar presentations
Building LinkedIn’s Real-time Data Pipeline
Advertisements

CX Analytics: Best Practices in Measuring For Success
A Fast Growing Market. Interesting New Players Lyzasoft.
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Apache Spark and the future of big data applications Eric Baldeschwieler.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Oracle Challenges Parallelism Limitations Parallelism is the ability for a single query to be run across multiple processors or servers. Large queries.
Module 7 Reading SQL Server® 2008 R2 Execution Plans.
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
USING MULTIPLE PERSISTENCE LAYERS IN SPARK TO BUILD A SCALABLE PREDICTION ENGINE Richard Williamson
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Page 1 © Hortonworks Inc – All Rights Reserved Hive: Data Organization for Performance Gopal Vijayaraghavan.
Cloudera Kudu Introduction
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
PolyBase Query Hadoop with ease Sahaj Saini SQL Server, Microsoft.
Stream Processing with Tamás István Ujj
Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.
From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,
Big thanks to everyone!.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
OMOP CDM on Hadoop Reference Architecture
Plan for Final Lecture What you may expect to be asked in the Exam?
Image taken from: slideshare
Pipe Engineering.
Apache Kudu Zbigniew Baranowski.
  Choice Hotels’ journey to better understand its customers through self-service analytics Narasimhan Sampath & Avinash Ramineni Strata Hadoop World |
Analytics as a First-Class Concern
PROTECT | OPTIMIZE | TRANSFORM
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Hadoop and Analytics at CERN IT
The Future of Apache Flink®
Real-time analytics using Kudu at petabyte scale
Windows Azure SQL Federation
Fast Data Made Easy Ted Malaska Cloudera With Kafka and Kudu
Zhangxi Lin Texas Tech University
Microsoft /2/2018 3:42 PM BRK3129 Query Big Data using the Expanded T-SQL footprint with PolyBase in SQL Server 2016 Casey Karst Program Manager.
Spark SQL.
Running virtualized Hadoop, does it make sense?
Chapter 14 Big Data Analytics and NoSQL
BI and SQL Analytics with Hadoop in the Cloud
CSE-291 Cloud Computing, Fall 2016 Kesden
Scaling SQL with different approaches
Building Analytics At Scale With USQL and C#
Partner Solution Overview
Powering real-time analytics on Xfinity using Kudu
Lab #2 - Create a movies dataset
Competing on Analytics II
LAMBDA ARCHITECTURES IN PRACTICE Kafka · Hadoop · Storm · Druid
Blazing-Fast Performance:
Global Consumer Insights
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Tapping the Power of Your Historical Data
The future of column-oriented data processing with Arrow and Parquet
20 Questions with Azure SQL Data Warehouse
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Parallel Analytic Systems
Interpret the execution mode of SQL query in F1 Query paper
Lecture 13: Query Execution
Applying Data Warehouse Techniques
Big DATA.
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Streaming data processing using Spark
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Applying Data Warehouse Techniques
SDMX meeting Big Data technologies
Presentation transcript:

Building realtime BI systems with HDFS and Kudu Set timer for 20 minutes

Drivers for Streaming Data Data Freshness Time to Analytic Business Context Freshness of Data: Our customers need to look at the freshest data possible. They don’t want to wait for an ETL job to run at night. Time to Analytic: Don’t wait to start visualizing, combine historical and streaming data together Context: Understanding of my data… can I look at the stream alone or do I need to see it in relation to historical events? Streaming data has a distinct context of time, either insertion time, time of event, or other. Working with streaming data requires an interface that allows analysts to navigate through time but have context as to where they are and what they are looking at.

Streaming Data @ Zoomdata Visualizations react to new data delivered Users select a rolling window or pin a start time to capture cumulative metrics Users start, stop, pause the stream How we think of streaming contrast realtime w/ near-realtime data dvr append only Challenges (pause, play, rewind, restatement, context)

Typical Streaming Architectures HDFS (what about query) Event Kafka, JMS, RabbitMQ, etc... Storm, Spark Streaming, Flink, etc.. Now What? Cassandra (no aggregation) Lambda (let’s take a look at that for a sec)

Lambda Stream data to hdfs Keep some in avro Do your own compactions to parquet / orc Expose via impala, sparksql, or other Impala avro partition (speed) With history in parquet Union compatible schema Project as single table via view Works ok… still doing a lot of manual data management OR Oh.. and what happens to non-commutative operations like Distinct Count?

You wish you could change the past... Works OK until.... You wish you could change the past...

Restatements … yeah we went there Txn ID Item Price Quantity Partition 1 Jeans 25.00 2 2016-05-30 Shirt 10.00 2016-05-31 3 Skirt 20.00 2016-06-01 Restatement 1 Jeans 25.00 3

Restatements … how you do it Txn ID Item Price Quantity Partition 1 Jeans 25.00 2 2016-05-30 Shirt 10.00 2016-05-31 3 Skirt 20.00 2016-06-01 Restatement 1 Jeans 25.00 3 General Algorithm Figure out which partition(s) are affected Recompute affected partition(s) with restated data Drop/replace existing partition(s) with new data What if my restatement spans across partitions?

Enter Kudu What is Kudu? Kudu is an open source storage engine for structured data which supports low-latency random access together with efficient analytical access patterns. (Source: http://getkudu.io/kudu.pdf) Why do you care? It makes management of streaming data for ad-hoc analysis MUCH easier Bridges the mental gap from random access to append only Why does Zoomdata care?

Impala + Kudu: Performance Nearly the same performance as Parquet for many similar workloads Column oriented storage format Simplified data management model Can handle a new class of streaming use cases and workloads

Impala + Kudu: Performance Nearly the same performance as Parquet for many similar workloads Column oriented storage format Simplified data management model Can handle a new class of streaming use cases and workloads Great… let’s just use Kudu from now on: We can ingest data with great write throughput Support analytic queries Support random access writes What’s not to love? Ship It!

There’s a catch... … it’s your data model Good news! If you have figured this out with HDFS and Parquet, you’re not too far off. Things to consider: Access pattern and partition scheme (similar to partitioning data parquet) Has a big role to play in parallelism of your queries Cardinality of your attributes Affects what type of column encoding you decide to use Key structure You get only one, use it wisely Concurrency and Replication Factor More work load can be distributed across a higher number of nodes More on these topics can be found at : http://getkudu.io/docs/schema_design.html

Let’s put it all together I have a fruit stand I sell my fruits via phone order to remote buyers My transactions look something like: Orders(orderID,orderTS,fruit,price,customerID,customerPhone,customerAddress)

Impala DDL for Kudu CREATE EXTERNAL TABLE `strata_fruits_expanded` ( `_ts` BIGINT, `_id` STRING, `fruit` STRING, `country_code` STRING, `country_area_code` STRING, `phone_num` STRING, `message_date` BIGINT, `price` FLOAT, `keyword` STRING ) DISTRIBUTE BY HASH (_ts) INTO 60 BUCKETS TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'strata_fruits_expanded', 'kudu.master_addresses' = '10.xxx.xxx.xxx:7051', 'kudu.key_columns' = '_ts, _id' ); Key Key

Impala DDL for Kudu CREATE EXTERNAL TABLE `strata_fruits_expanded` ( `_ts` BIGINT, `_id` STRING, `fruit` STRING, `country_code` STRING, `country_area_code` STRING, `phone_num` STRING, `message_date` BIGINT, `price` FLOAT, `keyword` STRING ) DISTRIBUTE BY HASH (_ts) INTO 60 BUCKETS TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'strata_fruits_expanded', 'kudu.master_addresses' = '10.xxx.xxx.xxx:7051', 'kudu.key_columns' = '_ts, _id' ); attributes Low cardinality attributes -- things I want to group by -- are great candidates for dictionary encoding attributes

Impala DDL for Kudu CREATE EXTERNAL TABLE `strata_fruits_expanded` ( `_ts` BIGINT, `_id` STRING, `fruit` STRING, `country_code` STRING, `country_area_code` STRING, `phone_num` STRING, `message_date` BIGINT, `price` FLOAT, `keyword` STRING ) DISTRIBUTE BY HASH (_ts) INTO 60 BUCKETS TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'strata_fruits_expanded', 'kudu.master_addresses' = '10.xxx.xxx.xxx:7051', 'kudu.key_columns' = '_ts, _id' ); How you distribute your data directly impacts your ability to process in parallel as well as any predicate push-down type of operations Kudu can perform Partition Scheme For large tables, such as fact tables, aim for as many tablets as you have cores in the cluster -- but figure out what else you are running as well.

Let’s see it in action…. Demo Event Kafka, JMS, RabbitMQ, etc... Storm, Spark Streaming, Flink, etc.. Now What? Order Kafka Streamsets Kudu

Special Thanks Anton Gorshkov: For his original streaming with kafka fruit stand demo The Cloudera Kudu Team: Specifically Todd Lipcon for all the insight into Kudu optimization Nexmo: For use of their SaaS SMS service in this demo StreamSets: For their dead-simple way to hook up a data pipeline with Kafka into Kudu

Questions For more info come visit us in booth K222 For more information contact: ruhollah@zoomdata.com For more info come visit us in booth K222