Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns For Real Time Streaming Analytics 19 Feb 2015 Sheetal Dolas Principal Architect,

Slides:



Advertisements
Similar presentations
Implementing Tableau Server in an Enterprise Environment
Advertisements

Tableau Software Australia
1 SEDA: An Architecture for Well- Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University.
1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.
A Java Architecture for the Internet of Things Noel Poore, Architect Pete St. Pierre, Product Manager Java Platform Group, Internet of Things September.
Chapter 19: Network Management Business Data Communications, 4e.
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Wade Wegner Windows Azure Technical Evangelist Microsoft Corporation Windows Azure AppFabric Caching.
** MapReduce Debugging with Jumbune. * Agenda * Debugging Challenges Debugging MapReduce Jumbune’s Debugger Zero Tolerance in Production.
HOL9396: Oracle Event Processing 12c
Pulsar Realtime Analytics At Scale Tony Ng, Sharad Murthy June 11, 2015.
OEP BOF9272 SOA Event Delivery Network
1 Introduction Introduction to database systems Database Management Systems (DBMS) Type of Databases Database Design Database Design Considerations.
David Besemer, CTO On Demand Data Integration with Data Virtualization.
Web Application Architecture: multi-tier (2-tier, 3-tier) & mvc
ETL By Dr. Gabriel.
Computer Measurement Group, India Reliable and Scalable Data Streaming in Multi-Hop Architecture Sudhir Sangra, BMC Software Lalit.
GOVERNMENT SERVICES INTEGRATION INDUSTRY SOLUTION.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Word Wide Cache Distributed Caching for the Distributed Enterprise.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Net Optics Confidential and Proprietary Net Optics appTap Intelligent Access and Monitoring Architecture Solutions.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
Opening Keynote Presentation An Architecture for Intelligent Trading  Alessandro Petroni – Senior Principal Architect, Financial Services, TIBCO Software.
Version 4.0. Objectives Describe how networks impact our daily lives. Describe the role of data networking in the human network. Identify the key components.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
9 Systems Analysis and Design in a Changing World, Fourth Edition.
How to Build Scalable & Secure Database Applications Noel Jerke & Erin Welker Scalability Experts.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
20 Copyright © 2008, Oracle. All rights reserved. Cache Management.
Session id: Darrell Hilliard Senior Delivery Manager Oracle University Oracle Corporation.
Flight is a SaaS Solution that Accelerates the Secure Transfer of Large Files and Data Sets Into and Out of Microsoft Azure Blob Storage MICROSOFT AZURE.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.
Internet of Things. Creating Our Future Together.
Chapter 1 Overview of Databases and Transaction Processing.
BIG DATA/ Hadoop Interview Questions.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Managing Data Resources File Organization and databases for business information systems.
Microsoft Ignite /28/2017 6:07 PM
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
SQL Database Management
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Connected Infrastructure
Chapter 19: Network Management
Smart Building Solution
Support digital applications with a resilient, highly available and NRT Hadoop Backend Santander UK.
Open Source distributed document DB for an enterprise
Collecting heterogeneous data into a central repository
Smart Building Solution
Connected Infrastructure
Azure SQL Database – Scaling in and Scaling out with elastic pool
© 2016 Global Market Insights, Inc. USA. All Rights Reserved Fuel Cell Market size worth $25.5bn by 2024Low Power Wide Area Network.
Hadoop Market
Exploring Azure Event Grid
9/18/2018 Big Data Analytics with HDInsight Module 6 – Storm Essentials Asad Khan Nishant Thacker Principal PM Manager Technical Product Manager.
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Continuous Automated Chatbot Testing
Big Data - in Performance Engineering
Near Real Time ETLs with Azure Serverless Architecture
Taming the Big Data Fire Hose
Technical Capabilities
Customer 360.
Presentation transcript:

Page 1 © Hortonworks Inc – All Rights Reserved Design Patterns For Real Time Streaming Analytics 19 Feb 2015 Sheetal Dolas Principal Architect, Hortonworks

Page 2 © Hortonworks Inc – All Rights Reserved Who am I ? Principal Hortonworks Most of the career has been in field, solving real life business problems Last 5+ years in Big Data including Hadoop, Storm etc. Co-developed Cisco OpenSOC (

Page 3 © Hortonworks Inc – All Rights Reserved Agenda Streaming Architectural Patterns - Overview Design Patterns o What o Why o Illustrations QA

Page 4 © Hortonworks Inc – All Rights Reserved Streaming Architectural Patterns

Page 5 © Hortonworks Inc – All Rights Reserved Real Time Streaming Architecture Source Systems Sources Syslog Machine Data External Streams Other Data Collection Flume / Custom Agent A Agent B Agent N Messaging System Kafka Topic B Topic N Topic A Real Time Processing Storm Topology B Topology N Topology A Storage Search Elastic Search / Solr Low Latency NoSql HBase Historic Hive / HDFS Access Web Services REST API Web Apps Analytic Tools R / Python BI Tools Alerting Systems

Page 6 © Hortonworks Inc – All Rights Reserved Lambda Architecture New Data Data Stream Batch Layer All Data Pre-compute Views Speed Layer Stream Processing Real Time View Serving Layer Batch View Data Access Query

Page 7 © Hortonworks Inc – All Rights Reserved Kappa Architecture Data Source Data Stream Stream Processing System Job Version n Serving DB Output table n Output table n + 1 Data Access Query Job Version n + 1

Page 8 © Hortonworks Inc – All Rights Reserved Design Patterns

Page 9 © Hortonworks Inc – All Rights Reserved Design Pattern – What is it? A General reusable solution to a commonly occurring problem within a given context in software design. Solution Reusable Problem Commonl y Occurring Software Design Contextua l

Page 10 © Hortonworks Inc – All Rights Reserved Design Patterns – Why ? Streaming use cases have distinct characteristics o Unpredictable incoming data patterns o Correlating multiple streams o Out-of-sequence and late events High scale and continuous streams pose new challenges o Peaks and valleys o Changing data characteristics over period of time o Maintain the latency and throughput SLAs

Page 11 © Hortonworks Inc – All Rights Reserved Streaming Patterns Architectural Patterns Real-time Streaming Near-real-time Streaming Lambda Architecture Kappa Architecture Functional Patterns Stream Joins Top N (Trending) Rolling Windows Data Management Patterns External Lookup Responsive Shuffling Out-of- Sequence Events Stream Security Patterns Message Encryption Authorized Access Secure Cluster Authentication

Page 12 © Hortonworks Inc – All Rights Reserved Streaming Patterns – Being Discussed Architectural Patterns Real-time Streaming Near-real-time Streaming Lambda Architecture Kappa Architecture Functional Patterns Stream Joins Top N (Trending) Rolling Windows Data Management Patterns External Lookup Responsive Shuffling Out-of- Sequence Events Stream Security Patterns Message encryption Authorized Access Secure Cluster Authentication

Page 13 © Hortonworks Inc – All Rights Reserved External Lookup Dynamic, High Speed Enrichments With External Data Lookup

Page 14 © Hortonworks Inc – All Rights Reserved External Lookup - Description Referencing frequently changing external system data for event enrichments, filters or validations by minimizing the event processing latencies, system bottlenecks and maintaining high throughput. Page 14

Page 15 © Hortonworks Inc – All Rights Reserved External Lookup - Challenges Increased latency due to frequent external system calls Insufficient memory to hold all reference data in memory Scalability and performance issues with large data reference sets Dynamic reference data needs frequent cache purge and refreshes External systems can become a bottleneck Page 15

Page 16 © Hortonworks Inc – All Rights Reserved External Lookup – Potential Options PerformanceScalabilityFault Tolerance Always Fetch Cache Everything Partition and Cache on the go

Page 17 © Hortonworks Inc – All Rights Reserved External Lookup - A Reference Use Case Real Time Credit Card Fraud Identification and Alert o Credit card transaction data comes as stream (typically through Kafka) o External system holds information about the card holder’s recent location o Each credit card transaction is looked up against user’s current location o If the geographic distance between the credit card transaction location and user’s recent known location is significant, the credit card transaction is flagged as potential fraud Page 17

Page 18 © Hortonworks Inc – All Rights Reserved External Lookup - Topology Overview Page 18 Storm Source Stream Credit Card Transaction Spout Partitioner Bolt Alerting System External Reference Data Fraud Analyzer Bolt Locally caches the user location data. Cache validity is time bound Partitions data based on area code of the mobile numbers User Location Information Fraud Alert Looks up user’s current location from external system and finds geo distance between transaction location and user location

Page 19 © Hortonworks Inc – All Rights Reserved External Lookup - Peek in the Bolts Page 19 Storm Partitioner Bolt Instance 2 Partitioner Bolt Instance 1 Partitioner Bolt Instance n Fraud Analyzer Bolt Instance 1 CANVTX Fraud Analyzer Bolt Instance 2 NYCTMA Fraud Analyzer Bolt Instance n FLNCOH Stream is partitioned based on area code Local cache (time sensitive) (Use lightweight caching solution like Guava)

Page 20 © Hortonworks Inc – All Rights Reserved External Lookup - Benefits of the approach Only required data is cached (on demand) Each bolt caches only partition of reference data Data is locally cached so trips to external system are reduced Cache is time sensitive On the go cache building handles failures elegantly Page 20

Page 21 © Hortonworks Inc – All Rights Reserved External Lookup – Applicability Stream processing depends on external data External data is sufficiently large that could not be hold in memory of each task External data keeps changing External system has scalability limitations

Page 22 © Hortonworks Inc – All Rights Reserved Responsive Shuffling

Page 23 © Hortonworks Inc – All Rights Reserved Responsive Shuffling - Description Automatically adjust shuffling for better performance and throughput during peaks and varying data skews in streams

Page 24 © Hortonworks Inc – All Rights Reserved Responsive Shuffling - Challenges Incoming data stream is unpredictable and can be skewed Skew can change from time to time Managing latency and throughput with skews is difficult Since streams are continuously flowing, restarting topology with new shuffling logic is practically not possible

Page 25 © Hortonworks Inc – All Rights Reserved Shuffling – Potential Options Latency & Throughput System ReliabilityUptime Static Shuffle Responsive Shuffle

Page 26 © Hortonworks Inc – All Rights Reserved External Lookup - A Reference Use Case Optimized HBase Inserts o Event data is stored in HBase after storm processing o Group events such that a bolts can insert more events in HBase with less trips to region servers o Over period of time HBase regions can split/merge o Automatically adjust the event grouping as HBase region layout changes over period of time Page 26

Page 27 © Hortonworks Inc – All Rights Reserved Example – HBase writes w/o responsive shuffling HBase Bolt Instance 2 (100 events) HBase Bolt Instance 1 (100 events) HBase Bolt Instance 3 (100 events) Region Server Instance 1 (100 events) Region Server Instance 2 (100 events) Region Server Instance 3 (100 events) 300 events sent 300 events received 9 trips to region servers 300 events sent App Bolt Instance 1 (100 events) App Bolt Instance 2 (100 events) App Bolt Instance 3 (100 events)

Page 28 © Hortonworks Inc – All Rights Reserved Responsive Shuffling - Design

Page 29 © Hortonworks Inc – All Rights Reserved Example – HBase writes with responsive shuffling HBase Bolt Instance 2 (100 events) HBase Bolt Instance 1 (100 events) HBase Bolt Instance 3 (100 events) Region Server Instance 1 (100 events) Region Server Instance 2 (100 events) Region Server Instance 3 (100 events) 300 events sent 300 events received 3 trips to region servers 300 events sent RS Aware Partitioner Partitioner automatically adapts to splitting/mergi ng HBase regions App Bolt Instance 1 (100 events) App Bolt Instance 2 (100 events) App Bolt Instance 3 (100 events)

Page 30 © Hortonworks Inc – All Rights Reserved Responsive Shuffling – Sample Code In App Bolt In RS Aware Partitioner In Topology

Page 31 © Hortonworks Inc – All Rights Reserved Responsive Shuffling - Benefits Topology responds to changes in data patterns and adopts accordingly Maintains high level of SLA and throughput adherence Minimizes needs for maintenance & hence downtimes

Page 32 © Hortonworks Inc – All Rights Reserved Responsive Shuffling - Applicability Change in shuffle pattern does not impact final outcome Data stream has varying skews Target/Reference system specifications change over period of time

Page 33 © Hortonworks Inc – All Rights Reserved Out-of-Sequence Events

Page 34 © Hortonworks Inc – All Rights Reserved Out-of-Sequence Events - Description An out-of-sequence event is one that's received late, sufficiently late that you've already processed events that should have been processed after the out-of-sequence event was received.

Page 35 © Hortonworks Inc – All Rights Reserved Out-of-Sequence Events - Challenges Hard to determine if all events in given window have been received Need referencing of relevant data for late events Builds more pressure on processing components Increased latency and degraded overall system performance

Page 36 © Hortonworks Inc – All Rights Reserved Out-of-Sequence Events – Potential Options LatencyResult AccuracyOperational Ease Drop Wait Fan Out

Page 37 © Hortonworks Inc – All Rights Reserved Out-of-Sequence Events - Processing Source Spout Event Filter Bolt Typical Processing Bolt Monitors currently being processed events and identifying out-of-sequence events Ordered events Out-of- Sequence events Special Handling Bolt Based on complexities in processing, this can be extended as different topology

Page 38 © Hortonworks Inc – All Rights Reserved Out-of-Sequence Events – Benefits Separation of concerns Maintain the the overall throughput and latency requirements Independent scaling of components

Page 39 © Hortonworks Inc – All Rights Reserved Out-of-Sequence Events - Applicability When order of events matter Processing out-of-sequence events needs special and complex logic Stream has relatively low volume of out-of-sequence events

Page 40 © Hortonworks Inc – All Rights Reserved Thank

Page 41 © Hortonworks Inc – All Rights Reserved Appendix

Page 42 © Hortonworks Inc – All Rights Reserved Data Security in Kafka

Page 43 © Hortonworks Inc – All Rights Reserved Data Security in Kafka - Description Ability to use Kafka as secure data transfer mechanism. Apache Kafka is widely used messaging platform in streaming applications. Unfortunately Kafka does not have built in support for Authentication & Authorization (yet)

Page 44 © Hortonworks Inc – All Rights Reserved Data Security in Kafka - Flow Source Systems Sources Syslog Data Collection Custom Collector Encryptin g Producer Messaging System Kafka Encrypted Messages Real Time Processing Storm Kafka Spout Decryptin g Bolt App Bolt

Page 45 © Hortonworks Inc – All Rights Reserved Data Security in Kafka – Encryption Details Data Collection Event Producer Messaging System Kafka Topic Event(s) Envelope Real Time Processing Storm Decrypting Bolt Event(s) Envelope Encrypted AES Key (w/ RSA) Encrypted Event (w/ AES) Event(s) Envelope Event Encrypt event(s) w/ AES Encrypt AES key w/ RSA Event Decrypt event(s) w/ AES Decrypt AES key w/ RSA

Page 46 © Hortonworks Inc – All Rights Reserved Data Security in Kafka – Encryption Details RSA public/private keys are generated ahead of time and securely shared with topology AES key is randomly generated and periodically refreshed Only user having appropriate RSA private key can read the data One event or a batch of events can be encrypted together as per needs

Page 47 © Hortonworks Inc – All Rights Reserved Data Security in Kafka - Applicability Multiple applications want to use Kafka as their source to the stream Data is sensitive and can not be shared between applications Other components in the pipeline are secured

Page 48 © Hortonworks Inc – All Rights Reserved Micro Batching

Page 49 © Hortonworks Inc – All Rights Reserved Micro Batching - Description Micro-batching is a technique that allows a process or task to treat a stream as a sequence of small batches or chunks of data. For incoming streams, the events can be packaged into small batches and delivered to a batch system for processing

Page 50 © Hortonworks Inc – All Rights Reserved Micro Batching - Challenges Data delivery reliability Unnecessary data duplication Increased latency Complexity in time-bound batching

Page 51 © Hortonworks Inc – All Rights Reserved Micro Batching - Design Options Thread-based Model Controller stream to trigger batch flush Use of Tick Tuples

Page 52 © Hortonworks Inc – All Rights Reserved Tick Tuples Tick tuples are system generated tuples that Storm can send to your bolt if you need to perform some actions at a fixed interval

Page 53 © Hortonworks Inc – All Rights Reserved Micro Batching - Benefits Takes advantages of system characteristic by batching events together Adheres to processing latency needs by ensuring that batches are executed by certain intervals Prevents data loss by acknowledging events only after successful processing Simple, elegant and easy to maintain code

Page 54 © Hortonworks Inc – All Rights Reserved Micro Batching - Applicability Target systems are more efficient with bulk transactions Processing group of events is more efficient than individual event End to end event latency is not super sensitive

Page 55 © Hortonworks Inc – All Rights Reserved Micro Batching – Sample Code

Page 56 © Hortonworks Inc – All Rights Reserved Thank