Apache Storm and Kafka Boston Storm User Group September 25, 2014 P. Taylor Goetz,

Slides:



Advertisements
Similar presentations
REAL-TIME NETWORK ANALYTICS WITH STORM
Advertisements

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Kafka high-throughput, persistent, multi-reader streams
NoSQL Databases: MongoDB vs Cassandra
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Page 1 © Hortonworks Inc – All Rights Reserved Design Patterns For Real Time Streaming Analytics 19 Feb 2015 Sheetal Dolas Principal Architect,
Chapter 9: Database Systems
Pulsar Realtime Analytics At Scale Tony Ng, Sharad Murthy June 11, 2015.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Apache Spark and the future of big data applications Eric Baldeschwieler.
Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Word Wide Cache Distributed Caching for the Distributed Enterprise.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Netflix Data Pipeline with Kafka Allen Wang & Steven Wu.
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
© Paradigm Publishing Inc. 9-1 Chapter 9 Database and Information Management.
IMDGs An essential part of your architecture. About me
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 9 Database Systems. © 2005 Pearson Addison-Wesley. All rights reserved 9-2 Chapter 9: Database Systems 9.1 Database Fundamentals 9.2 The Relational.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
Intuitions for Scaling Data-Centric Architectures
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
History • Created by Nathan BackType • Open sourced on 19th September, 2011 Documentation at Contribution
State Machine Replication State Machine Replication through transparent distributed protocols State Machine Replication through a shared log.
Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
BIG DATA/ Hadoop Interview Questions.
Apache Kafka A distributed publish-subscribe messaging system
Microsoft Partner since 2011
Ignite in Sberbank: In-Memory Data Fabric for Financial Services
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook
Microsoft Ignite /28/2017 6:07 PM
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Robert Metzger, Aljoscha Connecting Apache Flink® to the World: Reviewing the streaming connectors.
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Euro17 LSO Hackathon Open LSO Analytics
Protecting a Tsunami of Data in Hadoop
Distributed, real-time actionable insights on high-volume data streams
Data Loss and Data Duplication in Kafka
Kafka & Samza Weize Sun.
Replicated LevelDB on JBoss Fuse
Hadoop Aakash Kag What Why How 1.
Real-time Streaming and Data Pipelines with Apache Kafka
Hybrid Management and Security
Real-Time Processing with Apache Flume, Kafka, and Storm Kamlesh Dhawale Ankalytics
Open Source distributed document DB for an enterprise
Collecting heterogeneous data into a central repository
Advisor(s): Dr. Mohammad Pourhomayoun
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Capital One Architecture Team and DataTorrent
X in [Integration, Delivery, Deployment]
Ewen Cheslack-Postava
Evolution of messaging systems and event driven architecture
Clouds & Containers: Case Studies for Big Data
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Caching 50.5* + Apache Kafka
Presentation transcript:

Apache Storm and Kafka Boston Storm User Group September 25, 2014 P. Taylor Goetz,

What is Apache Kafka?

A pub/sub messaging system. Re-imagined as a distributed commit log.

Apache Kafka Fast Scalable Durable Distributed

Apache Kafka Fast “A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.”

Apache Kafka Scalable “Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime.”

Apache Kafka Durable “Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.”

Apache Kafka Distributed “Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.”

Apache Kafka: Use Cases Stream Processing Messaging Click Streams Metrics Collection and Monitoring Log Aggregation

Apache Kafka: Use Cases Greek letter architectures Which are really just streaming design patterns

Apache Kafka: Under the Hood Producers/Consumers (Publish-Subscribe)

Apache Kafka: Under the Hood Producers write data to Brokers Consumers read data from Brokers This work is distributed across the cluster

Apache Kafka: Under the Hood Data is stored in topics. Topics are divided into partitions. Partitions are replicated.

Apache Kafka: Under the Hood Topics are named feeds to which messages are published.

Apache Kafka: Under the Hood Topics consist of partitions.

Apache Kafka: Under the Hood A partition is an ordered and immutable sequence of messages that is continually appended to.

Apache Kafka: Under the Hood A partition is an ordered, immutable sequence of messages that is continually appended to.

Apache Kafka: Under the Hood Sequential disk access can be faster than RAM!

Apache Kafka: Under the Hood Within a partition, each message is assigned a unique ID called an offset that identifies it.

Apache Kafka: Under the Hood ZooKeeper is used to store cluster state information and consumer offsets.

Storm and Kafka A match made in heaven.

Data Source Reliability A data source is considered unreliable if there is no means to replay a previously-received message. A data source is considered reliable if it can somehow replay a message if processing fails at any point. A data source is considered durable if it can replay any message or set of messages given the necessary selection criteria.

Data Source Reliability A data source is considered unreliable if there is no means to replay a previously-received message. A data source is considered reliable if it can somehow replay a message if processing fails at any point. A data source is considered durable if it can replay any message or set of messages given the necessary selection criteria. Kafka is a durable data source.

Reliability in Storm Exactly once processing requires a durable data source. At least once processing requires a reliable data source. An unreliable data source can be wrapped to provide additional guarantees. With durable and reliable sources, Storm will not drop data. Common pattern: Back unreliable data sources with Apache Kafka (minor latency hit traded for 100% durability).

Storm and Kafka Apache Kafka is an ideal source for Storm topologies. It provides everything necessary for: At most once processing At least once processing Exactly once processing Apache Storm includes Kafka spout implementations for all levels of reliability. Kafka Supports a wide variety of languages and integration points for both producers and consumers.

Storm-Kafka Integration Included in Storm distribution since Core Storm Spout Trident Spouts (Transactional and Opaque- Transactional)

Storm-Kafka Integration Features: Ingest from Kafka Configurable start time (offset position): Earliest, Latest, Last, Point-in-Time Write to Kafka (next release)

Use Cases

Core Storm Use Case Cisco Open Security Operations Center (OpenSOC)

Analyzing 1.2 Million Network Packets Per Second in Real Time

OpenSOC: Intrusion DetectionOpenSOC: Intrusion Detection Breaches occur in sec./min./hrs., but take days/weeks/months to discover.

Data 3V is not getting any smaller…

"Traditional Security analytics tools scale up, not out.” "OpenSOC is a software application that turns a conventional big data platform into a security analytics platform.” - James Sirota, Cisco Security Solutions

OpenSOC Conceptual Model

OpenSOC Architecture

PCAP Topology

Telemetry Enrichment Topology

Enrichment

Analytics Dashboards

OpenSOC Cisco

Trident Use Case Health Market Science Master Data Management

Health Market Science “Master File” database of every healthcare practitioner in the U.S. Kept up-to-date in near-real-time Represents the “truth” at any point in time (“Golden Record”)

Health Market Science Build products and services around the Master File Leverage those services to gather new data and updates

Master Data Management

Data In

Data Out

MDM Pipeline

Polyglot Persistence Choose the right tool for the job.

Data Pipeline

Why Trident? Aggregations and Joins Bulk update of persistence layer (Micro-batches) Throughput vs. Latency

CassandraCqlState public void commit(Long txid) { BatchStatement batch = new BatchStatement(Type.LOGGED); batch.addAll(this.statements); clientFactory.getSession().execute(batch); } public void addStatement(Statement statement) { this.statements.add(statement); } public ResultSet execute(Statement statement){ return clientFactory.getSession().execute(statement); }

CassandraCqlStateUpdater public void updateState(CassandraCqlState state, List tuples, TridentCollector collector) { for (TridentTuple tuple : tuples) { Statement statement = this.mapper.map(tuple); state.addStatement(statement); }

Mapper Implementation public Statement map(List keys, Number value) { Insert statement = QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME); statement.value(KEY_NAME, keys.get(0)); statement.value(VALUE_NAME, value); return statement; } public Statement retrieve(List keys) { Select statement = QueryBuilder.select().column(KEY_NAME).column(VALUE_NAME).from(KEYSPACE_NAME, TABLE_NAME).where(QueryBuilder.eq(KEY_NAME, keys.get(0))); return statement; }

Storm Cassandra CQL {tuple} —> CQL Statement Trident Batch == CQL Batch

Customer Dashboard

Thanks! P. Taylor Goetz,