Going Real-time Data Collection and Stream Processing with Apache Kafka Jay Kreps, Confluent Hi My name is Jay, I’m going to be talking about using Apache.

Slides:

Advertisements

Similar presentations

Request Dispatching for Cheap Energy Prices in Cloud Data Centers

Advertisements

SpringerLink Training Kit

Luminosity measurements at Hadron Colliders

From Word Embeddings To Document Distances

Choosing a Dental Plan Student Name

Virtual Environments and Computer Graphics

Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI

THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –

D. Phát triển thương hiệu

NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN

Điều trị chống huyết khối trong tai biến mạch máu não

BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.

Nasal Cannula X particulate mask

Evolving Architecture for Beyond the Standard Model

HF NOISE FILTERS PERFORMANCE

Electronics for Pedestrians – Passive Components –

Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel

L-Systems and Affine Transformations

CMSC423: Bioinformatic Algorithms, Databases and Tools

Some aspect concerning the LMDZ dynamical core and its use

Bayesian Confidence Limits and Intervals

实习总结（Internship Summary)

Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,

Front End Electronics for SOI Monolithic Pixel Sensor

Face Recognition Monday, February 1, 2016.

Solving Rubik's Cube By: Etai Nativ.

CS284 Paper Presentation Arpad Kovacs

انتقال حرارت 2 خانم خسرویار.

Summer Student Program First results

Theoretical Results on Neutrinos

HERMESでのHard Exclusive生成過程による核子内クォーク全角運動量についての研究

Wavelet Coherence & Cross-Wavelet Transform

yaSpMV: Yet Another SpMV Framework on GPUs

Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.

MOCLA02 Design of a Compact L-band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Fuel cell development program for electric vehicle

Overview of TST-2 Experiment

Optomechanics with atoms

داده کاوی سئوالات نمونه

Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium

ლექცია 4 - ფული და ინფლაცია

10. predavanje Novac i financijski sustav

Wissenschaftliche Aussprache zur Dissertation

FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,

Particle acceleration during the gamma-ray flares of the Crab Nebular

Interpretations of the Derivative Gottfried Wilhelm Leibniz

Advisor: Chiuyuan Chen Student: Shao-Chun Lin

Widow Rockfish Assessment

SiW-ECAL Beam Test 2015 Kick-Off meeting

On Robust Neighbor Discovery in Mobile Wireless Networks

Chapter 6 并发：死锁和饥饿 Operating Systems: Internals and Design Principles

You NEED your book!!! Frequency Distribution

Y V =0 a V =V0 x b b V =0 z

Fairness-oriented Scheduling Support for Multicore Systems

Climate-Energy-Policy Interaction

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Ch48 Statistics by Chtan FYHSKulai

The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.

Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs

Online Learning: An Introduction

Factor Based Index of Systemic Stress (FISS)

What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.

THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*

Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.

The Toroidal Sporadic Source: Understanding Temporal Variations

FW 3.4: More Circle Practice

ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف

Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM

Limits on Anomalous WWγ and WWZ Couplings from DØ

Presentation transcript:

Going Real-time Data Collection and Stream Processing with Apache Kafka Jay Kreps, Confluent Hi My name is Jay, I’m going to be talking about using Apache Kafka to build a central platform for data streams and stream processing. 5 years of work in 35 mins.

Experience at LinkedIn This story takes place at LinkedIn. I was on the infra team. Previously had built a distributed key-value store Was leading Hadoop adoption effort.

2009: We want all our data in Hadoop! Had done initial prototypes, had some valuable things Want Data Lake/Hub

What is all our data? XML, CSV dumps, etc. Translate the nasty data to pretty schemas in Hadoop XML is super slow and labor intensive to parse What if the data has a comma?

Initial approach: “gut it out” Various ad hoc load jobs Manual parsing of each new data we got into target schema 14% of data Not the kind of team I wanted

Problems Data coverage Many source systems Relational DBs Log files Metrics Messaging systems Many data formats Constant change New schemas New data sources Data grandfathering 14% of data in Hadoop N new non-hadoop engineers imply O(N) hadoop engineers

Needed: organizational scalability Hadn’t paid much attention to how data flows through the business, more systems and algorithms. Believe data is what’s important.

How does everything else work? Got interested in things like schemas, metadata, dataflow, etc

Relational database changes Real-time, mostly lossless—couldn’t go back in time Batch, mostly lossless–high latency Why CSV dumps?

NoSQL

User events Batch aggregation Lossy, high-latency, only went to data warehouse/Hadoop

Application Logs Splunk

Messaging Low-throughput, lossy, no real scalability story No central system No integration with batch wold

Metrics and operational data Real-time Lossy Not multi-subscriber

This is a giant mess Reality about 100x more complex 300 services ~100 databases Multi-datacenter Trolling: load into Oracle, search, etc

Impossible ideas Publish data from Hadoop to a search index Run a SQL query to find the biggest latency bottleneck Run a SQL query to find common error patterns Low latency monitoring of database changes or user activity Incorporate popularity in real-time display and relevance algorithms Products that incorporate user activity Stop showing the same things Mark people as interested in jobs N^2 systems, data types

An infrastructure solution? Not tenable—new systems, data being created faster than we integrate them Not all problems solvable with infrastructure Extract the common pattern…

Idea: Stream Data Platform Had Hadoop—great that can be our warehouse/archive. But what about real-time data and real-time processing? All the messed up stuff? Files are really just a stream of updates stored together

First Attempt: Messaging systems! This problem is solved…don’t reinvent the wheel!

Problems Throughput Batch systems Persistence Stream Processing Ordering guarantees Partitioning Not suitable for ETL Not suitable for high throughput

Second Attempt: Build Kafka! Reinvent the wheel! Initial time estimate: 3 months

What does it do? Producers, consumers, topic Like a messaging system

Commit Log Abstraction Different: Commit log Stolen from distributed database internals Key abstraction for systems, real-time processing, data integration

Logs and Publish-Subscribe Very strong, fast publish subscribe mechanism

Kafka: A Modern Distributed System for Streams Scalability of a filesystem Hundreds of MB/sec/server throughput Many TB per server Guarantees of a database Messages strictly ordered All data persistent Distributed by default Replication Partitioning model Producers, Consumers, and Brokers all fault tolerant and horizontally scalable Built like a modern dist system

Stream Data Platform Let’s dive into the elements of this platform One of the big things this enables is stream processing…let’s dive into that. Truviso

Stream Processing a la carte cat input | grep “foo” | wc -l Like unix pipe, streams data between programs Distributed, fault-tolerant, elastically scalable

Stream Processing with Frameworks + = Stream Processing Streams + Processing = Stream Processing HDFS analogy Most stream processing system use Kafka Some require it—rollback recovery

cat /usr/share/dict/words | wc -l Unix Pipes, Modernized cat /usr/share/dict/words | wc -l Need a uniform data format Uniform representation of stream Unix from days of “timesharing”—now we want a distributed version Services—where a process runs has nothing to do with where you want it’s data At the data center level Differences—Distributed (fault tolerant), partitioned parallelism, long running processes

Bad Schemas < No Schemas < Good Schemas On Schemas Bad Schemas < No Schemas < Good Schemas

Principled notion of schemas & compatibility Kafka is data format agnostic (like HDFS) Still need schemas Used Avro For database data—capture what you know about the table—don’t throw it all away with CSV For event data—logs, user activity, metrics, etc the schema is a conversation between the users of the data and the publishers Explain schema repository

Put it all together Stream Data Platform Trace data flow

At LinkedIn Everything in the company is a real-time stream > 500 billion messages written per day > 2.5 trillion messages read per day ~ 1 PB of stream data Tens of thousands of producer processes Backbone for data stores Search Social Graph Newsfeed Primary storage (in progress) Basis for stream processing Everything that happens in the company is a real-time stream

Best of all: My data munging team disappeared! Scaled to 1000 data sources 0.5 people staffed it Lot’s of other munging teams disappeared

Elsewhere Describe adoption cycle

Why this is the future System diversity is increasing Data diversity and volume is increasing The world is getting faster The technology exists Basically the three Vs of big data If you think the world needs to change (and live in silicon valley) there is really only one option

Confluent Mission: Make this a practical reality everywhere Product Schemas and metadata management Connectors for common systems Monitor data flow end-to-end Stream processing integration First release next week Only one thing you can do if you think the world needs to change, you live in Silicon Valley—quit your job and do it.

Questions? Confluent @confluentinc http://confluent.io Apache Kafka http://kafka.apache.org http://linkd.in/199iMwY Me @jaykreps Office hours tomorrow at