CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

Slides:



Advertisements
Similar presentations
Building LinkedIn’s Real-time Data Pipeline
Advertisements

High Speed Total Order for SAN infrastructure Tal Anker, Danny Dolev, Gregory Greenman, Ilya Shnaiderman School of Engineering and Computer Science The.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Apache Storm and Kafka Boston Storm User Group September 25, 2014 P. Taylor Goetz,
Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Spark: Cluster Computing with Working Sets
Kafka high-throughput, persistent, multi-reader streams
Building a Real-Time Data Pipeline: Apache Kafka at Linkedin Hadoop Summit 2013 Joel Koshy June 2013 LinkedIn Corporation ©2013 All Rights Reserved.
Introduction to Distributed Systems
Big Data Open Source Software and Projects ABDS in Summary XVI: Layer 13 Part 1 Data Science Curriculum March Geoffrey Fox
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
The Google File System.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
CS 142 Lecture Notes: Large-Scale Web ApplicationsSlide 1 RAMCloud Overview ● Storage for datacenters ● commodity servers ● GB DRAM/server.
Etcetera! CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Hadoop Ecosystem Overview
Case Study - GFS.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
The Hadoop Distributed File System
Netflix Data Pipeline with Kafka Allen Wang & Steven Wu.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
High Throughput Computing on P2P Networks Carlos Pérez Miguel
L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Intuitions for Scaling Data-Centric Architectures
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Apache Kafka A distributed publish-subscribe messaging system
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Distributed, real-time actionable insights on high-volume data streams
Data Loss and Data Duplication in Kafka
Primary-Backup Replication
Kafka & Samza Weize Sun.
Hadoop Aakash Kag What Why How 1.
Introduction to Apache Kafka
Real-time Streaming and Data Pipelines with Apache Kafka
Fast data arrives in real time and potentially high volume
Real-Time Processing with Apache Flume, Kafka, and Storm Kamlesh Dhawale Ankalytics
Report from MesosCon North America June 2016, Denver, U.S.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ken Birman & Kishore Pusukuri, Spring 2018
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Ewen Cheslack-Postava
Evolution of messaging systems and event driven architecture
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Distributed File Systems
THE GOOGLE FILE SYSTEM.
Apache Hadoop and Spark
Caching 50.5* + Apache Kafka
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook Apache Kafka CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook

Overview Kafka is a “publish-subscribe messaging rethought as a distributed commit log” Fast Scalable Durable Distributed

Kafka adoption and use cases LinkedIn: activity streams, operational metrics, data bus 400 nodes, 18k topics, 220B msg/day (peak 3.2M msg/s), May 2014 Netflix: real-time monitoring and event processing Twitter: as part of their Storm real-time data pipelines Spotify: log delivery (from 4h down to 10s), Hadoop Loggly: log collection and processing Mozilla: telemetry data Airbnb, Cisco, Gnip, InfoChimps, Ooyala, Square, Uber, …

How fast is Kafka? “Up to 2 million writes/sec on 3 cheap machines” Using 3 producers on 3 different machines, 3x async replication Only 1 producer/machine because NIC already saturated Sustained throughput as stored data grows Slightly different test config than 2M writes/sec above.

Why is Kafka so fast? Fast writes: Fast reads: While Kafka persists all data to disk, essentially all writes go to the page cache of OS, i.e. RAM. Fast reads: Very efficient to transfer data from page cache to a network socket Linux: sendfile() system call Combination of the two = fast Kafka! Example (Operations): On a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache.

A first look The who is who The data Producers write data to brokers. Consumers read data from brokers. All this is distributed. The data Data is stored in topics. Topics are split into partitions, which are replicated.

A first look

Topics Topic: feed name to which messages are published Example: “zerg.hydra” Kafka prunes “head” based on age or max size or “key” Broker(s) new Producer A1 Producer A2 Producer An … Producers always append to “tail” (think: append to a file) Kafka topic … Older msgs Newer msgs

Topics Consumers use an “offset pointer” to Consumer group C1 Consumers use an “offset pointer” to track/control their read progress (and decide the pace of consumption) Consumer group C2 Broker(s) new Producer A1 Producer A2 Producer An … Producers always append to “tail” (think: append to a file) … Older msgs Newer msgs

Partitions A topic consists of partitions. Partition: ordered + immutable sequence of messages that is continually appended to

Partitions #partitions of a topic is configurable #partitions determines max consumer (group) parallelism cf. parallelism of Storm’s KafkaSpout via builder.setSpout(,,N) Consumer group A, with 2 consumers, reads from a 4-partition topic Consumer group B, with 4 consumers, reads from the same topic

Partition offsets Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset Consumers track their pointers via (offset, partition, topic) tuples Consumer group C1

Replicas of a partition Replicas: “backups” of a partition They exist solely to prevent data loss. Replicas are never read from, never written to. They do NOT help to increase producer or consumer parallelism! Kafka tolerates (numReplicas - 1) dead brokers before losing data LinkedIn: numReplicas == 2  1 broker can die

Kafka Quickstart Steps for downloading Kafka, starting a server, and creating a console-based consumer/producer Requires ZooKeeper to be installed and running https://kafka.apache.org/documentation.html#quickstart https://github.com/adamjshook/hadoop-demos/tree/master/kafka