CSCI5570 Large Scale Data Processing Systems

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Apache Storm A scalable distributed & fault tolerant real time computation system ( Free & Open Source ) Shyam Rajendran 16-Feb-15.
MapReduce.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Spark: Cluster Computing with Working Sets
1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Real-Time Stream Processing CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Pregel: A System for Large-Scale Graph Processing
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Part III BigData Analysis Tools (Storm) Yuan Xue
BIG DATA/ Hadoop Interview Questions.
PySpark Tutorial - Learn to use Apache Spark with Python
MillWheel Fault-Tolerant Stream Processing at Internet Scale
Chapter 4 – Thread Concepts
Heron: a stream data processing engine
Apache Storm.
HERON.
TensorFlow– A system for large-scale machine learning
Lecture 22: Stream Processing, Graph Processing
E-Storm: Replication-based State Management in Distributed Stream Processing Systems Xunyun Liu, Aaron Harwood, Shanika Karunasekera, Benjamin Rubinstein.
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Guangxiang Du*, Indranil Gupta
Chapter 10 Data Analytics for IoT
Original Slides by Nathan Twitter Shyam Nutanix
CSCI5570 Large Scale Data Processing Systems
Real-Time Processing with Apache Flume, Kafka, and Storm Kamlesh Dhawale Ankalytics
Large-scale file systems and Map-Reduce
Spark Presentation.
Chapter 4 – Thread Concepts
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
PREGEL Data Management in the Cloud
Software Engineering Introduction to Apache Hadoop Map Reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Replication Middleware for Cloud Based Storage Service
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Ministry of Higher Education
Boyang Peng, Le Xu, Indranil Gupta
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Slides prepared by Samkit
Lecture 16 (Intro to MapReduce and Hadoop)
From Rivulets to Rivers: Elastic Stream Processing in Heron
Introduction to MapReduce
Apache Storm: Design And Usage
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
COS 518: Distributed Systems Lecture 11 Mike Freedman
Lecture 29: Distributed Systems
Presentation transcript:

CSCI5570 Large Scale Data Processing Systems Distributed Stream Processing Systems James Cheng CSE, CUHK Slide Ack.: modified based on the slides from Nathan Marz, Mahender Immadi, Thirupathi Guduru and Karthick Ramasamy

Twitter, Inc., *University of Wisconsin – Madison Storm@Twitter Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel*, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, Dmitriy Ryaboy Twitter, Inc., *University of Wisconsin – Madison SIGMOD 2014

Twitter Storm Storm is currently one of the most popular stream processing system Features Efficient at-least-once message processing guarantee Flexible message dispatching schemes

Storm-Core Concepts Tuple Stream Spout Bolt Topology Task

Tuple and Stream Tuple Stream Data unit (or message primitive) Contain different fields (E.g., word and count) Stream Unbounded sequence of tuples

Spout Source of data streams Wrap a streaming data source and emit tuples Examples: Twitter Streaming API/Kafka

Bolt Abstraction of processing elements Consume tuples and may output tuples Examples: Filter/Aggregation/Join

Topology Job definition A DAG consists of spouts, bolts and edges

Task Each Spout and Bolt are running in multiple instances in parallel An instance is denoted as a task

Stream Grouping When a tuple is emitted, which processing element does it go to?

Stream Grouping Shuffle grouping: send a tuple to a consumer processing element randomly Fields grouping: mod hashing on one or several fields of a tuple All grouping: replicate all the tuples to every consumer processing element Global grouping: send all tuples to a single processing element

Storm Word Count Topology(Job) Twitter Spout Split Sentence Split Sentence WordCount Report Storm Word Count Topology shuffle field global

Streaming Word Count TopologyBuilder is used to construct topologies in Java

Streaming Word Count Define a spout in the topology with parallelism of 5 tasks

Streaming Word Count Split sentences into words with parallelism of 8 tasks Consumer decides what data it receives and how it gets grouped

Streaming Word Count Create a word count stream

Streaming Word Count

Streaming Word Count

Streaming Word Count Submitting topology to a cluster

Streaming Word Count Running topology in local mode

System Overview Nimbus (Master) Supervisor (Slave) Zookeeper Distributing and coordinating the execution of the topology Failure monitoring Supervisor (Slave) Spawn workers Execute spouts or bolts Keep listening tuples Zookeeper Coordination management Nimbus Zookeeper Supervisor Storm Framework

Nimbus and Zookeeper Nimbus: similar to JobTracker in Hadoop User describes the topology as a Thrift object and sends the object to Nimbus any programming language can be used to create a Storm topology e.g., Summingbird User also uploads the user code to Nimbus Nimbus uses a combination of local disk and Zookeeper to store states about the topology User code is stored on local disk The topology Thrift objects are stored in Zookeeper Supervisors tell Nimbus periodically the topologies they are running and any vacancies to run more topologies Nimbus does the match-making between pending topologies and supervisors The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

Nimbus and Zookeeper Zookeeper: coordination between Nimbus and Supervisors Nimbus and Supervisors are stateless, all their states are kept in Zookeeper or on local disk: key to Storm’s resilience If Nimbus service fails, workers still continue to make forward progress Supervisors restart the workers if they fail But if Nimbus is down, then users cannot submit new topologies If running topologies experience machine failures, they cannot be reassigned to different machines until Nimbus is revived

Supervisor Supervisor runs on each Storm node receives assignments from Nimbus and spawns workers based on the assignment monitors the health of the workers and respawns them if necessary Supervisor architecture: Supervisor spawns three threads The main thread reads the Storm configuration, initializes the Supervisor’s global map, creates a persistent local state in the file system, and schedules recurring timer events Three types of events => next page

Supervisor The heartbeat event: The synchronize supervisor event: scheduled to run (e.g., every 15 sec) in the context of the main thread the thread reports to Nimbus that the supervisor is alive The synchronize supervisor event: executed (e.g., every 10 sec) in the event manager thread the thread is responsible for managing the changes in the existing assignments if the changes include addition of new topologies, schedules a synchronize process event

Supervisor The synchronize process event: runs (e.g., every 3 sec) under the context of the process event manager thread the thread is responsible for managing worker processes that run a fragment of the topology on the same node as the supervisor reads worker heartbeats from the local state and classifies those workers as either valid, timed out, not started, or disallowed “timed out”: the worker did not provide a heartbeat in the specified time frame, assumed to be dead “not started”: yet to be started because it belongs to a newly submitted topology, or an existing topology whose worker is being moved to this supervisor “disallowed”: should not be running either because its topology has been killed, or the worker of the topology has been moved to another node

Workers and Executors Each worker process runs several executors inside a JVM Executors are threads within the worker process Each executor can run several tasks A task is an instance of a spout or a bolt A task is strictly bound to an executor (no dynamic reassignment, e.g., for load balancing, at the moment)

Workers and Executors To route incoming and outgoing tuples, each worker process has two dedicated threads: a worker receive thread and a worker send thread

Workers and Executors Each executor also consists of two threads: the user logic thread and the executor send thread

Workers and Executors Worker receive thread: examines the destination task id of an incoming tuple and queues the incoming tuple to the appropriate in queue associated with its executor

Workers and Executors User logic thread: takes incoming tuples from the in queue, examines the destination task id, and then runs the actual task (a spout or bolt instance) for the tuple, and generates output tuple(s). These outgoing tuples are then placed in an out queue that is associated with this executor.

Workers and Executors Executor send thread: takes the tuples from the out queue and puts them in a global transfer queue. The global transfer queue contains all the outgoing tuples from executors in the worker process

Workers and Executors Worker send thread: examines each tuple in the global transfer queue and based on its task destination id, sends it to the next worker downstream. For outgoing tuples that are destined for a different task on the same worker, it writes the tuple directly into the in queue of the destination task.

Messages Processing Guarantee (Fault Tolerance) At Most Once (e.g. S4) Messages may be missing Minimum overhead At Least Once (e.g. Storm) Messages will not be lost, but may be processed more than once Medium overhead Exactly Once (e.g. MillWheel) Messages will be processed exactly once Maximum overhead

Storm At-Least-Once Guarantee Each tuple emitted from spout will be processed at least once, which is meaningful in idempotent operations. Idempotent operation Duplicate operation of the same input will not affect the output, which means f(f(x)) = f(x), in which x means input and f means some operations. Example: filter, maximum, minimum (ok to apply the operation to the same input more than once) The implementation can be very efficient

Storm-At-Least-Once Guarantee Implementation XOR value of pairs will be 0 1^2…^N-1^N^N^N-1…2^1 = 0 Each tuple is either created or consumed and we can XOR its ID in the above two phases.

Storm-At-Least-Once Guarantee Add extra ACKerBolt XOR each source tupleID from spouts and new tupleIDs generated by processing that source tuple Changes of XOR value in the example: 001 -> 001^001^002^003=001-> 001^002^003=0 Send ACK to Spout1 when the value becomes 0 Spout1 will resend tuple1 when no ACK has been received for a long time. Spout1 ack tuple1 tuple1 ACKerBolt SplitterBolt tuple1, tuple2, tuple3 Tuple1:[Spout1, 000] Tuple1:[Spout1, 001] Tuple1:[Spout1, 001] tuple2, tuple3 tuple2, tuple3 WordCountBolt

Storm@Twitter Runs on hundreds of servers (spread across multiple datacenters) at Twitter Several hundreds of topologies run on these clusters, some run on more than a few hundred nodes Many terabytes of data flows through the Storm clusters every day, generating several billions of output tuples

Storm@Twitter Storm topologies are used by a number of groups inside Twitter, including revenue, user services, search, and content discovery Simple things like filtering and aggregating the content of various streams at Twitter (e.g. computing counts) Also for more complex things like running simple machine learning algorithms (e.g. clustering) on stream data

Storm@Twitter Storm is resilient to failures, continues to work even when Nimbus is down (the workers continue making forward progress) Can take a machine down for maintenance without affecting the topology The latency of the 99th percentile response time for processing a tuple is close to 1ms Cluster availability is 99.9% over a long period of time

Guaranteeing Message Processing Tuple tree

Guaranteeing Message Processing A spout tuple is not fully processed until all tuples in the tree have been completed If the tuple tree is not completed within a specified timeout, the spout tuple is replayed

Guaranteeing Message Processing Reliability API “Anchoring” creates a new edge in the tuple tree

Guaranteeing Message Processing Marks a single node in the tree as complete

Guaranteeing Message Processing Storm tracks tuple trees for you in an extremely efficient way and provides at-least-once guarantee

Transactional Topologies How do you do idempotent counting with an at least once delivery guarantee? Won’t you overcount? Transactional topologies solve this problem and provides exactly-once guarantee

Transactional Topologies Exactly-once guarantee for each tuple is expensive Process small batches of tuples Batch 1 Batch 2 Batch 3

Transactional Topologies If a batch fails, replay the whole batch Once a batch is completed, commit the batch Bolts can optionally be “committers” Batch 1 Batch 2 Batch 3

Transactional Topologies Commits are ordered. If there’s a failure during commit, the whole batch + commit is retried Commit 1 Commit 2 Commit 3 Commit 4